'How can I fetch data from another Excel tab using a formula as reference using Python and Pandas (or something like)
First, I'm not that sure if pandas is the right approach to this, it may be better done with VBA or another lib like openpyxl.
I have a excel sheet which has two different tabs (tab1 has a name and a value, which is a formula like: ='tab2'!H10, for instance, tab2 has said value (or sum of values) and other bunch of information).
I want to get information from the value column on tab1, which may have reference for more than one cell on the second tab ='tab2'!H10 + 'tab2'!H12 + 'tab2'!H20 on row = Name1. Extract those ROWS (row 10, 12 and 20 on this example) and fetch information from 3 columns on tab2, for those rows.
Then, I want to "join" (not sure if a join) the name on tab1 with those 3 columns from tab2 on said lines. Something like this as the end result:
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 10
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 12
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 20
Code that I'm trying and it's not currently working, error ValueError: cannot join with no overlapping index names
import numpy as np
import pandas as pd
from IPython.display import display
from openpyxl import Workbook
from openpyxl import load_workbook
wbx = load_workbook(filename= 'test.xlsx')
sheet_names = wbx.sheetnames
name1 = sheet_names[0]
sheet_ranges1 = wbx[name1]
df1 = pd.DataFrame(sheet_ranges1.values)
name2 = sheet_names[1]
sheet_ranges2 = wbx[name2]
df2 = pd.DataFrame(sheet_ranges2.values)
pd.set_option("display.max_rows", None, "display.max_columns", None)
c1 = df1.iloc[:,[1]]
c2 = df1.iloc[:,24]
print(c1.dtypes)
res = c2.str.extractall(r"!H(?P<line>\d+)?")
res2 = c1.merge(pd.DataFrame(res), how='left', left_index=True, right_index=True)
Solution 1:[1]
hope it helps:
import pandas as pd
df1 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet1')
df2 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet2')
df3 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet3')
# drop columns as needed that are not to include in merged result, or to avoid duplicate column that will be col_x and col_y
df1 = df1.drop(columns=['col2', 'col3'], index=False)
# join table
dfx = df1.merge(df2, how="inner", left_on="col1", right_on="col2)
merged = dfx.merge(df3, how="left", left_on="col7", right_on="col3)
print(merged.head())
you can do as well in VBA
Sub JoinTables()
Dim connection As ADODB.Connection
Set connection = New ADODB.Connection
With connection
.Provider = "Microsoft.Jet.OLEDB.4.0"
.ConnectionString = "Data Source=" & ThisWorkbook.FullName & ";" & "Extended Properties=Excel 8.0;"
.Open
End With
Dim recordset As ADODB.Recordset
Set recordset = New ADODB.Recordset
recordset.Open "SELECT * FROM [Sheet1$] INNER JOIN [Sheet2$] ON [Sheet1$].[type] = " & "[Sheet2$].[type]", connection
With Worksheets("Sheet3")
.Cells(2, 1).CopyFromRecordset recordset
End With
recordset.Close
connection.Close
End Sub
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
