'How can I fetch data from another Excel tab using a formula as reference using Python and Pandas (or something like)

First, I'm not that sure if pandas is the right approach to this, it may be better done with VBA or another lib like openpyxl.

I have a excel sheet which has two different tabs (tab1 has a name and a value, which is a formula like: ='tab2'!H10, for instance, tab2 has said value (or sum of values) and other bunch of information).

I want to get information from the value column on tab1, which may have reference for more than one cell on the second tab ='tab2'!H10 + 'tab2'!H12 + 'tab2'!H20 on row = Name1. Extract those ROWS (row 10, 12 and 20 on this example) and fetch information from 3 columns on tab2, for those rows.

Then, I want to "join" (not sure if a join) the name on tab1 with those 3 columns from tab2 on said lines. Something like this as the end result:

| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 10

| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 12

| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 20

Code that I'm trying and it's not currently working, error ValueError: cannot join with no overlapping index names

import numpy as np
import pandas as pd
from IPython.display import display
from openpyxl import Workbook
from openpyxl import load_workbook

wbx = load_workbook(filename= 'test.xlsx')

sheet_names = wbx.sheetnames

name1 = sheet_names[0]
sheet_ranges1 = wbx[name1]

df1 = pd.DataFrame(sheet_ranges1.values)

name2 = sheet_names[1]
sheet_ranges2 = wbx[name2]

df2 = pd.DataFrame(sheet_ranges2.values)

pd.set_option("display.max_rows", None, "display.max_columns", None)

c1 = df1.iloc[:,[1]]
c2 = df1.iloc[:,24]
print(c1.dtypes)

res = c2.str.extractall(r"!H(?P<line>\d+)?")
res2 = c1.merge(pd.DataFrame(res), how='left', left_index=True, right_index=True)


Solution 1:[1]

hope it helps:

import pandas as pd
df1 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet1')
df2 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet2')
df3 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet3')

# drop columns as needed that are not to include in merged result, or to avoid duplicate column that will be col_x and col_y
df1 = df1.drop(columns=['col2', 'col3'], index=False)

# join table
dfx = df1.merge(df2, how="inner", left_on="col1", right_on="col2)
merged = dfx.merge(df3, how="left", left_on="col7", right_on="col3)
print(merged.head())

you can do as well in VBA

Sub JoinTables()

 Dim connection As ADODB.Connection
 Set connection = New ADODB.Connection

 With connection
     .Provider = "Microsoft.Jet.OLEDB.4.0"
     .ConnectionString = "Data Source=" & ThisWorkbook.FullName & ";" & "Extended Properties=Excel 8.0;"
     .Open
 End With

 Dim recordset As ADODB.Recordset
 Set recordset = New ADODB.Recordset

 recordset.Open "SELECT * FROM [Sheet1$] INNER JOIN [Sheet2$] ON [Sheet1$].[type] = " & "[Sheet2$].[type]", connection

 With Worksheets("Sheet3")
     .Cells(2, 1).CopyFromRecordset recordset
 End With

 recordset.Close
 connection.Close

 End Sub

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1