'How to compare tables with python petl
TL;DR
How can I resolve differences between two petl tables that appear to have mostly matching records, but are being returned as not matching.
Details
I am trying to rebuild data integrations/pipelines with pythons petl module. The integrations are currently built in SnapLogic. I'm on the last step of one integration where I have to compare two tables and get the records from one table that are not in the other table(new records or deletions) and I also need to get the records that have been updated between the tables(updates).
The first table is parsed from a pipe delimited .txt file and is transformed to fit the structure of the database that this data is to be inserted into.
The second table is all the records that are currently on the database that the first table will be written to. I am doing a sql call with pandas to get the data, converting the data to a pandas dataframe and then converting that to a petl table with the petl.fromdataframe() method. The database it's coming from is an mssql 2016 server.
df = pd.read_sql_query(
'SELECT * FROM db', connection)
table = etl.fromdataframe(df)
So essentially the database table is kept up to date by the file that is converted into the first table and is compared to the current state of the database.
As far as I can tell, the two tables are identical as far as datatypes and number of columns.
There are around 36 columns and 1500 rows.
When I try using the petl.complement(table1, table2) method, I would expect to see all the records from table1 that are not in table2, however that doesn't happen. If I switch the tables in the method, I get the exact same records as with the original order. There should be only a few records that are not in both tables but I get almost all the records for each combination of the complement method.
I've also tried petl.recordcomplement and it gives the same results. petl.intersection, which is supposed to show the records that are the same from each table, gives around 200 records.
I've visually looked over excel files of both tables and I can't see any differences in a handful of the records I've been able to look at.
When I test these methods with the examples from the petl docs, it works fine.
My first assumption was the datatypes are different. I've created loops that go over each tables records and uses the petl.typecounts and the petl.valuecounts methods to see differences. There doesn't seem to be any difference in datatypes and there are differences in values but not 1500 records worth.
The mappings for table1 that comes from the pipe delimited file:
mappings = OrderedDict()
mappings['ID'] = 'ID', lambda i: math.nan if i is None or i == "" else (
str(int(i))).zfill(7)
mappings['CLASSIFICATION'] = 'CLASSIFICATION'
mappings['FIRST_NAME'] = 'FIRST_NAME'
mappings['LAST_NAME'] = 'LAST_NAME'
mappings['NICKNAME'] = 'NICKNAME'
mappings['NAME'] = 'NAME'
mappings['TITLE'] = 'TITLE'
mappings['USERNAME'] = 'USERNAME', lambda u: "" if "." in u else u
mappings['CAMPUS_EMAIL'] = 'CAMPUS_EMAIL'
mappings['ZIP'] = 'ZIP', lambda z: str(
int(z))[0:5] if z != None and math.isnan(z) != True and z != "" else z
mappings['PHONE'] = 'PHONE'
mappings['DOB'] = 'DOB', lambda d: datetime.fromisoformat(
f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['OFF_CAMPUS_EMAIL'] = 'OFF_CAMPUS_EMAIL'
mappings['BUILDING'] = 'BUILDING CODE'
mappings['ROOM'] = 'ROOM NUMBER'
mappings['CAMPUS_PHONE'] = 'CAMPUS_PHONE'
mappings['DEPARTMENT'] = 'DEPARTMENT DESC'
mappings['COMPANY'] = 'COMPANY'
mappings['OFFICE'] = 'CAMPUS BUILDING DESC'
mappings['SUPERVISOR_ID'] = 'SUPERVISOR_ID', lambda s: (str(s)).zfill(
7) if s != None and s != "" else s
mappings['SUPERVISOR_FIRST_NAME'] = 'SUPERVISOR_FIRST_NAME'
mappings['SUPERVISOR_LAST_NAME'] = 'SUPERVISOR_LAST_NAME'
mappings['SUPERVISOR_EMAIL'] = 'SUPERVISOR_EMAIL'
mappings['GENDER'] = 'GENDER'
mappings['MIDDLE_NAME'] = 'MIDDLE_NAME'
mappings['FULL_TIME'] = 'FULL_TIME'
mappings['PREFIX'] = 'PREFIX'
mappings['START_DATE'] = 'START_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['CAMPUS_BOX'] = 'CAMPUS_BOX'
mappings['SYSTEM_ACCESS_START_DATE'] = 'SYSTEM_ACCESS_START_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['SYSTEM_ACCESS_END_DATE'] = 'SYSTEM_ACCESS_END_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['PER_POS_START_DATE'] = 'PER_POS_START_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['PER_POS_END_DATE'] = 'PER_POS_END_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['HR_EFFECT_TERM_DATE'] = 'HRP_EFFECT_TERM_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['HR_EFFECT_RETIRE_DATE'] = 'HRP_EFFECT_RETIRE_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['COLLEAGUE_ID'] = 'ALT_ID', lambda i: math.nan if i is None or i == "" or i == math.nan else (
str(int(i))).zfill(7)
mapped = etl.fieldmap(conditioned, mappings)
After doing the mappings, I started to compare the tables and I found that the empty cells in the delimited file were showing up as nan in the ipython notebook representation of the table and the table that came from the queried database were just blank or possibly null.
So I went about mapping the queried database table to try and make everything match:
mappings = OrderedDict()
mappings['ID'] = 'ID'
mappings['CLASSIFICATION'] = 'CLASSIFICATION', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['FIRST_NAME'] = 'FIRST_NAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['LAST_NAME'] = 'LAST_NAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['NICKNAME'] = 'NICKNAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['NAME'] = 'NAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['TITLE'] = 'TITLE', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['USERNAME'] = 'USERNAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['CAMPUS_EMAIL'] = 'CAMPUS_EMAIL', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['ZIP'] = 'ZIP', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['PHONE'] = 'PHONE', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['DOB'] = 'DOB', lambda d: datetime.fromisoformat(
f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['OFF_CAMPUS_EMAIL'] = 'OFF_CAMPUS_EMAIL', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['BUILDING'] = 'BUILDING', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['ROOM'] = 'ROOM', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['CAMPUS_PHONE'] = 'CAMPUS_PHONE', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['DEPARTMENT'] = 'DEPARTMENT', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['COMPANY'] = 'COMPANY', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['OFFICE'] = 'OFFICE', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['SUPERVISOR_ID'] = 'SUPERVISOR_ID', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['SUPERVISOR_FIRST_NAME'] = 'SUPERVISOR_FIRST_NAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['SUPERVISOR_LAST_NAME'] = 'SUPERVISOR_LAST_NAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['SUPERVISOR_EMAIL'] = 'SUPERVISOR_EMAIL', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['GENDER'] = 'GENDER', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['MIDDLE_NAME'] = 'MIDDLE_NAME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['FULL_TIME'] = 'FULL_TIME', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['PREFIX'] = 'PREFIX', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['START_DATE'] = 'START_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['CAMPUS_BOX'] = 'CAMPUS_BOX', lambda oc: math.nan if oc == "" or oc == None else oc
mappings['SYSTEM_ACCESS_START_DATE'] = 'SYSTEM_ACCESS_START_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['SYSTEM_ACCESS_END_DATE'] = 'SYSTEM_ACCESS_END_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['PER_POS_START_DATE'] = 'PER_POS_START_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['PER_POS_END_DATE'] = 'PER_POS_END_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['HR_EFFECT_TERM_DATE'] = 'HR_EFFECT_TERM_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['HR_EFFECT_RETIRE_DATE'] = 'HR_EFFECT_RETIRE_DATE', lambda d: datetime.fromisoformat(f'{str(d)[0:10]}' + 'T' + f'{str(d)[11:]}') if str(
d) != "nan" else datetime.fromisoformat("1900-01-01T00:00:00.000")
mappings['COLLEAGUE_ID'] = 'COLLEAGUE_ID', lambda oc: math.nan if oc == "" or oc is None else oc
mapped = etl.fieldmap(wkdy_idm_table, mappings)
| ID | CLASSIFICATION | FIRST_NAME | LAST_NAME | NICKNAME | TITLE | USERNAME | CAMPUS_EMAIL |
|---|---|---|---|---|---|---|---|
| 20009 | Staff | Noone | Nobody | No | Manager of No | nnobody | [email protected] |
| ID | CLASSIFICATION | FIRST_NAME | LAST_NAME | NICKNAME | TITLE | USERNAME | CAMPUS_EMAIL |
|---|---|---|---|---|---|---|---|
| 20009 | Staff | Noone | Nobody | No | Manager of No | nnobody | [email protected] |
The above example looks exactly the same from each table, but these records would be returned by the petl.complement method as if they are different in some way. I can't understand this.
Again, I can't seem to find any real differences between the records that are being returned by the petl.complement method. Can anyone think of why this comparison isn't working properly or if I should just try a different module to do the comparison. If the datatypes are the same, and the tables have the same columns, in the same order, why would I get records that appear to match from each table when they shouldn't be included in the comparison? Thanks for any help.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
