'Easiest way to ignore or drop one header row from first page, when parsing table spanning several pages
I am parsing a PDF with tabula-py, and I need to ignore the first two tables, but then parse the rest of the tables as one, and export to a CSV. On the first relevant table (index 2) the first row is a header-row, and I want to leave this out of the csv.
See my code below, including my attempt at dropping the relevant row from the Pandas frame.
What is the easiest/most elegant way of achieving this?
tables = tabula.read_pdf('input.pdf', pages='all', multiple_tables=True)
f = open('output.csv', 'w')
# tables[2].drop(index=0) # tried this, but makes no difference
for df in tables[2:]:
df.to_csv(f, index=False, sep=';')
f.close()
Solution 1:[1]
Given the following toy dataframes:
import pandas as pd
tables = [
pd.DataFrame([[1, 3], [2, 4]]),
pd.DataFrame([["a", "b"], [1, 3], [2, 4]]),
]
for table in tables:
print(table)
# Ouput
0 1
0 1 3
1 2 4
0 1
0 a b <<< Unwanted row in table[1]
1 1 3
2 2 4
You can drop the first row of the second dataframe either by reassigning the resulting dataframe (preferable way):
tables[1] = tables[1].drop(index=0)
Or inplace:
tables[1].drop(index=0, inplace=True)
And so, in both cases:
print(table[1])
# Output
0 1
1 1 3
2 2 4
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Laurent |
