'Having trouble extracting Info from online pdf
I've been trying to extract data from a pdf (names, addresses, dates, numbers). I've tried using pdfplumber, PyPDF2, camelot, and tabula but don't have much experience. I've managed to extract data, but my issue is that because the info in the same column that is part of the same sentence carries onto multiple lines, and as a result. Below is the code from trying tabula and the result. I'm hoping someone can give me some advice on which library would be best to use and how to go about this; thanks!
from tabula import read_pdf
from tabulate import tabulate
import pandas
servicemembers_cases = read_pdf("https://www.mass.gov/doc/servicemember-cases/download",
pages=3, output_format="dataframe")
servicemembers_cases
#resulting data frame below
[ 22 SM 000324 02/03/2022 Chicopee 233 Asselin Street \
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 22 SM 000325 02/03/2022 Southbridge 663 North Woodstock Road
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 22 SM 000326 02/03/2022 Plymouth 41 Goldfinch Lane
7 22 SM 000327 02/04/2022 Westfield 31 Bristol Street
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 22 SM 000328 02/04/2022 Wilmington 19 Cedar Street
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 NaN NaN NaN NaN
14 22 SM 000329 02/04/2022 Somerset 341 Lawton Street
15 22 SM 000330 02/04/2022 Norton 13 Ledge Road
16 22 SM 000331 02/04/2022 Worcester 106 Mill Street
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 NaN NaN NaN NaN
20 NaN NaN NaN NaN
21 NaN NaN NaN NaN
22 NaN NaN NaN NaN
23 NaN NaN NaN NaN
24 22 SM 000332 02/04/2022 Brockton 192 Algonquin Street
25 22 SM 000333 02/04/2022 West Springfield 121 Garden Street
26 NaN NaN NaN NaN
27 NaN NaN NaN NaN
28 NaN NaN NaN NaN
29 22 SM 000334 02/04/2022 Abington 422 Tamarack Lane
Longbridge Financial, LLC Sandra D. Daletto also known as
0 NaN Sandra D. Gagnon aka Charlsie
1 NaN E. Gagnon aka Charlsie
2 NaN Elizabeth Daletto et al
3 TIAA, FSB d/b/a TIAA Bank Heirs, Devisees And Legal
4 f/k/a EverBank Representatives Ofthe Estate Of
5 NaN Patricia A. Rondeau et al
6 LoanCare, LLC Amy Lynne Ten Berge et al
7 Freedom Mortgage Corporation Heirs, Devisees And Legal
8 NaN Representatives Of The Estate Of
9 NaN Robert Larry Brueno et al
10 Wells Fargo Bank, N.A. Montana Rose Cole, Individually
11 NaN and as Personal Representative
12 NaN of the Estate of Julie C. Foshay
13 NaN et al
14 LoanCare, LLC Tyler J. Root
15 Lakeview Loan Servicing, Llc Randy D. Sawmiller
16 The Bank Of New York Mellon Richard R. Beaupre
17 Fka The Bank Of New York, As NaN
18 Trustee For The NaN
19 Certificateholders Of The Cwalt, NaN
20 Inc. Alternative Loan Trust NaN
21 2007-16cb Mortgage NaN
22 Pass-Through Certificates, NaN
23 Series 2007-16cb NaN
24 M & T Bank Rosalee E. Robinson
25 Pnc Bank National Association, Marcia A. Printz
26 Successor By Merger To NaN
27 National City Mortgage, A NaN
28 Division Of National City Bank NaN
29 Lakeview Loan Servicing, Llc Kimberly Laidlaw ]
pandas.DataFrame.from_records(servicemembers_cases)
#result of trying to use pandas on data frame below (it only shows one row)
0 22 SM 000324 02/03/2022 Chicopee 233 Asselin Street Longbridge Financial, LLC Sandra D. Daletto also known as
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
