'Regex for bank transaction parsing
How do I parse and extract the 4 important columns from a text table of the following format? These are bank transaction line items extracted from a PDF using Ruby's pdf-reader package - as you can see the spacing between columns is very irregular between various columns.
11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30
11/4 Recurring Payment authorized on 11/01 Digitalocean.Com 12.00
11/4 Purchase authorized on 11/01 Google *Gsuite_Get 24.00
11/4 Purchase authorized on 11/02 Amazon Web Service 460.15
11/4 Purchase authorized on 11/02 Amazon Web Service 8.07 2,903.09
11/5 Recurring Payment authorized on 11/03 Atlassian 15.00 2,888.09
11/6 Recurring Payment authorized on 11/04 Pipedrive Inc NY NY 24.00 2,864.09
11/12 Foobar Retail Dis 211011 ABCDEFGH 8,031.44
11/12 Wire Trans Svc Charge - Sequence: 999999999999 Srf# 45.00
11/12 WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC 5,000.00 5,850.53
11/14 Purchase authorized on 11/13 Microconf Microconf.Com MN 100.00 5,702.53
The above transactions are extracted from a bank PDF with the following visual layout

Need to parse the bold colums via a regexp:
- Date - dd/mm format - always present
- Check number - always empty and may be ignored (alphanumeric single word?)
- Description - Text with dates, numbers, special characters - always present
- Credits - currency amount (only for deposits)
- Debits - currency amount (only for payments)
- Balance - currency amount (appears sporadically, not important)
I have only been able to get as far as /^(\d{1,2}\/\d{1,2})\s+/mg for extracting the mm/dd. Should I start chomping the amounts from the right, but then there are no clear seperator patterns!
Solution 1:[1]
Presuming your target is as csv/spreadsheet entries

It is best to tackle the task in stages, and my preferred target format is CSV for a spreadsheet
TL;DR see last comment
11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30
11/4 Recurring Payment authorized on 11/01 Digitalocean.Com 12.00
11/4 Purchase authorized on 11/01 Google *Gsuite_Get 24.00
11/4 Purchase authorized on 11/02 Amazon Web Service 460.15
11/4 Purchase authorized on 11/02 Amazon Web Service 8.07 2,903.09
11/5 Recurring Payment authorized on 11/03 Atlassian 15.00 2,888.09
11/6 Recurring Payment authorized on 11/04 Pipedrive Inc NY NY 24.00 2,864.09
11/12 Foobar Retail Dis 211011 ABCDEFGH 8,031.44
11/12 Wire Trans Svc Charge - Sequence: 999999999999 Srf# 45.00
11/12 WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC 5,000.00 5,850.53
11/14 Purchase authorized on 11/13 Microconf Microconf.Com MN 100.00 5,702.53
1st we can target the bigger gaps so pick a suitable width, don't worry about the staggers they will be resolved later.
to become ??.??
We either need to protect existing commas so replace those with another unused symbol say ~ or for currency best to delete them from between numbers.
Replace all line ends with dummy extensions, it will not matter if there are too many columns if not digits so use
??.?? ??.?? (yes in this case we assume under 1000 and cannot use , # or *)
thus 11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30
becomes 11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30 ??.?? ??.??
11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30 ??.?? ??.??
11/4 Recurring Payment authorized on 11/01 Digitalocean.Com ??.?? 12.00 ??.?? ??.??
11/4 Purchase authorized on 11/01 Google *Gsuite_Get ??.?? 24.00 ??.?? ??.??
11/4 Purchase authorized on 11/02 Amazon Web Service ??.?? 460.15 ??.?? ??.??
11/4 Purchase authorized on 11/02 Amazon Web Service ??.?? 8.07 2903.09 ??.?? ??.??
11/5 Recurring Payment authorized on 11/03 Atlassian ??.?? 15.00 2888.09 ??.?? ??.??
11/6 Recurring Payment authorized on 11/04 Pipedrive Inc NY NY ??.?? 24.00 2864.09 ??.?? ??.??
11/12 Foobar Retail Dis 211011 ABCDEFGH 8031.44 ??.?? ??.??
11/12 Wire Trans Svc Charge - Sequence: 999999999999 Srf# ??.?? 45.00 ??.?? ??.??
11/12 WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC ??.?? 5000.00 5850.53 ??.?? ??.??
11/14 Purchase authorized on 11/13 Microconf Microconf.Com MN ??.?? 100.00 5702.53 ??.?? ??.??
Now we can target the remaining irregular white space so replace all larger spaces with 2 or 3 spaces as appropriate (usually 2 will do, but beware any description with double spaces.)
11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30 ??.?? ??.??
11/4 Recurring Payment authorized on 11/01 Digitalocean.Com ??.?? 12.00 ??.?? ??.??
11/4 Purchase authorized on 11/01 Google *Gsuite_Get ??.?? 24.00 ??.?? ??.??
11/4 Purchase authorized on 11/02 Amazon Web Service ??.?? 460.15 ??.?? ??.??
11/4 Purchase authorized on 11/02 Amazon Web Service ??.?? 8.07 2903.09 ??.?? ??.??
11/5 Recurring Payment authorized on 11/03 Atlassian ??.?? 15.00 2888.09 ??.?? ??.??
11/6 Recurring Payment authorized on 11/04 Pipedrive Inc NY NY ??.?? 24.00 2864.09 ??.?? ??.??
11/12 Foobar Retail Dis 211011 ABCDEFGH 8031.44 ??.?? ??.??
11/12 Wire Trans Svc Charge - Sequence: 999999999999 Srf# ??.?? 45.00 ??.?? ??.??
11/12 WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC ??.?? 5000.00 5850.53 ??.?? ??.??
11/14 Purchase authorized on 11/13 Microconf Microconf.Com MN ??.?? 100.00 5702.53 ??.?? ??.??
Finally add headers, replace with comma separators and remove the ??.??
Date,Description,Credits,Debits,Balance,,,
11/4,Stripe Transfer St-XYZ Agnostic Computers,582.30,,
11/4,Recurring Payment authorized on 11/01 Digitalocean.Com,,12.00,,
11/4,Purchase authorized on 11/01 Google *Gsuite_Get,,24.00,,
11/4,Purchase authorized on 11/02 Amazon Web Service,,460.15,,
11/4,Purchase authorized on 11/02 Amazon Web Service,,8.07,2903.09,,
11/5,Recurring Payment authorized on 11/03 Atlassian,,15.00,2888.09,,
11/6,Recurring Payment authorized on 11/04 Pipedrive Inc NY NY,,24.00,2864.09,,
11/12,Foobar Retail Dis 211011 ABCDEFGH,8031.44,,
11/12,Wire Trans Svc Charge - Sequence: 999999999999 Srf#,,45.00,,
11/12,WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC,,5000.00,5850.53,,
11/14,Purchase authorized on 11/13 Microconf Microconf.Com MN,,100.00,5702.53,,
On import to the spreadsheet the headers and possibly currency need a style.
In hindsight I realised all you need to do is
- Remove commas
- Inject a dummy column 3 in big white spaces (even a single ~)
- Pare down spaces to 2x space then replace those 2 spaces with comma
- Remove dummy entry e.g. the ~
- Add header
Date,Description,Credits,Debits,Balance
The rest will take care of itself.
Solution 2:[2]
TL;DR
Your main problem is that if you're dealing with string data after you've parsed it from the PDF, so it's difficult to determine which positional elements correspond with which field. You really ought to open a separate question about how to solve this at PDF parse time, rather than trying to parse the text after the PDF parsing phase. That said, below is a solution that works with the limited sample you provided, and that should at least get you started with the string parsing you're attempting to do.
Assumptions and Examples
From your example, it looks like there are some implicit business rules to your format:
- Some fields are always present (e.g. date and description).
- Only one debit or credit will be be on each line.
- Each line will have a maximum of 4/5 populated fields.
However, even if "balance" isn't important, you can't really tell whether something is a credit or a debit without reference to some existing balance or clearly-defined number of spaces in the parsed output, so you either need to fix your input data or PDF parsing to ensure you always have a balance (which you could possibly calculate at PDF parsing time) or ensure you know the specific field widths that are in the PDF layout or the parsed output from the PDF.
While only a partial solution which you will need to update for your real use cases, you could create a Struct or other object to hold your data, and then make additional parsing decisions based on how many fields or spaces between fields that each transaction holds. A potential solution follows.
Example Using Your String from the PDF Parse
NB: The code examples below have been aggressively wrapped to 60 characters whenever it won't affect the results to reduce side-scrolling within StackOverflow code blocks. Feel free to reflow the code to suit your own style choices.
We'll start by storing the parsed text you provided in your original post in a here-document for exercising the rest of this code example.
text_extracted_from_pdf = <<~'EXTRACTED_TEXT'
11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30
11/4 Recurring Payment authorized on 11/01 Digitalocean.Com 12.00
11/4 Purchase authorized on 11/01 Google *Gsuite_Get 24.00
11/4 Purchase authorized on 11/02 Amazon Web Service 460.15
11/4 Purchase authorized on 11/02 Amazon Web Service 8.07 2,903.09
11/5 Recurring Payment authorized on 11/03 Atlassian 15.00 2,888.09
11/6 Recurring Payment authorized on 11/04 Pipedrive Inc NY NY 24.00 2,864.09
11/12 Foobar Retail Dis 211011 ABCDEFGH 8,031.44
11/12 Wire Trans Svc Charge - Sequence: 999999999999 Srf# 45.00
11/12 WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC 5,000.00 5,850.53
11/14 Purchase authorized on 11/13 Microconf Microconf.Com MN 100.00 5,702.53
EXTRACTED_TEXT
We'll also define some constants that we'll use to parse your extracted text after the PDF parse, and a Struct class to hold the results of parsing each line of text. You may need to adjust these based on your real data.
# This describes what a currency item looks like after your
# PDF parse.
MONEY_FMT = /\b[\d,]+\.\d{2}\b/
# Make some assumptions about fixed-width fields. These
# values seem reliable given the sample string data from
# your original post.
LN_START_TO_LAST_CRED_CHR = /^.{92}\.\d{2}$?/
LN_START_TO_END_OF_DEBIT = /^.{93,}#{MONEY_FMT}$?/
Transaction = Struct.new(:date, :description, :credit,
:debit, :balance, keyword_init:
true)
Now we read the output from the PDF parse to try to analyze the resulting string. Using Ruby 3.1.1, and with code aggressively wrapped to minimize side-scrolling on StackOverflow:
transactions = []
text_extracted_from_pdf.each_line do
fields = _1.split /\s{2,}/
date, description = fields.shift 2
balance = fields.pop.chomp if fields.count == 2
# This violates our rule of 4/5 populated fields.
raise "too many fields remaining: #{fields.count}" unless
fields.count == 1
# Match on characters from start of line to end of credit.
credit =
fields.pop.chomp if _1.match? LN_START_TO_LAST_CRED_CHR
# Match on characters from start of line to end of debit.
debit =
fields.pop.chomp if _1.match? LN_START_TO_END_OF_DEBIT
transactions << Transaction.new({date: date, description:
description, credit:
credit, debit: debit,
balance: balance})
end
Expected Result
The transactions array should now hold a collection of Transaction objects which you can iterate over as needed. For example, the example code above populates the transactions Array with the following Struct objects:
transactions
#=>
[#<struct Transaction date="11/4", description="Stripe Transfer St-XYZ Agnostic Computers", credit="582.30", debit=nil, balance=nil>,
#<struct Transaction date="11/4", description="Recurring Payment authorized on 11/01 Digitalocean.Com", credit=nil, debit="12.00", balance=nil>,
#<struct Transaction date="11/4", description="Purchase authorized on 11/01 Google *Gsuite_Get", credit=nil, debit="24.00", balance=nil>,
#<struct Transaction date="11/4", description="Purchase authorized on 11/02 Amazon Web Service", credit=nil, debit="460.15", balance=nil>,
#<struct Transaction date="11/4", description="Purchase authorized on 11/02 Amazon Web Service", credit=nil, debit="8.07", balance="2,903.09">,
#<struct Transaction date="11/5", description="Recurring Payment authorized on 11/03 Atlassian", credit=nil, debit="15.00", balance="2,888.09">,
#<struct Transaction date="11/6", description="Recurring Payment authorized on 11/04 Pipedrive Inc NY NY", credit=nil, debit="24.00", balance="2,864.09">,
#<struct Transaction date="11/12", description="Foobar Retail Dis 211011 ABCDEFGH", credit="8,031.44", debit=nil, balance=nil>,
#<struct Transaction date="11/12", description="Wire Trans Svc Charge - Sequence: 999999999999 Srf#", credit=nil, debit="45.00", balance=nil>,
#<struct Transaction date="11/12", description="WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC", credit=nil, debit="5,000.00", balance="5,850.53">,
#<struct Transaction date="11/14", description="Purchase authorized on 11/13 Microconf Microconf.Com MN", credit=nil, debit="100.00", balance="5,702.53">]
Validate Your String Parse
A lot of things can go wrong when people make assumptions about either formatting or their code's logic. If you want to validate your Struct objects, you can iterate over the collection to identify bad parses, or you could choose to log, warn, or raise an exception inside your parsing loop above.
# If you have parsed both a credit and a debit on the same line,
# something's wrong.
transactions.map do
warn "bad parse for #{_1}" if _1.credit && _1.debit
end.compact!
#=> []
Instead of simply raising a warning here, you could also use Array#reject! to remove items directly from transactions which didn't parse properly, assuming you don't simply skip adding them to the collection in the first place within the #each_line loop above. How you choose to identify and handle a bad parse is really up to you; this is just one of many approaches, and is meant to illustrate that you need to validate the results of each PDF or string parse somewhere in your code.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
