'How to use a variable in look ahead/after regex in python?
I have a data that I extracted from a pdf to text, and it is in this format :
text =
1
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
2
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
3
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
....
The number 1,2,3 are supposed to be indexes. Now I want to write a loop that put in a data frame for each index, the information that follows it. So the result would be a table like :
| index | Address | ZipCode | PhoneNumber | IdNumber | Date |
|---|
I started writing the code but im stuck at the regex part.. How to put a variable to iterate in the look ahead or after part ? Any solution ?
import re
import pandas as pd
indexes = re.findall(r'(?<=\n)\d{1}(?=\n)', text)
# convert string to integer
for i in range(0, len(indexes)):
indexes[i] = int(indexes[i])
# extract the data
text
indexes
data = {}
for index in indexes:
next_index = index+1
index_value = re.search(r'(?<={index}).*(?={next_index})', text).group()
data[index] = index_value
Thanks!
Solution 1:[1]
The slippery slope of regex: if you think you are going to solve your problem with regex, you may soon have 2 problems.
You data is much easier to parse. Each record occupies 7 lines and you already know which line contains which piece of the data:
first_index = None
data = []
current = {}
for i, line in enumerate(text.split("\n")):
if first_index is None:
first_index = i if re.match("\d+", line) else None
if first_index is None or line == "":
continue
delta = (i - first_index) % 7
if delta == 0:
current["index"] = int(line)
elif delta == 1:
current["Address"] = line
elif delta == 2:
current["Address"] += f" {line}"
elif delta == 3:
current["ZipCode"] = line
elif delta == 4:
current["PhoneNumber"] = line
elif delta == 5:
current["IdNumber"] = line
elif delta == 6:
current["Date"] = line
data.append(current)
current = {}
df = pd.DataFrame(data)
A more panda-y solution, once again relying on the 7-row-per-record structure of your text:
from io import StringIO
col_names = ["index", "Address1", "Address2", "ZipCode", "PhoneNumber", "IdNumber", "Date"]
df = (
pd.read_csv(StringIO(text), header=None)
# pd.read_csv("data.txt", header=None) # alternative: read it directly from the file
.assign(a=lambda x: x.index // 7, b=lambda x: x.index % 7)
.set_index(["a", "b"])
.unstack()
.set_axis(col_names, axis=1)
.rename_axis(None)
)
a is the record number (similar idea to index), b is the position of the line within the record.
The downside of this approach is that all columns are strings and you must manually convert them to the appropriate datatype.
Solution 2:[2]
Combine with string formatting:
rf'(?<={index}).*(?={next_index})'
>>> import re
>>> index=4
>>> next_index=5
>>> re.search(rf'(?<={index}).*(?={next_index})', '...4something5...')
<re.Match object; span=(4, 13), match='something'>
If you end up having to use the braces in your pattern, double them up:
>>> a="a"
>>> re.search(rf'{a}{{3}}', 'aaaa')
<re.Match object; span=(0, 3), match='aaa'>
Solution 3:[3]
Use:
s = pd.read_csv('file.txt', header=None)
is_digit = s[0].str.isdigit()
index = (is_digit & (~is_digit.shift(-1, fill_value=False))).cumsum()
columns = index.groupby(index).cumcount()
new_df = \
s.pivot_table(index=index,
columns=columns,
aggfunc='first', values=0)\
.set_axis(['index', 'Address 1', 'Address 2',
'ZipeCode', 'Phone Number',
'ID Number', 'Date'], axis=1)
new_df = new_df.assign(Address=new_df['Address 1'].str.cat(new_df['Address 2'], ' '))\
.drop(['Address 1', 'Address 2'], axis=1)
print(new_df)
index ZipeCode Phone Number ID Number Date \
0
1 1 Zipe Code Phone number ID number Date
2 2 Zipe Code Phone number ID number Date
3 3 Zipe Code Phone number ID number Date
Address
0
1 Address line 1 Address line 2
2 Address line 1 Address line 2
3 Address line 1 Address line 2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Richard Dodson |
| Solution 3 | ansev |
