'How to use a variable in look ahead/after regex in python?

I have a data that I extracted from a pdf to text, and it is in this format :

text =
1
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
2
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
3
Address line 1
Address line 2
Zipe Code
Phone number
ID number
Date
....

The number 1,2,3 are supposed to be indexes. Now I want to write a loop that put in a data frame for each index, the information that follows it. So the result would be a table like :

index Address ZipCode PhoneNumber IdNumber Date

I started writing the code but im stuck at the regex part.. How to put a variable to iterate in the look ahead or after part ? Any solution ?

import re
import pandas as pd
indexes = re.findall(r'(?<=\n)\d{1}(?=\n)', text)
# convert string to integer
for i in range(0, len(indexes)):
    indexes[i] = int(indexes[i])
# extract the data
text
indexes
data = {}
for index in indexes: 
    next_index = index+1 
    index_value = re.search(r'(?<={index}).*(?={next_index})', text).group()
    data[index] = index_value

Thanks!



Solution 1:[1]

The slippery slope of regex: if you think you are going to solve your problem with regex, you may soon have 2 problems.

You data is much easier to parse. Each record occupies 7 lines and you already know which line contains which piece of the data:

first_index = None
data = []
current = {}

for i, line in enumerate(text.split("\n")):
    if first_index is None:
        first_index = i if re.match("\d+", line) else None

    if first_index is None or line == "":
        continue

    delta = (i - first_index) % 7
    if delta == 0:
        current["index"] = int(line)
    elif delta == 1:
        current["Address"] = line
    elif delta == 2:
        current["Address"] += f" {line}"
    elif delta == 3:
        current["ZipCode"] = line
    elif delta == 4:
        current["PhoneNumber"] = line
    elif delta == 5:
        current["IdNumber"] = line
    elif delta == 6:
        current["Date"] = line
        data.append(current)
        current = {}

df = pd.DataFrame(data)

A more panda-y solution, once again relying on the 7-row-per-record structure of your text:

from io import StringIO

col_names = ["index", "Address1", "Address2", "ZipCode", "PhoneNumber", "IdNumber", "Date"]
df = (
    pd.read_csv(StringIO(text), header=None)
    # pd.read_csv("data.txt", header=None)   # alternative: read it directly from the file
    .assign(a=lambda x: x.index // 7, b=lambda x: x.index % 7)
    .set_index(["a", "b"])
    .unstack()
    .set_axis(col_names, axis=1)
    .rename_axis(None)
)

a is the record number (similar idea to index), b is the position of the line within the record.

The downside of this approach is that all columns are strings and you must manually convert them to the appropriate datatype.

Solution 2:[2]

Combine with string formatting:

rf'(?<={index}).*(?={next_index})'

>>> import re
>>> index=4
>>> next_index=5
>>> re.search(rf'(?<={index}).*(?={next_index})', '...4something5...')
<re.Match object; span=(4, 13), match='something'>

If you end up having to use the braces in your pattern, double them up:

>>> a="a"
>>> re.search(rf'{a}{{3}}', 'aaaa')
<re.Match object; span=(0, 3), match='aaa'>

Solution 3:[3]

Use:

s = pd.read_csv('file.txt', header=None)
is_digit = s[0].str.isdigit()
index = (is_digit & (~is_digit.shift(-1, fill_value=False))).cumsum()
columns = index.groupby(index).cumcount()

new_df = \
s.pivot_table(index=index, 
              columns=columns, 
              aggfunc='first', values=0)\
.set_axis(['index', 'Address 1', 'Address 2',
           'ZipeCode', 'Phone Number', 
           'ID Number', 'Date'], axis=1)

new_df = new_df.assign(Address=new_df['Address 1'].str.cat(new_df['Address 2'], ' '))\
    .drop(['Address 1', 'Address 2'], axis=1)
print(new_df)

  index   ZipeCode  Phone Number  ID Number  Date  \
0                                                   
1     1  Zipe Code  Phone number  ID number  Date   
2     2  Zipe Code  Phone number  ID number  Date   
3     3  Zipe Code  Phone number  ID number  Date  

                         Address  
0                                 
1  Address line 1 Address line 2  
2  Address line 1 Address line 2  
3  Address line 1 Address line 2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Richard Dodson
Solution 3 ansev