'Slow processing of Python list

I have a file that has around 440K lines of data. I need to read these data and find the actual "table" in the text file. Part of the text file looks like this.

[BEGIN] 2022/4/8 14:00:05
<Z0301IPBBPE03>screen-length 0 temporary                          
Info: The configuration takes effect on the current user terminal interface only.
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Charging_VRF routing-table
 
 BGP Local router ID is 10.12.24.19
 Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
               h - history,  i - internal, s - suppressed, S - Stale
               Origin : i - IGP, e - EGP, ? - incomplete
 RPKI validation codes: V - valid, I - invalid, N - not-found

    
 VPN-Instance Charging_VRF, Router ID 10.12.24.19:

 Total Number of Routes: 2479
        Network            NextHop                       MED        LocPrf    PrefVal Path/Ogn

 *>i    10.0.19.0/24       10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.143.0/24      10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.144.128/25    10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.148.80/32     10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.148.81/32     10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.201.16/28     10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.201.64/29     10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
 *>i    10.0.201.94/32     10.12.8.21                     0          100        300     ?
 * i                       10.12.8.22                     0          100        0       ?
...
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Gb_VRF routing-table
 
 BGP Local router ID is 10.12.24.19
 Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
               h - history,  i - internal, s - suppressed, S - Stale
               Origin : i - IGP, e - EGP, ? - incomplete
 RPKI validation codes: V - valid, I - invalid, N - not-found

    
 VPN-Instance Gb_VRF, Router ID 10.12.24.19:

 Total Number of Routes: 1911
        Network            NextHop                       MED        LocPrf    PrefVal Path/Ogn

 *>i    10.1.133.192/30    10.12.8.63                     0          100        300     ?
 * i                       10.12.8.63                     0          100        0       ?
 *>i    10.1.133.216/30    10.12.8.64                     0          100        300     ?
 * i                       10.12.8.64                     0          100        0       ?
 *>i    10.1.160.248/29    10.12.40.7                     0          100        300     ?
 * i                       10.12.40.7                     0          100        0       ?
 *>i    10.1.161.0/29      10.12.40.8                     0          100        300     ?
 * i                       10.12.40.8                     0          100        0       ?
 *>i    10.1.161.248/32    10.12.40.7                     2          100        300     ?
 * i                       10.12.40.7                     2          100        0       ?
 *>i    10.1.161.249/32    10.12.40.7                     2          100        300     ?
 * i                       10.12.40.7                     2          100        0       ?
 *>i    10.1.164.248/29    10.12.40.7                     0          100        300     ?
 * i                       10.12.40.7                     0          100        0       ?
 *>i    10.1.165.0/29      10.12.40.8                     0          100        300     ?
 * i                       10.12.40.8                     0          100        0       ?
 *>i    10.1.165.248/32    10.12.40.7                     2          100        300     ?
 * i                       10.12.40.7                     2          100        0       ?

The text file goes long way, and it has plenty of garbage lines which I did not want to, so I am trying to find the keywords (display bgp vpnv4 vpn-instance) and start reading once I found. The code looks like this, which I will convert the table into my dataframe.

My problem is that, reading this 440k lines of code and convert into dataframe takes me almost half an hour to complete, I am here to seek help to see if there is a better way to improve the efficiency. Thank you!

bgp_df = pd.DataFrame()
vrf_list = ['Charging_VRF', 'Gb_VRF', 'Gn_VRF']

def generate_bgp_network_list(block, vrf):
    ip_address_list = block.split('\n')
    ip_addresses = [[address for address in ip_address.strip().split(' ') if address] for ip_address in ip_address_list if ip_address] # generate list of lines
    ip_addresses = [address for address in ip_addresses if len(address) > 0]        # remove empty list
    ip_addresses = [(ipaddress.IPv4Network(ip_address[1], False), ip_address[-1]) for ip_address in ip_addresses if validate_ipaddress(ip_address[1])]

    bgp_data = [{'ip_network': address, 'vrf': vrf, 'as_number': as_number} for address, as_number in ip_addresses]
    bgp_df = bgp_df.append(data, index=False)

def read_bgp_file(file):
    if file == '':
        return


    file = open(file, encoding=get_encoding_type(file))
    lines = file.readlines()
    start = False
    block = ''
    lines = iter(lines)
    for line in lines:
        if '<' in line and len(block) > 0:
            generate_bgp_network_list(block, vrf)
            start = False
            block = ''
        if f'display bgp vpnv4 vpn-instance' in line:
            vrf = line.strip().split(' ')[-2]
            if vrf in vrf_list:
                start = True
        if start:
            block += line


Solution 1:[1]

Looks to me that you only require lines starting with *>i. If this is your case, how about such simple approach:

def input_file_to_dataframe(file_name: str):
    result = []
    prefix = '*>i'
    
    with open(file_name, "r") as file:
        lines = file.readlines()
        
        for line in lines:
            line = line.strip()
            if line.startswith(prefix):
                line = line.replace(prefix, '').split()
                result.append(line)

    return pd.DataFrame(data=result)

Run with ~50k lines:

input_file_to_dataframe('file.txt')
# 46.3 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Solution 2:[2]

For me the readlines here is the major issue, because it will load all lines at once.

If you were iterating directly on the file, I expect it would read it line by line with a faster result:

with open(file_name, "r") as the_file:
    for line in the_file:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rafaó
Solution 2 Floh