'Slow processing of Python list
I have a file that has around 440K lines of data. I need to read these data and find the actual "table" in the text file. Part of the text file looks like this.
[BEGIN] 2022/4/8 14:00:05
<Z0301IPBBPE03>screen-length 0 temporary
Info: The configuration takes effect on the current user terminal interface only.
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Charging_VRF routing-table
BGP Local router ID is 10.12.24.19
Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
h - history, i - internal, s - suppressed, S - Stale
Origin : i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V - valid, I - invalid, N - not-found
VPN-Instance Charging_VRF, Router ID 10.12.24.19:
Total Number of Routes: 2479
Network NextHop MED LocPrf PrefVal Path/Ogn
*>i 10.0.19.0/24 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.143.0/24 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.144.128/25 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.148.80/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.148.81/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.16/28 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.64/29 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.94/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
...
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Gb_VRF routing-table
BGP Local router ID is 10.12.24.19
Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
h - history, i - internal, s - suppressed, S - Stale
Origin : i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V - valid, I - invalid, N - not-found
VPN-Instance Gb_VRF, Router ID 10.12.24.19:
Total Number of Routes: 1911
Network NextHop MED LocPrf PrefVal Path/Ogn
*>i 10.1.133.192/30 10.12.8.63 0 100 300 ?
* i 10.12.8.63 0 100 0 ?
*>i 10.1.133.216/30 10.12.8.64 0 100 300 ?
* i 10.12.8.64 0 100 0 ?
*>i 10.1.160.248/29 10.12.40.7 0 100 300 ?
* i 10.12.40.7 0 100 0 ?
*>i 10.1.161.0/29 10.12.40.8 0 100 300 ?
* i 10.12.40.8 0 100 0 ?
*>i 10.1.161.248/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
*>i 10.1.161.249/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
*>i 10.1.164.248/29 10.12.40.7 0 100 300 ?
* i 10.12.40.7 0 100 0 ?
*>i 10.1.165.0/29 10.12.40.8 0 100 300 ?
* i 10.12.40.8 0 100 0 ?
*>i 10.1.165.248/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
The text file goes long way, and it has plenty of garbage lines which I did not want to, so I am trying to find the keywords (display bgp vpnv4 vpn-instance) and start reading once I found. The code looks like this, which I will convert the table into my dataframe.
My problem is that, reading this 440k lines of code and convert into dataframe takes me almost half an hour to complete, I am here to seek help to see if there is a better way to improve the efficiency. Thank you!
bgp_df = pd.DataFrame()
vrf_list = ['Charging_VRF', 'Gb_VRF', 'Gn_VRF']
def generate_bgp_network_list(block, vrf):
ip_address_list = block.split('\n')
ip_addresses = [[address for address in ip_address.strip().split(' ') if address] for ip_address in ip_address_list if ip_address] # generate list of lines
ip_addresses = [address for address in ip_addresses if len(address) > 0] # remove empty list
ip_addresses = [(ipaddress.IPv4Network(ip_address[1], False), ip_address[-1]) for ip_address in ip_addresses if validate_ipaddress(ip_address[1])]
bgp_data = [{'ip_network': address, 'vrf': vrf, 'as_number': as_number} for address, as_number in ip_addresses]
bgp_df = bgp_df.append(data, index=False)
def read_bgp_file(file):
if file == '':
return
file = open(file, encoding=get_encoding_type(file))
lines = file.readlines()
start = False
block = ''
lines = iter(lines)
for line in lines:
if '<' in line and len(block) > 0:
generate_bgp_network_list(block, vrf)
start = False
block = ''
if f'display bgp vpnv4 vpn-instance' in line:
vrf = line.strip().split(' ')[-2]
if vrf in vrf_list:
start = True
if start:
block += line
Solution 1:[1]
Looks to me that you only require lines starting with *>i. If this is your case, how about such simple approach:
def input_file_to_dataframe(file_name: str):
result = []
prefix = '*>i'
with open(file_name, "r") as file:
lines = file.readlines()
for line in lines:
line = line.strip()
if line.startswith(prefix):
line = line.replace(prefix, '').split()
result.append(line)
return pd.DataFrame(data=result)
Run with ~50k lines:
input_file_to_dataframe('file.txt')
# 46.3 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 2:[2]
For me the readlines here is the major issue, because it will load all lines at once.
If you were iterating directly on the file, I expect it would read it line by line with a faster result:
with open(file_name, "r") as the_file:
for line in the_file:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rafaó |
| Solution 2 | Floh |
