'How to find specific value in a text file in python

Good morning guys, my question is: I have a text file in this format:

1 00:00:00,000 --> 00:00:00,033 <font size="36">FrameCnt: 1, DiffTime:
33ms 2022-05-19 16:15:57,729,790 [iso : 110] [shutter : 1/640.0] [fnum
: 280] [ev : 0] [ct : 5284] [color_md : default] [focal_len : 240]
[dzoom_ratio: 20088, delta:10088],[latitude: 38.259025] [longtitude:
15.598678] [rel_alt: 9.737 abs_alt: 99.324] [Drone: Yaw:51.4,
Pitch:-1.8, Roll:-1.3] </font>

2 00:00:00,033 --> 00:00:00,066 <font size="36">FrameCnt: 2, DiffTime:
33ms 2022-05-19 16:15:57,762,098 [iso : 110] [shutter : 1/640.0] [fnum
: 280] [ev : 0] [ct : 5284] [color_md : default] [focal_len : 240]
[dzoom_ratio: 20088, delta:0],[latitude: 38.259030] [longtitude:
15.598689] [rel_alt: 9.737 abs_alt: 99.324] [Drone: Yaw:51.4,
Pitch:-1.8, Roll:-1.3] </font>

My intention is to retrieve FrameCnt, latitude, and longitude values for block of 6 rows. That is my possible output:

1, 38.259025, 15.598678
    
2, 38.259030, 15.598689

How is it possible to do this in python? Thank you very much in advance



Solution 1:[1]

You can do this by regex:

import re

regex = (r".*\[latitude: (.*)\] \[longtitude:\n"
    r"(.*)\] \[rel_alt.*")

test_str = ("1 00:00:00,000 --> 00:00:00,033 <font size=\"36\">FrameCnt: 1, DiffTime:\n"
    "33ms 2022-05-19 16:15:57,729,790 [iso : 110] [shutter : 1/640.0] [fnum\n"
    ": 280] [ev : 0] [ct : 5284] [color_md : default] [focal_len : 240]\n"
    "[dzoom_ratio: 20088, delta:10088],[latitude: 38.259025] [longtitude:\n"
    "15.598678] [rel_alt: 9.737 abs_alt: 99.324] [Drone: Yaw:51.4,\n"
    "Pitch:-1.8, Roll:-1.3] </font>\n\n"
    "2 00:00:00,033 --> 00:00:00,066 <font size=\"36\">FrameCnt: 2, DiffTime:\n"
    "33ms 2022-05-19 16:15:57,762,098 [iso : 110] [shutter : 1/640.0] [fnum\n"
    ": 280] [ev : 0] [ct : 5284] [color_md : default] [focal_len : 240]\n"
    "[dzoom_ratio: 20088, delta:0],[latitude: 38.259030] [longtitude:\n"
    "15.598689] [rel_alt: 9.737 abs_alt: 99.324] [Drone: Yaw:51.4,\n"
    "Pitch:-1.8, Roll:-1.3] </font>")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    print(matchNum, end=" , ")
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        print (match.group(groupNum), end=" ")
    print("")

Solution 2:[2]

I think this is a good use for regex lookbehind. Lookbehind in regex is a regex part that needs to be before the mathc in order to be a valid match, but it isn't included in the return match. The syntax is (?<=<lookbehind_regex>).

We are look here for the strings after "FrameCnt: ", "latitude: " and "longtitude: ", so our patterns will start with (?<=FrameCnt: ) and so on.

Next, we want to find a digit, which can be floating point or not. This can be found using [0-9.]+. The [0-9.] part means any digit character or a period. The + means that we want [0-9.] one or more times. This needs to be included in the match, so we place it outside the lookbehind.

Our pattterns will thus look like (?<=FrameCnt: )[0-9.]+, (?<=latitude: )[0-9.]+ and (?<=longtitude: )[0-9.]+.

We could now hardcode these patterns with the three different words, but what if you decide tomorrow that you also need another value? That's why I would use a for loop and dynamically construct the pattern from a base pattern.

Here's the code:

from regex import findall

pattern="(?<=%s)[0-9.]+"
tofind=["FrameCnt","latitude","longtitude"]
found=[]

with open("filename.txt","r") as file:
    txt=file.read()

for string in tofind:
    newpattern=pattern % (string+": ")
    found.append(findall(newpattern,txt))
print(found)

Output:

[['1', '2'], ['38.259025', '38.259030'], ['15.598678', '15.598689']]

Now we still have to change its data type into int, and put it into a DataFrame.

from pandas import DataFrame as df

frame=df(found,dtype=float,index=tofind)
print(frame)

Output:

                    0          1
FrameCnt     1.000000   2.000000
latitude    38.259025  38.259030
longtitude  15.598678  15.598689

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 nfn
Solution 2 The_spider