'Filter list with regex multiple values from a JSON file
Trying to get a list with filtered items using regex. I am trying to get out a specific location codes from the results. I am able to get the results from a JSON file, but I am stuck at figuring out how I can use multiple regex values to filter out the results from the JSON file.
This is how far I am:
import json
import re
file_path = './response.json'
result = []
with open(file_path) as f:
data = json.loads(f.read())
for d in data:
result.append(d['location_code'])
result = list(dict.fromkeys(result))
re_list = ['.*dk*', '.*se*', '.*fi*', '.*no*']
matches = []
for r in re_list:
matches += re.findall( r, result)
# r = re.compile('.*denmark*', '', '', '')
# filtered_list = list(filter(r.match, result))
print(matches)
Output from the first JSON sort. I need to filter out country initials like dk, no, lv, fi, ee etc. and leave only the data that include the specific country codes.
[
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
...
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
Would appreciate any help. Thanks!
Solution 1:[1]
In that case, I know this could work if you try. here is a way that could be used:
Set up multiple fields.
for the first pattern you could:
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|([^"]+)"
or
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|*"
or
for text:
.*?text"\s?:\s?"([\w\s]+)
for names:
.*?name"\s?:\s?"([\w\s]+)
let me know it, if you are able to do
Solution 2:[2]
This looks like regex won't be the best tool; for example, .*fi.* will match sofia, which is probably not wanted; even if we insist on periods before and after, all of the example rows have .na., but probably shouldn't match a search for Namibia.
Probably a better way would be to parse the string more carefully, using one or more of (a) the csv module (if it can contain quoting and escaping in the fields), (b) the split method, and/or (c) regular expressions, to retrieve the country code from each row. Once we have the country code, we can then compare it explicitly
For example, using the split method:
DATA = [
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
COUNTRIES = ['dk', 'se', 'fi', 'no']
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
return country
filtered = [
row for row in DATA
if extract_country(row) in COUNTRIES
]
print(filtered)
or, if you prefer one-liners, you can skip the extract_country function:
filtered = [
row for row in DATA
if row.split('|')[1].split('.')[2] in COUNTRIES
]
Both of these split the row on | and take the second column to get the geographical area, then split the geo area on . and take the third item, which seems to be the country code. If you have documentation for your data source, you will be able to check whether this is true.
One additional check might be to verify that the extracted country code has exactly two letters, as a partial check for irregularities in the data:
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
if not re.match('^[a-z]{2}$', country):
raise ValueError(
'Expected a two-letter country code, got "%s" in row %s'
% (country, row)
)
return country
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sarim Sikander |
| Solution 2 |
