'REGEXP_EXTRACT in Impala

I am trying to figure out how to extract customer ID from string that looks loke this:

{"param":"success","value":"10","level":"0","error_code":"101","customer_id":"5b0e9b23e423b0d33c9f7ddfd", "purchases": "13", "last_activity_ts": "123523465"}

I am trying to extract customer ID from strings that contain error code 101 with following code:

select regexp_extract(field, '\"customer_id":"(.*)', 0) from table_name
where field rlike '"error_code":"101"'

But this gives me a following result:

"customer_id":"5b0e9b23e423b0d33c9f7ddfd", "purchases": "13", "last_activity_ts": "123523465"}

Expected result:

5b0e9b23e423b0d33c9f7ddfd

Could you please help me with this?



Solution 1:[1]

You can use below regex:

"customer_id":"(\w+)"

Demo : https://regex101.com/r/MEOGw8/39

Test:

{"param":"success","value":"10","level":"0","error_code":"101","customer_id":"5b0e9b23e423b0d33c9f7ddfd", "purchases": "13", "last_activity_ts": "123523465"}

Match:

Match 1
Full match  63-104  `"customer_id":"5b0e9b23e423b0d33c9f7ddfd"`
Group 1.    78-103  `5b0e9b23e423b0d33c9f7ddfd`

SQL Statement:

select regexp_extract(field, '"customer_id":"(\w+)"',1, 1) from table_name
where field rlike '"error_code":"101"'

Solution 2:[2]

Your regex matches from "customer_id":" until the end of the line because you use .* which will match any charcter zero or more times and you use 0 as the last parameter of regexp_extract. which refers to the entire extracted string.

To match what is between the double quotes, you could match not a double quote and capture that in a group ([^"]+) using a negated character class:

"customer_id":"([^"]+)"

Or you might specify the character ranges in a character class, repeat it one or more times ([a-f0-9]+) and capture that in a group:

"customer_id":"([a-f0-9]+)"

Your value is in the first capturing group which I think you could specify using 1 as the third parameter for regexp_extract.

regexp_extract(field, '"customer_id":"([a-f0-9]+)"', 1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew
Solution 2