'How to parse json in snowflake to get the count based on certain field in json
[
{
"type": "ENTITLEMENT",
},
{
"type": "ENTITLEMENT",
},
{
"type": "ROLE"
},
{
"type": "ACCESS_PROFILE"
}
]
I have a column with the above json value in snowflake. I trying to get the count for a particular type. For example , I want the counts for type
= ENTITLEMENT
. This should be 2.
I could flatten this data. But this would produce a lot of duplicate data for columns ( i have a table with large number of columns).
Looking for ways to parse this json.
Things I have tried.
- Write a java udf to parse the the content of example
REATE OR REPLACE FUNCTION IDN_DATA.entitlement_counter(access array)
RETURNS NUMBER
LANGUAGE java handler = 'JsonCounterWithFilter.entitlement_counter'
as
$$
public class JsonCounterWithFilter {
public int entitlement_counter(String[] access) {
int counter = 0;
for(String acc :access) {
if(acc.contains("ENTITLEMENT")) {
counter++;
}
}
return counter;
}
}
$$;
this does not work as it would looks for the word entitlement
anywhere, not just in the field type
. I dont have access jackson library to parse this json.
Solution 1:[1]
A short JS UDF can do this count:
create or replace function count_object_in_array(A array, T string, V string)
returns string
language javascript
as
$$
return A.reduce((count, x) => count + (x[T] == V?1:0), 0)
$$;
For example:
create or replace temp table stst as
select parse_json('[
{
"type": "ENTITLEMENT",
},
{
"type": "ENTITLEMENT",
},
{
"type": "ROLE"
},
{
"type": "ACCESS_PROFILE"
}
]') a;
select *, count_object_in_array(a, 'type', 'ENTITLEMENT')
from stst;
Solution 2:[2]
Here's a pure SQL approach:
create or replace temp table T1 as
select parse_json($$
[
{
"type": "ENTITLEMENT",
},
{
"type": "ENTITLEMENT",
},
{
"type": "ROLE"
},
{
"type": "ACCESS_PROFILE"
}
]
$$) COL1;
select VALUE:type::string as TYPE
,count(*) as CT
from t1, table(flatten(t1.col1))
group by TYPE
Solution 3:[3]
Lateral flatten should not produce dups (at least not based on your input). You can do an aggregate within the native json commands. See if this will work for your data
WITH JSON_DATA AS(
SELECT
PARSE_JSON('[
{
"type": "ENTITLEMENT",
},
{
"type": "ENTITLEMENT",
},
{
"type": "ROLE"
},
{
"type": "ACCESS_PROFILE"
}
]' ) as json)
SELECT value:type::string as type, count(*) FROM JSON_DATA, lateral flatten(input => json)
GROUP BY type;
results:
TYPE | COUNT(*) |
---|---|
ENTITLEMENT | 2 |
ROLE | 1 |
ACCESS_PROFILE | 1 |
Solution 4:[4]
So if we have "more" data as you imply you have than the single demo line:
with data as (
select column1 as id, parse_json(column2) as json from values
('a', '[ {"type":"ENTITLEMENT"},{"type":"ENTITLEMENT"},{"type":"ROLE"},{"type":"ACCESS_PROFILE"}]'),
('b', '[ {"type":"ENTITLEMENT"},{"type":"ROLE"},{"type":"ACCESS_PROFILE"}]'),
('c', '[ {"type":"ENTITLEMENT"},{"type":"ENTITLEMENT"},{"type":"ENTITLEMENT"},{"type":"ROLE"},{"type":"ACCESS_PROFILE"}]')
)
And then just flatten that as if is just one huge blob of code, you will have "mixed" up values:
SELECT
j.value:type::string as type, count(*)
FROM data as d, lateral flatten(input => d.json) j
GROUP BY type;
TYPE | COUNT(*) |
---|---|
ENTITLEMENT | 6 |
ROLE | 3 |
ACCESS_PROFILE | 3 |
the classic answer is to use some "extra" information in the rows like ID
to also group by:
SELECT
d.id,
j.value:type::string as type,
count(*) as count
FROM data as d, lateral flatten(input => d.json) j
GROUP BY 1,2;
ID | TYPE | COUNT |
---|---|---|
a | ENTITLEMENT | 2 |
a | ROLE | 1 |
a | ACCESS_PROFILE | 1 |
b | ENTITLEMENT | 1 |
b | ACCESS_PROFILE | 1 |
c | ENTITLEMENT | 3 |
c | ROLE | 1 |
c | ACCESS_PROFILE | 1 |
b | ROLE | 1 |
Anther way if you have no per row value like ID
or it's not unique per row is the SEQ
element on the FLATTEN
SELECT
j.SEQ,
j.value:type::string as type,
count(*) as count
FROM data as d, lateral flatten(input => d.json) j
GROUP BY 1,2;
SEQ | TYPE | COUNT |
---|---|---|
1 | ENTITLEMENT | 2 |
1 | ROLE | 1 |
1 | ACCESS_PROFILE | 1 |
2 | ENTITLEMENT | 1 |
2 | ROLE | 1 |
2 | ACCESS_PROFILE | 1 |
3 | ROLE | 1 |
3 | ACCESS_PROFILE | 1 |
3 | ENTITLEMENT | 3 |
Or you can just drop the SEQ as a select and just have it in the GROUP BY, but that might confuse people later in a "why do we have duplicates values"
SELECT
j.value:type::string as type,
count(*) as count
FROM data as d, lateral flatten(input => d.json) j
GROUP BY j.seq, 1;
TYPE | COUNT |
---|---|
ENTITLEMENT | 2 |
ROLE | 1 |
ACCESS_PROFILE | 1 |
ENTITLEMENT | 1 |
ROLE | 1 |
ACCESS_PROFILE | 1 |
ROLE | 1 |
ACCESS_PROFILE | 1 |
ENTITLEMENT | 3 |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Felipe Hoffa |
Solution 2 | Greg Pavlik |
Solution 3 | P Needleman |
Solution 4 | Simeon Pilgrim |