'Return elements of Redshift JSON array on separate rows

I have a Redshift table that looks like this:

 id | metadata
---------------------------------------------------------------------------
 1  | [{"pet":"dog"},{"country":"uk"}]
 2  | [{"pet":"cat"}]
 3  | []
 4  | [{"country":"germany"},{"education":"masters"},{"country":"belgium"}]
  • All array elements have just one field.
  • There is no guarantee that a particular field will feature in any of an array's elements.
  • A field name can be repeated in an array
  • The array elements can be in any order

I am wanting to get back a table that looks like this:

 id |   field   |  value
------------------------
 1  | pet       | dog
 1  | country   | uk
 2  | pet       | cat
 4  | country   | germany
 4  | education | masters
 4  | country   | belgium

I can then combine this with my queries on the rest of the input table.

I have tried playing around with the Redshift JSON functions, but without being able to write functions/use loops/have variables in Redshift, I really can't see a way to do this!

Please let me know if I can clarify anything else.



Solution 1:[1]

There is generic version for CREATE VIEW seq_0_to_3. Let's call it CREATE VIEW seq_0_to_n. This can be generated by

CREATE VIEW seq_0_to_n AS (  
    SELECT row_number() over (
                          ORDER BY TRUE)::integer - 1 AS i
    FROM <insert_large_enough_table> LIMIT <number_less_than_table_entries>);

This helps in generating large sequences as a view.

Solution 2:[2]

It's now possible in Redshift to treat strings in either array format [] or json format {} as parsable json structures. First let's make a temp table based on your data:

create temporary table #t1 (id int, json_str varchar(100));
truncate table #t1;
insert into #t1 values (1, '[{"pet":"dog"},{"country":"uk"}]');
insert into #t1 values (2, '[{"pet":"cat"}]');
insert into #t1 values (3, '[]');
insert into #t1 values (4, '[{"country":"germany"},{"education":"masters"},{"country":"belgium"}]');

This creation of a common table expression (cte) will be used to implicitly convert the json_str field into a formal json structure of SUPER type. If the table's field were already SUPER type, we could skip this step.

drop table if exists #t2;
create temporary table #t2 as
with cte as 
    (select 
        x.id,
        json_parse(x.json_str) as json_str -- convert string to SUPER structure 
    from
        #t1 x
    )
select
    x.id
    ,unnested
from
    cte x, x.json_str as unnested -- an alias of cte and x.json_str is required!
order by 
    id
;

Now we have an exploded list of key/value pairs to easily extract:

select 
    t2.id
    ,json_key -- this is the extracted key
    ,cast(json_val as varchar) as json_val -- eleminates the double quote marks
from
    #t2 t2, unpivot t2.unnested as json_val at json_key --"at some_label" (e.g. json_key) will extract the key
order by
    id

A different way to render the info is to allow the parsing engine to turn keys into columns. This isn't what you asked for, but potentially interesting:

select 
    id
    ,cast(t2.unnested.country as varchar) -- data is already parsed into rows, so it's directly addressable now
    ,cast(t2.unnested.education as varchar)
    ,cast(t2.unnested.pet as varchar)
from
    #t2 t2
;

If you want more info on this, use a search engine to search for parsing the SUPER data type. If the data already existed as SUPER in the Redshift table, these latter 2 queries would work natively against the table, no need for a temp table.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 João Luiz Carabetta
Solution 2