'Filter jsonb results in subselect

I'm building a hierarchical JSON result from several tables. These are just examples but should be sufficient for the purpose of this demonstration to get the idea:

CREATE TABLE book (
    id INTEGER PRIMARY KEY NOT NULL,
    data JSONB
);
CREATE TABLE author (
    id INTEGER PRIMARY KEY NOT NULL,
    data JSONB
);
CREATE TABLE book_author (
    id INTEGER PRIMARY KEY NOT NULL,
    author_id INTEGER,
    book_id INTEGER
);
CREATE UNIQUE INDEX pk_unique ON book_author (author_id, book_id);

Test data:

INSERT INTO book (id, data) VALUES
  (1, '{"pages": 432, "title": "2001: A Space Odyssey"}')
, (2, '{"pages": 300, "title": "The City And The City"}')
, (3, '{"pages": 143, "title": "Unknown Book"}');

INSERT INTO author (id, data) VALUES
  (1, '{"age": 90, "name": "Arthur C. Clarke"}')
, (2, '{"age": 43, "name": "China Miéville"}');

INSERT INTO book_author (id, author_id, book_id) VALUES
  (1, 1, 1)
, (2, 1, 2);

I've created the following function:

CREATE OR REPLACE FUNCTION public.book_get()
  RETURNS json AS
$BODY$
DECLARE
    result json;
BEGIN
      SELECT to_json(array_agg(_b)) INTO result
      FROM (
        SELECT
          book.id id,
          book.data->>'title' title,
          book.data->>'pages' pages,
          (
            SELECT to_json(array_agg(_a))
            FROM (
              SELECT
                author.id id,
                author.data->>'name' "name",
                author.data->>'age' age
              FROM
                author, book_author ba
              WHERE
                ba.author_id = author.id AND 
                  ba.book_id = book.id
              ORDER BY id
            ) _a
          ) authors
        FROM
          book
        ORDER BY id ASC
      ) _b;
        
    RETURN result;
END;
$BODY$ LANGUAGE plpgsql VOLATILE;

Executing the function book_get

SELECT book_get();

Produces the following results

[
   {
      "id":1,
      "title":"2001: A Space Odyssey",
      "pages":432,
      "authors":[
         {
            "id":1,
            "name":"Arthur C. Clarke",
            "age":90
         }
      ]
   },
   {
      "id":2,
      "title":"The City And The City",
      "pages":300,
      "authors":[
         {
            "id":2,
            "name":"China Miéville",
            "age":43
         }
      ]
   },
   {
      "id":3,
      "title":"Unknown Book",
      "pages":143,
      "authors":null
   }
]

Now I'm able to filter the data with a WHERE clause, e.g.

SELECT to_json(array_agg(_b)) INTO result
FROM (
 ...
) _b
-- give me the book with id 1
WHERE _b.id = 1;
-- or give me all titles with the occurrence of 'City' anywhere
WHERE _b.title LIKE '%City%';
-- or has more than 200 pages
WHERE _b.pages > 200;

How would I make it possible to filter on authors? E.g. something equivalent to WHERE _b.authors.'name' = 'Arthur C. Clarke'.

I have absolutely no idea what type authors becomes? Or is? Is it still a record (array)? Is it JSON already? I guess because I can access id, title and pages accessing _b.authors isn't such a problem?

Accessing _b.authors gives me ERROR: missing FROM-clause entry for table "authors"

Accessing with JSON operators _b.authors->>.. or _b->authors->>.. gives me

operator does not exist: record -> json
Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.

I remember using GROUP BY with the HAVING clause:

GROUP BY _b.authors
HAVING _b.authors->>'name' = 'Arthur C. Clarke';

But it gives me the error:

ERROR: could not identify an equality operator for type json

To make it a bit more clear:

      SELECT to_json(array_agg(_b)) INTO result
      FROM (
        ...
      ) _b
      WHERE _b.authors->0->>'name' = 'Arthur C. Clarke';

Will basically do what I need, this only matches if the author on index 0 is Arthur C. Clarke. If he has cowritten the book and he would be on second place (index 1) then there wouldn't have been a match. So what I try to find is the correct syntax on scanning _b.authors which happens to be an JSON array filled with authors. It just doesn't accept any attempt. As far as I understand are the @> and #> only supported for JSONB. So how do I get the correct syntax on selecting _b.authors on any column against a value.

Update 2

Ok read the docs again ... it seems I didn't get the part from Postgres documentation that there is a difference between JSON and JSONB regarding functions, I thought it was only concerning the data type. Replacing to_json with to_jsonb seems to do the trick with using operators like @> etc in the where clause.

Update 3

@ErwinBrandstetter: Makes sense. LATERAL wasn't known to me yet, good to know it exists. I got the hang on JSON/JSONB's functions and operators, makes a lot of sense to me now. What isn't clear to me is finding occurrences with LIKE for example in the WHERE clause.

If I need to use jsonb_array_elements to unnest objects in an array (because in the final WHERE clause, the content of b.authors is of JSONB datatype). I could then do

SELECT * FROM jsonb_array_elements('[
  {"age": 90, "name": "the Arthur C. Clarke"},
  {"age": 43, "name": "China Miéville"},
  {"age": null, "name": "Erwin the Brandstetter"}
]'::jsonb) author
WHERE 
  author->>'name' LIKE '%the%';

and get the desired result,

1: {"age": 90, "name": "the Arthur C. Clarke"}
2: {"age": null, "name": "Erwin the Brandstetter"}

but whats the approach to achieve this in the final (last) WHERE clause in my example? Pointing out the final WHERE clause because I'd like to filter the complete set of results, and not partially filter somewhere in the middle of a sub-select. So in general I would like to filter out books in the final result set that have an author with a middle name 'C.' or a first name 'Arthur'.

Update 4

In the FROM clause of course. I'll have to do performance tweaking at the end when I figured out all possibilities, but this is what I came up with.

SELECT json_agg(_b) INTO result
FROM (
...
) _b,
jsonb_array_elements(_b.authors) AS arrauthors
WHERE arrauthors->>'name' LIKE 'Arthur %';

Will give all books with an author name starting with 'Arthur'. I still appreciate comments or updates to this approach.



Solution 1:[1]

How would I make it possible to filter on authors? E.g. something equivalent to WHERE _b.authors.'name' = 'Arthur C. Clarke'.

You are on the right track in your question updates with jsonb and the "contains" operator @>. The best approach depends on what and how you want to filter exactly.

Basic function

Your basic function can be simpler:

CREATE OR REPLACE FUNCTION public.book_get()
  RETURNS jsonb
  LANGUAGE sql STABLE AS
$func$
SELECT jsonb_agg(books)
FROM  (
   SELECT b.data || jsonb_build_object('id', b.id, 'authors', a.authors) AS books
   FROM   book b
   LEFT   JOIN (  -- LEFT JOIN to include books without authors
      SELECT book_id, jsonb_agg(data_plus) AS authors
      FROM  (
         SELECT ba.book_id, jsonb_set(a.data, '{id}', to_jsonb(a.id)) AS data_plus
         FROM   book_author ba
         JOIN   author a ON a.id = ba.author_id
         ORDER  BY ba.book_id, ba.author_id
         ) a0
      GROUP  BY 1
      ) a ON a.book_id = b.id
   ORDER  BY b.id
   ) b0
$func$;

Major points

  • Make it SQL, simpler. No need for plpgsql.
  • Make it STABLE.
  • Don't omit the keyword AS for column aliases.
  • Use jsonb_agg()

If you just want to add the id column as key to your data, there are simpler ways:

  1. With the new jsonb_set() in Postgres 9.5:

         jsonb_set(data, '{id}', to_jsonb(id))
    

    This adds the object or updates the value of an existing object with the same key - the equivalent of an UPSERT in SQL. You can also constrain the operation to UPDATE only, see the manual.
    I use this in the inner subquery to add a single key.

  2. Concatenate two jsonb values:

         b.data || jsonb_build_object('id', b.id, 'authors', a.authors) 
    

Again, existing keys on the same level in the left value are replaced with keys in the right value. I build the object with jsonb_build_object(). See:

I use this in the outer subquery, simpler to add multiple keys. (And to demonstrate both options.

Your original query converted all values to text, which is probably not intended. This query preserves original data types for all jsonb values.

Test result

To test the result of your function for the existence of the author:

SELECT public.book_get() @> '[{"authors": [{"name":"Arthur C. Clarke"}]}]';

You have match the JSON structure in the pattern. And it only works for an exact match. Or you can use jsonb_array_elements() like you added in your last update for partial matches.

Either method is expensive since you test after building a JSON document from three whole tables.

Filter first

To actually filter books that have (possibly among others!) the given author, adapt your underlying query. You ask to filter books that ...

have an author with a middle name 'C.' or a first name 'Arthur'.

SELECT jsonb_agg(b.data || jsonb_build_object('id', b.id, 'authors', a.authors) ORDER BY b.id) AS books
FROM   book b
     , LATERAL (  -- CROSS JOIN since we filter before the join
   SELECT jsonb_agg(jsonb_set(a.data, '{id}', to_jsonb(a.id)) ORDER BY a.id) AS authors
   FROM   book_author ba 
   JOIN   author a ON a.id = ba.author_id
   WHERE  ba.book_id = b.id
   ) a
WHERE  EXISTS (
   SELECT 1                                 -- one of the authors matches
   FROM   book_author ba
   JOIN   author a ON a.id = ba.author_id
   WHERE  ba.book_id = b.id
   AND   (a.data->>'name' LIKE '% C. %' OR  -- middle name 'C.'
          a.data->>'name' LIKE 'Arthur %')  -- or a first name 'Arthur'.
   );

Filter books that have at least one matching author before you build the result.

Note how I use ORDER BY as modifier to the jsob_agg() aggregate function instead of a subquery to sort results like in the previous example. This is typically slower but shorter. And good enough for a small result set. Consider:

If your tables are big and you need the queries fast, use indexes! For this particular query a function trigram GIN index like this should work wonders for big tables:

CREATE INDEX author_special_idx ON author USING gin ((data->>'name') gin_trgm_ops);

Detailed explanation / instructions:

Solution 2:[2]

A nice tutorial on JSOn in postgresql is recommended. If you create your data in this way:

CREATE TABLE json_test (
  id serial primary key,
  data jsonb
);
INSERT INTO json_test (data) VALUES 
  ('{"id":1,"title":"2001: A Space Odyssey","pages":432,"authors":[{"id":1,"fullname":"Arthur C. Clarke"}]}'),
  ('{"id":2,"title":"The City And The City","pages":300,"authors":[{"id":2,"fullname":"China Miéville"}]}'),
  ('{"id":3,"title":"Unknown Book","pages":143,"authors":null}');

You can select with a specific id

SELECT * FROM json_test
WHERE data @> '{"id":2}';

Or look for a specific name in a sub-array:

SELECT * FROM json_test
WHERE data -> 'authors' @> '[{"fullname": "Arthur C. Clarke"}]'

Or find book with over 200 pages:

SELECT * FROM json_test
WHERE (data -> 'pages')::text::int > 200

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Joost Döbken