'Simple way to parse SQL subqueries

I am working on a SQL analysis tool that, given a RAW SQL SELECT query, can give some sort of analysis. The first version of the tool is finished and can analyze simple RAW queries. However, when the query contains a subquery it breaks.

So I am looking for a simple but reliable way to parse queries and subqueries. My tool must analyze every subquery individually so for example:

Suppose this is the query that the tool is given as input:

SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)

Then I would like to get a list of queries like so:

queries = [
    "SELECT name, EMAIL FROM <subquery> WHERE id in <subquery>"
    "SELECT * FROM user WHERE email IS NOT NULL"
    "SELECT cID FROM customer WHERE points > 5)"
]

In my first attempt, I am using the fact that subqueries are always written between brackets. So I scan the initial query for brackets. This works when subqueries aren't nested i.e. the are no subqueries inside subqueries. I also experimented a bit with AST, but felt that it was probably a bit too complicated and that there are probably more simple ways.

Anyone who's able to guide me in the right direction? I am using Python, but examples in other languages are also much appreciated.



Solution 1:[1]

You can use sqlparse:

import sqlparse
def queries(d):
  if type(d) != sqlparse.sql.Token:
     paren = isinstance(d, sqlparse.sql.Parenthesis)
     v = [queries(i) for i in (d if not paren else d[1:-1])]
     subseq, qrs = ''.join(str(i[0]) for i in v), [x for _, y in v for x in y]
     if [*d][paren].value == 'SELECT':
        return '<subquery>', [subseq]+qrs
     return subseq, qrs
  return d, []

s="""SELECT name, email
     FROM (SELECT * FROM user WHERE email IS NOT NULL)
     WHERE id IN (SELECT cID FROM customer WHERE points > 5)
"""
_, subqueries = queries(sqlparse.parse(s)[0])

Output:

['SELECT name, email\n     FROM <subquery>\n     WHERE id IN <subquery>\n', 'SELECT * FROM user WHERE email IS NOT NULL', 'SELECT cID FROM customer WHERE points > 5']

Using the sqlparse library, you can parse a SQL input string into a tokenized stream of keywords, statements, and values. The function queries above takes in a sqlparse.sql.Statement object and searches for any occurrence of a SELECT statement in the query, reformatting the original input along the way to remove subqueries, per the desired output sample.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1