'Parsing SQL with Python

I want to create a SQL interface on top of a non-relational data store. Non-relational data store, but it makes sense to access the data in a relational manner.

I am looking into using ANTLR to produce an AST that represents the SQL as a relational algebra expression. Then return data by evaluating/walking the tree.

I have never implemented a parser before, and I would therefore like some advice on how to best implement a SQL parser and evaluator.

  • Does the approach described above sound about right?
  • Are there other tools/libraries I should look into? Like PLY or Pyparsing.
  • Pointers to articles, books or source code that will help me is appreciated.

Update:

I implemented a simple SQL parser using pyparsing. Combined with Python code that implement the relational operations against my data store, this was fairly simple.

As I said in one of the comments, the point of the exercise was to make the data available to reporting engines. To do this, I probably will need to implement an ODBC driver. This is probably a lot of work.



Solution 1:[1]

This reddit post suggests python-sqlparse as an existing implementation, among a couple other links.

Solution 2:[2]

TwoLaid's Python SQL Parser works very well for my purposes. It's written in C and needs to be compiled. It is robust. It parses out individual elements of each clause.

https://github.com/TwoLaid/python-sqlparser

I'm using it to parse out queries column names to use in report headers. Here is an example.

import sqlparser

def get_query_columns(sql):
   '''Return a list of column headers from given sqls select clause'''

   columns = []

   parser = sqlparser.Parser()

   # Parser does not like new lines
   sql2 = sql.replace('\n', ' ')

   # Check for syntax errors
   if parser.check_syntax(sql2) != 0:
      raise Exception('get_query_columns: SQL invalid.')

   stmt = parser.get_statement(0)
   root = stmt.get_root()
   qcolumns = root.__dict__['resultColumnList']
   for qcolumn in qcolumns.list:
      if qcolumn.aliasClause:
         alias = qcolumn.aliasClause.get_text()
         columns.append(alias)
      else:
         name = qcolumn.get_text()
         name = name.split('.')[-1] # remove table alias
         columns.append(name)

   return columns

sql = '''
SELECT 
   a.a,
   replace(coalesce(a.b, 'x'), 'x', 'y') as jim,
   a.bla as sally  -- some comment
FROM
   table_a as a
WHERE
   c > 20
'''

print get_query_columns(sql)

# output: ['a', 'jim', 'sally']

Solution 3:[3]

Of course, it may be best to leverage python-sqlparse on Google Code

UPDATE: Now I see that this has been suggested - I concur that this is worthwhile:

Solution 4:[4]

I am using python-sqlparse with great success.

In my case I am working with queries that are already validated, my AST-walking code can make some sane assumptions about the structure.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dylan Hogg
Solution 2
Solution 3 Mathieu Longtin
Solution 4 Overbryd