'Using temp tables in pyspark queries

I need to read some data into Spark using SQL. The query, for performance reasons, really requires a temp table. When I try and use a query including a temp table as outlined below however spark insists that there is a syntax error, even though that's not true.

For example, this works fine:

# pre-emptive stuff that loads connection parameters and whatnot

query = """
select top 50
  *
from DW.Dim.Sales
"""

df = spark.read\
    .format("jdbc")\
    .option("url", url)\
    .option("query", query)\
    .option("user", 'svcDataBricks_DEV')\
    .option("password", sql_password)\
    .load()

But this returns an error:

query = """
select top 50
  *
into #MyData
from DW.Dim.Sales
"""

df = spark.read\
    .format("jdbc")\
    .option("url", url)\
    .option("query", query)\
    .option("user", 'svcDataBricks_DEV')\
    .option("password", sql_password)\
    .load()

Specifically this error:

com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'into'.

I guess the problem is that spark.read() is not designed to do any kind of write action, but the documentation is somewhat opaque so I can't quite discern how to make this work.



Solution 1:[1]

See my answer over in this SO question.

In short, it's not working because the driver is prepending your Sql withSELECT * FROM ( and appending it with ) spark_generated_alias. That makes what you wrote syntactically incorrect. You have to get creative with your own custom "Sql injection attack" to get what you want.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Granger