'Dynamic alias and table name in PYSPARK

I have the following sql code:

SELECT DISTINCT
c.test AS test1
,c.hello AS hello1

FROM
(
  (
  SELECT a.*,b.hello
  FROM tabletest_system1 a
  LEFT JOIN
    (
    SELECT hello FROM tablepeople_system1 
    ) b
   ON a.id = b.id 

  UNION

  SELECT x.*,y.hello
  FROM tabletest_system2 x
  LEFT JOIN
    (
    SELECT hello FROM tablepeople_system2
    ) y
    ON x.id = y.id
  )
) c

This creates a certain table. The reason for using UNION is due to the fact that I extract the same two tables from two different systems. Both tables should first be joined and the joined tables should then be stacked using UNION to have one big table consisting of system 1 and 2.

I need to recreate the FROM (line 5 onwards) logic using pyspark and make it dynamic.

My attempt so far in Databricks/pyspark:

%python
systems = "system1,system2"
query = ""
counter = 1
for system in systems.split(','):
  query = query + """SELECT a.*,b.hello
  FROM tabletest_system1 a 
  LEFT JOIN
    (
    SELECT hello from tablepeople_system1
    ) b
     ON a.id = b.id"""
  if counter < len(kernalName.split(',')):
      query =  query + " UNION "
      counter =counter + 1
print(query)

gives the following output

SELECT a.*,b.hello
  FROM tabletest_system1 a 
  LEFT JOIN
    (
    SELECT hello from tablepeople_system1
    ) b
     ON a.id = b.id UNION SELECT a.*,b.hello
  FROM tabletest_system1 a 
  LEFT JOIN
    (
    SELECT hello from tablepeople_system1
    ) b
     ON a.id = b.id

My question is, does anyone know how to transform the python code in such a way that the second portion AFTER THE UNION QUERY does not return the same alias a and b but returns x and y as provided in the first chunk of code.

Hope someone can help out!

Thanks in advance



Solution 1:[1]

You can use f-strings to substitute your dynamic query with variables.

Here is your solution:

systems = "system1,system2"
aliases = ['a','b','x','y']
query = ""
counter = 1
for system in systems.split(','):
  alias1 = aliases[counter*2-2]
  alias2 = aliases[counter*2-1]
  query = query + f"""SELECT {alias1}.*,{alias2}.hello
  FROM tabletest_{system} {alias1} 
  LEFT JOIN
    (
    SELECT hello from tablepeople_{system}
    ) {alias2}
     ON {alias1}.id = {alias2}.id"""
  if counter < len(systems.split(',')):
      query =  query + " UNION "
      counter =counter + 1
print(query)

Output:

  SELECT a.*,b.hello
  FROM tabletest_system1 a 
  LEFT JOIN
    (
    SELECT hello from tablepeople_system1
    ) b
     ON a.id = b.id UNION SELECT x.*,y.hello
  FROM tabletest_system2 x 
  LEFT JOIN
    (
    SELECT hello from tablepeople_system2
    ) y
     ON x.id = y.id

For more tables, you can just add the aliases to the alias list on line 2.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 partha_devArch