'How to processing json data in a column by using python/pyspark?
Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info)
| JSON_Info | 
|---|
| {"sampleData":"dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph)=12 UV Index=0_2 "} | 
| {"sampleData":"dataDetails: 2 002 2010/01/02 01:10:03 [device_info(1)] Weather=66F Wind Speed(mph)=13 UV Index=0_2 "} | 
| {"sampleData":"dataDetails: 3 003 2010/01/02 01:11:14 [device_info(1)] Weather=67F Wind Speed(mph)=14 UV Index=0_2 "} | 
| {"sampleData":"dataDetails: 4 004 2010/01/02 01:12:23 [device_info(1)] Weather=68F Wind Speed(mph)=15 UV Index=0_2 "} | 
All the info start with "sampleData" as the key, the value is the long piece info like below: "dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph)=12 UV Index=0_2 ".
Ideally, I would love to get muli info from the value(dataDetails) into different columns like below:
| Index | SI | Date | Time | DeviceNumber | WeatherDegree | WindSpeed | UVIndex | 
|---|---|---|---|---|---|---|---|
| 1 | 001 | 2010/01/02 | 01:09:10 | [device_info(1)] | 65F | 12 | 0_2 | 
| 2 | 002 | 2010/01/02 | 01:10:03 | [device_info(1)] | 66F | 13 | 0_2 | 
| 3 | 003 | 2010/01/02 | 01:11:14 | [device_info(1)] | 67F | 14 | 0_2 | 
| 4 | 004 | 2010/01/02 | 01:12:23 | [device_info(1)] | 68F | 15 | 0_2 | 
Below are some of my thoughts(but not sure how to approach):
- once get the info like "1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph)=12 UV Index=0_2", separate the long piece by space to get most of the info (before separate, need to remove the space from "Wind Speed" and "UV Index" to "WindSpeed" and "UVIndex" . 
- then get the left side of the "=" sign as the column name (if there is an equal sign) 
In short, how to use python/pyspark to get muli info of the value inside a JSON data column
Can someone please help?
Thanks a lot
Solution 1:[1]
Assuming your column is JSON type already, you can use from_csv function with option sep: ' ' to use blank space as separator
from pyspark.sql import functions as F
(df
    .withColumn('JSON_Info', F.from_json('JSON_Info', 'sampleData string'))
    .select(F.from_csv('JSON_Info.sampleData', 'c1 string, Index string, SI string, Date string, Time string, DeviceNumber string, WeatherDegree string, c8 string, Wind string, c10 string, UVIndex string', {'sep': ' '}).alias('csv'))
    .select('csv.*')
    .drop('c1', 'c8', 'c10')
    .withColumn('WeatherDegree', F.split('Weather', '=')[1])
    .withColumn('Wind', F.split('Wind', '=')[1])
    .withColumn('UVIndex', F.split('UVIndex', '=')[1])
    .show()
)
+-----+---+----------+--------+----------------+-------------+----+-------+
|Index|SI |Date      |Time    |DeviceNumber    |WeatherDegree|Wind|UVIndex|
+-----+---+----------+--------+----------------+-------------+----+-------+
|1    |001|2010/01/02|01:09:10|[device_info(1)]|65F          |12  |0_2    |
|2    |002|2010/01/02|01:10:03|[device_info(1)]|66F          |13  |0_2    |
|3    |003|2010/01/02|01:11:14|[device_info(1)]|67F          |14  |0_2    |
|4    |004|2010/01/02|01:12:23|[device_info(1)]|68F          |15  |0_2    |
+-----+---+----------+--------+----------------+-------------+----+-------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|---|
| Solution 1 | 
