'How to convert dataframe column to dictionary
Firstly, I want to thank everybody for any help in advance! I have 4 tables, I joined them and got a PySpark dataframe. One of the dataframe columns looks like this and it has about 200 000 records:
{"table_name":"BTR.DAILY_BTR.JSC_MON","login":"0015471"}
{"table_name":"BTR.DAILY_BTR.ESHOP.JSC_MON","login":"0015471"}
The type of this column is 'string'. I need to get value by key table_name. I tried to use json method .loads:
sparam = t1.select(col('ADD_PARAMS'))
json.loads(sparam)
But I got error:
TypeError: the JSON object must be str, bytes or bytearray, not DataFrame
Then I tried to change the column type:
sparam = t1.select(col('ADD_PARAMS').cast('string'))
type(sparam)
It shows that the type is dataframe:
pyspark.sql.dataframe.DataFrame
Anyway i tried to use method "loads" again:
json.loads(sparam)
But I got the same error:
TypeError: the JSON object must be str, bytes or bytearray, not DataFrame
I tried to use different options to get value of table_name ranging from converting to json, dict, regex, splits, but nothing helped.
UPD
Here is some of the code which is used:
import io
import sys
import pandas as pd
import numpy as np
import json
import findspark
findspark.init('/mnt/nfs-spark/spark-2.3.3/')
findspark.find()
import pyspark
import pyspark.sql.functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Window, HiveContext
from pyspark.sql.functions import col, lit
from pyspark.sql import Row
from pyspark.sql.functions import udf
survey_requests = spark.read.parquet('/mnt/gluster-storage/etl/download/surv/survey_requests/*')
channels = spark.read.parquet('/mnt/gluster-storage/etl/download/surv/channels')
channel_segments = spark.read.parquet('/mnt/gluster-storage/etl/download/surv/channel_segments'))
channel_branches = spark.read.parquet('/mnt/gluster-storage/etl/download/surv/channel_branches'))
channel_touchpoints = spark.read.parquet('/mnt/gluster-storage/etl/download/surv/channel_touchpoints'))
t1 = (c.join(cs, c.IDD_SEGMENT == cs.ID_SEGMENT, "left")\
.join(ct, ct.ID_TOUCHPOINT == c.IDD_TOUCHPOINT, "left")\
.join(cb, cb.ID_BRANCH == c.IDD_BRANCH, "left"))\
.join(survey_requests, c.ID_CHANNEL == survey_requests.CHANNEL_ID, 'right')
sparam = t1.select(col('ADD_PARAMS').cast('string'))
type(sparam)
pyspark.sql.dataframe.DataFrame
json.loads(sparam)
TypeError: the JSON object must be str, bytes or bytearray, not DataFrame
Solution 1:[1]
It's .cast('string'), not .cast('str')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pltc |
