'I am new with Pyspark, I am trying onehotencoding in iot dataset for deeplearning implementation

When I am trying to fit the pipeline, I am getting a error like this

Py4JJavaError: An error occurred while calling o1380.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 71.0 failed 1 times, most recent failure: Lost task 6.0 in stage 71.0 (TID 682) (LAPTOP-DMUSUVBM executor driver): java.lang.IllegalArgumentException: requirement failed: Vector should have dimension larger than zero.

Here is the code I tried:

def select_features_to_scale(df1=df1, lower_skew=-2, upper_skew=2, dtypes='int32'):
selected_features = []

feature_list = list(df1.toPandas().select_dtypes(include=[dtypes]).columns)


for feature in feature_list:

    if df1.toPandas()[feature].kurtosis() < -2 or df1.toPandas()[feature].kurtosis() > 2:
        
        selected_features.append(feature)

return selected_features

Here I am creating Spark Pipeline

cat_features = ['duration', 'orig_bytes', 'resp_bytes', 'orig_pkts', 'proto_icmp', 'proto_tcp', 'proto_udp']
label = 'label'
stages = []

Loop for StringIndexer and OHE for Categorical Variables

    for features in cat_features:
    
   
    string_indexer = StringIndexer(inputCol=features, outputCol=features + "_index")

One Hot Encode Categorical Features

    encoder = OneHotEncoder(inputCols=[string_indexer.getOutputCol()],
                                     outputCols=[features + "_class_vec"])
    
stages += [string_indexer, encoder]


label_str_index =  StringIndexer(inputCol=label, outputCol="label_index")


unscaled_features = select_features_to_scale(df1=df1, lower_skew=-2, upper_skew=2, dtypes='int32')

unscaled_assembler = VectorAssembler(inputCols=unscaled_features, outputCol="unscaled_features")
scaler = StandardScaler(inputCol="unscaled_features", outputCol="scaled_features")

stages += [unscaled_assembler, scaler]

assembler_inputs = [feature + "_class_vec" for feature in cat_features]
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="assembled_inputs") 

stages += [label_str_index, assembler]

assembler_final = VectorAssembler(inputCols=["scaled_features","assembled_inputs"], outputCol="features")

stages += [assembler_final]

pyspark apache-spark-mllib

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'I am new with Pyspark, I am trying onehotencoding in iot dataset for deeplearning implementation

Sources

Related Questions