'Condensing a bunch of columns into one array column mapped to a key

I'm doing a project that analyzes covid data and I'm trying to clean the data of null values and the like, but first need to make it usable. It currently has an individual column for every date and the amount of new cases that day. The Combined_Key column is unique so that was what I was going to try to map the dates and cases to. Also every column is of type String so I imagine I'll need to insert the data into a dataframe that's setup with the correct types but I also don't know how to do that without making 450 date columns all typed separately, even more exciting is that there isn't an inherent date type in spark/scala so not sure how to handle that.

UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20
84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.53952745,-86.64408227,"Autauga, Alabama, US",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,6,6,6,6,8,8,10,12,12,12,12,12,12,12,17,18,19,19,19,23,24,24,24,25,26,28,30,32,33,36,36,36,37,39,41,42,43,47,51,54,54,56,58,62,63,72,81

There's part of top 2 rows of the data, a whole lot of date columns have been left out. I'm working in the spark shell, I've tried something like this after turning the data into a table but that gets either a "error: 5 more arguments than can be applied to method ->: (y: B)(String, B)" or "error: type mismatch;" respectively.

var covidMap = scala.collection.mutable.Map[String, ArrayBuffer[Int]]()
table.foreach{x => covidMap += (x(10)).toString -> (x(11),x(20),x(30),x(40),x(50),x(60))}
table.foreach{x => covidMap += (x(10)).toString -> (x(11))}

Honestly I don't know if these are even close to what I need to be doing, I've been coding for 5 weeks in a training program and it's incredibly difficult for me thus far, so, I'm here. Any help is appreciated!



Solution 1:[1]

Starting with an example DataFrame (taking your first two example date columns and adding today's date to show it'll work in the future):

val df = List(
    (84001001,"US","USA",840,1001.0,"Autauga","Alabama","US",32.53952745,-86.64408227,"Autauga, Alabama, US",0,0,50)
)
.toDF("UID","iso2","iso3","code3","FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","1/22/20","1/23/20","4/2/22")
.show()

gives:

+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
|     UID|iso2|iso3|code3|  FIPS| Admin2|Province_State|Country_Region|        Lat|       Long_|        Combined_Key|1/22/20|1/23/20|4/2/22|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
|84001001|  US| USA|  840|1001.0|Autauga|       Alabama|            US|32.53952745|-86.64408227|Autauga, Alabama, US|      0|      0|    50|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+

We can then create a new column, which I've called dates but you can easily rename. Here the array function is used to combine all of the values of the date columns into column, which is an array:

import org.apache.spark.sql.functions.{array, col}

val dateRegex = "\\d+/\\d+/\\d+" // matches all columns in x/y/z format
val dateColumns = df.columns.filter(_.matches(dateRegex))

df
  // select all date columns and combine into a new column: `dates`
  .withColumn("dates", array(dateColumns.map(df(_)): _*))
  // drop the original date columns, keeping `dates`
  .drop(dateColumns: _*)
  .show(false)

gives:

+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
|UID     |iso2|iso3|code3|FIPS  |Admin2 |Province_State|Country_Region|Lat        |Long_       |Combined_Key        |dates     |
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
|84001001|US  |USA |840  |1001.0|Autauga|Alabama       |US            |32.53952745|-86.64408227|Autauga, Alabama, US|[0, 0, 50]|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+

A downside to this is that the output DataFrame doesn't retain the original date values.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 tjheslin1