'SQL in Databricks: Only keep oldest records from each month
I have a table tab1 with several columns including a date column. I want to create a view v1 with only the oldest records for each month and remove any records that came before. Duplicates should be kept and a new month_year column created.
INPUT TABLE tab1
| Provider | report_date | other columns |
|---|---|---|
| Apple | 20220101 | |
| Apple | 20220105 | |
| Apple | 20220120 | |
| Apple | 20220120 | |
| Apple | 20220205 | |
| Apple | 20220210 |
OUTPUT VIEW v1
| Provider | report_date | month_year | other columns |
|---|---|---|---|
| Apple | 20220120 | 01/2022 | |
| Apple | 20220120 | 01/2022 | |
| Apple | 20220210 | 02/2022 |
I've used the below code, but it does not work because report_date (yyyymmdd) is an int.
CREATE OR REPLACE TEMPORARY VIEW v1 AS
SELECT
*
FROM
tab1
WHERE
report_date IN (
SELECT MAX(report_date)
FROM tab1
GROUP BY MONTH(report_date), YEAR(report_date)
)
I've tried to cast report_date to char in the SELECT statement part so that the above works, but that gives me an error as well:
CAST(dataset_day_id AS varchar),
I've also tried CONVERT function, but I get the following error:
AnalysisException: Undefined function: 'CONVERT'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
