'AWS Athena/Presto SQL: Having trouble getting null values

I am doing a query in aws Athena where I want to get some total values, however I am having issues getting a column where the values are null, this column sometimes contains the value of [] that is consider also as null

Sample

My query

SELECT COUNT() AS total_rows,
    COUNT(DISTINCT sfattachmentid) AS total_attachments,
    (SELECT COUNT(DISTINCT salesforce_opportunity_id) FROM "athena_decisionengine"."transactions") AS total_opps,
    (SELECT COUNT(DISTINCT salesforce_opportunity_id) FROM "athena_decisionengine"."transactions" WHERE (oldcategory IS NOT NULL OR oldcategory != '[]')) AS opp_w_changed,
    (SELECT COUNT(DISTINCT salesforce_opportunity_id) FROM "athena_decisionengine"."transactions" WHERE (oldcategory IS NULL OR oldcategory = '[]')) AS opp_without_changed,
    SUM(CASE WHEN oldcategory != '' THEN 1 ELSE 0 END) AS oldCategory_changed,
    SUM(CASE WHEN oldcategory IS NULL THEN 1 ELSE 0 END) AS oldCategory_blank
FROM "athena_decisionengine"."transactions" 

Is giving the following results

SQL Results

However, the value of opp_without_changed seems wrong, becuase if I have total_opps of 1282 and opp_w_changed as 1110 I should expect opp_without_changed to be 172, but is showing me 1282 that seems to be the total of unique salesforce_opportunity_id, so it is like if the filter:

(oldcategory IS NULL OR oldcategory = '[]'))

Was not working



Solution 1:[1]

There are two problems in your query

  1. wrong boolean expressions
  2. wrong assumption that
    coumnt(distinct) = count(distinct NULL) + count(distinct NOT NULL)

This boolean expression oldcategory IS NOT NULL OR oldcategory != '[]' allows any value except NULL, it allows '[]' as well because '[]' is not NULL. If you want to filter out NULLs and '[]' then correct expression should be oldcategory != '[]' it does not allow NULLs as well because NULL can not be equal or not equal to something. Also it can be empty strings, not NULLs, with empty strings filtered also it will be

oldcategory not in ('[]','') --does not allow NULL, '[]', ''

Second expression including empty rows will be:

oldcategory IS NULL OR oldcategory in ('[]','')  --allows NULL, '[]', '' only

Also you are counting DISTINCT salesforce_opportunity_id, not just rows satisfying the WHERE condition, the same salesforce_opportunity_id can possibly have records with NULL, empty, '[]' and other values, so these datasets can intersect and you should NOT expect that

count (distinct salesforce_opportunity_id  ) = count(distinct salesforce_opportunity_id  where oldcategory is NULL) + count (distinct salesforce_opportunity_id  where oldcategory is NOT NULL)

DISTINCT counts are not additive. If you want check that TOTAL = NULLs + NOT NULLs, count everything without DISTINCT and it should match.

Solution 2:[2]

For most use case?give a default value to those NULL value should be ok.

coalesce(oldcategory,'[]') not in (a,b,c,d)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 suiwenfeng