'Counting the number of occurrences of unique values using Pig Latin

I am trying to figure out top 5 of the most downloaded RStudio packages on December 1, 2019 (from http://cran-logs.rstudio.com/) using Apache Pig Latin. The columns I need are 'r_os' and 'package'. Here is my code:

A = load '2019-12-01.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
B = FOREACH A GENERATE r_os,package;
C = DISTINCT B;
D = GROUP C BY package;
result = FOREACH C GENERATE flatten($0), COUNT($1) as package_distr;

I'm getting the following result, which is wrong:

(magrittr,10)
(htmltools,10)
(httr,10)
(lubridate,10)
(ellipsis,10)

The number of occurrences should be more, not 10. My desired output should look approximately like:

(magrittr,10000)
(htmltools,9876)
(httr,8700)
(lubridate,5320)
(ellipsis,3000)

Any idea what I'm doing wrong?



Solution 1:[1]

result = FOREACH D GENERATE group, COUNT(C) as package_distr;

?

group being the package name, and C being the name of the resulting bag when you grouped C, which we then count.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 saph_top