I have been learning Big Data and among other things Apache Pig. So initially I thought that once you have loaded the data a SUM applied after a FOREACH would bring the total amount, but that’s not the case. A grouping needs to be performed first else SUM would give an error.
We will perform the below script:
-- This PIG script sums the total amount of sales -- First load the data from sales.txt file data = LOAD 'sales.txt' USING PigStorage(',') AS (name:chararray, price:int, country:chararray); -- Group the data grouped = GROUP data ALL; -- Once grouped generate total sum of all sales total = FOREACH grouped GENERATE SUM(data.price); -- Print to screen DUMP total;
Save the above code with .pig extension. Test data will be loaded from below file.
Alice,3000,us Alice,2000,us Bob,500,ca Juan,500,mx Hans,2000,de Joan,1000,fr Piero,6000,it
and execute locally:
pig -x local totalsales.pig 17/01/04 14:28:57 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 17/01/04 14:28:57 INFO util.ProcessTree: setsid exited with exit code 0 17/01/04 14:28:58 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead (15000)
Reference:
Thomas Henson