AVG, MIN, MAX error in Pig while COUNT works - count
I'm trying to use AVG,MIN,MAX in Pig. Both MIN and MAX functions got stuck while executing and AVG function throws an error. But the COUNT function works fine.
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (GRADE 2 TEACHER,{(65587.90)}), 2nd :(GRADE 4 TEACHER,{(56567.24)})
My code:
register 'pig/contrib/piggybank/java/piggybank.jar';
define Replace org.apache.pig.piggybank.evaluation.string.REPLACE();
A = LOAD '/user/hduser/salaryTravel.csv' using org.apache.pig.piggybank.storage. CSVLoader() AS (name:chararray,job:chararray,salary:chararray,TA:chararray,type:chararray,org:chararray,year:int);
B = foreach A generate name,job,REPLACE(salary,',','') as salary:float, REPLACE(TA,',','') as TA:float, type, org, year;
C = filter B by type=='LBOE';
D = filter C by year==2010;
E = group D by job;
number = foreach E generate group,COUNT(D.salary);
average = foreach E genetate group,AVG(D.salary);
minim = foreach E genetate group,MIN(D.salary);
maxim = foreach E genetate group,MAX(D.salary);
Sample Data
(ABBOTT,DEEDEE W,GRADES 9-12 TEACHER,52,122.10,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABBOTT,RYAN V,GRADE 4 TEACHER,56,567.24,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABBOUD,CLAUDIA MORA,GRADES K-5 TEACHER,63,957.50,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABDUL-JABBAR,KHADEEJA ,GRADES 9-12 TEACHER,16,791.73,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABDUL-RAZACQ,SALAHUD-DIN ,INSTRUCTIONAL SPECIALIST P-8,45,832.92,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABDULLAH,DIANA ,SPECIAL ED PARAPRO/AIDE,10,934.94,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABDULLAH,NADIYAH W,GRADES 6-8 TEACHER,75,109.92,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(ABDULLAH,RHONDALYN Y,SPECIAL ED PARAPRO/AIDE,28,649.34,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)
(OSBORNE,CHRISTINE L,INSTRUCTIONAL SUPERVISOR,78,875.59,3,265.71,LBOE,COBB COUNTY SCHOOL DISTRICT,2010)
(OSBORNE,DORIS A,OCCUPATIONAL THERAPIST ,65,421.79,1,156.05,LBOE,COBB COUNTY SCHOOL DISTRICT,2010)
Sample data after the GROUP operation in line 7.
(GRADE 2 TEACHER,{(OSBORNE,VIRGINIA E,GRADE 2 TEACHER,65587.90,0,LBOE,COBB COUNTY SCHOOL DISTRICT,2010)})
(GRADE 4 TEACHER,{(ABBOTT,RYAN V,GRADE 4 TEACHER,56567.24,0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010)})
(MAINTENANCE PERSONNEL,{(BROOKS,RICHARD M,MAINTENANCE PERSONNEL,72655.52,0,LBOE,FULTON COUNTY BOARD OF EDUCATION,2010),(SUMNER,ROBERT O,MAINTENANCE PERSONNEL,72655.53,0,LBOE,FULTON COUNTY BOARD OF EDUCATION,2010),(MCCULLOUGH,ALVIN J,MAINTENANCE PERSONNEL,72655.52,0,LBOE,FULTON COUNTY BOARD OF EDUCATION,2010),(DALTON,JAMES E,MAINTENANCE PERSONNEL,72655.52,2124.60,LBOE,FULTON COUNTY BOARD OF EDUCATION,2010),(SMITH,KEVIN W,MAINTENANCE PERSONNEL,72655.52,0,LBOE,FULTON COUNTY BOARD OF EDUCATION,2010),(MANGHAM,LARRY G,MAINTENANCE PERSONNEL,72655.52,0,LBOE,FULTON COUNTY BOARD OF EDUCATION,2010)})
Is it a bug in Pig? Please help me.
Here is the updated Pig Script.
register 'pig/contrib/piggybank/java/piggybank.jar';
define Replace org.apache.pig.piggybank.evaluation.string.REPLACE();
A = LOAD '/user/hduser/salaryTravel.csv' using org.apache.pig.piggybank.storage. CSVLoader() AS (name:chararray,job:chararray,salary:chararray,TA:chararray,type:chararray,org:chararray,year:int);
B = foreach A generate name,job,REPLACE(salary,',','') as salary, REPLACE(TA,',','') as TA, type, org, year;
B1 = foreach B generate name, job, (double)salary, (double)TA, type, org, year;
C = filter B1 by type=='LBOE';
D = filter C by year==2010;
E = group D by job;
number = foreach E generate group,COUNT(D.salary);
average = foreach E generate group,AVG(D.salary);
minim = foreach E generate group,MIN(D.salary);
maxim = foreach E generate group,MAX(D.salary);
The issue was, you need to provide a explicit casting to the salary and TA attributes.
Related
Kusto: Apply function on multiple column values during bag_unpack
Given a dynamic field, say, milestones, it has value like: {"ta": 1655859586546, "tb": 1655859586646}, How do I print a table with columns like "ta", "tb" etc, with the single row as unixtime_milliseconds_todatetime(tolong(taValue)), unixtime_milliseconds_todatetime(tolong(tbValue)) etc. I figured that I'll need to write a function that I can call, so I created this:- let f = view(a:string ){ unixtime_milliseconds_todatetime(tolong(a)) }; I can use this function with a normal column as:- project f(columnName). However, in this case, its a dynamic field, and the number of items in the list is large, so I do not want to enter the fields manually. This is what I have so far. log_table | take 1 | evaluate bag_unpack(milestones, "m_") // This gives me fields as columns // | project-keep m_* // This would work, if I just wanted the value, however, I want `view(columnValue) | project-keep f(m_*) // This of course doesn't work, but explains the idea.
Based on the mv-apply operator // Generate data sample. Not part of the solution. let log_table = materialize(range record_id from 1 to 10 step 1 | mv-apply range(1, 1 + rand(5), 1) on (summarize milestones = make_bag(pack_dictionary(strcat("t", make_string(to_utf8("a")[0] + toint(rand(26)))), 1600000000000 + rand(60000000000))))); // Solution Starts here. log_table | mv-apply kv = milestones on ( extend k = tostring(bag_keys(kv)[0]) | extend v = unixtime_milliseconds_todatetime(tolong(kv[k])) | summarize milestones = make_bag(pack_dictionary(k, v)) ) | evaluate bag_unpack(milestones) record_id ta tb tc td te tf tg th ti tk tl tm to tp tr tt tu tw tx tz 1 2021-07-06T20:24:47.767Z 2 2021-05-09T07:21:08.551Z 2022-07-28T20:57:16.025Z 2022-07-28T14:21:33.656Z 2020-11-09T00:54:39.71Z 2020-12-22T00:30:13.463Z 3 2021-12-07T11:07:39.204Z 2022-05-16T04:33:50.002Z 2021-10-20T12:19:27.222Z 4 2022-01-31T23:24:07.305Z 2021-01-20T17:38:53.21Z 5 2022-04-27T22:41:15.643Z 7 2022-01-22T08:30:08.995Z 2021-09-30T08:58:46.47Z 8 2022-03-14T13:41:10.968Z 2022-03-26T10:45:19.56Z 2022-08-06T16:50:37.003Z 10 2021-03-03T11:02:02.217Z 2021-02-28T09:52:24.327Z 2021-04-09T07:08:06.985Z 2020-12-28T20:18:04.973Z 9 2022-02-17T04:55:35.468Z 6 2022-08-02T14:44:15.414Z 2021-03-24T10:22:36.138Z 2020-12-17T01:14:40.652Z 2022-01-30T12:45:54.28Z 2022-03-31T02:29:43.114Z Fiddle
Data Scraping with list in excel
I have a list in Excel. One code in Column A and another in Column B. There is a website in which I need to input both the details in two different boxes and it takes to another page. That page contains certain details which I need to scrape in Excel. Any help in this?
Ok. Give this a shot: import pandas as pd import requests df = pd.read_excel('C:/test/data.xlsx') url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint' results = pd.DataFrame() for row in df.itertuples(): payload = { 'iec': '%010d' %row[1], 'name':row[2]} response = requests.post(url, params=payload) print ('IEC: %010d\tName: %s' %(row[1],row[2])) try: dfs = pd.read_html(response.text) except: print ('The name Given By you does not match with the data OR you have entered less than three letters') temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']], columns = ['IEC','Party Name and Address','ERROR']) results = results.append(temp_df, sort=False).reset_index(drop=True) continue generalData = dfs[0] generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True) directorData = dfs[1] directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True) directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ] try: branchData = dfs[2] branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True) branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ] except: branchData = pd.DataFrame() print ('No Branch Data.') temp_df = pd.concat([generalData, directorData, branchData], axis=1) results = results.append(temp_df, sort=False).reset_index(drop=True) results.to_excel('path.new_file.xlsx', index=False) Output: print (results.to_string()) IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09 0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...
Syntax error when using count in loop
I am trying to run a loop where I count the total in each file under the variable _merge, and then count certain outcomes of _merge, such as _merge=1 and so on. I then want to calculate percentages by dividing each instance of _merge by the total under _merge. Below is my code: /*define local list*/ local ward_names B C D E FN FS GS HE /*loop for each dbase*/ foreach file of local ward_names { use "../../../cleaning/sra/output/`file'_ward_CTS_Merged.dta", clear count if _merge local ward_count=r(N) count if _merge==1 local count_master=r(N) count if _merge==2 local count_using=r(N) count if _merge==3 local count_match=r(N) clear set obs 1 g ward_count='ward_count' g count_master=`count_master' g count_using=`count_using' g count_match=`count_match' g ward= "`file'" save "../temp/`file'_collapsed_diagnostics.dta", replace clear The code was running fine until I tried to add the total count for each ward file: g ward_count='ward_count' 'ward_count' invalid name Is this a syntax error or something more severe?
You need to use ` instead of ' when you refer to a local macro: generate ward_count = `ward_count' EDIT: As per #NickCox's recommendation you can improve your code by using the tabulate command with its matcell() option to get the counts all at once: tabulate _merge, matcell(A) _merge | Freq. Percent Cum. ------------------------+----------------------------------- master only (1) | 1 16.67 16.67 matched (3) | 5 83.33 100.00 ------------------------+----------------------------------- Total | 6 100.00 matrix list A A[2,1] c1 r1 1 r2 5 So you could then do the following: generate count_master = A[1,1] generate count_match = A[2,1]
Crystal Reports Difference of group total
I have a report which has two groups. Group B always has only 2 values. I want to get the difference of total values of Item Type 01 and Item Type 02 to the Group B footer (Tot type01 - tot type02). Help me to achieve this. I tried few formulas but non of them works for me Month01 Month2 Group A Group B Item Type 01 ab 10 10 ac 20 30 ad 30 30 **Total** 60 70 Item Type 02 ab 10 20 ac 10 15 ad 20 5 **Total** 40 30 **Difference 20 40** I want something like this NumberVar sum01 := 0; Numbervar sum02 := 0; GroupName ({DataTable1.IncomeType}) = Type 01 Then sum01 := Sum ({DataTable1.Month01}, {DataTable1.IncomeType}) if GroupName ({DataTable1.IncomeType}) = Type 02 Then sum02 := Sum ({DataTable1.Month01}, {DataTable1.IncomeType}) sum01 - sum02 I know this isn't correct. I used it to explain my question for you as much as possible. Really appreciate your guidence
You can do this using arrays.. Take 2 arrays and store values for Month1 and Month2 and in group footer retrive and add those. Create a formula #Month1Array and place in Item Type group footer after Month1 summary Shared Numbervar array x; x:=x+sum(Month1,Item GRoup); 1; Create a formula #Month2Array and place in Item Type group footer after Month2 summary Shared Numbervar array y; y:=y+sum(Month2,Item GRoup); 1; Now in the footer where you want to see the difference write below formula for Create a formula #Month1 Shared Numbervar array x; x[1]-x[2] Create a formula #Month2 Shared Numbervar array y; y[1]-y[2]
PIG: Do not understand why AVG deos not work when COUNT does
I am running the following set of commands in Pig. My data set has one row for each student in a class and each student has a number of grades. Student name is tab separated from grades for that student. The scores for each student are comma separated. I need to find the average grade for each student. After grouping, I can successfully get the count of grades for each student but I cannot get the average score for each student. Pig complains it cannot find the iterator when it is averaging. I am confused since the iterator for both aggregate function COUNT and AVG is the same. I am not sure what I am missing. Any help is appreciated? Scripts: grunt> A = LOAD 'grades.txt' USING PigStorage('\t') AS (f1:chararray,f2:chararray); grunt> dump A; (s14,59,94,81) (s15,60,77) (s16,77,77) (s17,76,76) (s18,19,61,72) (s20,34,35) grunt> B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as (grade:int); grunt> describe B; B: {stu: chararray,grade: int} grunt> dump B; (s14,59) (s14,94) (s14,81) (s15,60) (s15,77) (s16,77) (s16,77) (s17,76) (s17,76) (s18,19) (s18,61) (s18,72) (s20,34) (s20,35) grunt> grp = group B by stu; grunt> cnt = foreach grp generate group, COUNT(B.grade); grunt> dump cnt; (s14,3) (s15,2) (s16,2) (s17,2) (s18,3) (s20,2) grunt> avg = foreach grp generate group, AVG(B.grade); grunt> dump avg; 2015-03-20 21:56:30,900 ERROR org.apache.pig.tools.pigstats.PigStatsUtil: 1 map reduce job(s) failed! 2015-03-20 21:56:30,907 ERROR org.apache.pig.tools.grunt.Grunt: ERROR 1066: Unable to open iterator for alias avg Details at logfile: /home/training/pig/pig_1426902869706.log grunt>
As mentioned in the comments, a workaround was found: changed B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as (grade:int) to B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as grade And then copied the bag into: C = foreach B generate stu as stu, grade as (int)grade;