PIG - Scalar has more than one row in the output. 1s - count

I have data set in the following format:
100000853384|RETAIL|OTHER|4.625|280000|360|02/2012|04/2012|31|31|1|23|801|NO|CASH-OUT REFINANCE|SF|1|INVESTOR|CA|945||FRM
100003735682|RETAIL|SUNTRUST MORTGAGE INC.|3.99|466000|360|01/2012|03/2012|80|80|2|30|788|NO|PURCHASE|SF|1|PRINCIPAL|MD|208||FRM
100006367485|CORRESPONDENT|PHH MORTGAGE CORPORATION|4|229000|360|02/2012|04/2012|67|67|2|36|794|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|CA|959||FRM
4th record is the ORIGINAL_INTEREST_RATE.
Now My Question is
What is the interest rate for which most number of people have taken a loan.
I write following codes
LOAD DATA SET
loanAqiData = LOAD 'hdfs://masterNode:8020/home/hadoop/hadoop_data/LOAN_Acquisition_DATA/Acquisition_2012Q1.txt'
USING PigStorage('|')
AS
(
LOAN_IDENTIFIER:chararray
, CHANNEL:chararray
, SELLER_NAME:chararray
, ORIGINAL_INTEREST_RATE:float
, ORIGINAL_UNPAID_PRINCIPAL_BALANCE :float
, ORIGINAL_LOAN_TERM :float
, ORIGINATION_DATE:chararray
, FIRST_PAYMENT_DATE:chararray
, ORIGINAL_LOAN_TO_VALUE:float
, ORIGINAL_COMBINED_LOAN_TO_VALUE :float
, NUMBER_OF_BORROWERS:float
, DEBT_TO_INCOME_RATIO:float
, CREDIT_SCORE:float
, FIRST_TIME_HOME_BUYER_INDICATOR:chararray
, LOAN_PURPOSE:chararray
, PROPERTY_TYPE:chararray
, NUMBER_OF_UNITS:chararray
, OCCUPANCY_STATUS:chararray
, PROPERTY_STATE:chararray
, ZIP:chararray
, MORTGAGE_INSURANCE_PERCENTAGE:float
, PRODUCT_TYPE:chararray
);
//- Group By Interest Rate
grouped_by_interest_rate = group loanAqiData by ORIGINAL_INTEREST_RATE;
No of Counts for individual Interest Rate
count_for_specific_interest = FOREACH grouped_by_interest_rate GENERATE group as INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
Dump
dump count_for_specific_interest
Output
(3.625,1)
(3.75,2)
(3.875,26)
(3.99,8)
(4.0,21)
(4.1,1)
(4.125,15)
(4.25,16)
(4.375,15)
(4.376,26)
(4.5,10)
(4.625,3)
But I want to get
(3.875,26) and (4.376,26)
How Can I get ?
Also If I want to get the Loan Interest for which minimum No of people has taken Loan ..

I'd suggest you use the MAX() function (http://pig.apache.org/docs/r0.11.0/func.html#max) to determine the highest number of people and then filter by this number.
Here is an example of code that should work (not tested) :
FOREACH count_for_specific_interest {
max_value= MAX($1.NO_OF_PEOPLE);
GENERATE INTEREST_RATE, NO_OF_PEOPLE, max_value;
}
RESULT = FILTER count_for_specific_interest BY NO_OF_PEOPLE==max_value;
For the min you would be able to use exactly the same script replacing MAX() by MIN()

Finally this is resolved.
let me write down the steps
1) Load
2) Group by Interest
grp = group loanAqiData by ORIGINAL_INTEREST_RATE;
3) Count No of people against each Interest
cntForEachGrp = FOREACH grp GENERATE group as
INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
Output
(3.625,1) (3.75,2) (3.875,26) (3.99,8) (4.0,21) (4.1,1) (4.125,15) (4.25,16) (4.375,15) (4.376,26) (4.5,10) (4.625,3)
4) Group them all to put in the same BAG
grpALL = GROUP cntForEachGrp ALL;
(all,{(3.625,1),(3.75,2),(3.875,26),(3.99,8),(4.0,21),(4.1,1),(4.125,15),(4.25,16),(4.375,15),(4.376,1),(4.5,10),(4.625,3),(4.75,5),(4.875,4),(5.0,2),(5.25,1)})
5) Calculate Max No of people from the BAG
maxVal = FOREACH grpALL {
max_value= MAX(cntForEachGrp.NO_OF_PEOPLE);
GENERATE cntForEachGrp.INTEREST_RATE, cntForEachGrp.NO_OF_PEOPLE, max_value as
max_no;
}
grunt> describe maxVal;
maxVal: {{(INTEREST_RATE: float)},{(NO_OF_PEOPLE: long)},max_no: long}
dump maxVal;
({(3.625),(3.75),(3.875),(3.99),(4.0),(4.1),(4.125),(4.25),(4.375),(4.376),(4.5),(4.625),(4.75),(4.875),(5.0),(5.25)},{(1),(2),(26),(8),(21),(1),(15),(16),(15),(1),(10),(3),(5),(4),(2),(1)},26)
6)Filter out Loan interest having Max no of people
RESULT=FILTER cntForEachGrp BY NO_OF_PEOPLE == maxVal.max_no ;
After dump we get interest Rate -3.875 has max no of people 26.
Why we have to do
grpALL = GROUP cntForEachGrp ALL;
and
what is the inner meaning of the nested foreach in (5)

Related

Python ArcPy - Print Layer with highest field value

I have some python code that goes through layers in my ArcGIS project and prints out the layer names and their corresponding highest value within the field "SUM_USER_VisitCount".
Output Picture
What I want the code to do is only print out the layer name and SUM_USER_VisitCount field value for the one layer with the absolute highest value.
Desired Output
I have been unable to figure out how to achieve this and can't find anything online either. Can someone help me achieve my desired output?
Sorry if the code layout is a little weird. It got messed up when I pasted it into the "code sample"
Here is my code:
import arcpy
import datetime
from datetime import timedelta
import time
#Document Start Time in-order to calculate Run Time
time1 = time.clock()
#assign project and map frame
p =
arcpy.mp.ArcGISProject(r'E:\arcGIS_Shared\Python\CumulativeHeatMaps.aprx')
m = p.listMaps('Map')[0]
Markets = [3000]
### Centers to loop through
CA_Centers = ['Castro', 'ColeValley', 'Excelsior', 'GlenPark',
'LowerPacificHeights', 'Marina', 'NorthBeach', 'RedwoodCity', 'SanBruno',
'DalyCity']
for Market in Markets:
print(Market)
for CA_Center in CA_Centers:
Layers =
m.listLayers("CumulativeSumWithin{0}_{1}_Jun2018".format(Market,CA_Center))
fields = ['SUM_USER_VisitCount']
for Layer in Layers:
print(Layer)
sqlClause = (None, 'ORDER BY ' + 'SUM_USER_VisitCount') # + 'DESC'
with arcpy.da.SearchCursor(in_table = Layer, field_names = fields,
sql_clause = sqlClause) as searchCursor:
print (max(searchCursor))
You can create a dictonary that stores the results from each query and then print out the highest one at the end.
results_dict = {}
for Market in Markets:
print(Market)
for CA_Center in CA_Centers:
Layers =
m.listLayers("CumulativeSumWithin{0}_{1}_Jun2018".format(Market,CA_Center))
fields = ['SUM_USER_VisitCount']
for Layer in Layers:
print(Layer)
sqlClause = (None, 'ORDER BY ' + 'SUM_USER_VisitCount') # + 'DESC'
with arcpy.da.SearchCursor(in_table = Layer, field_names = fields,
sql_clause = sqlClause) as searchCursor:
print (max(searchCursor))
results_dict[Layer] = max(searchCursor)
# get key for dictionary item with the highest value
highest_count_layer = max(results_dict, key=results_dict.get)
print(highest_count_layer)
print(results_dict[highest_count_layer])

R Language: getCommentReplies() error:

despite reading the existing answers, I still don't know how to fix this problem.
I am trying to extract Comments For each post in the 1st phase which it is doing successfully and then in the 2nd phase for each comment extract the corresponding replies for that comment (i.e. in my program when i=1 [1st post] AND when j=1 [1st comment] )
However by the time getCommentreplies() tries to extract the very first reply for the very first comment of the first post it throws up the following error:
Error in data.frame(from_id = json$from$id, from_name = json$from$name, :
arguments imply differing number of rows: 0, 1
my program:
load ("fb_oauth")
fb_page_no_nullz<-getPage(page="gtbank", token=fb_oauth,n=130, since= '2018/3/10', until= '2018/3/12',feed=TRUE,api = 'v2.11') #Extract THE LATEST n=7 FCMB posts excluding Null rows from FCMB page# into variable/vector fb_page .
no_of_rows=na.omit(nrow(fb_page_no_nullz)) #Count the number of rows without NULLS and store in var no_of_rows
i=1
all_comments<-NULL
while (i<=no_of_rows)
{
postt <- getPost(post=fb_page_no_nullz$id[i], n=200, token=fb_oauth, comments = TRUE, likes=FALSE, api= "v2.11" ) #Extract N comments for each post
no_of_row_c=na.omit(nrow(postt$comments))
if(no_of_row_c!=0) #If their are no comments for each post then pick the next post.
{
comment_details<-postt$comments[,1:7]
comment_details$from_id<-comment_details$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
j =1
while (j<=no_of_row_c)
{
repl<-NULL
repl<-getCommentReplies(comment_details$id[i],token=fb_oauth,n=200,replies=TRUE,likes=FALSE,n.replies=100)
j=j+1
}
}
#all_comments$from_id<-all_comments$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
all_comments<-rbind(all_comments,comment_details) # Cummutatively append all comments for all posts into the data frame all_comments
i=i+1
}
#allPC<-merge(all_comments,fb_page_no_nullz, by.x= substr(c("id"),1,14), by.y=substr(c("id"),14,30),all.x = TRUE)

Bosun how to add series with different tags?

I'm trying to add 4 series using bosun expressions. They are from 1,2,3,4 weeks ago. I shifted them using shift() to have current time. But I can't add them since they have the shift=1w etc tags. How can I add these series together?
Thank you
edit: here's the query for 2 weeks
$period = d("1w")
$duration = d("30m")
$week1end = tod(1 * $period )
$week1start = tod(1 * $period + $duration )
$week2end = tod(2 * $period )
$week2start = tod(2 * $period + $duration )
$q1 = q("avg:1m-avg:os.cpu{host=myhost}", $week1start, $week1end)
$q2 = q("avg:1m-avg:os.cpu{host=myhost}", $week2start, $week2end)
$shiftedq1 = shift($q1, "1w")
$shiftedq2 = shift($q2, "2w")
$shiftedq1+ $shiftedq2
edit: here's what Bosun said
The problem is similar to: How do I add the series present in the output of an over query:
over("avg:1m-avg:os.cpu{host=myhost}", "30m", "1w", 2)
There is a new function called addtags that is pending documentation (see https://raw.githubusercontent.com/bosun-monitor/bosun/master/docs/expressions.md for draft) which seems to work when combined with rename. Changing the last line to:
$shiftedq1+addtags(rename($shiftedq2,"shift=shiftq2"),"shift=1w")
should generate a single result group like { host=hostname, shift=1w, shiftq2=2w }. If you add additional queries for q3 and q4 you probably need to rename the shift tag for those to unique values like shiftq3 and shiftq4.
If you were using a numbersets instead of seriessets, then the Transpose function would let you "Drop" the unwanted tags. This is useful when generating alerts, since crit and warn need a single number value not a series set:
$average_per_q = avg(merge($shiftedq1,$shiftedq2))
$sum_over_all = sum(t($average_per_q,"host"))
Result: { host=hostname } 7.008055555555557
Side note you probably want to use a counter for os.cpu instead of a gauge. Example: $q1 = q("avg:1m-avg:rate{counter,,1}:os.cpu{. Without that rate section you are using the raw counter values instead of the gauge value.

PIG: Do not understand why AVG deos not work when COUNT does

I am running the following set of commands in Pig. My data set has one row for each student in a class and each student has a number of grades. Student name is tab separated from grades for that student. The scores for each student are comma separated. I need to find the average grade for each student.
After grouping, I can successfully get the count of grades for each student but I cannot get the average score for each student. Pig complains it cannot find the iterator when it is averaging. I am confused since the iterator for both aggregate function COUNT and AVG is the same. I am not sure what I am missing. Any help is appreciated?
Scripts:
grunt> A = LOAD 'grades.txt' USING PigStorage('\t') AS
(f1:chararray,f2:chararray);
grunt> dump A;
(s14,59,94,81)
(s15,60,77)
(s16,77,77)
(s17,76,76)
(s18,19,61,72)
(s20,34,35)
grunt> B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as (grade:int);
grunt> describe B;
B: {stu: chararray,grade: int}
grunt> dump B;
(s14,59)
(s14,94)
(s14,81)
(s15,60)
(s15,77)
(s16,77)
(s16,77)
(s17,76)
(s17,76)
(s18,19)
(s18,61)
(s18,72)
(s20,34)
(s20,35)
grunt> grp = group B by stu;
grunt> cnt = foreach grp generate group, COUNT(B.grade);
grunt> dump cnt;
(s14,3)
(s15,2)
(s16,2)
(s17,2)
(s18,3)
(s20,2)
grunt> avg = foreach grp generate group, AVG(B.grade);
grunt> dump avg;
2015-03-20 21:56:30,900 ERROR org.apache.pig.tools.pigstats.PigStatsUtil:
1 map reduce job(s) failed!
2015-03-20 21:56:30,907 ERROR org.apache.pig.tools.grunt.Grunt: ERROR 1066:
Unable to open iterator for alias avg
Details at logfile: /home/training/pig/pig_1426902869706.log
grunt>
As mentioned in the comments, a workaround was found:
changed
B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as (grade:int)
to
B = foreach A generate f1 as stu, Flatten(TOKENIZE(f2)) as grade
And then copied the bag into:
C = foreach B generate stu as stu, grade as (int)grade;

Two index with one value in a lua table

I am very new to lua and my plan is to create a table. This table (I call it test) has 200 entries - each entry has the same subentries (In this example the subentries money and age):
This is a sort of pseudocode:
table test = {
Entry 1: money=5 age=32
Entry 2: money=-5 age=14
...
Entry 200: money=999 age=72
}
How can I write this in lua ? Is there a possibility ? The other way would be, that I write each subentry as a single table:
table money = { }
table age = { }
But for me, this isn't a nice way, so maybe you can help me.
Edit:
This question Table inside a table is related, but I cannot write this 200x.
Try this syntax:
test = {
{ money = 5, age = 32 },
{ money = -5, age = 14 },
...
{ money = 999, age = 72 }
}
Examples of use:
-- money of the second entry:
print(test[2].money) -- prints "-5"
-- age of the last entry:
print(test[200].age) -- prints "72"
You can also turn the problem on it's side, and have 2 sequences in test: money and age where each entry has the same index in both arrays.
test = {
money ={1000,100,0,50},
age={40,30,20,25}
}
This will have better performance since you only have the overhead of 3 tables instead of n+1 tables, where n is the number of entries.
Anyway you have to enter your data one way or another. What you'd typically do is make use some easily parsed format like CSV, XML, ... and convert that to a table. Like this:
s=[[
1000 40
100 30
0 20
50 25]]
test ={ money={},age={}}
n=1
for balance,age in s:gmatch('([%d.]+)%s+([%d.]+)') do
test.money[n],test.age[n]=balance,age
n=n+1
end
You mean you do not want to write "money" and "age" 200x?
There are several solutions but you could write something like:
local test0 = {
5, 32,
-5, 14,
...
}
local test = {}
for i=1,#test0/2 do
test[i] = {money = test0[2*i-1], age = test0[2*i]}
end
Otherwise you could always use metatables and create a class that behaves exactly like you want.

Resources