R Language: getCommentReplies() error: - r

despite reading the existing answers, I still don't know how to fix this problem.
I am trying to extract Comments For each post in the 1st phase which it is doing successfully and then in the 2nd phase for each comment extract the corresponding replies for that comment (i.e. in my program when i=1 [1st post] AND when j=1 [1st comment] )
However by the time getCommentreplies() tries to extract the very first reply for the very first comment of the first post it throws up the following error:
Error in data.frame(from_id = json$from$id, from_name = json$from$name, :
arguments imply differing number of rows: 0, 1
my program:
load ("fb_oauth")
fb_page_no_nullz<-getPage(page="gtbank", token=fb_oauth,n=130, since= '2018/3/10', until= '2018/3/12',feed=TRUE,api = 'v2.11') #Extract THE LATEST n=7 FCMB posts excluding Null rows from FCMB page# into variable/vector fb_page .
no_of_rows=na.omit(nrow(fb_page_no_nullz)) #Count the number of rows without NULLS and store in var no_of_rows
i=1
all_comments<-NULL
while (i<=no_of_rows)
{
postt <- getPost(post=fb_page_no_nullz$id[i], n=200, token=fb_oauth, comments = TRUE, likes=FALSE, api= "v2.11" ) #Extract N comments for each post
no_of_row_c=na.omit(nrow(postt$comments))
if(no_of_row_c!=0) #If their are no comments for each post then pick the next post.
{
comment_details<-postt$comments[,1:7]
comment_details$from_id<-comment_details$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
j =1
while (j<=no_of_row_c)
{
repl<-NULL
repl<-getCommentReplies(comment_details$id[i],token=fb_oauth,n=200,replies=TRUE,likes=FALSE,n.replies=100)
j=j+1
}
}
#all_comments$from_id<-all_comments$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
all_comments<-rbind(all_comments,comment_details) # Cummutatively append all comments for all posts into the data frame all_comments
i=i+1
}
#allPC<-merge(all_comments,fb_page_no_nullz, by.x= substr(c("id"),1,14), by.y=substr(c("id"),14,30),all.x = TRUE)

Related

Using R- My for loop only works on first iteration

I have a loop on each iteration does at least 1 and at most 2 API GET requests
I have a list of movies I loop through, I do a GET request to get the movies id corresponding in the API database, then I do a request to get the reviews using the movies ID.
I run the code and it works for the first movie but all the other movies remain empty. even when the first movie has no reviews.
Here is the code
for(i in 1:9964) {
id_url<- paste(url,search_title,key,COPY$Film[i],sep = "")
#GET request for movie
api_call_id<-GET(id_url)
#make readdble
read<- rawToChar(api_call_id$content)
#turn json into object
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$results$id[1])){
break;
}
else{ #get the movie id from the json
id<-JSON$results$id[1]
}
#url to get movies raitings
raiting_url<- paste(url,raiting,key,id,sep = "")
#call
api_call_raiting<- GET(raiting_url)
#readble
read<- rawToChar(api_call_raiting$content)
#json
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$rottenTomatoes)){
#set column value
COPY[ i, 'rTomatoes'] <- "No Review"
}
else{
#set column value
COPY[ i, 'rTomatoes'] <- JSON$rottenTomatoes
}

How to correct the output generated through str_detect/str_contains in R

I just have a column "methods_discussed" in CSV (link is https://github.com/pandas-dev/pandas/files/3496001/multiple_responses.zip)
multi<- read.csv("multiple_responses.csv", header = T)
This file having values name of family planning methods in the column name like:
methods_discussed
emergency female_sterilization male_sterilization iud NaN injectables male_condoms -77 male_condoms female_sterilization male_sterilization injectables iud male_condoms
I have created a vector of all but not -77 and NAN of 8 family planning methods as:
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')
I want to create new indicator variable based on the names of vector (method_names) in the existing data frame multi2, for this I used (I)
for (abc in method_names) {
multi2[abc]<- as.integer(str_detect(multi2$methods_discussed, fixed(abc)))
}
(II)
for (abc in method_names) {
multi2[abc]<- as.integer(str_contains(abc,multi2$methods_discussed))
}
(III) I also tried
for (abc in method_names) {
multi2[abc]<- as.integer(stri_detect_fixed(multi2$methods_discussed, abc))
}
but the output is not matching as expected. Probably male_sterilization is a substring of female_sterilization and it shows 1(TRUE) for male_sterilization for female_sterlization also. It is shown below in the Actual output at row 2. It must show 0 (FALSE) as female_sterilization is in the method_discussed column at row 2. I also don't want to generate any thing like 0/1 (False/True) (should be blank) corresponding to -77 and blank in method_discussed (All are highlighted in Expected output.
Actual Output
Expected Output
No error in code but only in the output.
You can add word boundaries to fix that issue.
multi<- read.csv("multiple_responses.csv", header = T)
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')
for (abc in method_names) {
multi[abc]<- as.integer(grepl(paste0('\\b', abc, '\\b'), multi$methods_discussed))
}
multi[multi$methods_discussed %in% c('', -77), method_names] <- ''

How to check if subset is empty in R

I have a set of data with weight with time (t), I need to identify outliers of weight for every time (t), after which I need to send a notification email.
I'm using bloxplot($out) to identify the outliers, it seems to work, but I'm not sure if:
It's the correct way to use the boxplot?
I can't detect if the boxplot has no outlier or if its empty (or maybe, I'm using a wrong technique)
Or possibly the subset itself is empty (could be the root cause)
For now, I just need to trap the empty subset and check if out variable is empty or not.
Below is my R script code:
#i am a comment, and the compiler doesn't care about me
#load our libraries
library(ggplot2)
library(mailR)
#some variables to be used later
from<-""
to<-""
getwd()
setwd("C:\\Temp\\rwork")
#read the data file into a data(d) variable
d<-read.csv("testdata.csv", header=TRUE) #file
#get the current time(t)
t <-format(Sys.time(),"%H")
#create a subset of d based on t
sbset<-subset(d,Time==t)
#identify if outlier exists then send an email report
out<-boxplot(sbset$weight)$out
if(length(out)!=0){
#create a boxplot of the subset
boxplot(sbset$weight)
subject = paste("Attention: An Outlier is detected for Scheduled Job Run on Hour ",t)
message = toString(out) #sort(out)
}else{
subject = paste("No Outlier Identified")
message = ""
}
email<-send.mail(from=from,
to=to,
subject=subject,
body=message,
html=T,
smtp=list(host.name = "smtp.gmail.com",
port = 465,
user.name = from,
passwd = "", #password of sender email
ssl = TRUE),
authenticate=TRUE,
send=TRUE)
DATA
weight,Time,Chick,x
42,0,1,1
51,2,1,1
59,4,1,1
64,6,1,1
76,8,1,1
93,10,1,1
106,12,1,1
125,14,1,1
149,16,1,1
171,18,1,1
199,20,1,1
205,21,1,1
40,0,2,1
49,2,2,1
58,4,2,1
72,6,2,1
84,8,2,1
103,10,2,1
122,12,2,1
138,14,2,1
162,16,2,1
187,18,2,1
209,20,2,1
215,21,2,1
43,0,3,1
39,2,3,1
55,4,3,1
67,6,3,1
84,8,3,1
99,10,3,1
115,12,3,1
138,14,3,1
163,16,3,1
187,18,3,1
198,20,3,1
202,21,3,1
42,0,4,1
49,2,4,1
56,4,4,1
67,6,4,1
74,8,4,1
87,10,4,1
102,12,4,1
108,14,4,1
136,16,4,1
154,18,4,1
160,20,4,1
157,21,4,1
41,0,5,1
42,2,5,1
48,4,5,1
60,6,5,1
79,8,5,1
106,10,5,1
141,12,5,1
164,14,5,1
197,16,5,1
199,18,5,1
220,20,5,1
223,21,5,1
41,0,6,1
49,2,6,1
59,4,6,1
74,6,6,1
97,8,6,1
124,10,6,1
141,12,6,1
148,14,6,1
155,16,6,1
160,18,6,1
160,20,6,1
157,21,6,1
41,0,7,1
49,2,7,1
57,4,7,1
71,6,7,1
89,8,7,1
112,10,7,1
146,12,7,1
174,14,7,1
218,16,7,1
250,18,7,1
288,20,7,1
305,21,7,1
42,0,8,1
50,2,8,1
61,4,8,1
71,6,8,1
84,8,8,1
93,10,8,1
110,12,8,1
116,14,8,1
126,16,8,1
134,18,8,1
125,20,8,1
42,0,9,1
51,2,9,1
59,4,9,1
68,6,9,1
85,8,9,1
96,10,9,1
90,12,9,1
92,14,9,1
93,16,9,1
100,18,9,1
100,20,9,1
98,21,9,1
41,0,10,1
44,2,10,1
52,4,10,1
63,6,10,1
74,8,10,1
81,10,10,1
89,12,10,1
96,14,10,1
101,16,10,1
112,18,10,1
120,20,10,1
124,21,10,1
43,0,11,1
51,2,11,1
63,4,11,1
84,6,11,1
112,8,11,1
139,10,11,1
168,12,11,1
177,14,11,1
182,16,11,1
184,18,11,1
181,20,11,1
175,21,11,1
41,0,12,1
49,2,12,1
56,4,12,1
62,6,12,1
72,8,12,1
88,10,12,1
119,12,12,1
135,14,12,1
162,16,12,1
185,18,12,1
195,20,12,1
205,21,12,1
41,0,13,1
48,2,13,1
53,4,13,1
60,6,13,1
65,8,13,1
67,10,13,1
71,12,13,1
70,14,13,1
71,16,13,1
81,18,13,1
91,20,13,1
96,21,13,1
41,0,14,1
49,2,14,1
62,4,14,1
79,6,14,1
101,8,14,1
128,10,14,1
164,12,14,1
192,14,14,1
227,16,14,1
248,18,14,1
259,20,14,1
266,21,14,1
41,0,15,1
49,2,15,1
56,4,15,1
64,6,15,1
68,8,15,1
68,10,15,1
67,12,15,1
68,14,15,1
41,0,16,1
45,2,16,1
49,4,16,1
51,6,16,1
57,8,16,1
51,10,16,1
54,12,16,1
42,0,17,1
51,2,17,1
61,4,17,1
72,6,17,1
83,8,17,1
89,10,17,1
98,12,17,1
103,14,17,1
113,16,17,1
123,18,17,1
133,20,17,1
142,21,17,1
39,0,18,1
35,2,18,1
43,0,19,1
48,2,19,1
55,4,19,1
62,6,19,1
65,8,19,1
71,10,19,1
82,12,19,1
88,14,19,1
106,16,19,1
120,18,19,1
144,20,19,1
157,21,19,1
41,0,20,1
47,2,20,1
54,4,20,1
58,6,20,1
65,8,20,1
73,10,20,1
77,12,20,1
89,14,20,1
98,16,20,1
107,18,20,1
115,20,20,1
117,21,20,1
40,0,21,2
50,2,21,2
62,4,21,2
86,6,21,2
125,8,21,2
163,10,21,2
217,12,21,2
240,14,21,2
275,16,21,2
307,18,21,2
318,20,21,2
331,21,21,2
41,0,22,2
55,2,22,2
64,4,22,2
77,6,22,2
90,8,22,2
95,10,22,2
108,12,22,2
111,14,22,2
131,16,22,2
148,18,22,2
164,20,22,2
167,21,22,2
43,0,23,2
52,2,23,2
61,4,23,2
73,6,23,2
90,8,23,2
Your first use of boxplot is unnecessarily creating a plot, you can use
out <- boxplot.stats(sbset$weight)$out
for a little efficiency.
You are interested in the presence of rows, but length(sbset) will return the number of columns. I suggest instead nrow or NROW.
if (NROW(out) > 0) {
boxplot(sbset$weight)
# ...
} else {
# ...
}

IndexError: list index out of range, scores.append( (fields[0], fields[1]))

I'm trying to read a file and put contents in a list. I have done this mnay times before and it has worked but this time it throws back the error "list index out of range".
the code is:
with open("File.txt") as f:
scores = []
for line in f:
fields = line.split()
scores.append( (fields[0], fields[1]))
print(scores)
The text file is in the format;
Alpha:[0, 1]
Bravo:[0, 0]
Charlie:[60, 8, 901]
Foxtrot:[0]
I cant see why it is giving me this problem. Is it because I have more than one value for each item? Or is it the fact that I have a colon in my text file?
How can I get around this problem?
Thanks
If I understand you well this code will print you desired result:
import re
with open("File.txt") as f:
# Let's make dictionary for scores {name:scores}.
scores = {}
# Define regular expressin to parse team name and team scores from line.
patternScore = '\[([^\]]+)\]'
patternName = '(.*):'
for line in f:
# Find value for team name and its scores.
fields = re.search(patternScore, line).groups()[0].split(', ')
name = re.search(patternName, line).groups()[0]
# Update dictionary with new value.
scores[name] = fields
# Print output first goes first element of keyValue in dict then goes keyName
for key in scores:
print (scores[key][0] + ':' + key)
You will recieve following output:
60:Charlie
0:Alpha
0:Bravo
0:Foxtrot

PIG - Scalar has more than one row in the output. 1s

I have data set in the following format:
100000853384|RETAIL|OTHER|4.625|280000|360|02/2012|04/2012|31|31|1|23|801|NO|CASH-OUT REFINANCE|SF|1|INVESTOR|CA|945||FRM
100003735682|RETAIL|SUNTRUST MORTGAGE INC.|3.99|466000|360|01/2012|03/2012|80|80|2|30|788|NO|PURCHASE|SF|1|PRINCIPAL|MD|208||FRM
100006367485|CORRESPONDENT|PHH MORTGAGE CORPORATION|4|229000|360|02/2012|04/2012|67|67|2|36|794|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|CA|959||FRM
4th record is the ORIGINAL_INTEREST_RATE.
Now My Question is
What is the interest rate for which most number of people have taken a loan.
I write following codes
LOAD DATA SET
loanAqiData = LOAD 'hdfs://masterNode:8020/home/hadoop/hadoop_data/LOAN_Acquisition_DATA/Acquisition_2012Q1.txt'
USING PigStorage('|')
AS
(
LOAN_IDENTIFIER:chararray
, CHANNEL:chararray
, SELLER_NAME:chararray
, ORIGINAL_INTEREST_RATE:float
, ORIGINAL_UNPAID_PRINCIPAL_BALANCE :float
, ORIGINAL_LOAN_TERM :float
, ORIGINATION_DATE:chararray
, FIRST_PAYMENT_DATE:chararray
, ORIGINAL_LOAN_TO_VALUE:float
, ORIGINAL_COMBINED_LOAN_TO_VALUE :float
, NUMBER_OF_BORROWERS:float
, DEBT_TO_INCOME_RATIO:float
, CREDIT_SCORE:float
, FIRST_TIME_HOME_BUYER_INDICATOR:chararray
, LOAN_PURPOSE:chararray
, PROPERTY_TYPE:chararray
, NUMBER_OF_UNITS:chararray
, OCCUPANCY_STATUS:chararray
, PROPERTY_STATE:chararray
, ZIP:chararray
, MORTGAGE_INSURANCE_PERCENTAGE:float
, PRODUCT_TYPE:chararray
);
//- Group By Interest Rate
grouped_by_interest_rate = group loanAqiData by ORIGINAL_INTEREST_RATE;
No of Counts for individual Interest Rate
count_for_specific_interest = FOREACH grouped_by_interest_rate GENERATE group as INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
Dump
dump count_for_specific_interest
Output
(3.625,1)
(3.75,2)
(3.875,26)
(3.99,8)
(4.0,21)
(4.1,1)
(4.125,15)
(4.25,16)
(4.375,15)
(4.376,26)
(4.5,10)
(4.625,3)
But I want to get
(3.875,26) and (4.376,26)
How Can I get ?
Also If I want to get the Loan Interest for which minimum No of people has taken Loan ..
I'd suggest you use the MAX() function (http://pig.apache.org/docs/r0.11.0/func.html#max) to determine the highest number of people and then filter by this number.
Here is an example of code that should work (not tested) :
FOREACH count_for_specific_interest {
max_value= MAX($1.NO_OF_PEOPLE);
GENERATE INTEREST_RATE, NO_OF_PEOPLE, max_value;
}
RESULT = FILTER count_for_specific_interest BY NO_OF_PEOPLE==max_value;
For the min you would be able to use exactly the same script replacing MAX() by MIN()
Finally this is resolved.
let me write down the steps
1) Load
2) Group by Interest
grp = group loanAqiData by ORIGINAL_INTEREST_RATE;
3) Count No of people against each Interest
cntForEachGrp = FOREACH grp GENERATE group as
INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
Output
(3.625,1) (3.75,2) (3.875,26) (3.99,8) (4.0,21) (4.1,1) (4.125,15) (4.25,16) (4.375,15) (4.376,26) (4.5,10) (4.625,3)
4) Group them all to put in the same BAG
grpALL = GROUP cntForEachGrp ALL;
(all,{(3.625,1),(3.75,2),(3.875,26),(3.99,8),(4.0,21),(4.1,1),(4.125,15),(4.25,16),(4.375,15),(4.376,1),(4.5,10),(4.625,3),(4.75,5),(4.875,4),(5.0,2),(5.25,1)})
5) Calculate Max No of people from the BAG
maxVal = FOREACH grpALL {
max_value= MAX(cntForEachGrp.NO_OF_PEOPLE);
GENERATE cntForEachGrp.INTEREST_RATE, cntForEachGrp.NO_OF_PEOPLE, max_value as
max_no;
}
grunt> describe maxVal;
maxVal: {{(INTEREST_RATE: float)},{(NO_OF_PEOPLE: long)},max_no: long}
dump maxVal;
({(3.625),(3.75),(3.875),(3.99),(4.0),(4.1),(4.125),(4.25),(4.375),(4.376),(4.5),(4.625),(4.75),(4.875),(5.0),(5.25)},{(1),(2),(26),(8),(21),(1),(15),(16),(15),(1),(10),(3),(5),(4),(2),(1)},26)
6)Filter out Loan interest having Max no of people
RESULT=FILTER cntForEachGrp BY NO_OF_PEOPLE == maxVal.max_no ;
After dump we get interest Rate -3.875 has max no of people 26.
Why we have to do
grpALL = GROUP cntForEachGrp ALL;
and
what is the inner meaning of the nested foreach in (5)

Resources