Using R- My for loop only works on first iteration - r

I have a loop on each iteration does at least 1 and at most 2 API GET requests
I have a list of movies I loop through, I do a GET request to get the movies id corresponding in the API database, then I do a request to get the reviews using the movies ID.
I run the code and it works for the first movie but all the other movies remain empty. even when the first movie has no reviews.
Here is the code
for(i in 1:9964) {
id_url<- paste(url,search_title,key,COPY$Film[i],sep = "")
#GET request for movie
api_call_id<-GET(id_url)
#make readdble
read<- rawToChar(api_call_id$content)
#turn json into object
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$results$id[1])){
break;
}
else{ #get the movie id from the json
id<-JSON$results$id[1]
}
#url to get movies raitings
raiting_url<- paste(url,raiting,key,id,sep = "")
#call
api_call_raiting<- GET(raiting_url)
#readble
read<- rawToChar(api_call_raiting$content)
#json
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$rottenTomatoes)){
#set column value
COPY[ i, 'rTomatoes'] <- "No Review"
}
else{
#set column value
COPY[ i, 'rTomatoes'] <- JSON$rottenTomatoes
}

Related

Web Scraping: How do I return specific user input forms in python?

I'm having trouble with the forms returning an exact match for the user input.
Emphasoft developer challenge:
Taking a list of tax form names (ex: "Form W-2", "Form 1095-C"),
search the website and return some informational results.
Specifically, you must return the "Product Number", the "Title", and
the maximum and minimum years the form is available for download.
Taking a tax form name (ex: "Form W-2") and a range of years
(inclusive, 2018-2020 should fetch three years), download all PDFs
available within that range.
import json import os import sys import requests from bs4 import BeautifulSoup
URL = 'https://apps.irs.gov/app/picklist/list/priorFormPublication.html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&{param.strip}&isDescending=false'
def get_forms(list_tax_form: list):
"""
function to get response from iris.gov with all forms content
:param list_tax_form: list of form names that we want to get info about
:return: dict with form name,form title
"""
response_list = [] # list for all responses of form names
with requests.session() as session:
for param in list_tax_form:
request_params = {'value': param,
'criteria': 'formNumber',
'submitSearch': 'Find',
}
res = session.get(URL, params=request_params).content
response_list.append(res)
return response_list
def parse_responses(list_tax_form: list):
"""
function to get all form names, titles years from previous func return
:param list_tax_form: list of form names that we want to get info about
:return: list of form names, titles, years
"""
responses = get_forms(list_tax_form)
# empty lists to fill them with the received information for all names, years, and titles
td_form_name, td_form_title, td_form_rev_year = [], [], []
for response in responses:
soup = BeautifulSoup(response, 'lxml')
td_name = soup.find_all('td', {'class': 'LeftCellSpacer'})
td_title = soup.find_all('td', {'class': 'MiddleCellSpacer'})
td_rev_year = soup.find_all('td', {'class': 'EndCellSpacer'})
td_form_name.extend(td_name)
td_form_title.extend(td_title)
td_form_rev_year.extend(td_rev_year)
return td_form_name, td_form_title, td_form_rev_year
def format_responses(list_tax_form: list):
"""
function to formate all responses for all forms we got!
1 Task
:param list_tax_form: list of form names that we want to get info about
:return: formated names,links,years
"""
td_names, td_titles, td_years = parse_responses(list_tax_form)
names = [name.text.strip() for name in td_names]
links = [link.find('a')['href'] for link in td_names]
titles = [title.text.strip() for title in td_titles]
years = [int(year.text.strip()) for year in td_years]
set_names = set(names)
final_dict = []
# loop to create dictionary of result information with years of tax form available to download
for name in set_names:
max_year = 0
min_year = max(years)
dict1 = {'form_number': name}
for index, p_name in enumerate(names):
if p_name == name:
if years[index] > max_year:
max_year = years[index]
elif years[index] < min_year:
min_year = years[index]
dict1['form_title'] = titles[index]
dict1['max_year'] = max_year
dict1['min_year'] = min_year
final_dict.append(dict1)
print(json.dumps(final_dict, indent=2))
return names, links, years
def download_files(list_tax_form):
"""
2 Task
Module to download pdf files of form_name that input from user.
:param list_tax_form: list of form names that we want to get info about
:return: message to user of successful create file or either
"""
names, links, years = format_responses(list_tax_form)
form_name = input('enter form name: ')
if form_name in names:
print('form exists. enter years range')
form_year1 = int(input('start year to analysis: '))
form_year2 = int(input('end year to analysis: '))
try:
os.mkdir(form_name)
except FileExistsError:
pass
# indecies to define names range in list of all tax form names
r_index = names.index(form_name) # index of first form_name mention on list
l_index = names.index(form_name) # index of last form_name mention on list
for name in names:
if name == form_name:
r_index += 1
years = years[l_index:r_index]
if form_year1 < form_year2:
range_years = range(form_year1, form_year2 + 1)
for year in range_years:
if year in years:
link = links[years.index(year)]
form_file = requests.get(link, allow_redirects=True)
open(f'{form_name}/{form_name}_{str(year)}.pdf', 'wb').write(form_file.content)
print(f'files saved to {form_name}/ directory!')
else:
print('input correct form name!')
if __name__ == '__main__':
tax_list = sys.argv[1:] # form names
download_files(tax_list)
(ex: "Form W-2" should not return "Form W-2 P")
When this file is ran, it is displaying other unrelated results.
How can I resolve this issue to display only specified user requests?

The New York Times API with R

I'm trying to get articles' information using The New York Times API. The csv file I get doesn't reflect my filter query. For example, I restricted the source to 'The New York Times', but the file I got contains other sources also.
I would like to ask you why the filter query doesn't work.
Here's the code.
if (!require("jsonlite")) install.packages("jsonlite")
library(jsonlite)
api = "apikey"
nytime = function () {
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',("The New York Times"),'AND type_of_material:',("News"),
'AND persons:',("Trump, Donald J"),
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
#get the total number of search results
initialsearch = fromJSON(url,flatten = T)
maxPages = round((initialsearch$response$meta$hits / 10)-1)
#try with the max page limit at 10
maxPages = ifelse(maxPages >= 10, 10, maxPages)
#creat a empty data frame
df = data.frame(id=as.numeric(),source=character(),type_of_material=character(),
web_url=character())
#save search results into data frame
for(i in 0:maxPages){
#get the search results of each page
nytSearch = fromJSON(paste0(url, "&page=", i), flatten = T)
temp = data.frame(id=1:nrow(nytSearch$response$docs),
source = nytSearch$response$docs$source,
type_of_material = nytSearch$response$docs$type_of_material,
web_url=nytSearch$response$docs$web_url)
df=rbind(df,temp)
Sys.sleep(5) #sleep for 5 second
}
return(df)
}
dt = nytime()
write.csv(dt, "trump.csv")
Here's the csv file I got.
It seems you need to put the () inside the quotes, not outside. Like this:
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',"(The New York Times)",'AND type_of_material:',"(News)",
'AND persons:',"(Trump, Donald J)",
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
https://developer.nytimes.com/docs/articlesearch-product/1/overview

How to check if subset is empty in R

I have a set of data with weight with time (t), I need to identify outliers of weight for every time (t), after which I need to send a notification email.
I'm using bloxplot($out) to identify the outliers, it seems to work, but I'm not sure if:
It's the correct way to use the boxplot?
I can't detect if the boxplot has no outlier or if its empty (or maybe, I'm using a wrong technique)
Or possibly the subset itself is empty (could be the root cause)
For now, I just need to trap the empty subset and check if out variable is empty or not.
Below is my R script code:
#i am a comment, and the compiler doesn't care about me
#load our libraries
library(ggplot2)
library(mailR)
#some variables to be used later
from<-""
to<-""
getwd()
setwd("C:\\Temp\\rwork")
#read the data file into a data(d) variable
d<-read.csv("testdata.csv", header=TRUE) #file
#get the current time(t)
t <-format(Sys.time(),"%H")
#create a subset of d based on t
sbset<-subset(d,Time==t)
#identify if outlier exists then send an email report
out<-boxplot(sbset$weight)$out
if(length(out)!=0){
#create a boxplot of the subset
boxplot(sbset$weight)
subject = paste("Attention: An Outlier is detected for Scheduled Job Run on Hour ",t)
message = toString(out) #sort(out)
}else{
subject = paste("No Outlier Identified")
message = ""
}
email<-send.mail(from=from,
to=to,
subject=subject,
body=message,
html=T,
smtp=list(host.name = "smtp.gmail.com",
port = 465,
user.name = from,
passwd = "", #password of sender email
ssl = TRUE),
authenticate=TRUE,
send=TRUE)
DATA
weight,Time,Chick,x
42,0,1,1
51,2,1,1
59,4,1,1
64,6,1,1
76,8,1,1
93,10,1,1
106,12,1,1
125,14,1,1
149,16,1,1
171,18,1,1
199,20,1,1
205,21,1,1
40,0,2,1
49,2,2,1
58,4,2,1
72,6,2,1
84,8,2,1
103,10,2,1
122,12,2,1
138,14,2,1
162,16,2,1
187,18,2,1
209,20,2,1
215,21,2,1
43,0,3,1
39,2,3,1
55,4,3,1
67,6,3,1
84,8,3,1
99,10,3,1
115,12,3,1
138,14,3,1
163,16,3,1
187,18,3,1
198,20,3,1
202,21,3,1
42,0,4,1
49,2,4,1
56,4,4,1
67,6,4,1
74,8,4,1
87,10,4,1
102,12,4,1
108,14,4,1
136,16,4,1
154,18,4,1
160,20,4,1
157,21,4,1
41,0,5,1
42,2,5,1
48,4,5,1
60,6,5,1
79,8,5,1
106,10,5,1
141,12,5,1
164,14,5,1
197,16,5,1
199,18,5,1
220,20,5,1
223,21,5,1
41,0,6,1
49,2,6,1
59,4,6,1
74,6,6,1
97,8,6,1
124,10,6,1
141,12,6,1
148,14,6,1
155,16,6,1
160,18,6,1
160,20,6,1
157,21,6,1
41,0,7,1
49,2,7,1
57,4,7,1
71,6,7,1
89,8,7,1
112,10,7,1
146,12,7,1
174,14,7,1
218,16,7,1
250,18,7,1
288,20,7,1
305,21,7,1
42,0,8,1
50,2,8,1
61,4,8,1
71,6,8,1
84,8,8,1
93,10,8,1
110,12,8,1
116,14,8,1
126,16,8,1
134,18,8,1
125,20,8,1
42,0,9,1
51,2,9,1
59,4,9,1
68,6,9,1
85,8,9,1
96,10,9,1
90,12,9,1
92,14,9,1
93,16,9,1
100,18,9,1
100,20,9,1
98,21,9,1
41,0,10,1
44,2,10,1
52,4,10,1
63,6,10,1
74,8,10,1
81,10,10,1
89,12,10,1
96,14,10,1
101,16,10,1
112,18,10,1
120,20,10,1
124,21,10,1
43,0,11,1
51,2,11,1
63,4,11,1
84,6,11,1
112,8,11,1
139,10,11,1
168,12,11,1
177,14,11,1
182,16,11,1
184,18,11,1
181,20,11,1
175,21,11,1
41,0,12,1
49,2,12,1
56,4,12,1
62,6,12,1
72,8,12,1
88,10,12,1
119,12,12,1
135,14,12,1
162,16,12,1
185,18,12,1
195,20,12,1
205,21,12,1
41,0,13,1
48,2,13,1
53,4,13,1
60,6,13,1
65,8,13,1
67,10,13,1
71,12,13,1
70,14,13,1
71,16,13,1
81,18,13,1
91,20,13,1
96,21,13,1
41,0,14,1
49,2,14,1
62,4,14,1
79,6,14,1
101,8,14,1
128,10,14,1
164,12,14,1
192,14,14,1
227,16,14,1
248,18,14,1
259,20,14,1
266,21,14,1
41,0,15,1
49,2,15,1
56,4,15,1
64,6,15,1
68,8,15,1
68,10,15,1
67,12,15,1
68,14,15,1
41,0,16,1
45,2,16,1
49,4,16,1
51,6,16,1
57,8,16,1
51,10,16,1
54,12,16,1
42,0,17,1
51,2,17,1
61,4,17,1
72,6,17,1
83,8,17,1
89,10,17,1
98,12,17,1
103,14,17,1
113,16,17,1
123,18,17,1
133,20,17,1
142,21,17,1
39,0,18,1
35,2,18,1
43,0,19,1
48,2,19,1
55,4,19,1
62,6,19,1
65,8,19,1
71,10,19,1
82,12,19,1
88,14,19,1
106,16,19,1
120,18,19,1
144,20,19,1
157,21,19,1
41,0,20,1
47,2,20,1
54,4,20,1
58,6,20,1
65,8,20,1
73,10,20,1
77,12,20,1
89,14,20,1
98,16,20,1
107,18,20,1
115,20,20,1
117,21,20,1
40,0,21,2
50,2,21,2
62,4,21,2
86,6,21,2
125,8,21,2
163,10,21,2
217,12,21,2
240,14,21,2
275,16,21,2
307,18,21,2
318,20,21,2
331,21,21,2
41,0,22,2
55,2,22,2
64,4,22,2
77,6,22,2
90,8,22,2
95,10,22,2
108,12,22,2
111,14,22,2
131,16,22,2
148,18,22,2
164,20,22,2
167,21,22,2
43,0,23,2
52,2,23,2
61,4,23,2
73,6,23,2
90,8,23,2
Your first use of boxplot is unnecessarily creating a plot, you can use
out <- boxplot.stats(sbset$weight)$out
for a little efficiency.
You are interested in the presence of rows, but length(sbset) will return the number of columns. I suggest instead nrow or NROW.
if (NROW(out) > 0) {
boxplot(sbset$weight)
# ...
} else {
# ...
}

RSelenium - Storing data in an array

I'm extracting events description from a list of event in my website.
Each event is a href link which goes to another page where we can find the image and the description of the event. I'm trying to store the image url and the description of all events in an array so I used the code below in the end of my loop, but I only get the image and the description of the last event looped:
m<-c(images_of_events)
n<-c( description_of_events)
cc<-remDr$findElement(using = "css", "[class = '_24er']")
cc<-remDr$getPageSource()
page_events<-read_html(cc[[1]][1])
links_events_data=html_nodes(page_events,'._24er > table > tbody > tr > td >
div> div._4dmk > a ')
events_urls<-html_attr(links_events_data,"href")
//the loop of each event
for (i in events_urls) {
remDr$navigate(paste("localhost://www.mywebsite",i,sep=""))
#get image
imagewebElem <- remDr$findElement(using = "class", "scaledImageFitWidth")
images_of_events<-imagewebElem $getElementAttribute("src")
descriptionwebElem <-remDr$findElement(using = "css", "[class = '_63ew']")
descriptionwebElem <-remDr$getPageSource()
page_event_description<-read_html(descriptionwebElem[[1]][1])
events_desc =html_nodes(page_event_description,'._63ew > span')
description_of_events= html_text(events_desc)
m<-c(images_of_events)
n<-c( description_of_events)
}
To save values in array in R you have to
1) create the array/data.frame dta <- data.frame(m=c(),n=c()) and then save to it dta[i,1] <- image_of_events and dta[i,2] <- description_of_evants where i is numeric iterator
2) create the array/data.frame and use rbind to add values like dta <- rbind(dta, data.frame(m=images_of_events, n = description_of_events))

R Language: getCommentReplies() error:

despite reading the existing answers, I still don't know how to fix this problem.
I am trying to extract Comments For each post in the 1st phase which it is doing successfully and then in the 2nd phase for each comment extract the corresponding replies for that comment (i.e. in my program when i=1 [1st post] AND when j=1 [1st comment] )
However by the time getCommentreplies() tries to extract the very first reply for the very first comment of the first post it throws up the following error:
Error in data.frame(from_id = json$from$id, from_name = json$from$name, :
arguments imply differing number of rows: 0, 1
my program:
load ("fb_oauth")
fb_page_no_nullz<-getPage(page="gtbank", token=fb_oauth,n=130, since= '2018/3/10', until= '2018/3/12',feed=TRUE,api = 'v2.11') #Extract THE LATEST n=7 FCMB posts excluding Null rows from FCMB page# into variable/vector fb_page .
no_of_rows=na.omit(nrow(fb_page_no_nullz)) #Count the number of rows without NULLS and store in var no_of_rows
i=1
all_comments<-NULL
while (i<=no_of_rows)
{
postt <- getPost(post=fb_page_no_nullz$id[i], n=200, token=fb_oauth, comments = TRUE, likes=FALSE, api= "v2.11" ) #Extract N comments for each post
no_of_row_c=na.omit(nrow(postt$comments))
if(no_of_row_c!=0) #If their are no comments for each post then pick the next post.
{
comment_details<-postt$comments[,1:7]
comment_details$from_id<-comment_details$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
j =1
while (j<=no_of_row_c)
{
repl<-NULL
repl<-getCommentReplies(comment_details$id[i],token=fb_oauth,n=200,replies=TRUE,likes=FALSE,n.replies=100)
j=j+1
}
}
#all_comments$from_id<-all_comments$from_name<-NULL # This line removes the columns from_id AND from_name from the v data Frame
all_comments<-rbind(all_comments,comment_details) # Cummutatively append all comments for all posts into the data frame all_comments
i=i+1
}
#allPC<-merge(all_comments,fb_page_no_nullz, by.x= substr(c("id"),1,14), by.y=substr(c("id"),14,30),all.x = TRUE)

Resources