Remove items from list while iterating through it in R - r

I have a data frame with observations on YouTube video_ids. When used in an API call, these ids allow me to fetch data on certain videos that I use to enrich my dataset.
First I created a list of unique video_ids with the below script. This returns a large list of 6350 unique elements.
video_ids <- list();
index <- 1
for(i in unique(df$video_id)){
video_ids[[index]] <- list(
video_id = i
)
index <- index + 1
}
The API documentation asks for a comma seperated list of video ids. I did that by using unlist(video_ids) which returns a large vector. I cannot use this vector in the API call, because it is way too long.
The maximum amount of ids I can process in one API call is 50.
library(httr)
api_key = "xxxx"
process_ids = unlist(video_ids[1:50]) #pass the first 50 elements of the video_ids list
url <- modify_url("https://www.googleapis.com/youtube/v3/videos",
query = list(
"part" = "snippet",
"id" = paste(process_ids, collapse=","),
"key" = api_key)
)
output <- content(GET(url), as = "parsed", type = "application/json")
What is the best approach for this in R? Can I loop through my list of 6350 elements by 50 items each loop, removing these items from the list when the loop completes?
My current script below loops through each video id in the list and fetches the data I need from the output of the API response. This works, but is very slow and requires a lot of loops / API calls. (6350 loops). It can't be the most effient way to approach this.
result <- list();
index <- 1
for (id in video_ids) {
api_key = "xxxx"
url <- modify_url("https://www.googleapis.com/youtube/v3/videos",
query = list(
"part" = "snippet",
"id" = paste(id, collapse=","),
"key" = api_key)
)
output <- content(GET(url), as = "parsed", type = "application/json")
#Adds what I need from the output to a list called result
for(t in output$items){
result[[index]] <- list(
video_id = t$id,
channel_id = t$snippet$channelId
)
}
index <- index + 1
}

You can try the following :
Split the video id's every 50 values and pass it to the API.
vec = unlist(video_ids)
result <- lapply(split(vec, ceiling(seq_along(vec)/50)), function(x) {
url <- modify_url("https://www.googleapis.com/youtube/v3/videos",
query = list(
"part" = "snippet",
"id" = paste(x, collapse=","),
"key" = api_key))
content(GET(url), as = "parsed", type = "application/json")
})

Related

Pass array in query params httr

How do I pass an array into the query of httr?
The request url should look like that:
https://www.example.com/xyz?type=3&type=5
My current code looks like that:
POST(url,
query = data.frame("something" = "somethingElse", type = ),
add_headers(.headers = c("token" = token),
encode = "json")
How do I add those types from the url example to my R example?
The default encoding for httr doesn't like to use the same name multiple times, but it is possible to separate your values into lists which have duplicate names. Here's a helper function i've used that can help
flattenbody <- function(x) {
# A form/query can only have one value per name, so take
# any values that contain vectors length >1 and
# split them up
# list(x=1:2, y="a") becomes list(x=1, x=2, y="a")
if (all(lengths(x)<=1)) return(x);
do.call("c", mapply(function(name, val) {
if (length(val)==1 || any(c("form_file", "form_data") %in% class(val))) {
x <- list(val)
names(x) <- name
x
} else {
x <- as.list(val)
names(x) <- rep(name, length(val))
x
}
}, names(x), x, USE.NAMES = FALSE, SIMPLIFY = FALSE))
}
Then you could use it with something like
POST(url,
query = flattenbody(list(something="somethingElse", type = c(3, 5))),
add_headers(.headers = c("token" = token)),
encode = "json"
)

Want to write a for loop that calls an API with some conditions (if/else)

I want to call an Api that gives me all my shop orders. I have a total of 86.000 orders where the ID of the first order is 2 and the ID of the most recent order is 250.000. Orders obviously are not count consecutive, due to some reason i dont know yet, but doesnt matter.
I started a simple script with a for loop where the ID gets updated in every loop, like this:
library(jsonlite)
library(httr)
user = "my_name"
token = "xyz"
y = 0
urls = rep("api.com/orders/", 250000)
for(i in urls){
y = y + 1
url = paste0(i, y)
a = httr::GET(url, authenticate(user, token))
a_content = httr:content(a, as = "text", encoding = "UTF-8")
a_json = jsonlite::fromJSON(a_content, flatten = T)
...
}
Problem here is, whenever there is an ID with no order, the loop stops with "{\"success\":false,\"message\":\"Order by id 1 not found\"}" So i somehow have to expand the code with some if else statements, like 'if the id does not match an order proceed to the next order'. Also i want to write all orders into a new list.
Any help apprichiated.
Maybe tryCatch can solve the problem. This url_exists function posted by user #hrbrmstr is called in the loop below.
Without testing, I would do something along the lines of:
base_url <- "api.com/orders/"
last_order_number <- 250000
for(i in seq.int(last_order_number)){
url = paste0(base_url, i)
if(url_exists(url)){
a <- httr::GET(url, authenticate(user, token))
a_content <- httr:content(a, as = "text", encoding = "UTF-8")
a_json <- jsonlite::fromJSON(a_content, flatten = T)
}
}

Passing many values to an API using R

I wish to scale my working API query to query many IDs and to store this in a nice rectangular data frame.
I need some help understanding how I can scale my code to take many input variables and then how to store them.
My working code is as follows:
pacman::p_load(tidyverse,httr,jsonlite,purrr)
path <- "https://npiregistry.cms.hhs.gov/api/?"
request <- httr::GET(url = path,
query = list(version = "2.0",
number = 1154328938))
response <- content(request, as = "text", encoding = "UTF-8")
df <- jsonlite::fromJSON(response, flatten = TRUE) %>%
data.frame()
providerData <- df %>%
select(results.number,
results.basic.name,
results.basic.gender,
results.basic.credential,
results.taxonomies) %>%
unnest_wider(results.taxonomies) %>%
rename(Provider_NPI = results.number,
Provider_Name = results.basic.name,
Provider_Gender = results.basic.gender,
Provider_Credentials = results.basic.credential,
Provider_Taxonomy = desc,
Provider_State = state) %>%
select(-code,-license,-primary)
I now wish to query these 4 IDs and to store them in the same data format as the example above.
I have tried using lapply and building my own function but I don't fully understand how to create objects that store returned values.
My function looks as follows:
getNPI <- function(object) {
httr::GET(url = path,
query = list(version = "2.0",
number = object))
}
providerIDs <- c('1073666335',
'1841395357',
'1104023381',
'1477765634')
test <- lapply(providerIDs, getNPI)
I'm pretty certain I need some sort of object like a list or data frame to store the values of httr::GET but this is where I am falling down. The other piece is how to pull the appropriate values from the returned objects and to store them in a neat data frame.
Your help would be greatly appreciated.
you have to add the "cleaning" steps and return a df inside your getNPI function, then you can later use do.call for "combine" all data into a "final" data frame:
Example
getNPI <- function(object) {
request <- httr::GET(url = path,
query = list(version = "2.0",
number = object))
df <- content(request, as = "text", encoding = "UTF-8") %>%
jsonlite::fromJSON(. , flatten = TRUE) %>%
data.frame()
df %>%
select(results.number,
results.basic.name,
results.basic.gender,
results.basic.credential,
results.taxonomies) %>%
unnest_wider(results.taxonomies)
# Add more selection, mutations as needed
}
test <- lapply(providerIDs, getNPI)
# Use do.call for rbind an make the final df
final_df <- do.call("rbind",test)
Hope this can help you
NOTE: In order to rbind works with do.call as expected, all the columns names has to be the same.

r Web scraping: Unable to read the main table

I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).
This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

R: Loop to capture weather data from multiple station

I wanted to perform loop to capture weather data from multiple stations using code below:
library(rwunderground)
sample_df <- data.frame(airportid = c("K6A2",
"KAPA",
"KASD",
"KATL",
"KBKF",
"KBKF",
"KCCO",
"KDEN",
"KFFC",
"KFRG"),
stringsAsFactors = FALSE)
history_range(set_location(airport_code =sample_df$airportid), date_start = "20170815", date_end = "20170822",
limit = 10, no_api = FALSE, use_metric = FALSE, key = get_api_key(),
raw = FALSE, message = TRUE)
It won't work.
Currently, you are passing the entire vector (multiple character values) into the history_range call. Simply lapply to iteratively pass the vector values and even return a list of history_range() return objects. Below uses a defined function to pass the parameter. Extend the function as needed to perform other operations.
capture_weather_data <- function(airport_id) {
data <- history_range(set_location(airport_code=airport_id),
date_start = "20170815", date_end = "20170822",
limit = 10, no_api = FALSE, use_metric = FALSE, key = get_api_key(),
raw = FALSE, message = TRUE)
write.csv(data, paste0("/path/to/output/", airport_id, ".csv"))
return(data)
}
data_list <- lapply(sample_df$airportid, capture_weather_data)
Also, name each item in list to the corresponding airport_id character value:
data_list <- setNames(data_list, sample_df$airportid)
data_list$K6A2 # 1st ITEM
data_list$KAPA # 2nd ITEM
data_list$KASD # 3rd ITEM
...
In fact, with sapply (the wrapper to lapply) you can generate list and name each item in same call but the input vector must be a character type (not factor):
data_list <- sapply(as.character(sample_df$airportid), capture_weather_data,
simplify=FALSE, USE.NAMES=TRUE)
names(data_list)
I think this history_range function that you brought up, from the rwunderground package as I understand, requires a weather underground API key. I went to the site and even signed up for it, but the email validation process in order to get a key (https://www.wunderground.com/weather/api) doesn't seem to be working correctly at the moment.
Instead I went to the CRAN mirror (https://github.com/cran/rwunderground/blob/master/R/history.R) and from what I understand, the function accepts only one string as set_location argument. The example provided in the documentation is
history(set_location(airport_code = "SEA"), "20130101")
So what you should be doing as a "loop", instead, is
sample_df <- as.vector(sample_df)
for(i in 1:length(sample_df)){
history_range(
set_location(airport_code = sample_df[[i]]),
date_start = "20170815", date_end = "20170822",
limit = 10, no_api = FALSE, use_metric = FALSE,
key = get_api_key(),
raw = FALSE, message = TRUE)
}
If this doesn't work, let me know. (Ack, somebody also gave another answer to this question while I was typing this up.)

Resources