Faster download for paginated nested JSON data from API in R? - r

I am downloading nested json data from the UN's SDG Indicators API, but using a loop for 50,006 paginations is waaayyy too slow to ever complete. Is there a better way?
https://unstats.un.org/SDGAPI/swagger/#!/Indicator/V1SdgIndicatorDataGet
I'm working in RStudio on a Windows laptop. Getting to the json nested data and structuring into a dataframe was a hard-fought win, but dealing with the paginations has me stumped. No response from the UN statistics email.
Maybe an 'apply' would do it? I only need data from 2004, 2007, and 2011 - maybe I can filter, but I don't think that would help the fundamental issue.
I'm probably misunderstanding the API structure - I can't see how querying 50,006 pages individually can be functional for anyone. Thanks for any insight!
library(dplyr)
library(httr)
library(jsonlite)
#Get data from the first page, and initialize dataframe and data types
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data", flatten = TRUE)
#Get number of pages from the field of the first page
pages <- page1$totalPages
SDGdata<- data.frame()
for(j in 1:25){
SDGdatarow <- rbind(page1$data[j,1:16])
SDGdata <- rbind(SDGdata,SDGdatarow)
}
SDGdata[1] <- as.character(SDGdata[[1]])
SDGdata[2] <- as.character(SDGdata[[2]])
SDGdata[3] <- as.character(SDGdata[[3]])
#Loop through all the rest of the pages
baseurl <- ("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data")
for(i in 2:pages){
mydata <- fromJSON(paste0(baseurl, "?page=", i), flatten=TRUE)
message("Retrieving page ", i)
for(j in 1:25){
SDGdatarow <- rbind(mydata$data[j,1:16])
rownames(SDGdatarow) <- as.numeric((i-1)*25+j)
SDGdata <- rbind.data.frame(SDGdata,SDGdatarow)
}
}
I do get the data I want, and in a nice dataframe, but inevitably the query has a connection issue after a few hundred pages, or my laptop shuts down etc. It's about 5 seconds a page. 5*50,006/3600 ~= 70 hours.

I think I (finally) figured out a workable solution: I can set the # of elements per page, resulting in a manageable number of pages to call. (I also filtered for just the 3 years I want which reduces the data). Through experimentation I figured out that about 1/10th of the elements download ok, so I set the call to 1/10 per page, with a loop for 10 pages. Takes about 20 minutes, but better than 70 hours, and works without losing the connection.
#initiate the df
SDGdata<- data.frame()
# call to get the # elements with the years filter
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011", flatten = TRUE)
perpage <- ceiling(page1$totalElements/10)
ptm <- proc.time()
for(i in 1:10){
SDGpage <- fromJSON(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011&pageSize=",perpage,"&page=",i), flatten = TRUE)
message("Retrieving page ", i, " :", (proc.time() - ptm) [3], " seconds")
SDGdata <- rbind(SDGdata,SDGpage$data[,1:16])
}

Related

efficient data collection from API using R

I am trying to get data from the UN Stats API for a list of indicators (https://unstats.un.org/SDGAPI/swagger/).
I have constructed a loop that can be used to get the data for a single indicator (code is below). The loop can be applied to multiple indicators as needed. However, this is likely to cause problems relating to large numbers of requests, potentially being perceived as a DDoS attack and taking far too long.
Is there an alternative way to get data for an indicator for all years and countries without making a ridiculous number of requests or in a more efficient manner than below? I suppose this question likely applies more generally to other similar APIs as well. Any help would be most welcome.
Please note: I have seen the post here (Faster download for paginated nested JSON data from API in R?) but it is not quite what I am looking for.
Minimal working example
# libraries
library(jsonlite)
library(dplyr)
library(purrr)
# get the meta data
page = ("https://unstats.un.org/SDGAPI//v1/sdg/Series/List")
sdg_meta = fromJSON(page) %>% as.data.frame()
# parameters
PAGE_SIZE =100000
N_PAGES = 5
FULL_DF = NULL
my_code = "SI_COV_SOCINS"
# loop to go over pages
for(i in seq(1,N_PAGES,1)){
ind = which(sdg_meta$code == my_code)
cat(paste0("Processing : ", my_code, " ", i, " of ",N_PAGES, " \n"))
my_data_page <- c(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Series/Data?seriesCode=",my_code,"&page=",i,"pageSize=",PAGE_SIZE))
df <- fromJSON(my_data_page) #depending on the data you are calling, you will get a list
df= df$data %>% as.data.frame() %>% distinct()
# break the loop when no more to add
if(is_empty(df)){
break
}
FULL_DF = rbind(FULL_DF,df)
Sys.sleep(5) # sleep to avoid any issues
}

Can't scrape all the rows using rvest

My goal is to scrape all this diamond data from bluenile.com. I've got some code that seems to be doing that, but it only grabs the first 61 rows.
By the way, I am using the "SelectorGadget" chrome plugin to get the CSS selectors. If I scroll down a little, the highlighting stops. Is it something to do with the website?
library('rvest')
le_url <- "https://www.bluenile.com/diamonds/round-cut?track=DiaSearchRDmodrn"
webpage <- read_html(le_url)
shape_data_html <- html_nodes(webpage,'.shape')
price_data_html <- html_nodes(webpage,'.price')
carat_data_html <- html_nodes(webpage,'.carat')
cut_data_html <- html_nodes(webpage,'.cut')
color_data_html <- html_nodes(webpage,'.color')
clarity_data_html <- html_nodes(webpage,'.clarity')
#Converting data to text
shape_data <- html_text(shape_data_html)
price_data <- html_text(price_data_html)
carat_data <- html_text(carat_data_html)
cut_data <- html_text(cut_data_html)
color_data <- html_text(color_data_html)
clarity_data <- html_text(clarity_data_html)
# make a data.frame
le_mat <- cbind(shape_data, price_data, carat_data, cut_data, color_data, clarity_data)
le_df <- le_mat[-1,]
colnames(le_df) <- le_mat[1,]
Data is dynamically added via API call as you scroll down page. The API call has a query string that allows you to specify startIndex (start row) and number of results per page (pageSize). The results per page max seems to be 1000. The return is json from which you can extract all the info you want including the total number of rows; accessed via key of countRaw. So, you can make a request for the initial 1000, parse out the total row count, countRaw, and perform a loop, adjusting the row startIndex parameter until you have all the results.
You can use a json parser e.g. jsonlite to handle the json response.
Example API endpoint call for first 1000 results:
https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus=
library(jsonlite)
url <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url)
print(r$countRaw)
You get a list of 8 elements from each call. r$results is a dataframe containing info of main interest.
Part of response:
Given the indicated result count I was expecting I could do something like (bearing in mind my limited R experience):
total <- r$countRaw
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=placeholder&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
if(total > 1000){
for(i in seq(1000, total + 1, by = 1000)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$results
# do something with df e.g. merge
}
}
However, it seems that there are only results for first two calls i.e. the initial df from r$results shown above and then:
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=1000&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url2)
df2 <- r$results
Searching the page with css selector .row yields 1002 results versus the indicated total All diamonds number; so, I think there is some exploration to do around filters.

A continuation of... Extracting data from an API using R

I'm a super new at this and working on R for my thesis. The code in this answer finally worked for me (Extracting data from an API using R), but I can't figure out how to add a loop to it. I keep getting the first page of the API when I need all 3360.
Here's the code:
library(httr)
library(jsonlite)
r1 <- GET("http://data.riksdagen.se/dokumentlista/?
sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12- 31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s#soktraff")
r2 <- rawToChar(r1$content)
class(r2)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
By the time I reach r4, it's already a data frame.
Please and thank you!
Edit: originally, I couldn't get a url that had the page as info within it. Now I have it (below). I still haven't been able to loop it.
"http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p="
I think you can extract the url of the next page from r3 as follows:
next_url <- r3$dokumentlista$`#nasta_sida`
# you need to re-check this, but sometimes I'm getting white spaces within the url,
# you may not face this problem, but in any case this line of code solved the issue
next_url <- gsub(' ', '', n_url)
GET(next_url)
Update
I tried the url with the page number with 10 pages and it worked
my_dfs <- lapply(1:10, function(i){
my_url <- paste0("http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p=", i)
r1 <- GET(my_url)
r2 <- rawToChar(r1$content)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
return(r4)
})
Update 2:
The extracted data frames are complex (e.g. some columns are lists of data frames) which is why a simple rbind will not work here, you'll have to do some pre-processing before you stack up the data together, something like this would work
my_dfs %>% lapply(function(df_0){
# Do some stuff here with the data, and choose the variables you need
# I chose the first 10 columns to check that I got 200 different observations
df_0[1:10]
}) %>% do.call(rbind, .)

Recommendation when using Sys.sleep() in R with rvest

I am scraping thousands of webpages using the R package rvest. In order not to overload the server, I timed the Sys.sleep() function with 5 seconds.
It works nice until we reach a value of ~400 webpages scraped. However, beyond this value, I get nothing and all data is empty, although an error is not thrown.
I am wondering whether there is any possibility to modify the Sys.sleep() function to scrape 350 webpages by 5 seconds each, then wait for instance 5 minuts, then continue with another 350 webpages... and so on.
I was checking the Sys.sleep() function documentation, and only time appears as an argument. So, if this is not possible to be done with this function, is there any other possibility or function to deal with this problem when scraping a huge bunch of pages?
UPDATE WITH AN EXAMPLE
This is part of my code. The object links is composed of more than 8 thousand links.
title <- vector("character", length = length(links))
short_description <- vector("character", length = length(links))
for(i in seq_along(links)){
Sys.sleep(5)
aff_link <- read_html(links[i])
title[i] <- aff_link %>%
html_nodes("title") %>%
html_text()
short_description[i] <- aff_link %>%
html_nodes(".clp-lead__headline") %>%
html_text()
}
You could add a check on the modulus of a loop variable and do an extra sleep every N iterations. Example:
> for(i in 1:100){
message("Getting page ",i)
Sys.sleep(5)
if((i %% 10) == 0){
message("taking a break")
Sys.sleep(10)
}
}
Every 10 iterations the expression i %% 10 is TRUE and you get an extra 10 seconds sleep.
I can think of more complex solutions but this might work for you.
One other possibility is to check if a page returns any data, and if not, sleep twice as long and try again, repeating this a number of times. Here's some semi-pseudocode:
get_page = function(page){
sleep = 5
for(try in 1:5){
html = get_content(page)
if(download_okay(html)){
return(html)
}
sleep = sleep * 2
Sys.sleep(sleep)
}
return("I tried - but I failed!")
}
Some web page getters like CURL will do this automatically with the right options - there may be a way to wangle that into your code too.

Not sure which way of combining my loop results I should be using

To make a long story short I'm trying to gather information on 6500 user, so I wrote a loop. Below you can find an example of 10 artists. In this loop I'm trying to use a call to gather information on all tracks of a user.
test <- fromJSON(getURL('http://api.soundcloud.com/users/52880864/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'))
This short example shows a dataframe with all the tracks uploaded by a user. When I use my loop I'd like to add all the dataframes together so that I can process them with tapply. This way I can for instance see how what the sum of all track likes are. However, two things are going wrong. First, when I run the loop, each users only shows one uploaded track. Second, I think I'm not combining the dataframes properly. Could somebody please explain to me what I'm doing wrong?
id <- c(20376298, 63320169, 3806325, 12231483, 18838035, 117385796, 52880864, 32704993, 63975320, 95667573)
Partition1 <- paste0("'http://api.soundcloud.com/users/", id, "/tracks?client_id=0ab2657a7e5b63b6dbc778e13c834e3d&limit=200&offset=1&linked_partitioning=1'")
results <- vector(mode = "list", length = length(Partition1))
for (i in seq_along(Partition1)){
message(paste0('Query #',i))
tryCatch({
result_i <- fromJSON((getURL(str_replace_all(Partition1[i],"'",""))))
clean_i <- function(x)ifelse(is.null(x),NA,ifelse(length(x)==0,NA,x))
results[[i]] <- plyr::llply(result_i, clean_i) %>% as_data_frame
if( i == 4 ) {
stop('stop')
}
}, error = function(e){
beepr::beep(1)
}
)
Sys.sleep(0.5)
}

Resources