R memory issues while webscraping with rvest - r

I am using rvest to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths. I am doing this with the following code:
data <- rep("", 4 * 28625)
k <- 1
for (i in 1:28625) {
name <- html(urls[i, 2]) %>%
html_node(xpath = '//*[#id="seriesDiv"]/table') %>%
html_table(fill = T)
data[k] <- name[4, 3]
data[k + 1:3] <- html(urls[i, 1]) %>%
html_nodes(xpath = xpaths) %>%
html_text()
k <- k + 4
}
dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))
It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.
Why is this so memory intensive? Is there anything I can do to resolve it?

Rvest has been updated to resolve this issue. See here:
http://www.r-bloggers.com/rvest-0-3-0/

Related

efficient data collection from API using R

I am trying to get data from the UN Stats API for a list of indicators (https://unstats.un.org/SDGAPI/swagger/).
I have constructed a loop that can be used to get the data for a single indicator (code is below). The loop can be applied to multiple indicators as needed. However, this is likely to cause problems relating to large numbers of requests, potentially being perceived as a DDoS attack and taking far too long.
Is there an alternative way to get data for an indicator for all years and countries without making a ridiculous number of requests or in a more efficient manner than below? I suppose this question likely applies more generally to other similar APIs as well. Any help would be most welcome.
Please note: I have seen the post here (Faster download for paginated nested JSON data from API in R?) but it is not quite what I am looking for.
Minimal working example
# libraries
library(jsonlite)
library(dplyr)
library(purrr)
# get the meta data
page = ("https://unstats.un.org/SDGAPI//v1/sdg/Series/List")
sdg_meta = fromJSON(page) %>% as.data.frame()
# parameters
PAGE_SIZE =100000
N_PAGES = 5
FULL_DF = NULL
my_code = "SI_COV_SOCINS"
# loop to go over pages
for(i in seq(1,N_PAGES,1)){
ind = which(sdg_meta$code == my_code)
cat(paste0("Processing : ", my_code, " ", i, " of ",N_PAGES, " \n"))
my_data_page <- c(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Series/Data?seriesCode=",my_code,"&page=",i,"pageSize=",PAGE_SIZE))
df <- fromJSON(my_data_page) #depending on the data you are calling, you will get a list
df= df$data %>% as.data.frame() %>% distinct()
# break the loop when no more to add
if(is_empty(df)){
break
}
FULL_DF = rbind(FULL_DF,df)
Sys.sleep(5) # sleep to avoid any issues
}

Rvest'ing using 'for' loops in R

My goal is to get the weather data from one of the websites. I (with a little help of kind stack users, thank you) already created the vector consisting list of 1440 links and decided to try and use the 'for' loop to iterate over them.
Additionaly, every page has weather for each week so it's 7 rows of data (one for each day) which i have to obtain, which are marked as num0/num1/num2/num3.
That's what I came up with:
Links <- #here are the 1440 links i need to iterate over
library("rvest")
for (index in seq(from=1, to=length(Links), by=1)) {
link = paste(Links[index])
for (num in 0:7) {
node_date <-paste(".num",num," .date",sep="")
node_conditions<-paste(".num",num," .cond span",sep="")
#here I tried to create an 'embeded for loop' to iterate 7 times over various nodes consisting data
page = read_html(link)
DayOfWeek = page %>% html_nodes(node_date) %>% html_text()
Conditions = page %>% html_nodes(node_conditions) %>% html_text()
}
}
For now I get an error
error in command 'open.connection(x, "rb")':HTTP error 502
and I'm really quite confused what should I do now.
Are there other ways to accomplish this goal? Or maybe I'm making some rookie mistakes in here?
Thank you in advance!

Faster download for paginated nested JSON data from API in R?

I am downloading nested json data from the UN's SDG Indicators API, but using a loop for 50,006 paginations is waaayyy too slow to ever complete. Is there a better way?
https://unstats.un.org/SDGAPI/swagger/#!/Indicator/V1SdgIndicatorDataGet
I'm working in RStudio on a Windows laptop. Getting to the json nested data and structuring into a dataframe was a hard-fought win, but dealing with the paginations has me stumped. No response from the UN statistics email.
Maybe an 'apply' would do it? I only need data from 2004, 2007, and 2011 - maybe I can filter, but I don't think that would help the fundamental issue.
I'm probably misunderstanding the API structure - I can't see how querying 50,006 pages individually can be functional for anyone. Thanks for any insight!
library(dplyr)
library(httr)
library(jsonlite)
#Get data from the first page, and initialize dataframe and data types
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data", flatten = TRUE)
#Get number of pages from the field of the first page
pages <- page1$totalPages
SDGdata<- data.frame()
for(j in 1:25){
SDGdatarow <- rbind(page1$data[j,1:16])
SDGdata <- rbind(SDGdata,SDGdatarow)
}
SDGdata[1] <- as.character(SDGdata[[1]])
SDGdata[2] <- as.character(SDGdata[[2]])
SDGdata[3] <- as.character(SDGdata[[3]])
#Loop through all the rest of the pages
baseurl <- ("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data")
for(i in 2:pages){
mydata <- fromJSON(paste0(baseurl, "?page=", i), flatten=TRUE)
message("Retrieving page ", i)
for(j in 1:25){
SDGdatarow <- rbind(mydata$data[j,1:16])
rownames(SDGdatarow) <- as.numeric((i-1)*25+j)
SDGdata <- rbind.data.frame(SDGdata,SDGdatarow)
}
}
I do get the data I want, and in a nice dataframe, but inevitably the query has a connection issue after a few hundred pages, or my laptop shuts down etc. It's about 5 seconds a page. 5*50,006/3600 ~= 70 hours.
I think I (finally) figured out a workable solution: I can set the # of elements per page, resulting in a manageable number of pages to call. (I also filtered for just the 3 years I want which reduces the data). Through experimentation I figured out that about 1/10th of the elements download ok, so I set the call to 1/10 per page, with a loop for 10 pages. Takes about 20 minutes, but better than 70 hours, and works without losing the connection.
#initiate the df
SDGdata<- data.frame()
# call to get the # elements with the years filter
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011", flatten = TRUE)
perpage <- ceiling(page1$totalElements/10)
ptm <- proc.time()
for(i in 1:10){
SDGpage <- fromJSON(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011&pageSize=",perpage,"&page=",i), flatten = TRUE)
message("Retrieving page ", i, " :", (proc.time() - ptm) [3], " seconds")
SDGdata <- rbind(SDGdata,SDGpage$data[,1:16])
}

Number of items to replace is not a multiple of replacement length. Rvest scraping

I know I have some problems with my for loop, but need somebody to spot where the problem is.
These are two pages I would like to scrape 100 links in each. Notice you need credentials to get in there. But I write them here just to see all the code:
urls <- c("http://cli.linksynergy.com/cli/publisher/links/linkfinder.php?mode=basic&keyword=linux&exact=&any=&exclude=&mid=-1&cat=&sort=&retailprice_sort=&productname_sort=&shortdesp_sort=&categoryname_sort=&keyword_sort=&linklang=pt_BR&currec=1&max=100",
"http://cli.linksynergy.com/cli/publisher/links/linkfinder.php?mode=basic&keyword=linux&exact=&any=&exclude=&mid=-1&cat=&sort=&retailprice_sort=&productname_sort=&shortdesp_sort=&categoryname_sort=&keyword_sort=&linklang=pt_BR&currec=101&max=100")
I use rvest package to scrape them. This is the for loop:
enlaces <- vector("character", length = length(urls))
for(i in seq_along(urls)){
Sys.sleep(1)
derby <- read_html(jump_to(session, urls[i]))
enlaces[i] <- derby %>%
html_nodes(".td_auto_left a:nth-child(1)") %>%
html_attr('href')
}
Ideally, I will get a vector composed of 200 links, 100 links scraped for each of the links stored in urls.
However, I get the error Number of items to replace is not a multiple of replacement length.
I think that the problem might be that enlaces expect only one object in each iteration. However it creates 100 and don't know how to proceed. Any idea?
I finally solved my question by creating a list instead than a vector, and with double brackets in the for loop.
enlaces <- list()
for(i in seq_along(urls)){
Sys.sleep(1)
derby <- read_html(jump_to(session, urls[i]))
enlaces[[i]] <- derby %>%
html_nodes(".td_auto_left a:nth-child(1)") %>%
html_attr('href')
}

R: looping through a list of links

I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}

Resources