Rvest'ing using 'for' loops in R - r

My goal is to get the weather data from one of the websites. I (with a little help of kind stack users, thank you) already created the vector consisting list of 1440 links and decided to try and use the 'for' loop to iterate over them.
Additionaly, every page has weather for each week so it's 7 rows of data (one for each day) which i have to obtain, which are marked as num0/num1/num2/num3.
That's what I came up with:
Links <- #here are the 1440 links i need to iterate over
library("rvest")
for (index in seq(from=1, to=length(Links), by=1)) {
link = paste(Links[index])
for (num in 0:7) {
node_date <-paste(".num",num," .date",sep="")
node_conditions<-paste(".num",num," .cond span",sep="")
#here I tried to create an 'embeded for loop' to iterate 7 times over various nodes consisting data
page = read_html(link)
DayOfWeek = page %>% html_nodes(node_date) %>% html_text()
Conditions = page %>% html_nodes(node_conditions) %>% html_text()
}
}
For now I get an error
error in command 'open.connection(x, "rb")':HTTP error 502
and I'm really quite confused what should I do now.
Are there other ways to accomplish this goal? Or maybe I'm making some rookie mistakes in here?
Thank you in advance!

Related

efficient data collection from API using R

I am trying to get data from the UN Stats API for a list of indicators (https://unstats.un.org/SDGAPI/swagger/).
I have constructed a loop that can be used to get the data for a single indicator (code is below). The loop can be applied to multiple indicators as needed. However, this is likely to cause problems relating to large numbers of requests, potentially being perceived as a DDoS attack and taking far too long.
Is there an alternative way to get data for an indicator for all years and countries without making a ridiculous number of requests or in a more efficient manner than below? I suppose this question likely applies more generally to other similar APIs as well. Any help would be most welcome.
Please note: I have seen the post here (Faster download for paginated nested JSON data from API in R?) but it is not quite what I am looking for.
Minimal working example
# libraries
library(jsonlite)
library(dplyr)
library(purrr)
# get the meta data
page = ("https://unstats.un.org/SDGAPI//v1/sdg/Series/List")
sdg_meta = fromJSON(page) %>% as.data.frame()
# parameters
PAGE_SIZE =100000
N_PAGES = 5
FULL_DF = NULL
my_code = "SI_COV_SOCINS"
# loop to go over pages
for(i in seq(1,N_PAGES,1)){
ind = which(sdg_meta$code == my_code)
cat(paste0("Processing : ", my_code, " ", i, " of ",N_PAGES, " \n"))
my_data_page <- c(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Series/Data?seriesCode=",my_code,"&page=",i,"pageSize=",PAGE_SIZE))
df <- fromJSON(my_data_page) #depending on the data you are calling, you will get a list
df= df$data %>% as.data.frame() %>% distinct()
# break the loop when no more to add
if(is_empty(df)){
break
}
FULL_DF = rbind(FULL_DF,df)
Sys.sleep(5) # sleep to avoid any issues
}

Number of items to replace is not a multiple of replacement length. Rvest scraping

I know I have some problems with my for loop, but need somebody to spot where the problem is.
These are two pages I would like to scrape 100 links in each. Notice you need credentials to get in there. But I write them here just to see all the code:
urls <- c("http://cli.linksynergy.com/cli/publisher/links/linkfinder.php?mode=basic&keyword=linux&exact=&any=&exclude=&mid=-1&cat=&sort=&retailprice_sort=&productname_sort=&shortdesp_sort=&categoryname_sort=&keyword_sort=&linklang=pt_BR&currec=1&max=100",
"http://cli.linksynergy.com/cli/publisher/links/linkfinder.php?mode=basic&keyword=linux&exact=&any=&exclude=&mid=-1&cat=&sort=&retailprice_sort=&productname_sort=&shortdesp_sort=&categoryname_sort=&keyword_sort=&linklang=pt_BR&currec=101&max=100")
I use rvest package to scrape them. This is the for loop:
enlaces <- vector("character", length = length(urls))
for(i in seq_along(urls)){
Sys.sleep(1)
derby <- read_html(jump_to(session, urls[i]))
enlaces[i] <- derby %>%
html_nodes(".td_auto_left a:nth-child(1)") %>%
html_attr('href')
}
Ideally, I will get a vector composed of 200 links, 100 links scraped for each of the links stored in urls.
However, I get the error Number of items to replace is not a multiple of replacement length.
I think that the problem might be that enlaces expect only one object in each iteration. However it creates 100 and don't know how to proceed. Any idea?
I finally solved my question by creating a list instead than a vector, and with double brackets in the for loop.
enlaces <- list()
for(i in seq_along(urls)){
Sys.sleep(1)
derby <- read_html(jump_to(session, urls[i]))
enlaces[[i]] <- derby %>%
html_nodes(".td_auto_left a:nth-child(1)") %>%
html_attr('href')
}

For loops fills only last row in R

I'm a beginner in R. I have a code that scrapes some data from a website (Rvest package) and then stores it as text in a matrix(which I'll later combine with other matrices to form a data frame).
To populate the matrix I'm using a for loop. The code had been working fine, but when I opened it today, it started filling in only the last iteration, and I don't get what the problem is:
all = matrix(nrow = length(urls))
for (i in length(urls)) {
p = read_html (urls[i]) %>%
html_nodes("p") %>%
html_text() %>%
as.character();
all[i,] = paste(p, collapse="_")
Any help would be much appreciated! Thanks (:

Iteratively Create Dataset Using rvest

I am pretty new to R, but I am really interested in learning how to use it (specifically the new package rvest) in order to screen scrape information from articles for research papers, etc.
I want to create a dataset of all the ratings and directors of movies on IMDB. I have the code that can get ONE rating at a time:
library(rvest)
HG_Movie <- html("http://www.imdb.com/title/tt01781922")
score <- HG_Movie %>%
html_node("strong span") %>%
html_text() %>%
as.numeric()
print(score)
That will work and I print the score at the end to make sure it is correct (6.9)
So, now, the hard part. I want to be able to iterate over many imdb pages and collect the rating and the name of the director as well, and I want these to be written into a dataset of some type (doesn't matter if it is .csv or .txt or whatever). The finishing dataset would look something like:
Title Score Director
XX YY HH
AA BB CC
and so on. It would be amazing to learn to do this both with a list of all the urls, or wihtout, using some sort of loop over a certain range of values. Any help would be greatly appreciated!

R memory issues while webscraping with rvest

I am using rvest to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths. I am doing this with the following code:
data <- rep("", 4 * 28625)
k <- 1
for (i in 1:28625) {
name <- html(urls[i, 2]) %>%
html_node(xpath = '//*[#id="seriesDiv"]/table') %>%
html_table(fill = T)
data[k] <- name[4, 3]
data[k + 1:3] <- html(urls[i, 1]) %>%
html_nodes(xpath = xpaths) %>%
html_text()
k <- k + 4
}
dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))
It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.
Why is this so memory intensive? Is there anything I can do to resolve it?
Rvest has been updated to resolve this issue. See here:
http://www.r-bloggers.com/rvest-0-3-0/

Resources