Automate webscraping with r - r

I have managed to scrape content for a single url, but am struggling to automate it for multiple urls.
Here how it is done for a single page:
library(XML); library(data.table)
theurl <- paste("http://google.com/",url,"/ul",sep="")
convertUTF <- htmlParse(theurl, encoding = "UTF-8")
tables <- readHTMLTable(convertUTF)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
table <- tables[[which.max(n.rows)]]
TableData <- data.table(table)
Now I have a vector of urls and want to scrape each for the corresponding table:
Here, I read in data comprising multiple http links:
ur.l <- data.frame(read.csv(file.choose(), header=TRUE, fill=TRUE))
theurl <- matrix(NA, nrow=nrow(ur.l), ncol=1)
for(i in 1:nrow(ur.l)){
url <- as.character(ur.l[i, 2])
}

Each of the three additional urls that you provided refer to pages that contain no tables, so it's not a particularly useful example dataset. However, a simple way to handle errors is with tryCatch. Below I've defined a function that reads in tables from url u, calculates the number of rows for each table at that url, then returns the table with the most rows as a data.table.
You can then use sapply to apply this function to each url (or, in your case, each org ID, e.g. 36245119) in a vector.
library(XML); library(data.table)
scrape <- function(u) {
tryCatch({
tabs <- readHTMLTable(file.path("http://finstat.sk", u, "suvaha"),
encoding='utf-8')
tab <- tabs[[which.max(sapply(tabs, function(x) nrow(x)))]]
data.table(tab)
}, error=function(e) e)
}
urls <- c('36245119', '46894853', '46892460', '46888721')
res <- sapply(urls, scrape)
Take a look at ?tryCatch if you want to improve the error handling. Presently the function simply returns the errors themselves.

Related

Can't scrape all the rows using rvest

My goal is to scrape all this diamond data from bluenile.com. I've got some code that seems to be doing that, but it only grabs the first 61 rows.
By the way, I am using the "SelectorGadget" chrome plugin to get the CSS selectors. If I scroll down a little, the highlighting stops. Is it something to do with the website?
library('rvest')
le_url <- "https://www.bluenile.com/diamonds/round-cut?track=DiaSearchRDmodrn"
webpage <- read_html(le_url)
shape_data_html <- html_nodes(webpage,'.shape')
price_data_html <- html_nodes(webpage,'.price')
carat_data_html <- html_nodes(webpage,'.carat')
cut_data_html <- html_nodes(webpage,'.cut')
color_data_html <- html_nodes(webpage,'.color')
clarity_data_html <- html_nodes(webpage,'.clarity')
#Converting data to text
shape_data <- html_text(shape_data_html)
price_data <- html_text(price_data_html)
carat_data <- html_text(carat_data_html)
cut_data <- html_text(cut_data_html)
color_data <- html_text(color_data_html)
clarity_data <- html_text(clarity_data_html)
# make a data.frame
le_mat <- cbind(shape_data, price_data, carat_data, cut_data, color_data, clarity_data)
le_df <- le_mat[-1,]
colnames(le_df) <- le_mat[1,]
Data is dynamically added via API call as you scroll down page. The API call has a query string that allows you to specify startIndex (start row) and number of results per page (pageSize). The results per page max seems to be 1000. The return is json from which you can extract all the info you want including the total number of rows; accessed via key of countRaw. So, you can make a request for the initial 1000, parse out the total row count, countRaw, and perform a loop, adjusting the row startIndex parameter until you have all the results.
You can use a json parser e.g. jsonlite to handle the json response.
Example API endpoint call for first 1000 results:
https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus=
library(jsonlite)
url <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url)
print(r$countRaw)
You get a list of 8 elements from each call. r$results is a dataframe containing info of main interest.
Part of response:
Given the indicated result count I was expecting I could do something like (bearing in mind my limited R experience):
total <- r$countRaw
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=placeholder&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
if(total > 1000){
for(i in seq(1000, total + 1, by = 1000)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$results
# do something with df e.g. merge
}
}
However, it seems that there are only results for first two calls i.e. the initial df from r$results shown above and then:
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=1000&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url2)
df2 <- r$results
Searching the page with css selector .row yields 1002 results versus the indicated total All diamonds number; so, I think there is some exploration to do around filters.

Reading nodes from multiple html and storing result as a vector

I have a list of locally saved html files. I want to extract multiple nodes from each html and save the results in a vector. Afterwards, I would like to combine them in a dataframe. Now, I have a piece of code for 1 node, which works (see below), but it seems quite long and inefficient if I apply it for ~ 20 variables. Also, something really strange with the saving to vector (XXX_name) it starts with the last observation and then continues with the first, second, .... Do you have any suggestions for simplifying the code/ making it more efficient?
# Extracts name variable and stores in a vector
XXX_name <- c()
for (i in 1:216) {
XXX_name <- c(XXX_name, name)
mydata <- read_html(files[i], encoding = "latin-1")
reads_name <- html_nodes(mydata, 'h1')
name <- html_text(reads_name)
#print(i)
#print(name)
}
Many thanks!
You can put the workings inside a function then apply that function to each of your variables with map
First, create the function:
read_names <- function(var, node) {
mydata <- read_html(files[var], encoding = "latin-1")
reads_name <- html_nodes(mydata, node)
name <- html_text(reads_name)
}
Then we create a df with all possible combinations of inputs and apply the function to that
library(tidyverse)
inputs <- crossing(var = 1:216, node = vector_of_nodes)
output <- map2(inputs$var, inputs$node, read_names)

R, script/function for retrieving more stocks

I'm a newbye in R and I've seen several posts about downloading more stocks, but for a reason or another they don't work as suggested.
My purpose is to download a vector of stocks and create a whole xts-matrix containing only Close prices for every stock (so a nobservations x 3 columns).
Anyway, I'd like to start from a basic script that doesn't work properly:
library(quantmod)
ticker=c("KO","AAPL","^GSPC")
for (i in 1:length(ticker)) {
simbol=as.xts(na.omit(getSymbols(ticker[i],from="2016-01-01",auto.assign=F)))
new=Cl(simbol)
merge(new[i])
}
It would be even better to write a function(symbols) that allows me to call whenever I need to just change the name of the stocks to download.
Thanks to everyone
This is how I would do what you want with a function wrapper (which is a pretty common kind of manipulation with xts):
ticker=c("KO","AAPL","^GSPC")
collect_close_series <- function(ticker) {
# Preallocate a list to store the result from each loop iteration (Note: lapply is another alternative to a direct loop)
lst <- vector("list", length(ticker))
for (i in 1:length(ticker)) {
symbol <- na.omit(getSymbols(ticker[i],from="2016-01-01",auto.assign = FALSE))
lst[[i]] <- Cl(symbol)
}
# You have a list of close prices. You can combine the objects in the list compactly using do.call; this is a common "data manipulation pattern" with xts objects.
rr <- do.call(what = merge, lst)
rr
}
out <- collect_close_series(ticker)
More advanced (better code design): You could write cleaner code by writing a function that handles each symbol (rather than a function that wraps and passes in all the symbols together) and then run lapply on it:
per_sym_close <- function(tick) {
symbol <- na.omit(getSymbols(tick,from="2016-01-01",auto.assign = FALSE))
Cl(symbol)
}
out2 <- do.call(merge, lapply(X = ticker, FUN = per_sym_close))
This gives the same result.
Hope this helps getting you started toward writing good R code!

ReadLines using multiple sources in R

I'm trying to use readLines() to scrape .txt files hosted by the Census and compile them into one .txt/.csv file. I am able to use it to read individual pages but I'd like to have it so that I can just run a function that will go out and readLines() based on a csv with urls.
My knowledge of looping and function properties isn't great, but here are the pieces of my code that I'm trying to incorporate:
Here is how I build my matrix of urls which I can add to and/or turn into a csv and have a function read it that way.
MasterList <- matrix( data = c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt"), ncol = 1)
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
Here's the function (riddled with problems) I started writing:
Scrape <- function(x){
for (i in x){
URLS <- i
headers <- readLines(URLS, n=2)
bod <- readLines(URLS)
bodclipped <- bod[-c(1,2,3)]
Totes <- c(headers, bodclipped)
write(Totes, file = "[Directory]/ScrapeTest.txt")
return(head(Totes))
}
}
The idea being that I would run Scrape(urls) which would generate a cumulation of the 3 urls I have in my "urls" matrix/csv with the Census' build in headers removed from all files except the first one (headers vs. bodclipped).
I've tried doing lapply() to "urls" with readLines but that only generates text based on the last url and not all three, and they still have the headers for each text file which I could just remove and then reattach at the end.
Any help would be appreciated!
As all of these documents are csv files with 38 columns you can combine then very easily using:
MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)
What happens here and how is this looping?
The lapply function basically creates a list with 3 (= length(urls)) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE). So raw_dat is a list with 3 data.frames containing your data. do.call(rbind, dat) binds em together.
The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)].
If all the scraped data fits into memory you can combine it this way and in the end write it into a file using:
write.csv(dat, "[Directory]/ScrapeTest.txt")

R: looping through a list of links

I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}

Resources