Web scraping data in R when there are a lot of links - r

I am trying to scrape baseball data from baseball-reference (e.g., https://www.baseball-reference.com/teams/NYY/2017.shtml). I have a huge vector of URLS that I created using a for loop, since the links follow a specific pattern. However, I am having trouble running my code, probably because I have to make too many connections within R. There are over 17000 elements in my vector, and my code stops working once it gets to around 16000. Is there an easier and perhaps a more efficient way to replicate my code?
require(Lahman)
teams <- unique(Teams$franchID)
years <- 1871:2017
urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
for(j in 1:length(years)) {
urls[i, j] <- paste0("https://www.baseball-reference.com/teams/",
teams[i], "/", years[j], ".shtml")
}
}
url_vector <- as.vector(urls)
list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:length(url_vector)) {
url <- url_vector[i]
res <- try(readLines(url), silent = TRUE)
## check if website exists
if(inherits(res, "try-error")) {
list_of_batting[[i]] <- NA
list_of_pitching[[i]] <- NA
}
else {
urltxt <- readLines(url)
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
doc <- htmlParse(urltxt)
tables_full <- readHTMLTable(doc)
tmp1 <- tables_full$players_value_batting
tmp2 <- tables_full$players_value_pitching
list_of_batting[[i]] <- tmp1
list_of_pitching[[i]] <- tmp2
}
print(i)
closeAllConnections()
}

Related

R: Creating new variables in a for loop and assign values

I would like to create a new variable, assign a list of values, and write into a hierarchical data frame. I tried the below but it is not writing in.
for(i in 1:sample){
for(j in 1:10){
x[,j]<-0
name <- paste("hierdata[[i]]$Test", j, sep = "_")
assign(name, rowSums(alpha+beta+x)))
}
}
Appreciate some help.
To address #Moody_Mudskipper 's comment:
The best way to declare hierarchical data is to use indexed lists:
hierdata <- lapply(1:sample,
function(iterator)
{
temp_list <- lapply(1:10,
function(j)
{
x[,j]<-0
value <- rowSums(alpha+beta+x)
return(value)
})
names(temp_list) <- lapply(1:10,function(j){paste0("temp_",j)})
return(temp_list)
})
not really a "one liner", but it contains all good "stuff". Lapply which returns list by default so it just nests list in each list.
Hope you enjoy some "good practice". :)
I just wanted to address your question precisely in my first try.
Try using
for(i in 1:sample){
hierdata[[i]] <- list()
for(j in 1:10){
code_j_init <- paste0("hierdata[[",i,"]]$Test_",j,"<- list()")
eval(parse(text = code_j_init))
hierdata[[i]][[j]] <- list(1,2,3)
}
}
or
for(i in 1:sample){
hierdata[[i]] <- list()
names <- c()
for(j in 1:10){
hierdata[[i]][[j]] <- list(1,2,3)
names <- c(names,paste0("Test_",j))
}
names(hierdata[[i]]) <- names
}

How can I tell R to apply functions to multiple data?

I have been stacking this work for quite long time, tried different approaches but couldn't succeed.
what I want is to apply following 4 functions to 30 different data (data1,2,3,...data30) within for loop or whatsoever in R. These datasets have same (10) column numbers and different rows.
This is the code I wrote for first data (data1). It works well.
for(i in 1:nrow(data1)){
data1$simp <-diversity(data1$sp, "simpson")
data1$shan <-diversity(data1$sp, "shannon")
data1$E <- E(data1$sp)
data1$D <- D(data1$sp)
}
I want to apply this code for other 29 data in order not to repeat the process 29 times.
Following code what I am trying to do now. But still not right.
data.list <- list(data1, data2,data3,data4,data5)
for(i in data.list){
data2 <- NULL
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
print(data2)
}
So I wanna ask how I can tell R to apply functions to other 29 data?
Thanks in advance!
You can do this with Map.
fun <- function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
}
result.list <- Map(fun, data.list)
Or, if you don't want to have a function fun in the .GlobalEnv, with lapply.
result.list <- lapply(data.list, function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
})
If I understand the question, it you're ultimately asking about your 'data2' variable and how to merge these all together? I think the issue you're having is that you're setting data2 <- NULL with each loop iteration. The proposed solution below moves this definition outside the loop and the call to rbind() should now append all your data frames together to return the consolidated dataset.
data.list <- list(data1, data2,data3,data4,data5) #all 29 can go here
data2 <- NULL
for(i in data.list){
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
}
print(data2)
I am assuming that your data1, ..., dataN are files stored in a directory and you're reading them one at a time. Also they have the same header.
What you can do is to import them one at a time and then perform the operations you want, as you mentioned:
files <- list.files(directoryPath) #maybe you can grep() some specific files
for (f in files){
data <- read.table(f) #choose header, sep and so on...
for(i in 1:nrow(data)){
data$simp <-diversity(data$sp, "simpson")
data$shan <-diversity(data$sp, "shannon")
data$E <- E(data$sp)
data$D <- D(data$sp)
}
}
be careful that you must be in the working directory or you must add a path to the filename while reading the tables (i.e. paste(path, f, sep=""))
There are plenty of options, here's one using only base functions:
data.list <- list(data1, data2, data3, data4, data5)
changed_data <- lapply(data.list, function(my_data) {
my_data$simp <-diversity(my_data$sp, "simpson")
my_data$shan <-diversity(my_data$sp, "shannon")
my_data$E <- E(my_data$sp)
my_data$D <- D(my_data$sp)
my_data})

How to loop through a list of strings in r

I have some bloated code in R which I'm trying to streamline. I'm trying to read spreadsheets into a dataframe and then transpose each one.
I have a list as follows
var <- c("amp_genes.annotated.BLCA.txt","amp_genes.annotated.BRCA.txt")
for (i in var) {
var[i] <- readWorksheet(wk, sheet="var[i]", header=T)
var[i] <- as.data.frame(var[i])
var[i] <- t(var1[i][3:ncol(var1[i]),])
}
The sheet = line has to have double quotes around the string variable.
This just tells me I have an unexpected }
Maybe try this; not sure it will work as I don't have your spreadsheets, but give it a try and let me know... And maybe if it doesn't work right out, it can hopefully unblock you wherever you're stuck.
library(XLConnect)
wk <- loadWorkbook("workbookname.xls")
sheetnames <- getSheets(object = wk)
content.tr <- list()
# To access sheets by their names
for (sheetname in sheetnames) {
content <- readWorksheet(wk, sheet=sheetname, header=T)
content.tr[[sheetname]] <- t(content[3:ncol(content),])
}
# To access sheets by their position
for (pos in c(1,2) {
content <- readWorksheet(wk, sheet=i, header=T)
content.tr[[sheetname[i]]] <- t(content[3:ncol(content),])
}
To access the dataframes:
names(content.tr)
spreadsheet1 <- content.tr[[1]]
spreadsheet2 <- content.tr[[2]]

Array of tables

I want to write some R tables into an excel file. So I have the follow?
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.table(outline)
print(outline) # this prints all n tables
write.csv(outline, 'Test.csv') # this only writes the last table
}
But I only get the last file written into the csv file. Not all of them. How would I fix this?
your writing to test.csv every time. So you keep over writing files. You need to change the filename for each step to keep the different files.
try:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.table(outline)
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
Looking at your code perhaps you want this instead:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}

How to Read Multiple HTML Tables in R

I am trying to automate the pulling in of and saving to a dataframe of this readHTML function; I am an R newbie and am having trouble figuring out how to write a loop that would automate this function that works if you do it one by one.
library('XML')
urls<-c("http://www.basketball-reference.com/teams/ATL/","http://www.basketball-reference.com/teams/BOS/")
theurl<-urls[2] #Pick second link (celtics)
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
BOS <-tables[[which.max(n.rows)]]
Team.History<-write.csv(BOS,"Bos.csv")
Any and all help would be very appreciated!
I think this combines the best of both answers (and tidies up a little).
library(RCurl)
library(XML)
stem <- "http://www.basketball-reference.com/teams/"
teams <- htmlParse(getURL(stem), asText=T)
teams <- xpathSApply(teams,"//*/a[contains(#href,'/teams/')]", xmlAttrs)[-1]
teams <- gsub("/teams/(.*)/", "\\1", teams)
urls <- paste0(stem, teams)
names(teams) <- NULL # get rid of the "href" labels
names(urls) <- teams
results <- data.frame()
for(team in teams){
tables <- readHTMLTable(urls[team])
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
team.results <- tables[[which.max(n.rows)]]
write.csv(team.results, file=paste0(team, ".csv"))
team.results$TeamCode <- team
results <- rbind(results, team.results)
rm(team.results, n.rows, tables)
}
rm(stem, team)
write.csv(results, file="AllTeams.csv")
I'm assuming you want to loop over your urls vector? I'd try something like this:
library('XML')
url_base <- "http://www.basketball-reference.com/teams/"
teams <- c("ATL", "BOS")
# better still, get the full list of teams as in
# http://stackoverflow.com/a/11804014/1543437
results <- data.frame()
for(team in teams){
theurl <- paste(url_base, team , sep="/")
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
team.results <-tables[[which.max(n.rows)]]
write.csv(team.results, file=paste0(team, ".csv"))
team.results$TeamCode <- team
results <- rbind(results, team.results)
}
write.csv(results, file="AllTeams.csv")

Resources