How to Read Multiple HTML Tables in R - r

I am trying to automate the pulling in of and saving to a dataframe of this readHTML function; I am an R newbie and am having trouble figuring out how to write a loop that would automate this function that works if you do it one by one.
library('XML')
urls<-c("http://www.basketball-reference.com/teams/ATL/","http://www.basketball-reference.com/teams/BOS/")
theurl<-urls[2] #Pick second link (celtics)
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
BOS <-tables[[which.max(n.rows)]]
Team.History<-write.csv(BOS,"Bos.csv")
Any and all help would be very appreciated!

I think this combines the best of both answers (and tidies up a little).
library(RCurl)
library(XML)
stem <- "http://www.basketball-reference.com/teams/"
teams <- htmlParse(getURL(stem), asText=T)
teams <- xpathSApply(teams,"//*/a[contains(#href,'/teams/')]", xmlAttrs)[-1]
teams <- gsub("/teams/(.*)/", "\\1", teams)
urls <- paste0(stem, teams)
names(teams) <- NULL # get rid of the "href" labels
names(urls) <- teams
results <- data.frame()
for(team in teams){
tables <- readHTMLTable(urls[team])
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
team.results <- tables[[which.max(n.rows)]]
write.csv(team.results, file=paste0(team, ".csv"))
team.results$TeamCode <- team
results <- rbind(results, team.results)
rm(team.results, n.rows, tables)
}
rm(stem, team)
write.csv(results, file="AllTeams.csv")

I'm assuming you want to loop over your urls vector? I'd try something like this:
library('XML')
url_base <- "http://www.basketball-reference.com/teams/"
teams <- c("ATL", "BOS")
# better still, get the full list of teams as in
# http://stackoverflow.com/a/11804014/1543437
results <- data.frame()
for(team in teams){
theurl <- paste(url_base, team , sep="/")
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
team.results <-tables[[which.max(n.rows)]]
write.csv(team.results, file=paste0(team, ".csv"))
team.results$TeamCode <- team
results <- rbind(results, team.results)
}
write.csv(results, file="AllTeams.csv")

Related

How can I tell R to apply functions to multiple data?

I have been stacking this work for quite long time, tried different approaches but couldn't succeed.
what I want is to apply following 4 functions to 30 different data (data1,2,3,...data30) within for loop or whatsoever in R. These datasets have same (10) column numbers and different rows.
This is the code I wrote for first data (data1). It works well.
for(i in 1:nrow(data1)){
data1$simp <-diversity(data1$sp, "simpson")
data1$shan <-diversity(data1$sp, "shannon")
data1$E <- E(data1$sp)
data1$D <- D(data1$sp)
}
I want to apply this code for other 29 data in order not to repeat the process 29 times.
Following code what I am trying to do now. But still not right.
data.list <- list(data1, data2,data3,data4,data5)
for(i in data.list){
data2 <- NULL
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
print(data2)
}
So I wanna ask how I can tell R to apply functions to other 29 data?
Thanks in advance!
You can do this with Map.
fun <- function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
}
result.list <- Map(fun, data.list)
Or, if you don't want to have a function fun in the .GlobalEnv, with lapply.
result.list <- lapply(data.list, function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
})
If I understand the question, it you're ultimately asking about your 'data2' variable and how to merge these all together? I think the issue you're having is that you're setting data2 <- NULL with each loop iteration. The proposed solution below moves this definition outside the loop and the call to rbind() should now append all your data frames together to return the consolidated dataset.
data.list <- list(data1, data2,data3,data4,data5) #all 29 can go here
data2 <- NULL
for(i in data.list){
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
}
print(data2)
I am assuming that your data1, ..., dataN are files stored in a directory and you're reading them one at a time. Also they have the same header.
What you can do is to import them one at a time and then perform the operations you want, as you mentioned:
files <- list.files(directoryPath) #maybe you can grep() some specific files
for (f in files){
data <- read.table(f) #choose header, sep and so on...
for(i in 1:nrow(data)){
data$simp <-diversity(data$sp, "simpson")
data$shan <-diversity(data$sp, "shannon")
data$E <- E(data$sp)
data$D <- D(data$sp)
}
}
be careful that you must be in the working directory or you must add a path to the filename while reading the tables (i.e. paste(path, f, sep=""))
There are plenty of options, here's one using only base functions:
data.list <- list(data1, data2, data3, data4, data5)
changed_data <- lapply(data.list, function(my_data) {
my_data$simp <-diversity(my_data$sp, "simpson")
my_data$shan <-diversity(my_data$sp, "shannon")
my_data$E <- E(my_data$sp)
my_data$D <- D(my_data$sp)
my_data})

Web scraping data in R when there are a lot of links

I am trying to scrape baseball data from baseball-reference (e.g., https://www.baseball-reference.com/teams/NYY/2017.shtml). I have a huge vector of URLS that I created using a for loop, since the links follow a specific pattern. However, I am having trouble running my code, probably because I have to make too many connections within R. There are over 17000 elements in my vector, and my code stops working once it gets to around 16000. Is there an easier and perhaps a more efficient way to replicate my code?
require(Lahman)
teams <- unique(Teams$franchID)
years <- 1871:2017
urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
for(j in 1:length(years)) {
urls[i, j] <- paste0("https://www.baseball-reference.com/teams/",
teams[i], "/", years[j], ".shtml")
}
}
url_vector <- as.vector(urls)
list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:length(url_vector)) {
url <- url_vector[i]
res <- try(readLines(url), silent = TRUE)
## check if website exists
if(inherits(res, "try-error")) {
list_of_batting[[i]] <- NA
list_of_pitching[[i]] <- NA
}
else {
urltxt <- readLines(url)
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
doc <- htmlParse(urltxt)
tables_full <- readHTMLTable(doc)
tmp1 <- tables_full$players_value_batting
tmp2 <- tables_full$players_value_pitching
list_of_batting[[i]] <- tmp1
list_of_pitching[[i]] <- tmp2
}
print(i)
closeAllConnections()
}

Creating function to parse multiple xml files in r

I have a code to parse a single xml file that relates to a data feed of football match. However, I have over 300+ games worth of data and I want to apply this code to all of these feeds as doing it manually by hand would take along time. I'm new to data science and although I have seen other posts about multiple XML parsing I don't really know to about changing the code so that it suits this data structure
library(XML)
library(plyr)
library(gdata)
library(reshape)
f24 <- file.choose() #XML FILE TO BE PARSED
grabAll <- function(XML.parsed, field){
parse.field <- xpathSApply(XML.parsed, paste("//", field, "[#*]", sep=""))
results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
if(typeof(results)=="list"){
do.call(rbind.fill, lapply(lapply(results, t), data.frame,
stringsAsFactors=F))
} else {
as.data.frame(results, stringsAsFactors=F)
}
}
#Play-by-Play Parsing
pbpParse <- xmlInternalTreeParse(f24)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value), 1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
#FINAL DATA FOR ONE GAME
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T, suffixes=c("", "Q"))

Feeding R a list of webpages through a CSV

I have created a function which scrapes information and adds it to a data.frame. I want to feed this function a list of urls from a .csv but it does not seem to work when I make it as function.
IMDB <- function(wp){
for (i in wp){
raw_data <- getURL(i)
data <- fromJSON(raw_data)
data <- as.list(data)
length(data)
final_data <- do.call(rbind, data)
Title <- final_data[c("Title"),]
ScreenWriter <- final_data[c("Writer"),]
Fdata <- cbind(Title,ScreenWriter)
Authors <- rbind(Authors, Fdata)
}
}

Looping Issue -Store the data which is of a different format

I am having some trouble storing the data after it runs. The code is picking the files up correctly and running the forecast model but it somehow stores the value for the last file. All the others are lost. Is there anyway that I can have all the results stored in a different array. The problem is that the format of the output is in "forecast" format and because of that I am getting stuck on it. I have looked through all the websites but couldn't find something like that.
Here is the code:
library(forecast)
library(quantmod)
library(forecast)
fileList <-as.array(length(50))
Forecast1 <- as.array(length(50))
fileList<-list.files(path ='C:\\Users\\User\\Downloads\\wOOLWORTHS\\',recursive =T, pattern = ".csv")
i<- integer()
j<-integer()
i=1
setwd("C:\\Users\\User\\Downloads\\wOOLWORTHS\\")
while (i<51)
{
a<-fileList[i]
print(a)
a <- read.csv(a)
fileSales<-a$sales
fileTransform<-log(fileSales)
plot.ts(fileTransform)
result1<-HoltWinters(fileTransform,beta = FALSE,gamma =FALSE,seasonal ="multiplicative",optim.control =TRUE)
result2<-forecast.HoltWinters(result1,h=1)
summary(result1)
accuracy(result2)
#Forecast1[i] <- result2(forecast)
#print(Forecast1[i])
i=i+1
}
It may just be how you are storing your results. Try filling an empty list instead (e.g.Forecast1):
setwd("C:\\Users\\User\\Downloads\\wOOLWORTHS\\")
library(forecast)
library(quantmod)
library(forecast)
fileList <- list.files(path ='C:\\Users\\User\\Downloads\\wOOLWORTHS\\',recursive =T, pattern = ".csv")
Forecast1 <- vector(mode="list", 50)
for(i in seq(length(fileList)){
a <- fileList[[i]]
#print(a)
a <- read.csv(a)
fileSales<-a$sales
fileTransform<-log(fileSales)
plot.ts(fileTransform)
result1<-HoltWinters(fileTransform,beta = FALSE,gamma =FALSE,seasonal ="multiplicative",optim.control =TRUE)
result2<-forecast.HoltWinters(result1,h=1)
#summary(result1)
#accuracy(result2)
Forecast1[[i]] <- result2
#print(Forecast1[i])
print(paste(i, "of", length(fileList), "completed"))
}

Resources