Web Scraping (in R) - readHTMLTable error - r

I have a file called Schedule.csv, which is structured as follows:
URLs
http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=27&year=2015
http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=28&year=2015
I am trying to use the explanation provided in the following question to scrape the html tables but it isn't working: How to scrape HTML tables from a list of links
My current code is as follows:
library(XML)
schedule<-read.csv("Schedule.csv")
stats <- list()
for(i in seq_along(schedule))
{
print(i)
total <- readHTMLTable(schedule[i])
n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
stats[[i]] <- as.data.frame(total[[which.max(n.rows)]])
}
I get an error when I run this code as follows:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"data.frame"’
If I manually type the URL's in a vector as per below I get exactly what I want when I run the readHTMLTable code.
schedule<-c("http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=27&year=2015","http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=28&year=2015")
Can someone please explain to me why the read.csv is not giving me a usable vector of information to input into the readHTMLTable function?

read.csv creates a data.frame in your shcedule. Then you want to access it by rows (seq_along and schedule[i] work along the columns of the data frame)
In your case you can do:
for (i in 1:nrow (schedule)) {
total <- readHTMLTable(schedule[i, 1])
as I understand you want the first column of your data.frame, change the , 1] or use column names otherwise.
Also notice that read.csv will read your first column as a factor so you may prefer to read it as a character:
schedule<-read.csv("Schedule.csv", as.is = TRUE)
An other alternative if your file has a unique column is to use readLines an then you can keep your loop as it was...
schedule<-readLines("Schedule.csv")
stats <- list()
for(i in seq_along(schedule))
{
print(i)
total <- readHTMLTable(schedule[i])
...
but be careful with the column names because they will be in the first element of your schedule vector

Related

Using for loop to scrape webpages in R

I am trying to scrape multiple webpages by using the list of URLs (a csv file)
This is my dataset: https://www.mediafire.com/file/9qh516tdcto7is7/nyp_data.csv/file
The url column includes all the links that I am trying to use and scrape.
I tried to use for() loop by:
news_urls <- read_csv("nyp_data.csv")
library(rvest)
content_list <- vector()
for (i in 1:nrow(news_urls)) {
nyp_url <- news_urls[i, 'url']
nyp_html <- read_html(nyp_url)
nyp_nodes <- nyp_html %>%
html_elements(".single__content")
tag_name = ".single__content"
nyp_texts <- nyp_html %>%
html_elements(tag_name) %>%
html_text()
{ content_list[i] <- nyp_texts[1]
}}
However, I am getting an error that says:
Error in UseMethod("read_xml") : no applicable method for
'read_xml' applied to an object of class "c('tbl_df', 'tbl',
'data.frame')"
I believe the links that I have work well; they aren't broken and I can access to them by clicking an individual link.
If for loop isn't the one that I should be using here, do have any other idea to scarpe the content?
I also tried:
urls <- news_urls[,5] #identify the column with the urls
url_xml <- try(apply(urls, 1, read_html)) #apply the function read_html() to the `url` vector
textScraper <- function(x) {
html_text(html_nodes (x, ".single__content")) %>% #in this data, my text is in a node called ".single__content"
str_replace_all("\n", "") %>%
str_replace_all("\t", "") %>%
paste(collapse = '')
}
article_text <- lapply(url_xml, textScraper)
article_text[1]
but it kept me giving an error,
Error in open.connection(x, "rb") : HTTP error 404
The error occures in this line:
nyp_html <- read_html(nyp_url)
As the error message tells you that the argument to read_xml (which is what is called internally by read_html) is a data.frame (amongst others, as it actually is a tibble).
This is because in this line:
nyp_url <- news_urls[i, 'url']
you are using single brackets to subset your data. Single brackets do return a data.frame containing the filtered data. You can avoid this by using double brackets like this:
nyp_url <- news_urls[[i, 'url']]
or this (which I usually find more readable):
nyp_url <- news_urls[i, ]$url
Either should fix your problem.
If you want to read more about using these notations you could look at this answer.

XML to data.frame with multiple files in R

I have a file containing multiple XML declarations which I was able to detect and individually read them from this post: Parseing XML by R always return XML declaration error . The data comes from: https://www.google.com/googlebooks/uspto-patents-applications-text.html.
### read xml document with more than one <?xml declaration in R
lines <- readLines("pa020829.xml")
start <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end <- c(start[-1]-1,length(lines))
get.xml <- function(i) {
txt <- paste(lines[start[i]:end[i]],collapse="\n")
# print(i)
xmlTreeParse(txt,asText=T)
# return(i)
}
docs <- lapply(1:10,get.xml)
> class(docs)
[1] "list"
> class(docs[1])
[1] "list"
> class(docs[[1]])
[1] "XMLDocument" "XMLAbstractDocument"
The file docs contains 10 similar documents called docs[[1]], docs[[2]], ... . I managed to extract the root of a single doc and to insert it into a matrix:
root <- xmlRoot(docs[[1]])
d <- rbind(unlist(xmlSApply(root[[1]], function(x) xmlSApply(x, xmlValue))))
However, I need to write code that would automatically retrieve the data of all 10 documents and attach them to a single data frame.
I tried the code below but it only retrieves the data of the first document's root and attaches it multiple times to the matrix.
d <- lapply(docs, function(x) rbind(unlist(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))))
I guess I need to change the way I call the root in the function.
Any idea on how to create a matrix with the data from all the documents?
The following code will return a matrix containing the data from all the documents:
getXmlInternal <- function(x) {
rbind(unlist(xmlSApply(xmlRoot(x), function(y) xmlSApply(y, xmlValue))))
}
d <- rbind(lapply(docs, function(x) getXmlInternal(x)))
This fixes the xmlRoot issue you mention by running that command on each of the documents supplied by the lapply command. The lapply command is wrapped in a call to rbind to ensure the output is in a matrix as requested.
The getXmlInternal function is included to make the answer a little more readable.

R: Error when looping to replace synonyms on corpus

I'm very new to R. Using the TM package, I'm trying to clean a set of txt documents by replacing synonyms.
As I will be working with a lot of data, I have tried to set up a table using excel where the words in the first column will be replaced with the words in the second column, and perform a loop to replace the words in my corpus. My codes are as shown:
library(tm)
docs <- Corpus(DirSource("C:....txt files"))
list <- read.csv("C:\\.....synonyms list.csv", header=F)
for(s in 1:length(docs)){
for(x in 1:nrow(list)){
docs[[s]]$content <- gsub(list[x,1],list[x,2], docs[[s]])
}
}
However, I got the error:Error in [.data.frame(x$dmeta, tag) : undefined columns selected
Does anyone knows what went wrong?
Thanks!
Maybe instead of docs[[s]]$content <- gsub(list[x,1],list[x,2], docs[[s]]) you need docs[[s]]$content <- gsub(list[x,1],list[x,2], docs[[s]]$content). I say maybe because I cannot really test it without any data

getting error: 'incorrect number of dimensions' when executing a function in R 3.1.2. What does it mean and how to fix it?

Let me first describe you the function and what I have to process.
Basically theres this folder containing some 300 comma separated value files. Each file has an ID associated with it, as in 200.csv has ID 200 in it and contains some data pertaining to sulphate and nitrate pollutants. What I have to do is calculate the mean of these particles for either one ID or a range of IDs. For example, calculating the mean of sulphate for ID 5 or calculate the same thing for IDs 5:10.
This is my procedure for processing the data but I'm getting a silly error in the end.
I have a list vector of these .csv files.
A master data frame combining all these files, I used the data.table package for this.
Time to describe the function:
pollutantmean <- function(spectate,pollutant,id)
specdata <- rbindlist(filelist)
setkey(specdata, ... = 'ID) ## because ID needs to be sorted out
for(i in id)
if(pollutant == 'sulphate'){
return(mean(specdata[, 'sulphate'], na.rm = TRUE))
}else{
if(pollutant == 'nitrate'){
return(mean(specdata([, nitrate], na.rm = TRUE))
}else{
print('NA')}
}
}
Now the function is very simply. I defined spectate, i defined the for loop to calculate data for each id. I get no error when function is run. But there is one error that is being the last obstacle.
'Error in "specdata"[, "sulfate"] : incorrect number of dimensions'
When I execute the function. Could someone elaborate?
I think this is the kind of task where the plyr function will be really helpful.
More specifically, I think you should use ldply and a bespoke function.
ldply: as in a list as the input and a dataframe as the output. The list should be the directory contents, and output will be the summary values from each of the csv files.
Your example isn't fully reproducible so the code below is just an example structure:
require(plyr)
require(stringr)
files_to_extract <- list.files("dir_with_files", pattern=".csv$")
files_to_extract <- str_sub_replace(files_to_extract, ".csv$", "")
fn <- function(this_file_name){
file_loc <- paste0(dir_with_files, "/", this_file_name, ".csv")
full_data <- read.csv(file_loc)
out <- data.frame(
variable_name=this_file_name,
variable_summary= mean(full_data$variable_to_summarise)
)
return(out)
}
summarised_output <- ldply(files_to_extract, fn)
Hope that helps. Probably won't work first time and you might want to add some additional conditions and so on to handle files that don't have the expected contents. Happy to discuss what it's doing but probably best to read this, as once you understand the approach it makes all kinds of tasks much easier.

non-numeric argument error while running FFT on a matrix

I read eachline of a csv file and save the first element of each line in a list, then I want to run FFT on this list, but I get this error:
Error in fft(x) : non-numeric argument
in my Example hier I read 4 rows:
con<-file("C:\\bla\\test.csv","r")
datalist<-list()
m<-list()
for(i in 1:4)
{
line<-readLines(con,n=1,warn=FALSE)
m<-list(as.integer(unlist(strsplit(line,split=","))))
datalist<-c(datalist,sapply(m,"[[",1))
}
datalist
close(con)
fftfun<- function(x) {fft(x)}
fft_amplitude <- function(x) {sqrt((Re(fft(x)))^2+(Im(fft(x)))^2)} }
apply(as.matrix(datalist),2,FUN=fftfun)
what should I do to solve this problem?
EDIT
My rows in csv file:
12,85,365,145,23
13,84,364,144,21
14,86,366,143,24
15,83,363,146,22
16,85,365,145,23
17,80,361,142,21
Your code seems overly complicated. Why don't you just do something like this :
df <- read.csv("test.csv", header=FALSE)
x <- df[,1]
fft(x)
Or, if you really want to read line by line :
con <- file("test.csv","r")
data <- NULL
for (i in 1:4) {
line<-readLines(con,n=1,warn=FALSE)
data <- c(data, as.numeric(strsplit(line,split=",")[[1]][1]))
}
close(con)
fft(data)
Let's assume your real question is: what happened to make apparently numeric data become non-numeric? Rather than slogging through the incredible number of type coercions in your code (csv to matrix to list to other list to as.matrix), I'm going to recommend you start by just plain reading one file into R and checking the typeof and class of each column. If anything turns out to be a factor rather than numeric , you may need to add the argument colClasses='character' .
If the data as read are numeric, then you're fouling it up in your subsequent conversions. Try simplifying the code as much as possible.

Resources