I have a code that I built to scrape player data from yahoo's fantasy football player page so I can get a list of players and the rank that yahoo gives them.
The code worked fine last year but now I am getting an error when I run the separate function:
> temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
Error in `[.data.frame`(x, x_vars) : undefined columns selected
In addition: Warning message:
Expected 6 pieces. Missing pieces filled with `NA` in 1 rows [1].
I cannot figure out why it is giving this error, the column I am trying to separate looks correct. I have another script that uses this function to do something similar and when I went to try to use it there it worked fine.
The "missing pieces filled in with 'NA'" warning shouldn't be a problem, just that it wont run because of the undefined columns error.
The minimal code that I use to get to where I am is this:
library(rvest)## For read.html
library(tidyr)## For separate function
#scrapes the data
url <- 'https://football.fantasysports.yahoo.com/f1/107573/players?status=A&pos=O&cut_type=9&stat1=S_S_2017&myteam=0&sort=PR&sdir=1&count=0'
web <- read_html(url)
table = html_nodes(web, 'table')
temp <- html_table(table)[[2]]
#
colnames(temp) <- c('one','two',3:26)
temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
The data is scraped in without names so I quickly give names to them including spelling out the column in question so it works with the separate function. I have tried using quotation marks around two in separate but it give the same error.
After remove the first row of temp, you code works.
library(dplyr)
colnames(temp) <- c('one','two',3:ncol(temp))
# Use ncol(temp) to make sure the column number is correct
temp2 <- temp %>%
filter(row_number() > 1) %>%
separate(two, c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
Related
I am trying to scrape multiple webpages by using the list of URLs (a csv file)
This is my dataset: https://www.mediafire.com/file/9qh516tdcto7is7/nyp_data.csv/file
The url column includes all the links that I am trying to use and scrape.
I tried to use for() loop by:
news_urls <- read_csv("nyp_data.csv")
library(rvest)
content_list <- vector()
for (i in 1:nrow(news_urls)) {
nyp_url <- news_urls[i, 'url']
nyp_html <- read_html(nyp_url)
nyp_nodes <- nyp_html %>%
html_elements(".single__content")
tag_name = ".single__content"
nyp_texts <- nyp_html %>%
html_elements(tag_name) %>%
html_text()
{ content_list[i] <- nyp_texts[1]
}}
However, I am getting an error that says:
Error in UseMethod("read_xml") : no applicable method for
'read_xml' applied to an object of class "c('tbl_df', 'tbl',
'data.frame')"
I believe the links that I have work well; they aren't broken and I can access to them by clicking an individual link.
If for loop isn't the one that I should be using here, do have any other idea to scarpe the content?
I also tried:
urls <- news_urls[,5] #identify the column with the urls
url_xml <- try(apply(urls, 1, read_html)) #apply the function read_html() to the `url` vector
textScraper <- function(x) {
html_text(html_nodes (x, ".single__content")) %>% #in this data, my text is in a node called ".single__content"
str_replace_all("\n", "") %>%
str_replace_all("\t", "") %>%
paste(collapse = '')
}
article_text <- lapply(url_xml, textScraper)
article_text[1]
but it kept me giving an error,
Error in open.connection(x, "rb") : HTTP error 404
The error occures in this line:
nyp_html <- read_html(nyp_url)
As the error message tells you that the argument to read_xml (which is what is called internally by read_html) is a data.frame (amongst others, as it actually is a tibble).
This is because in this line:
nyp_url <- news_urls[i, 'url']
you are using single brackets to subset your data. Single brackets do return a data.frame containing the filtered data. You can avoid this by using double brackets like this:
nyp_url <- news_urls[[i, 'url']]
or this (which I usually find more readable):
nyp_url <- news_urls[i, ]$url
Either should fix your problem.
If you want to read more about using these notations you could look at this answer.
Hello to all professionals out here,
I have created a csv which consists of cities and the corresponding Tripadvisor_Urls. If I now search for a specific link in my list, for example like here to Munich, the subset function ejects the URL. Now I try to read this URL, which is stored under search_url, using read_html. Unfortunately without success.
The relevant part of my code is the following.
search_url <- subset(data, city %in% "München", select = url)
pages <- read_html(search_url)
pages <- pages %>%
html_nodes("._15_ydu6b") %>%
html_attr('href')
When I run search_url I get the following output:
https://www.tripadvisor.de/Restaurants-g187323-Berlin.html
But when I use the above code and want to execute read_html, the following error occurs:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "data.frame"
I have now spent several hours on it, but unfortunately I have not received a suitable tip anywhere. It would be wonderful if you could help me here.
That's because the result of subset() is a data frame here, although the real result is simply one string. Check this simple example with mtcars:
# this will be data.frame although the result is one numeric value 21.4
class(subset(mtcars, disp == 258, select = mpg))
# [1] "data.frame"
So you probably can use
pages <- read_html(as.character(search_url))
if you are sure that your subset returns only 1 character value, otherwise
pages <- read_html(search_url[1, 1])
should work as well for the first result of your subset.
At the end of a script, I save many results to a one-line vector and append it to a csv file. One of the results is a one cell string containing collapsed predicted values by semicolons ";". This started to have difficulty when some of the predicted values are negatives (I think), when that happens it simply fails to include all the predicted values. It seems to happen more often when the first value is negative.
With this sample data, it happens to me every time around the 15th or 16th value.
#create a blank csv
write.csv(x= data.frame(year=NA, location=NA, side=NA, bias=NA, test_predicted=NA, observed=NA, adj_r2=NA)[-1,], file="Results/my_save_file.csv", row.names = FALSE)
#do this a more than a thousand times
for(i in c(5:1005)){
#dummy data
set.seed(45+i)
test_predic_data <- data.frame(testset=c(1:4), observed=rnorm(mean=8,sd=2,n=80), test_predicted=rnorm(mean=8, sd=10,n=80))
year<-(2016 + i)
location <- "outdoors"
side<-"left"
bias<-0.00658
adj_r2<-0.21
#make negative the begining observation
test_predic_data[1,3] <- test_predic_data[1,3]*-1
#compile results
result_line<-paste(year,
location,
side,
bias,
paste0(test_predic_data$test_predicted, collapse=";"),
paste0(test_predic_data$observed, collapse=";"),
adj_r2,
sep=",")
#then I save the result line (to my already created csv) by appending it to the #bottom:
write(result_line,file="Results/my_save_file.csv",append=TRUE)
}
UPDATE: With my real dataset I can check the error by open the csv in Excel (or R code below) and convert the cell from text to data for some weird reason it only has some of the predicted values. Two days ago this sample data was throwing me the error and there were no NA's in the sample data. However today I am not getting the error with the sample data. Maybe I don't know how to recreate the problem. I am running R 3.4.4 on Windows 10.
Checking issue in R after writing the line several times...
#read in file
my_save_file_df <- read.csv(file="Results/my_save_file.csv")
library(tidyverse)
#split the results
split_results <-my_save_file_df %>%
select(., year, observed, test_predicted) %>%
mutate(observed = strsplit(as.character(observed), ";")) %>%
mutate(test_predicted = strsplit(as.character(test_predicted), ";")) %>%
unnest(.)
This error message pops up when there is a problem
Error: All nested columns must have the same number of elements.
Call `rlang::last_error()` to see a backtrace
Is there a character limit to the number of items paste0() can work with? Am I running out of memory, do vectors or write.table() have a limit to collapsed items/characters in a single cell?
I have multiple text files that I'm trying to merge together into one dataframe.
Within each file I'm attempting to skip the first 10 rows, as well as the first column (there are 15 columns total, including the first one I'm trying to skip)
Here's code I'm currently using based on different pieces found online and on stack overflow:
for (x in list.files(pattern="*.txt", recursive=TRUE))
{
all_content <- readLines(x)
skip = all_content[-c(1:10)]
input <- read.table(textConnection(skip),
header = FALSE,
colClasses = c(rep("NULL", 1),
rep(NA, 14)),
sep="\t", stringsAsFactors = FALSE)
df <- rbind(df, input)
}
However I'm getting the "Error in rep(xi, length.out = nvar) :
attempt to replicate an object of type 'closure'" error and I can't seem to figure out what's causing it. The code was working the last time I tried it...not sure if I accidentally changed something.
Thanks all.
It is because you are trying to replicate null value, no matter how much you replicate null value it will be a single vector of Null value:
That's why it is showing error for closure object.
Let me know what happens when you add this before your for loop.
df <- NULL
I am trying to scrape player data from the Baseball Reference website, using a function to loop through multiple years (variable "year") for each player notated by "playerid."
library(plyr)
library(XML)
fetch_stats <- function(playerid, year) {
url <- paste0("http://www.baseball-reference.com/players/gl.cgi?id=",playerid,"&t=b&year=",year)
data <- readHTMLTable(url, stringsAsFactors = FALSE)
data <- data[[3]]
data$Year <- year
data$PlayerId <- playerid
data
}
This function works perfectly well when it is applied to a single year's worth of data, as seen here:
AdrianGonzales <- ldply("gonzaad01", fetch_stats, year= 2008, .progress="text")
However, as soon as I actually use the function to loop through the multiple years in a players career, it always spits out the following error:
AdrianGonzales <- ldply("gonzaad01", fetch_stats, year= 2009:2004, .progress="text")
Error in data[[3]] : subscript out of bounds
In addition: Warning message:
XML content does not seem to be XML: 'http://www.baseball- reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2009
http://www.baseball-reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2008
http://www.baseball-reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2007
http://www.baseball-reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2006
http://www.baseball-reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2005
http://www.baseball-reference.com/players/gl.cgi?id=gonzaad01&t=b&year=2004'
From what I have been able to find, the "subscript out of bounds" error happens when you exceed the limits of a defined dataset within R. For this particular function, I may just be dumb, but I don't see how that would apply in this case- or why it would work for a single year, but not for several at a time.
I'm open to any and all suggestions. Thanks ahead of time.
You could just use lapply as in the following way below. I put in a minor fix to fetch_stats as it seems that the 6th column returned has no name. You can do what you like with it, as it is just to show how you can use lapply instead.
library(plyr)
library(XML)
# Minor change made to get function working (naming column 6)
fetch_stats <- function(playerid, year) {
url <- paste0("http://www.baseball-reference.com/players/gl.cgi?id=",playerid,"&t=b&year=",year)
data <- readHTMLTable(url, stringsAsFactors = FALSE)
data <- data[[3]]
data$Year <- year
data$PlayerId <- played
### Column six name is empty.
names(data)[6] <- 'EMPTY'
data
}
res <- lapply(2009:2004, function(x) fetch_stats("gonzaad01", x))
resdf <- ldply(res)
This will create a list of 6 elements, one for each year, then convert the list to a data.frame
The way ldapply is applied in your code, it is not giving it one year at a time, it is giving the entire vector of years all at once.
EDIT
After looking a little closer, here is a solution using ldply
new_res <- ldply(.data = 2009:2004,
.fun = function(x) fetch_stats("gonzaad01", x),
.progress="text")
This gave me the same results as the other method above.