I am new to R and got stuck on a for loop for my web scraping project using rvest.
I am trying to extract notes (nested within scorecard URL) from ESPN cricinfo website. My code is this;
library(dplyr)
library(rvest)
get_notes = function(score){
score_page = read_html(score)
score_notes = score_page %>% html_nodes(".ds-mt-3 .ds-mb-4 .ds-p-4,
.ds-mb-4~ .ds-mb-4+ .ds-mb-4 .ds-p-4,
.ds-mt-3 .ds-text-typo-title .ds-text-tight-s") %>% html_text()
}
notesdata = data.frame()
for (page_result in c(2019,2020,2021)){
link = paste0("https://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=",
page_result,";type=year")
pages = read_html(link)
scorecard = pages %>% html_nodes("td:nth-child(7) .data-link") %>% html_text()
match_url = pages %>% html_nodes("td:nth-child(7) .data-link") %>%
html_attr("href") %>%
paste("https://www.espncricinfo.com/",., sep="")
notes = sapply(match_url, FUN = get_notes, USE.NAMES = FALSE)
notesdata = rbind(notesdata,
data.frame(t(notes),
desperse.level = 0)
)
print(paste("page:", page_result))
}
When I run this code, I get the following error message;
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Can someone help me create a data frame (or anything that I can turn into csv file)? Thanks a lot!
It is not easy to run your "reproducible" example due to the oget_notes function returns very large matrices. However, I think that your problem would be associated with the next:
onotes = sapply(omatch_url, FUN = oget_notes, USE.NAMES = FALSE)
That line returns a matrix of dimensions 3xL, where L is the length of omatch_url.
The other objects (oteam_one, oteam_two, oMatch_dates, oground and oscorecard) are of length L.
When you try to run data.frame(oteam_one,oteam_two,oMatch_dates,oground,oscorecard,onotes), the data.frame function is expecting a vector of some same length (L) or matrices with as many rows as the length of the vectors (L x #).
So, my suggestion would be to change your line 32 from this:
onotesdata = rbind(onotesdata,
data.frame(oteam_one,
oteam_two,
oMatch_dates,
oground,
oscorecard,
onotes))
To this:
onotesdata = rbind(onotesdata,
data.frame(oteam_one,
oteam_two,
oMatch_dates,
oground,
oscorecard,
t(onotes)),
deparse.level = 0)
Again, I couldn't run your script because the output of oget_notes, so I don't know if this solution will solve your issue.
Related
I'm aiming to get a list of all files in a Google Drive folder, as well at the associated metadata for those files. When I use drive_ls, it returns 3 columns {name, id, drive_resource}. drive_resource is a structured like this: list(kind = "drive#file", id = "abc",...). However, some of the list is not qualified by quotations, and commas are also occassionally used when not a separator.
Any ideas how I might properly unlist this? I can't find anywhere in the package that can handle this.
Using the package 'googledrive', I can get a list of all the files
a <- drive_ls(path = "abc", recursive = TRUE)
The below attempt gets close, but fails to get thee column names and also splits some values at the wrong place based on a comma being contained in the string.
a$drive_resource <- vapply(a$drive_resource, paste, collapse = ",", character(1L))
abcd <- a%>% separate(drive_resource, sep = ",", into = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30") )
You can try the following approach. It's an example with only four elements of the list (selected names are specified in the function). The function maps each list contained in each row to a tibble, so you can unnest it
require(googledrive)
require(dplyr)
f <- function(l){
l[c("version","webContentLink","viewedByMeTime","mimeType")] %>% as_tibble()
}
dr_content <- drive_ls(path = "<path>", recursive = TRUE)
dr_content <- dr_content %>% mutate(drive_resource = purrr::map(drive_resource, f))
dr_content <- dr_content %>% tidyr::unnest(drive_resource)
I've got a simple Map function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse() or a plain if statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin around. (Usually the writeBin would not be vectorized - I did it for easy viewing in my ifelse). Error:
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] :
replacement has length zero
If I leave out the if code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Based on the description, it seems like check the corresponding 'dates' for each 'links' and then apply the if/else. If that is the case, then we can have two arguments in Map
Map(function(ln, y) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
if(y == '2019-01-31') {
write
} else "He's dead, Jim"
},
links, dates)
I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.
This is a follow up to a prior thread. The code works fantastic for a single value but I get the following error when trying to pass more than 1 value I get an error based on the length of the function.
Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[1]) result is length 3
Here is a sample of the code. In most instances I have been able just to name an object and scrape that way.
library(httr)
library(rvest)
library(dplyr)
b<-c('48127','48180','49504')
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = b),
encode = "form"
) -> res
I was wondering if a loop to insert the values into the form would be the right way to go? However my loop writing skills are still in development and I am unsure of where to place it; in addition when i call the loop it doesn't print line by line it just returns null results.
#d isn't listed in the above code as it returns null
d<-for(i in 1:3){nrow(b)}
Here is an approach to send multiple POST requests
library(httr)
library(rvest)
b <- c('48127','48180','49504')
For each element in b perform a function that will send the appropriate POST request
res <- lapply(b, function(x){
res <- POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = x),
encode = "form"
)
res <- read_html(content(res, as="raw"))
})
Now for each element of the list res you should do the parsing steps explained by hrbrmstr: How can I Scrape a CGI-Bin with rvest and R?
library(tidyverse)
I will use hrbrmstr's code since he is king and it is already clear to you. Only thing we are doing here is performing it on each element of res list.
res_list = lapply(res, function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
ret <- data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
return(ret)
}
)
or using map from purrr
res %>%
map(function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"),
sep="\\|", extra="merge") -> ret
return(ret)
}
)
If you would like this in a data frame:
res_df <- data.frame(do.call(rbind, res_list), #rbinds list elements
b = rep(b, times = unlist(lapply(res_list, length)))) #names the rows according to elements in b
You can put the values inside the post as below,
b<-c('48127','48180','49504')
for(i in 1:length(b)) {
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode =b[i]),
encode = "form"
) -> res
# YOUR CODES HERE (for getting content of the page etc.)
}
But since for every different zipcode value the "res" value will be different, you need the put the rest of the codes inside the area I commented. Otherwise you get the last value only.
data_different_tech_count <- data_different_tech %>%
group_by(tech) %>%
summarise(count(tech))
now this gives me a data.frame as an output but I am unable to save the file. When I try to change the colnames, it shows me:
colnames(data1)[c(1,2)]<- c("tech","count")
Error in colnames<-(*tmp*, value = c("tech", "count")) :
'names' attribute [2] must be the same length as the vector [1]
When I am using
colnames(data_different_count_tech)
It says that I have only one column.
When I am using the
summary(data_different_count_tech)
it shows two columns.
When I am trying to write this file to my directory it returns the following error.
write.csv(file=data_different_tech_count,"tech.csv")
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
length of 'dimnames' [2] not equal to array extent
Are you trying to get a count of the number of times each value of tech appears? I can't get your example to work without having a reproducible example.
If so, here are a few alternatives that will give you what you want:
Using Dplyr
data_different_tech_count <- data_different_tech %>% group_by(tech) %>% summarise(count = n())
Using Base R
data_different_tech_count <- as.data.frame(table(data_different_tech$tech))
colnames(data_different_tech_count) <- c("tech","count")