I am currently experimenting with web scraping my own Stack Overflow profile (logout) using rvest. To find the CSS tags I use the SelectorGadget extension for google chrome. To start I would like to extract the numbers with headers under the Stats header of my profile which are marked as green and yellow (colors because of using the extension to find tag) in the picture below:
This gives me the following CSS tags: .md\:fl-auto , .fc-dark. The .fc-dark tag is for the numbers and .md\:fl-auto for the headers (reputation, reached, etc.). Extracting the numbers works, but extracting the headers, I get the following error: Error: '\:' is an unrecognized escape in character string starting "".md\:". Is it possible to extract this CSS tag and save both outputs in a dataframe? Here is a reproducible example:
library(rvest)
library(dplyr)
link <- "https://stackoverflow.com/users/14282714/quinten"
profile <- read_html(link)
numbers <- profile %>% html_nodes(".fc-dark") %>% html_text()
numbers
[1] "12,688" "49k" "847" "9"
headers <- profile %>% html_nodes(".md\:fl-auto") %>% html_text()
Error: '\:' is an unrecognized escape in character string starting "".md\:"
I am open to better options for web scraping my StackOverflow profile!
library(rvest)
library(dplyr)
library(stringr)
profile %>% html_nodes(".md\\:fl-auto") %>% html_text() %>%
stringr::str_squish() %>%
as_tibble() %>%
tidyr::separate(value, into = c("number", "header"), sep = "\\s") %>%
mutate(number = stringr::str_remove(number, "\\,") %>%
sub("k", "000", ., fixed = TRUE))
Output:
# A tibble: 4 x 2
number header
<dbl> <chr>
1 12688 reputation
2 49000 reached
3 847 answers
4 10 questions
Related
I am trying to scrape details from a website in order to gather details for pictures with a script in R.
What I need is:
Image name (1.jpg)
Image caption ("A recruit demonstrates the proper use of a CO2 portable extinguisher to put out a small outside fire.")
Photo credit ("Photo courtesy of: James Fortner")
There are over 16,000 files, and thankfully the web url goes "...asp?photo=1, 2, 3, 4" so there is base url which doesn't change, just the last section with the image number. I would like the script to loop for either a set number (I tell it where to start) or it just breaks when it gets to a page which doesn't exisit.
Using the code below, I can get the caption of the photo, but only one line. I would like to get the photo credit, which is on a separate line; there are three between the main caption and photo credit. I'd be fine if the table which is generated had two or three blank columns to account for the lines, as I can delete them later.
library(rvest)
library(dplyr)
link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)
caption = page %>% html_nodes(".text7 i") %>% html_text()
info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")
Scraping with rvest and tidyverse
library(tidyverse)
library(rvest)
get_picture <- function(page) {
cat("Scraping page", page, "\n")
page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
read_html()
tibble(
image_name = page %>%
html_element(".text7 img") %>%
html_attr("src"),
caption = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(1),
credit = page %>%
html_element(".text7") %>%
html_text() %>%
str_split(pattern = "\r\n\t\t\t\t") %>%
unlist %>%
nth(3)
)
}
# Get the first 1:50
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))
# A tibble: 42 × 3
image_name caption credit
<chr> <chr> <chr>
1 /photos/1.jpg Recruit Clay Hamric demonstrates the use… James…
2 /photos/2.jpg A recruit demonstrates the proper use of… James…
3 /photos/3.jpg Recruit Paul Melnick demonstrates the pr… James…
4 /photos/4.jpg Rescue 104 James…
5 /photos/5.jpg Rescue 104 James…
6 /photos/6.jpg Rescue 104 James…
7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
For the images, you can use the command line tool curl. For example, to download images 1.jpg through 100.jpg
curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"
For the R code, if you grab the whole .text7 section, then you can split into caption and photo credit subsequently:
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
As a loop
library(rvest)
library(tidyverse)
df<-data.frame(id=1:20,
image=NA,
caption=NA,
credit=NA)
for (i in 1:20){
cat(i, " ") # to monitor progress and debug
link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
tryCatch({ # This is to avoid stopping on an error message for missing pages
page <- read_html(link)
close(link)
df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
extractedtext <- page %>% html_nodes(".text7") %>% html_text()
df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
},
error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
I get inconsistent results with this current code, for example, page 15 has more line breaks than page 1.
TODO: enhance string extraction; switch to an 'append' method of adding data to a data.frame (vs pre-allocate and insert).
I am trying to check periodically for date of latest downloadable files that are added to page https://github.com/mrc-ide/global-lmic-reports/tree/master/data, where the file names are like 2021-05-22_v8.csv.zip
There is a code snip mentioned in Using R to scrape the link address of a downloadable file from a web page? that can be used with a tweak, and identifies the date of the first or earliest downloadable file on a web page, shown below.
library(rvest)
library(stringr)
library(xml2)
page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")
page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.csv.zip") %>% # find those that end in .csv.zip
.[[1]] # look at the first one
Returns:
[1] "/mrc-ide/global-lmic-reports/blob/master/data/2020-04-28_v1.csv.zip"
The question is what would be the code to identify the date of the latest .csv.zip file? E.g., 2021-05-22_v8.csv.zip as of checked on 2021-06-01.
The purpose is that if that date (i.e., 2021-05-22) is > latest update I have created in https://github.com/pourmalek/covir2 (e.g. IMPE 20210522 in https://github.com/pourmalek/covir2/tree/main/20210528), then a new update needs to be created.
You can convert the links to date and use which.max to get the latest one.
library(rvest)
library(stringr)
library(xml2)
page <- read_html("https://github.com/mrc-ide/global-lmic-reports/tree/master/data")
page %>%
html_nodes("a") %>% # find all links
html_attr("href") %>% # get the url
str_subset("\\.csv.zip") -> tmp # find those that end in .csv.zip
tmp[tmp %>%
basename() %>%
substr(1, 10) %>%
as.Date() %>% which.max()]
#[1] "/mrc-ide/global-lmic-reports/blob/master/data/2021-05-22_v8.csv.zip"
To get the data the latest date you can use -
tmp %>%
basename() %>%
substr(1, 10) %>%
as.Date() %>% max()
#[1] "2021-05-22"
I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)
Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))
I built a simple scrape to get a data frame with NFL draft results for 2020. I intent to use this code to map several years of results but for some reason, when I change the code for a single page scrape for any other year than 2020, I get the error at the bottom.
library(tidyverse)
library(rvest)
library(httr)
library(curl)
This scrape for 2020 works flawlesslessy, although the col names are in row 1 which isn't a big deal to me as I can deal with this later (mentioning though in case this might have to do with the problem):
x <- "https://www.pro-football-reference.com/years/2020/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
below the url is changed from 2020 to 2019 which is an active page with a table of the same format. For some reason, the same call as above does not work:
x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
df <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 261, 2
There are two tables at the url provided. There is the core draft (table 1, id = "drafts") and the supplemental draft (table 2, id = "drafts_supp").
The as.data.frame() call fails because it is trying to combine the two tables but they have differing columns in both name and number. You can direct rvest to read just the specific table you are interested in by providing the html_node() with either the xpath or the selector. You can find the xpath or selector by inspecting the specific table you are interested in, right-click > inspect on Chrome/Mozilla. Note that for selector to use id you'll need to use #drafts not just drafts and for xpath you typically have to wrap the text in single quotes.
This works: html_node(xpath = '//*[#id="drafts"]')
This doesn't because of the double quotes: html_node(xpath = "//*[#id="drafts"]")
Note that I believe the html_nodes("table") used in your example is unnecessary, as html_table() already selects only tables.
x <- "https://www.pro-football-reference.com/years/2019/draft.htm"
raw_html <- read_html(x)
# use xpath
raw_html %>%
html_node(xpath = '//*[#id="drafts"]') %>%
html_table()
# use selector
raw_html %>%
html_node("#drafts") %>%
html_table()
I'm trying to programmatically pull all of the box scores for a given day from NBA Reference (I used January 4th, 2020 which has multiple games). I started by creating a list of integers to denote the amount of box scores to pull:
games<- c(1:3)
Then I used developer tools from my browser to determine what each table contains (you can use selector gadget):
#content > div.game_summaries > div:nth-child(1) > table.team
Then I used purrr::map to create a list of the the tables to pull, using games:
map_list<- map(.x= '', paste, '#content > div.game_summaries > div:nth-child(', games, ') > table.teams',
sep = "")
# check map_list
map_list
Then I tried to run this list through a for loop to generate three tables, using tidyverse and rvest, which delivered an error:
for (i in map_list){
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][i]) %>%
html_table() %>%
glimpse()
}
Error in selectr::css_to_xpath(css, prefix = ".//") :
Zero length character vector found for the following argument: selector
In addition: Warning message:
In selectr::css_to_xpath(css, prefix = ".//") :
NA values were found in the 'selector' argument, they have been removed
For reference, if I explicitly denote the html or call the exact item from map_list, the code works as intended (run below items for reference):
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node('#content > div.game_summaries > div:nth-child(1) > table.teams') %>%
html_table() %>%
glimpse()
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][1]) %>%
html_table() %>%
glimpse()
How do I make this work with a list? I have looked at other threads but even though they use the same site, they're not the same issue.
Using your current map_list, if you want to use for loop this is what you should use
library(rvest)
for (i in seq_along(map_list[[1]])){
read_html('https://www.basketball-reference.com/boxscores/') %>%
html_node(map_list[[1]][i]) %>%
html_table() %>%
glimpse()
}
but I think this is simpler as you don't need to use map to create map_list since paste is vectorized :
map_list<- paste0('#content > div.game_summaries > div:nth-child(', games, ') > table.teams')
url <- 'https://www.basketball-reference.com/boxscores/'
webpage <- url %>% read_html()
purrr::map(map_list, ~webpage %>% html_node(.x) %>% html_table)
#[[1]]
# X1 X2 X3
#1 Indiana 111 Final
#2 Atlanta 116
#[[2]]
# X1 X2 X3
#1 Toronto 121 Final
#2 Brooklyn 102
#[[3]]
# X1 X2 X3
#1 Boston 111 Final
#2 Chicago 104
This page is reasonably straight forward to scrape. Here is a possible solution, first scrape the game summary nodes "div with class=game_summary". This provides a list of all of the games played. Also this allows the use of html_node function which guarantees a return, thus keeping the list sizes equal.
Each game summary is made up of three subtables, the first and third tables can be scraped directly. The second table does not have a class assigned thus making it more tricky to retrieve.
library(rvest)
page <- read_html('https://www.basketball-reference.com/boxscores/')
#find all of the game summaries on the page
games<-page %>% html_nodes("div.game_summary")
#Each game summary has 3 sub tables
#game score is table 1 of class=teams
#the stats is table 3 of class=stats
# the quarterly score is the second table and does not have a class defined
table1<-games %>% html_node("table.teams") %>% html_table()
stats <-games %>% html_node("table.stats") %>% html_table()
quarter<-sapply(games, function(g){
g %>% html_nodes("table") %>% .[2] %>% html_table()
})