Manipulating two characters in url with purrr package for scraping pupose - r

I'm having difficulties writing a scrape function with the purrr package (first time). I want to scrape multiple pages by changing two characters of the designated url. The following code works for only one season of football players data.
page_func <- function(page) {
cat(".")
df <- read_html(paste0("http://www.voetbal.com/spelerslijst/ned-eredivisie-2017-2018/nach-name/",
page)) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame() %>%
as.tbl() %>%
select(Speler, Team, Geboren, Lengte, Positie) %>%
add_column(seizoen = "2017-2018")
}
raw_seizoen_17_18 <- map_df(1:11, page_func)
Output:
# A tibble: 541 x 6
Speler Team Geboren Lengte Positie seizoen
<chr> <chr> <chr> <chr> <chr> <chr>
1 Amir Absalem FC Groningen 19.06.1997 ??? VD 2017-2018
2 Asumah Abubakar Willem II 10.05.1997 183 cm AV 2017-2018
3 Ragnar Ache Sparta Rotterdam 28.07.1998 182 cm AV 2017-2018
4 Marouane Afaker SBV Excelsior 09.05.1999 ??? AV 2017-2018
5 Gor Agbaljan Heracles Almelo 25.04.1997 183 cm MV 2017-2018
6 Thomas Agyepong NAC Breda 10.10.1995 168 cm AV 2017-2018
Now I want to scrape all seasons from 1956-1957 untill 2017-2018 in one function, but I can't yet figure out how to manipulate these two variables with purrr.
page_season_func <- function(seizoen, page) {
cat(".")
df <- read_html(paste0("http://www.voetbal.com/spelerslijst/ned-eredivisie-",
seizoen,
"/nach-name/",
page)) %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame() %>%
as.tbl() %>%
select(Speler, Team, Geboren, Lengte, Positie) %>%
add_column(year = seizoen)
}

seasons <-
1956:2017 %>%
paste(., . + 1, sep = "-")
res <-
cross2(seasons, 1:11) %>%
transpose() %>%
pmap_df(page_season_func)

You can use map2_dfr, with the .id tag to specify the year in your output:
page_span <- 1:11
year_span <- 1956:1958
years <- sort(rep(year_span, length(page_span)))
names(years) <- years # need to name years for .id to work
pages <- rep(page_span, length(year_span))
map2_dfr(years, pages, page_season_func, .id="year")
Output:
# A tibble: 6 x 6
year Speler Team Geboren Lengte Positie
<chr> <chr> <chr> <chr> <chr> <chr>
1 1956 Sjeng Adang Roda JC Kerk… 04.07.19… ??? MV
2 1956 Wim Anderiesen jr. AFC Ajax 02.09.19… ??? VD
3 1956 Wim Andriesen AFC Ajax 09.02.19… ??? MV
4 1956 Aad Bak Feyenoord 18.06.19… ??? MV
5 1956 Huub Bisschops Roda JC Kerk… 22.01.19… ??? AV
6 1956 Wim Bleijenberg AFC Ajax 05.11.19… ??? AV
A couple of changes to page_season_func():
seizoen2 is created, which makes the y1-y2 format from y1 as input
no need to add a year column, now that you can use map2_dfr's .id argument
page_season_func <- function(seizoen, page) {
cat(".")
seizoen2 <- paste(seizoen, seizoen+1, sep="-")
df <- read_html(paste0("http://www.voetbal.com/spelerslijst/ned-eredivisie-",
seizoen2,
"/nach-name/",
page)) %>%
html_nodes("table") %>%
html_table(fill=TRUE) %>%
as.data.frame() %>%
as.tbl() %>%
select(Speler, Team, Geboren, Lengte, Positie)
}

Related

map_df -- Argument 1 must be a data frame or a named atomic vector

I am an infectious diseases physician and have set myself the challenge of creating a dataframe with the UK cumulative published cases of monkeypox, so I can graph it as a runing tally or a chloropleth map as there is no nice dashboard at present for this.
All the data is published as html webpages rather than as a nice csv so I am trying to scrape it all off the internet using the rvest package.
Data is only published intermittently (about twice per week) with the cumulative totals for each of the 4 home nations in UK.
I have managed to get working code to pull data from each of the separate webpages and testing it on the first 2 pages in my mpx_gov_uk_pages list works well giving a small example tibble:
library(tidyverse)
library(lubridate)
library(rvest)
library(janitor)
# load in overview page url which has links to each date of published cases
mpx_gov_uk_overview_page <- c("https://www.gov.uk/government/publications/monkeypox-outbreak-epidemiological-overview")
# extract urls for each date page
mpx_gov_uk_pages <- mpx_gov_uk_overview_page %>%
read_html %>%
html_nodes(".govuk-link") %>%
html_attr('href') %>%
str_subset("\\d{1,2}-[a-z]+-\\d{4}") %>%
paste0("https://www.gov.uk", .) %>%
as.character()
# make table for home nations for each date
table1 <- mpx_gov_uk_pages[1] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[1], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
table2 <- mpx_gov_uk_pages[2] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[2], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
# Combine tables
bind_rows(table1, table2)
#> # A tibble: 8 × 3
#> date area cases
#> <date> <chr> <dbl>
#> 1 2022-08-02 England 2638
#> 2 2022-08-02 Northern Ireland 24
#> 3 2022-08-02 Scotland 65
#> 4 2022-08-02 Wales 32
#> 5 2022-07-29 England 2436
#> 6 2022-07-29 Northern Ireland 19
#> 7 2022-07-29 Scotland 61
#> 8 2022-07-29 Wales 30
I want to automate this by creating a generic function and passing the list of urls to purrr::map_df as there will be an ever growing number of pages (there's already 13):
pull_first_table <- function(x){
x %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract({{x}}, "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
}
summary_table <- map_df(mpx_gov_uk_pages, ~ pull_first_table)
Error in `dplyr::bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_error()` to see where the error occurred.
The generic function seems to work ok when I supply it with a single element e.g. mpx_gov_uk_cases[2] but I cannot seem to get map_df to work properly even though the webscraping is producing tibbles.
All help and pointers greatly welcomed.
We just need the function and not a lambda expression.
map_dfr(mpx_gov_uk_pages, pull_first_table)
-output
# A tibble: 52 × 3
date area cases
<date> <chr> <dbl>
1 2022-08-02 England 2638
2 2022-08-02 Northern Ireland 24
3 2022-08-02 Scotland 65
4 2022-08-02 Wales 32
5 2022-07-29 England 2436
6 2022-07-29 Northern Ireland 19
7 2022-07-29 Scotland 61
8 2022-07-29 Wales 30
9 2022-07-26 England 2325
10 2022-07-26 Northern Ireland 18
# … with 42 more rows
If we use the lambda expression,
map_dfr(mpx_gov_uk_pages, ~ pull_first_table(.x))

R web scraping output "character (empty)"

I am new to R.
I need help assigning web scraping data to "salary". Somehow, my variable "salary" is showing character (empty) in my environment. I have used SelectorGadget to find the html nodes.
Would really appreciate it if someone can explain it to me. Thanks!
library(rvest)
library(tidyverse)
library(magrittr)
nba_player_salaries <- read_html("https://hoopshype.com/salaries/players/2018-2019/")
salary <- nba_player_salaries %>%
html_nodes("tbody .hh-salaries-sorted") %>%
html_text2()
You can directly extract the table from the page :
library(rvest)
library(dplyr)
url <- 'https://hoopshype.com/salaries/players/2018-2019/'
url %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
setNames(.[1, ]) %>% #Since column names are in 1st row
slice(-1) %>% #Remove 1st row
select(-1) #Remove 1st column
# Player `2018/19` `2018/19(*)`
# <chr> <chr> <chr>
# 1 Stephen Curry $37,457,154 $38,320,489
# 2 Russell Westbrook $35,665,000 $36,487,029
# 3 Chris Paul $35,654,150 $36,475,929
# 4 LeBron James $35,654,150 $36,475,929
# 5 Kyle Lowry $32,700,000 $33,453,690
# 6 Blake Griffin $31,873,932 $32,608,582
# 7 Gordon Hayward $31,214,295 $31,933,741
# 8 James Harden $30,570,000 $31,274,596
# 9 Paul George $30,560,700 $31,265,082
#10 Mike Conley $30,521,115 $31,224,584
# … with 566 more rows

URL substitution using vectors

I'm trying to substitute the years 2000-2020 into the url by using vectors. The error I get is Can't combine 1$Rnk and 4$Rnk . How can I fix this?
TDF_wtables <- function(url){
url %>%
read_html() %>%
# extract the table part of the html code
html_node("table") %>%
# create R dataset from webpage contents
html_table() %>%
# only Year and Gross are of interest in our analysis
select(c("Rnk", "Rider", "Team", "Time")) %>%
as_tibble()
}
Year <- 2000:2020
TDFurls <- str_c("https://www.procyclingstats.com/race/tour-de-france/",Year,"/gc")
Maps <- map_dfr(TDFurls, TDF_wtables, .id = "Year")
Maps
You get that error because sometimes there are rows like this
For any table that contains rows like the one shown above, the program cannot naturally tell if the Rnk column should be of a character type or an integer type. The simplest solution to this would be just converting all everything into character
TDF_wtables <- function(url){
url %>%
read_html() %>%
# extract the table part of the html code
html_node("table") %>%
# create R dataset from webpage contents
html_table() %>%
# only Year and Gross are of interest in our analysis
select(c("Rnk", "Rider", "Team", "Time")) %>%
as_tibble() %>%
# all columns as character
mutate(across(.fns = as.character))
}
Output
> Maps <- map_dfr(TDFurls, TDF_wtables)
> Maps
# A tibble: 3,225 x 4
Rnk Rider Team Time
<chr> <chr> <chr> <chr>
1 1 Zanini StefanoMapei - Quickstep Mapei - Quickstep 3:12:363:12:36
2 2 Zabel ErikTeam Telekom Team Telekom ,,0:00
3 3 Vainšteins RomānsVini Caldirola - Sidermec Vini Caldirola - Sidermec ,,0:00
4 4 Rodriguez FredMapei - Quickstep Mapei - Quickstep ,,0:00
5 5 van Heeswijk MaxMapei - Quickstep Mapei - Quickstep ,,0:00
6 6 Magnien EmmanuelLa Française des Jeux La Française des Jeux ,,0:00
7 7 Simon FrançoisBonjour - Toupargel Bonjour - Toupargel ,,0:00
8 8 McEwen RobbieFarm Frites Farm Frites ,,0:00
9 9 Commesso SalvatoreSaeco Saeco ,,0:00
10 10 Piziks ArvisMemoryCard - Jack & Jones MemoryCard - Jack & Jones ,,0:00
# ... with 3,215 more rows
Update
That Time column has two spans nested in each row. They have the same text, so you have to trim off one of them to avoid repetition. Also, I just realised that your code does not get you the table you want. You only want the table on the "GC" tab, right? Consider the following function:
TDF_wtables <- function(url){
gc_table <-
url %>%
read_html() %>%
html_node("div[class$='resultCont '] > table")
timeff <-
gc_table %>%
html_nodes("tbody > tr > td > span.timeff") %>%
html_text()
gc_table %>%
html_table() %>%
select(c("Rnk", "Rider", "Team", "Time")) %>%
as_tibble() %>%
mutate(Time = timeff, across(.fns = as.character))
}
Output
# A tibble: 3,222 x 4
Rnk Rider Team Time
<chr> <chr> <chr> <chr>
1 NA Armstrong LanceUS Postal Service US Postal Service " 92:33:08"
2 2 Ullrich JanTeam Telekom Team Telekom "6:02"
3 3 Beloki JosebaFestina - Lotus Festina - Lotus "10:04"
4 4 Moreau ChristopheFestina - Lotus Festina - Lotus "10:34"
5 5 Heras RobertoKelme - Costa Blanca Kelme - Costa Blanca "11:50"
6 6 Virenque RichardPolti Polti "13:26"
7 7 Botero SantiagoKelme - Costa Blanca Kelme - Costa Blanca "14:18"
8 8 Escartín FernandoKelme - Costa Blanca Kelme - Costa Blanca "17:21"
9 9 Mancebo FranciscoBanesto Banesto "18:09"
10 10 Nardello DanieleMapei - Quickstep Mapei - Quickstep "18:25"
# ... with 3,212 more rows

how to scrape text from a HTML body

I've never scraped. Would it be straightforward to scrape the text in the main, big gray box only from the link below (starting with header SRUS43 KMSR 271039, ending with .END)? My end goal is to basically have three tidy columns of data from all that text: the five digit codes, the values in inches, and the basin elevation descriptions, so any pointers with processing the text format are welcome, too.
https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6
thank you for any help.
Reading in the text is fairly easy (see #DiceBoyT answer). Cleaning up the format for three columns is a bit more involved. Below could use some clean-up (especially with the regex), but it gets the job done:
library(tidyverse)
library(rvest)
text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>%
html_node(".notes") %>%
html_text()
df <- tibble(txt = read_lines(text))
df %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
) %>%
separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(
combined_val = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_
) %>% as.numeric
) %>%
filter(!is.na(combined_val)) %>%
mutate(
code = zoo::na.locf(code),
basin_desc = zoo::na.locf(basin_desc)
) %>%
select(
code, combined_val, basin_desc
)
#> # A tibble: 643 x 3
#> code combined_val basin_desc
#> <chr> <dbl> <chr>
#> 1 ACSC1 0 San Antonio Ck - Sunol
#> 2 ADLC1 0 Arroyo De La Laguna
#> 3 ADOC1 0 Santa Ana R - Prado Dam
#> 4 AHOC1 0 Arroyo Honda nr San Jose
#> 5 AKYC1 41 SF American nr Kyburz
#> 6 AKYC1 3.2 SF American nr Kyburz
#> 7 AKYC1 42.2 SF American nr Kyburz
#> 8 ALQC1 0 Alamo Canal nr Pleasanton
#> 9 ALRC1 0 Alamitos Ck - Almaden Res
#> 10 ANDC1 0 Coyote Ck - Anderson Res
#> # ... with 633 more rows
Created on 2019-03-27 by the reprex package (v0.2.1)
This is pretty straightforward to scrape with rvest:
library(rvest)
text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>%
html_node(".notes") %>%
html_text()

Replacing missing value when web scraping (rvest)

I'm trying to write a script that will go through a list of players provided by the website Transfermarkt and gathering some information about them. For that, I've created the script below, but faced a problem with 1 of the 29 players in the list. Due to one page being arranged differently than the others, the code outputs a list of only 28 players since it can't find information on the aforementioned page.
I understand why the code I've written doesn't find any information on the given page and thus giving me a list of 28, but I don't know how to rewrite a code in order to achieve what I want:
for the script to simply replace the entry with a "-" if it does not find anything, in this case a nationality, for the node on a particular page and return a full list with 29 players with all the other info in it.
The player page in question is this and while the other pages has the node used in the code for nationality, here it's ".dataValue span".
I'm still quite new to R and it might be an easy fix, but atm I can't figure it out. Any help or advise is appreciated.
URL <- "http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1"
WS <- read_html(URL)
Team <- WS %>% html_nodes(".spielprofil_tooltip") %>% html_attr("href") %>% as.character()
Team <- paste0("http://www.transfermarkt.de",Team)
Catcher <- data.frame(Name=character(),Nat=character(),Vertrag=character())
for (i in Team) {
WS1 <- read_html(i)
Name <- WS1 %>% html_nodes("h1") %>% html_text() %>% as.character()
Nat <- WS1 %>% html_nodes(".hide-for-small+ p .dataValue span") %>% html_text() %>% as.character()
Vertrag <- WS1 %>% html_nodes(".dataValue:nth-child(9)") %>% html_text() %>% as.character()
if (length(Nat) > 0) {
temp <- data.frame(Name,Nat,Vertrag)
Catcher <- rbind(Catcher,temp)
}
else {}
cat("*")
}
num_Rows <- nrow(Catcher)
odd_indexes <- seq(1,num_Rows,2)
Catcher <- data.frame(Catcher[odd_indexes,])
It's honestly easier to scrape the whole table, just in case things move around. I find purrr is a helpful complement for rvest, allowing you to iterate over URLs and node lists and easily coerce results to data.frames:
library(rvest)
library(purrr)
# build dynamically if you like
urls <- c(boateng = 'http://www.transfermarkt.de/jerome-boateng/profil/spieler/26485',
friedl = 'http://www.transfermarkt.de/marco-friedl/profil/spieler/156990')
# scrape once, parse iteratively
html <- urls %>% map(read_html)
df <- html %>%
map(html_nodes, '.dataDaten p') %>%
map_df(map_df,
~list(
variable = .x %>% html_node('.dataItem') %>% html_text(trim = TRUE),
value = .x %>% html_node('.dataValue') %>% html_text(trim = TRUE) %>% gsub('\\s+', ' ', .)
),
.id = 'player')
df
#> # A tibble: 17 × 3
#> player variable value
#> <chr> <chr> <chr>
#> 1 boateng Geb./Alter: 03.09.1988 (28)
#> 2 boateng Geburtsort: Berlin
#> 3 boateng Nationalität: Deutschland
#> 4 boateng Größe: 1,92 m
#> 5 boateng Position: Innenverteidiger
#> 6 boateng Vertrag bis: 30.06.2021
#> 7 boateng Berater: SAM SPORTS
#> 8 boateng Nationalspieler: Deutschland
#> 9 boateng Länderspiele/Tore: 67/1
#> 10 friedl Geb./Alter: 16.03.1998 (19)
#> 11 friedl Nationalität: Österreich
#> 12 friedl Größe: 1,87 m
#> 13 friedl Position: Linker Verteidiger
#> 14 friedl Vertrag bis: 30.06.2021
#> 15 friedl Berater: acta7
#> 16 friedl Akt. Nationalspieler: Österreich U19
#> 17 friedl Länderspiele/Tore: 6/0
Alternately, that particular piece of data is in three places on those pages, so if one is inconsistent there's a chance the others are better. Or grab them from the table with the whole team—countries are not printed, but they're in the title attribute of the flag images, which can be grabbed with html_attr:
html <- read_html('http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1')
team <- html %>%
html_nodes('tr.odd, tr.even') %>%
map_df(~list(player = .x %>% html_node('a.spielprofil_tooltip') %>% html_text(),
nationality = .x %>% html_nodes('img.flaggenrahmen') %>% html_attr('title') %>% toString()))
team
#> # A tibble: 29 × 2
#> player nationality
#> <chr> <chr>
#> 1 Manuel Neuer Deutschland
#> 2 Sven Ulreich Deutschland
#> 3 Tom Starke Deutschland
#> 4 Jérôme Boateng Deutschland
#> 5 David Alaba Österreich
#> 6 Mats Hummels Deutschland
#> 7 Javi Martínez Spanien
#> 8 Juan Bernat Spanien
#> 9 Philipp Lahm Deutschland
#> 10 Rafinha Brasilien, Deutschland
#> # ... with 19 more rows

Resources