how to scrape text from an icon - R - r

I'm trying to scrape all the data from this website. There are icons over some of the competitors names indicating that the person was disqualified for being a 'no-show'.
I would like create a data frame with all the competitors while also specifying who was disqualified, but I'm running into two issues:
(1) trying to add the disclaimer next to the persons name produces the error cannot coerce class ‘"xml_nodeset"’ to a data.frame.
(2) trying to extract the text from just the icon (and not the competitor names) produces a blank data frame.
library(rvest); library(tidyverse)
html = read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
dq = data.frame(winner = html %>%
html_nodes('.match-card__competitor--red') %>%
html_text(trim = TRUE),
opponent = html %>%
html_nodes('hr+ .match-card__competitor'),
dq = html %>%
html_nodes('.match-card__disqualification') %>%
html_text())

This approach generally works only on tabular data where you can be sure that the number of matches for each of those selectors are constant and order is also fixed. In your example you have:
127 matches for .match-card__competitor--red
127 matches for hr+ .match-card__competitor
14 matches for .match-card__disqualification (you get no results for this because you should use html_attr("title") for title attribute instead of html_text())
Basically you are trying to combine columns of different lengths into the same dataframe. Even if it would work, you'd just add DSQ for 14 first matches.
As you'd probably want to keep information about matched, participants, results and disqualifications instead of just having a list of participants, I'd suggest to work with a list of match cards, i.e. extract all required information from a single card while not breaking relations and then move to the next card.
My purrr is far from perfect, but perhaps something like this:
library(rvest)
library(magrittr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
# helpers -----------------------------------------------------------------
# to keep matches with details (when/where) in header
is_valid_match <- function(element){
return(length(html_elements(element, ".bracket-match-header")) > 0)
}
# detect winner
is_winner <- function(element){
return(length(html_elements(element, ".match-competitor--loser")) < 1 )
}
# extract data from competitor sections
comp_details <- function(comp_card, prefix="_"){
l = lst()
l[paste(prefix, "n", sep = "")] <- comp_card %>% html_element(".match-card__competitor-n") %>% html_text()
l[paste(prefix, "name", sep = "")] <- comp_card %>% html_element(".match-card__competitor-name") %>% html_text()
l[paste(prefix, "club", sep = "")] <- comp_card %>% html_element(".match-card__club-name") %>% html_text()
l[paste(prefix, "dq", sep = "")] <- comp_card %>% html_element(".match-card__disqualification") %>% html_attr("title")
l[paste(prefix, "won", sep = "")] <- comp_card %>% html_element(".match-competitor--loser") %>% length() == 0
return(l)
}
# scrape & process --------------------------------------------------------
html <- read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
html %>%
# collect all match cards
html_elements("div.tournament-category__match") %>%
keep(is_valid_match) %>%
# apply anonymous function to every item in the list of match cards
map(function(match_card){
match_id <- match_card %>% html_element(".tournament-category__match-card") %>% html_attr("id")
where <- match_card %>% html_element(".bracket-match-header__where") %>% html_text()
when <- match_card %>% html_element(".bracket-match-header__when") %>% html_text()
competitors <- html_nodes(match_card, ".match-card__competitor")
# extract competitior data
comp01 <- competitors[[1]] %>% comp_details(prefix = "comp01_")
comp02 <- competitors[[2]] %>% comp_details(prefix = "comp02_")
winner_idx <- competitors %>% detect_index(is_winner)
# lst for creating a named list
l <- lst(match_id, where, when, winner_idx)
# combine all items and comp lists into single list
l <- c(l,comp01, comp02)
return(l)
}) %>%
# each resulting list item into single-row tibble
map(as_tibble) %>%
# reduce list of tibbles into single tibble
reduce(bind_rows)
Result:
#> # A tibble: 65 × 14
#> match_id where when winne…¹ comp0…² comp0…³ comp0…⁴ comp0…⁵ comp0…⁶ comp0…⁷
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 match-1-1 FIGH… Sat … 2 58 Christ… Rodrig… <NA> FALSE 66
#> 2 match-1-9 FIGH… Sat … 2 6 Melvin… GF Team Disqua… FALSE 66
#> 3 match-1-… FIGH… Sat … 2 47 Eric R… Atos J… <NA> FALSE 66
#> 4 match-1-… FIGH… Sat … 1 47 Eric R… Atos J… <NA> TRUE 10
#> 5 match-1-… FIGH… Sat … 2 42 Ivan M… CheckM… <NA> FALSE 66
#> 6 match-1-… FIGH… Sat … 2 18 Joel S… Gracie… <NA> FALSE 47
#> 7 match-1-… FIGH… Sat … 1 42 Ivan M… CheckM… <NA> TRUE 26
#> 8 match-1-… FIGH… Sat … 2 34 Matthe… Super … <NA> FALSE 18
#> 9 match-2-9 FIGH… Sat … 1 62 Bryan … Team J… <NA> TRUE 4
#> 10 match-2-… FIGH… Sat … 2 22 Steffe… Six Bl… <NA> FALSE 30
#> # … with 55 more rows, 4 more variables: comp02_name <chr>, comp02_club <chr>,
#> # comp02_dq <chr>, comp02_won <lgl>, and abbreviated variable names
#> # ¹​winner_idx, ²​comp01_n, ³​comp01_name, ⁴​comp01_club, ⁵​comp01_dq,
#> # ⁶​comp01_won, ⁷​comp02_n
Created on 2022-09-19 with reprex v2.0.2
Also note that not all matches have a winner and both participants can be disqualified (screenshot), so splitting them to winners & opponents might not be optimal.

Related

Cleaning an oddly structured dataframe from an excel file (any recommendations on functions also appreciated)

I'm trying to make a dataframe pulled from an excel file more user-friendly by creating a "Type" column.
The data can be found here: https://www.dmo.gov.uk/data/pdfdatareport?reportCode=D1A (direct download excel link here: https://www.dmo.gov.uk/umbraco/surface/DataExport/GetDataExport?reportCode=D1A&exportFormatValue=xls&parameters=%26COBDate%3D11%2F04%2F2011)
As you can probably see, the type of data is all grouped together in column A, like so:
What I'd like to do is is change title "Conventional Gilts" to being "Name", and create a "Type" column that has the different categories pulled from their grouped title. In the linked file, the "Types" would be: "Ultra-Short", "Short", "Medium", "Long", "Index-linked Gilts (3-month Indexation Lag)", "Undated Gilts (non "rump")", and ""Rump" Gilts".
While I feel I would need to do some form of pattern recognition using a package like grepl, I'm not sure how I can achieve this from a 'dynamic' perspective (changing if new categories are created).
Any advice on how to achieve this (or even achieve this in a function) would be greatly appreciated.
I don't know about a single function to do all this; the data is haphazardly arranged and needs to be fixed "manually", for example:
library(readxl)
library(tidyverse)
gilts <- read_xls("C:/Users/Administrator/Documents/gilts.xls")
gilts %>%
filter(!apply(gilts, 1, function(x) all(is.na(x)))) %>%
filter(seq(nrow(.)) < 44) %>%
select(1:7) %>%
filter(seq(nrow(.)) != 1) %>%
setNames(unlist(slice(., 1))) %>%
filter(seq(nrow(.)) != 1) %>%
mutate(splitter = cumsum(is.na(`ISIN Code`))) %>%
group_by(splitter) %>%
mutate(Type = first(`Conventional Gilts`)) %>%
summarize(across(everything(), ~.x[-1])) %>%
ungroup() %>%
select(-1) %>%
select(c(8, 1:7)) %>%
rename(Name = `Conventional Gilts`) %>%
mutate(across(c(4, 5, 7),
~ as.Date(as.numeric(.x), origin = "1899-12-30"))) %>%
mutate(across(contains("million"), as.numeric))
#> `summarise()` has grouped output by 'splitter'. You can override using the
#> `.groups` argument.
#> # A tibble: 37 x 8
#> Type Name ISIN ~1 Redempti~2 First Is~3 Divid~4 Current/~5 Total~6
#> <chr> <chr> <chr> <date> <date> <chr> <date> <dbl>
#> 1 Ultra-Short 9% Conv~ GB0002~ 2011-07-12 1987-07-12 12 Jan~ 2011-07-01 7312.
#> 2 Ultra-Short 3¼% Tre~ GB00B3~ 2011-12-07 2008-11-14 7 Jun/~ 2011-05-26 15747
#> 3 Ultra-Short 5% Trea~ GB0030~ 2012-03-07 2001-05-25 7 Mar/~ 2011-08-26 26867.
#> 4 Ultra-Short 5¼% Tre~ GB00B1~ 2012-06-07 2007-03-16 7 Jun/~ 2011-05-26 25612.
#> 5 Ultra-Short 4½% Tre~ GB00B2~ 2013-03-07 2008-03-05 7 Mar/~ 2011-08-26 33787.
#> 6 Ultra-Short 8% Trea~ GB0008~ 2013-09-27 1993-04-01 27 Mar~ 2011-09-16 8378.
#> 7 Ultra-Short 2¼% Tre~ GB00B3~ 2014-03-07 2009-03-20 7 Mar/~ 2011-08-26 29123.
#> 8 Short 5% Trea~ GB0031~ 2014-09-07 2002-07-25 7 Mar/~ 2011-08-26 36579.
#> 9 Short 2¾% Tre~ GB00B4~ 2015-01-22 2009-11-04 22 Jan~ 2011-07-13 28181.
#> 10 Short 4¾% Tre~ GB0033~ 2015-09-07 2003-09-26 7 Mar/~ 2011-08-26 33650.
#> # ... with 27 more rows, and abbreviated variable names 1: `ISIN Code`,
#> # 2: `Redemption Date`, 3: `First Issue Date`, 4: `Dividend Dates`,
#> # 5: `Current/Next \nEx-dividend Date`,
#> # 6: `Total Amount in Issue \n(£ million nominal)`
Created on 2022-10-30 with reprex v2.0.2
Different approach, premised on the fact that all the gilts start with numbers and the types do not. Makes use of janitor which has super helpful functions for cleaning up messy imported data like this.
library(tidyverse)
library(readxl)
library(janitor)
import_gilts <- read_excel("20221031 - Gilts in Issue.xls.xls", skip = 7)
gilts <- import_gilts %>%
filter(!str_detect(1, "^Note|^Page")) %>%
rename(Name = `Conventional Gilts`) %>%
remove_empty(which = "rows") %>%
mutate(Type = case_when(str_detect(Name, "^[^0-9]") ~ Name,
TRUE ~ NA_character_),
.before = Name) %>%
fill(Type, .direction = "down") %>%
arrange(desc(...9)) %>%
row_to_names(row_number = 2) %>%
rename(Type = 1,
Name = 2) %>%
filter(Type != Name)
Quick draft so there's certainly room for improvement.
Should be able to be turned into a function as long as the number of imported columns and number of rows to skip reading in the file stay the same.

Rvest and SelectorGadget results in empty table

I am trying to download several tables from this website using rvest and SelectorGadget.
The css selector is "#main li" as can be seen from the screenshot below.
When I run the following code, unfortunately an empty table results.
library(rvest)
psh <- read_html("https://pershingsquareholdings.com/performance/net-asset-value-and-returns/")
psh.node <- html_node(psh, "#main li")
psh.table = html_table(psh, fill = TRUE)
I guess the site prevents scraping, but it would otherwise be great if an alternative way could be recommended to get the data.
Thanks in advance!
Problem is that it is not an html <table> but a list:
library(rvest)
library(purrr)
psh <- read_html("https://pershingsquareholdings.com/performance/net-asset-value-and-returns/")
psh.node <- html_node(psh, "#main .psh_table")
headers <- psh.node %>% html_element(".psh_table_row.headings") %>%
html_elements("li") %>% html_text()
table <- psh.node %>% html_elements("ul.psh_table_row") %>%
map_dfr(~ html_elements(., "li") %>% html_text() %>% set_names(headers))
table
#> # A tibble: 44 × 10
#> `As of Date` Period USDNAV…¹ Euron…² GBPNA…³ LSE G…⁴ LSE U…⁵ MTDRe…⁶ QTDRe…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 As of Date Period USDNAV/… Eurone… GBPNAV… LSE GB… LSE US… MTDRet… QTDRet…
#> 2 25 October Weekly $51.01 $32.60 £44.47 £28.45 $32.63 12.0% 12.0%
#> 3 18 October Weekly $47.81 $30.25 £42.22 £26.70 $30.23 5.0% 5.0%
#> 4 11 October Weekly $44.96 $29.35 £40.93 £26.35 $29.55 -1.3% -1.3%
#> 5 30 September Monthly $45.55 $30.00 £40.79 £27.00 $30.25 -4.5% 8.2%
#> 6 27 September Weekly $45.49 $30.25 £42.44 £28.00 $30.23 -4.6% 8.1%
#> 7 20 September Weekly $47.88 $32.00 £42.06 £27.80 $31.75 0.4% 13.8%
#> 8 13 September Weekly $49.11 $32.00 £42.71 £27.70 $32.13 2.7% 16.4%
#> 9 6 September Weekly $47.80 $32.25 £41.51 £28.10 $32.48 -0.1% 13.3%
#> 10 31 August Monthly $47.83 $32.70 £41.17 £27.90 $32.83 2.9% 13.4%
#> # … with 34 more rows, 1 more variable: YTDReturn <chr>, and abbreviated
#> # variable names ¹​`USDNAV/Share`, ²​`EuronextPrice/Share`, ³​`GBPNAV/Share`,
#> # ⁴​`LSE GBPPrice/Share`, ⁵​`LSE USDPrice/Share`, ⁶​MTDReturn, ⁷​QTDReturn
Edit
To find all similar tables:
psh.nodes <- html_elements(psh, "#main .psh_table")
tables <- map(psh.nodes, function(psh.node){
headers <- psh.node %>% html_element(".psh_table_row.headings") %>%
html_elements("li") %>% html_text()
table <- psh.node %>% html_elements("ul.psh_table_row") %>%
map_dfr(~ html_elements(., "li") %>% html_text() %>% set_names(headers))
table
})

map_df -- Argument 1 must be a data frame or a named atomic vector

I am an infectious diseases physician and have set myself the challenge of creating a dataframe with the UK cumulative published cases of monkeypox, so I can graph it as a runing tally or a chloropleth map as there is no nice dashboard at present for this.
All the data is published as html webpages rather than as a nice csv so I am trying to scrape it all off the internet using the rvest package.
Data is only published intermittently (about twice per week) with the cumulative totals for each of the 4 home nations in UK.
I have managed to get working code to pull data from each of the separate webpages and testing it on the first 2 pages in my mpx_gov_uk_pages list works well giving a small example tibble:
library(tidyverse)
library(lubridate)
library(rvest)
library(janitor)
# load in overview page url which has links to each date of published cases
mpx_gov_uk_overview_page <- c("https://www.gov.uk/government/publications/monkeypox-outbreak-epidemiological-overview")
# extract urls for each date page
mpx_gov_uk_pages <- mpx_gov_uk_overview_page %>%
read_html %>%
html_nodes(".govuk-link") %>%
html_attr('href') %>%
str_subset("\\d{1,2}-[a-z]+-\\d{4}") %>%
paste0("https://www.gov.uk", .) %>%
as.character()
# make table for home nations for each date
table1 <- mpx_gov_uk_pages[1] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[1], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
table2 <- mpx_gov_uk_pages[2] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[2], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
# Combine tables
bind_rows(table1, table2)
#> # A tibble: 8 × 3
#> date area cases
#> <date> <chr> <dbl>
#> 1 2022-08-02 England 2638
#> 2 2022-08-02 Northern Ireland 24
#> 3 2022-08-02 Scotland 65
#> 4 2022-08-02 Wales 32
#> 5 2022-07-29 England 2436
#> 6 2022-07-29 Northern Ireland 19
#> 7 2022-07-29 Scotland 61
#> 8 2022-07-29 Wales 30
I want to automate this by creating a generic function and passing the list of urls to purrr::map_df as there will be an ever growing number of pages (there's already 13):
pull_first_table <- function(x){
x %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract({{x}}, "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
}
summary_table <- map_df(mpx_gov_uk_pages, ~ pull_first_table)
Error in `dplyr::bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_error()` to see where the error occurred.
The generic function seems to work ok when I supply it with a single element e.g. mpx_gov_uk_cases[2] but I cannot seem to get map_df to work properly even though the webscraping is producing tibbles.
All help and pointers greatly welcomed.
We just need the function and not a lambda expression.
map_dfr(mpx_gov_uk_pages, pull_first_table)
-output
# A tibble: 52 × 3
date area cases
<date> <chr> <dbl>
1 2022-08-02 England 2638
2 2022-08-02 Northern Ireland 24
3 2022-08-02 Scotland 65
4 2022-08-02 Wales 32
5 2022-07-29 England 2436
6 2022-07-29 Northern Ireland 19
7 2022-07-29 Scotland 61
8 2022-07-29 Wales 30
9 2022-07-26 England 2325
10 2022-07-26 Northern Ireland 18
# … with 42 more rows
If we use the lambda expression,
map_dfr(mpx_gov_uk_pages, ~ pull_first_table(.x))

Web scraping with R (rvest)

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)
You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))
You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.
Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

Replacing missing value when web scraping (rvest)

I'm trying to write a script that will go through a list of players provided by the website Transfermarkt and gathering some information about them. For that, I've created the script below, but faced a problem with 1 of the 29 players in the list. Due to one page being arranged differently than the others, the code outputs a list of only 28 players since it can't find information on the aforementioned page.
I understand why the code I've written doesn't find any information on the given page and thus giving me a list of 28, but I don't know how to rewrite a code in order to achieve what I want:
for the script to simply replace the entry with a "-" if it does not find anything, in this case a nationality, for the node on a particular page and return a full list with 29 players with all the other info in it.
The player page in question is this and while the other pages has the node used in the code for nationality, here it's ".dataValue span".
I'm still quite new to R and it might be an easy fix, but atm I can't figure it out. Any help or advise is appreciated.
URL <- "http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1"
WS <- read_html(URL)
Team <- WS %>% html_nodes(".spielprofil_tooltip") %>% html_attr("href") %>% as.character()
Team <- paste0("http://www.transfermarkt.de",Team)
Catcher <- data.frame(Name=character(),Nat=character(),Vertrag=character())
for (i in Team) {
WS1 <- read_html(i)
Name <- WS1 %>% html_nodes("h1") %>% html_text() %>% as.character()
Nat <- WS1 %>% html_nodes(".hide-for-small+ p .dataValue span") %>% html_text() %>% as.character()
Vertrag <- WS1 %>% html_nodes(".dataValue:nth-child(9)") %>% html_text() %>% as.character()
if (length(Nat) > 0) {
temp <- data.frame(Name,Nat,Vertrag)
Catcher <- rbind(Catcher,temp)
}
else {}
cat("*")
}
num_Rows <- nrow(Catcher)
odd_indexes <- seq(1,num_Rows,2)
Catcher <- data.frame(Catcher[odd_indexes,])
It's honestly easier to scrape the whole table, just in case things move around. I find purrr is a helpful complement for rvest, allowing you to iterate over URLs and node lists and easily coerce results to data.frames:
library(rvest)
library(purrr)
# build dynamically if you like
urls <- c(boateng = 'http://www.transfermarkt.de/jerome-boateng/profil/spieler/26485',
friedl = 'http://www.transfermarkt.de/marco-friedl/profil/spieler/156990')
# scrape once, parse iteratively
html <- urls %>% map(read_html)
df <- html %>%
map(html_nodes, '.dataDaten p') %>%
map_df(map_df,
~list(
variable = .x %>% html_node('.dataItem') %>% html_text(trim = TRUE),
value = .x %>% html_node('.dataValue') %>% html_text(trim = TRUE) %>% gsub('\\s+', ' ', .)
),
.id = 'player')
df
#> # A tibble: 17 × 3
#> player variable value
#> <chr> <chr> <chr>
#> 1 boateng Geb./Alter: 03.09.1988 (28)
#> 2 boateng Geburtsort: Berlin
#> 3 boateng Nationalität: Deutschland
#> 4 boateng Größe: 1,92 m
#> 5 boateng Position: Innenverteidiger
#> 6 boateng Vertrag bis: 30.06.2021
#> 7 boateng Berater: SAM SPORTS
#> 8 boateng Nationalspieler: Deutschland
#> 9 boateng Länderspiele/Tore: 67/1
#> 10 friedl Geb./Alter: 16.03.1998 (19)
#> 11 friedl Nationalität: Österreich
#> 12 friedl Größe: 1,87 m
#> 13 friedl Position: Linker Verteidiger
#> 14 friedl Vertrag bis: 30.06.2021
#> 15 friedl Berater: acta7
#> 16 friedl Akt. Nationalspieler: Österreich U19
#> 17 friedl Länderspiele/Tore: 6/0
Alternately, that particular piece of data is in three places on those pages, so if one is inconsistent there's a chance the others are better. Or grab them from the table with the whole team—countries are not printed, but they're in the title attribute of the flag images, which can be grabbed with html_attr:
html <- read_html('http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1')
team <- html %>%
html_nodes('tr.odd, tr.even') %>%
map_df(~list(player = .x %>% html_node('a.spielprofil_tooltip') %>% html_text(),
nationality = .x %>% html_nodes('img.flaggenrahmen') %>% html_attr('title') %>% toString()))
team
#> # A tibble: 29 × 2
#> player nationality
#> <chr> <chr>
#> 1 Manuel Neuer Deutschland
#> 2 Sven Ulreich Deutschland
#> 3 Tom Starke Deutschland
#> 4 Jérôme Boateng Deutschland
#> 5 David Alaba Österreich
#> 6 Mats Hummels Deutschland
#> 7 Javi Martínez Spanien
#> 8 Juan Bernat Spanien
#> 9 Philipp Lahm Deutschland
#> 10 Rafinha Brasilien, Deutschland
#> # ... with 19 more rows

Resources