Web scraping with R (rvest) - r

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)

You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))

You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.

Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

Related

Can I scrape a single list that spans across multiple pages when webscraping with R?

Okay so I'm trying to scrape the table with dog temperaments from this website: https://atts.org/breed-statistics/statistics-page1/
However the table spans across 8 pages in total (and therefore 8 unique urls)
Currently, for page 1 of the table, I have written the following code:
url <- "https://atts.org/breed-statistics/statistics-page1/"
webpage <- read_html(url)
bn_data_html <- html_nodes(webpage, "td:nth-child(1)")
bn_data <- html_text(bn_data_html)
nt_data_html <- html_nodes(webpage, "td:nth-child(2)")
nt_data <- html_text(nt_data_html)
passed_data_html <- html_nodes(webpage, "td:nth-child(3)")
passed_data <- html_text(passed_data_html)
failed_data_html <- html_nodes(webpage, "td:nth-child(4)")
failed_data <- html_text(failed_data_html)
percent_data_html <- html_nodes(webpage, "td:nth-child(5)")
percent_data <- html_text(percent_data_html)
breeds <- data.frame(Breed = bn_data, Number_tested = nt_data, Passed = passed_data, Failed = failed_data, Percent = percent_data)
Which works wonderfully to scrape the data from the first page. However, in order to scrape the entire table, the only way I can think of to do it would be to replace the original url and rerun the chunk of code eight times for each page of the table. Is there a way to do this without having to rerun it eight times? Say the table spanned 100 pages and rerunning the code that many times just wasn't feasible?
This is how you get them dogs into a dataframe, scraping 1:8 pages. Note the usage of html_table().
library(tidyverse)
library(rvest)
get_dogs <- function(page) {
str_c("https://atts.org/breed-statistics/statistics-page", page) %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::row_to_names(1) %>%
janitor::clean_names()
}
dogs_df <- map_dfr(1:8, get_dogs)
# A tibble: 250 x 5
breed_name tested passed failed percent
<chr> <chr> <chr> <chr> <chr>
1 AFGHAN HOUND 165 120 45 72.7%
2 AIREDALE TERRIER 110 86 24 78.2%
3 AKBASH DOG 16 14 2 87.5%
4 AKITA 598 465 133 77.8%
5 ALAPAHA BLUE BLOOD BULLDOG 12 9 3 75.0%
6 ALASKAN KLEE KAI 2 1 1 50.0%
7 ALASKAN MALAMUTE 244 207 37 84.8%
8 AMERICAN BANDAGGE 1 1 0 100.0%
9 AMERICAN BULLDOG 214 186 28 86.9%
10 AMERICAN ESKIMO 86 71 15 82.6%
# ... with 240 more rows
# i Use `print(n = ...)` to see more rows

how to scrape text from an icon - R

I'm trying to scrape all the data from this website. There are icons over some of the competitors names indicating that the person was disqualified for being a 'no-show'.
I would like create a data frame with all the competitors while also specifying who was disqualified, but I'm running into two issues:
(1) trying to add the disclaimer next to the persons name produces the error cannot coerce class ‘"xml_nodeset"’ to a data.frame.
(2) trying to extract the text from just the icon (and not the competitor names) produces a blank data frame.
library(rvest); library(tidyverse)
html = read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
dq = data.frame(winner = html %>%
html_nodes('.match-card__competitor--red') %>%
html_text(trim = TRUE),
opponent = html %>%
html_nodes('hr+ .match-card__competitor'),
dq = html %>%
html_nodes('.match-card__disqualification') %>%
html_text())
This approach generally works only on tabular data where you can be sure that the number of matches for each of those selectors are constant and order is also fixed. In your example you have:
127 matches for .match-card__competitor--red
127 matches for hr+ .match-card__competitor
14 matches for .match-card__disqualification (you get no results for this because you should use html_attr("title") for title attribute instead of html_text())
Basically you are trying to combine columns of different lengths into the same dataframe. Even if it would work, you'd just add DSQ for 14 first matches.
As you'd probably want to keep information about matched, participants, results and disqualifications instead of just having a list of participants, I'd suggest to work with a list of match cards, i.e. extract all required information from a single card while not breaking relations and then move to the next card.
My purrr is far from perfect, but perhaps something like this:
library(rvest)
library(magrittr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
# helpers -----------------------------------------------------------------
# to keep matches with details (when/where) in header
is_valid_match <- function(element){
return(length(html_elements(element, ".bracket-match-header")) > 0)
}
# detect winner
is_winner <- function(element){
return(length(html_elements(element, ".match-competitor--loser")) < 1 )
}
# extract data from competitor sections
comp_details <- function(comp_card, prefix="_"){
l = lst()
l[paste(prefix, "n", sep = "")] <- comp_card %>% html_element(".match-card__competitor-n") %>% html_text()
l[paste(prefix, "name", sep = "")] <- comp_card %>% html_element(".match-card__competitor-name") %>% html_text()
l[paste(prefix, "club", sep = "")] <- comp_card %>% html_element(".match-card__club-name") %>% html_text()
l[paste(prefix, "dq", sep = "")] <- comp_card %>% html_element(".match-card__disqualification") %>% html_attr("title")
l[paste(prefix, "won", sep = "")] <- comp_card %>% html_element(".match-competitor--loser") %>% length() == 0
return(l)
}
# scrape & process --------------------------------------------------------
html <- read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
html %>%
# collect all match cards
html_elements("div.tournament-category__match") %>%
keep(is_valid_match) %>%
# apply anonymous function to every item in the list of match cards
map(function(match_card){
match_id <- match_card %>% html_element(".tournament-category__match-card") %>% html_attr("id")
where <- match_card %>% html_element(".bracket-match-header__where") %>% html_text()
when <- match_card %>% html_element(".bracket-match-header__when") %>% html_text()
competitors <- html_nodes(match_card, ".match-card__competitor")
# extract competitior data
comp01 <- competitors[[1]] %>% comp_details(prefix = "comp01_")
comp02 <- competitors[[2]] %>% comp_details(prefix = "comp02_")
winner_idx <- competitors %>% detect_index(is_winner)
# lst for creating a named list
l <- lst(match_id, where, when, winner_idx)
# combine all items and comp lists into single list
l <- c(l,comp01, comp02)
return(l)
}) %>%
# each resulting list item into single-row tibble
map(as_tibble) %>%
# reduce list of tibbles into single tibble
reduce(bind_rows)
Result:
#> # A tibble: 65 × 14
#> match_id where when winne…¹ comp0…² comp0…³ comp0…⁴ comp0…⁵ comp0…⁶ comp0…⁷
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 match-1-1 FIGH… Sat … 2 58 Christ… Rodrig… <NA> FALSE 66
#> 2 match-1-9 FIGH… Sat … 2 6 Melvin… GF Team Disqua… FALSE 66
#> 3 match-1-… FIGH… Sat … 2 47 Eric R… Atos J… <NA> FALSE 66
#> 4 match-1-… FIGH… Sat … 1 47 Eric R… Atos J… <NA> TRUE 10
#> 5 match-1-… FIGH… Sat … 2 42 Ivan M… CheckM… <NA> FALSE 66
#> 6 match-1-… FIGH… Sat … 2 18 Joel S… Gracie… <NA> FALSE 47
#> 7 match-1-… FIGH… Sat … 1 42 Ivan M… CheckM… <NA> TRUE 26
#> 8 match-1-… FIGH… Sat … 2 34 Matthe… Super … <NA> FALSE 18
#> 9 match-2-9 FIGH… Sat … 1 62 Bryan … Team J… <NA> TRUE 4
#> 10 match-2-… FIGH… Sat … 2 22 Steffe… Six Bl… <NA> FALSE 30
#> # … with 55 more rows, 4 more variables: comp02_name <chr>, comp02_club <chr>,
#> # comp02_dq <chr>, comp02_won <lgl>, and abbreviated variable names
#> # ¹​winner_idx, ²​comp01_n, ³​comp01_name, ⁴​comp01_club, ⁵​comp01_dq,
#> # ⁶​comp01_won, ⁷​comp02_n
Created on 2022-09-19 with reprex v2.0.2
Also note that not all matches have a winner and both participants can be disqualified (screenshot), so splitting them to winners & opponents might not be optimal.

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?
You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

Reshaping Data into panel form

I have data where the object name is a variable name like EPS, Profit etc. (around 25 such distinct objects)
The data is arranged like this :
EPS <- read.table(text = "
Year Microsoft Facebook
2001 12 20
2002 15 23
2003 16 19
", header = TRUE)
Profit <- read.table(text = "
Year Microsoft Facebook
2001 15 36
2002 19 40
2003 25 45
", header = TRUE)
I want output like this :
Year Co_Name EPS Profit
2001 Microsoft 12 15
2002 Microsoft 15 19
2003 Microsoft 16 25
2001 Facebook 20 36
2002 Facebook 23 40
2003 Facebook 19 45
How it can be done? Is there any way to arrange data of all variables as a single object? Data of each variable is imported into R from a csv file like EPS.csv, Profit.csv etc. Is there any way to create loop from importing to arranging data in a desired format?
Just for fun we can also achieve the same result using dplyr, tidyr and purrr.
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
list_of_csv <- list.files(path = ".", pattern = ".csv", full.names = TRUE)
file_name <- gsub(".csv", "", basename(list_of_csv))
list_of_csv %>%
map(~ read_csv(.)) %>%
map(~ gather(data = ., key = co_name, value = value, -year)) %>%
reduce(inner_join, by = c("year", "co_name")) %>%
setNames(., c("year", "co_name", file_name))
## Source: local data frame [6 x 4]
## year co_name eps profit
## (int) (fctr) (int) (int)
## 1 2001 microsoft 12 15
## 2 2002 microsoft 15 19
## 3 2003 microsoft 16 25
## 4 2001 facebook 20 36
## 5 2002 facebook 23 40
## 6 2003 facebook 19 45
We can get the datasets in a list. If we already created 'EPS', 'Profit' as objects, use mget to get those in a list, convert to a single data.table with rbindlist, melt to long format and reshape it back to 'wide' with dcast.
library(data.table)#v1.9.6+
DT <- rbindlist(mget(c('EPS', 'Profit')), idcol=TRUE)
DT1 <- dcast(melt(rbindlist(mget(c('EPS', 'Profit')), idcol=TRUE),
id.var=c('.id', 'Year'), variable.name='Co_Name'),
Year+Co_Name~.id, value.var='value')
DT1
# Year Co_Name EPS Profit
#1: 2001 Microsoft 12 15
#2: 2001 Facebook 20 36
#3: 2002 Microsoft 15 19
#4: 2002 Facebook 23 40
#5: 2003 Microsoft 16 25
#6: 2003 Facebook 19 45
If we need to arrange it, use order
DT1[order(factor(Co_Name, levels=unique(Co_Name)))]

Resources