Change data type of all columns in list of data frames before using `bind_rows()` - r

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?

You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

Related

How to Scrape multi page website using R language

I want to scrape contents of multi page website using R, currently I'm able to scrape the first page, How do I scrape all pages and store them in csv.
Here;s my code so far
library(rvest)
library(tibble)
library(tidyr)
library(dplyr)
df = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=1&selectedItem=viewAllAwardedContracts.do&T01_ps=100' %>%
read_html() %>% html_table()
df
write.csv(df,"Contracts_test_taneps.csv")
Scrape multiple pages. Change 1:2 to 1:YOU NUMBER
library(tidyverse)
library(rvest)
get_taneps <- function(page) {
str_c("https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p=",
page, "&selectedItem=viewAllAwardedContracts.do&T01_ps=100") %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::clean_names()
}
map_dfr(1:2, get_taneps)
# A tibble: 200 x 7
tender_no procuring_entity suppl~1 award~2 award~3 lot_n~4 notic~5
<chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Municipal Council SHIBAM~ 08/11/~ "66200~ N/A NA
2 AE/005/2022-2023/DODOMA/FA/NC/02 Ministry of Livestock and Fish~ NINO G~ 04/11/~ "46511~ N/A NA
3 LGA/014/2022/2023/G/01 UTAWALA Bagamoyo District Council VILANG~ 02/11/~ "90000~ N/A NA
4 LGA/014/014/2022/2023/G/01 FEDHA 3EPICAR Bagamoyo District Council VILANG~ 02/11/~ "88100~ N/A NA
5 LGA/014/2022/2023/G/01/ARDHI Bagamoyo District Council VILANG~ 31/10/~ "16088~ N/A NA
6 LGA/014/2022/2023/G/11 VIFAA VYA USAFI SOKO LA SAMAKI Bagamoyo District Council MBUTUL~ 31/10/~ "10000~ N/A NA
7 DCD - 000899- 400E - ANIMAL FEEDS Kibaha Education Centre ALOYCE~ 29/10/~ "82400~ N/A NA
8 AE/005/2022-2023/MOROGORO/FA/G/01 Morogoro Regional Referral Hos~ JIGABH~ 02/11/~ "17950~ N/A NA
9 IE/023/2022-23/HQ/G/13 Commission for Mediation and A~ AKO GR~ 27/10/~ "42500~ N/A NA
10 AE/005/2022-2023/MOROGORO/FA/G/05 Morogoro Municipal Council THE GR~ 01/11/~ "17247~ N/A NA
# ... with 190 more rows, and abbreviated variable names 1: supplier_name, 2: award_date, 3: award_amount, 4: lot_name,
# 5: notice_pdf
# i Use `print(n = ...)` to see more rows
Write as .csv
write_csv(df, "Contracts_test_taneps.csv")

Web scraping with R (rvest)

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)
You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))
You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.
Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

approximate character matching using R

I have two datafiles. One of the files contains only one column with the name of the company (usually a hospital) and the other one contains a list of companies with the respective adresses. The problem is that the company names do not exactly match. How can i match them approximately ?
> dput(head(HOSPITALS[130:140,], 10))
I would like to obtain one datafile, where the company is matchen with an adress, if available in adress
Check out the fuzzyjoin package and the stringdist_join functions.
Here's a starting point. In your example data ignore_case = TRUE solves the matching problem. Depending on how the full data looks, you will have to experiment with the arguments (e.g. max_dist) and possibly filter the result until your achieve what you want.
library(dplyr)
library(fuzzyjoin)
HOSPITALS %>%
stringdist_left_join(GH_MY,
by = c("hospital" = "hospital_name"),
ignore_case = TRUE,
max_dist = 2,
distance_col = "dist")
Result:
# A tibble: 10 x 6
hospital hospital_name adress district town dist
<chr> <chr> <chr> <chr> <chr> <dbl>
1 HOSPITAL PAPAR Hospital Papar Peti Surat No. 6, Papar Sabah 0
2 HOSPITAL PARIT BUNT~ Hospital Parit ~ Jalan Sempadan Parit Bun~ Perak 0
3 HOSPITAL PEKAN Hospital Pekan 26600 Pekan Pekan Pahang 0
4 HOSPITAL PENAWAR SD~ NA NA NA NA NA
5 HOSPITAL PORT DICKS~ Hospital Port D~ KM 11, Jalan Pantai Port Dick~ Negeri ~ 0
6 HOSPITAL PULAU PINA~ Hospital Pulau ~ Jalan Residensi Pulau Pin~ Pulau P~ 0
7 HOSPITAL PUSRAWI SD~ NA NA NA NA NA
8 HOSPITAL PUSRAWI SM~ NA NA NA NA NA
9 HOSPITAL PUTRAJAYA Hospital Putraj~ Pusat Pentadbiran Ker~ Putrajaya WP Putr~ 0
10 HOSPITAL QUEEN ELIZ~ NA NA NA NA NA

How to create a Markdown table with different column lengths based on a dataframe in long format in R?

I'm working on a R Markdown file that I would like to submit as a manuscript to an academic journal. I would like to create a table that shows which three words (item2) co-occur most frequently with some keywords (item1). Note that some key words have more than three co-occurring words. The data that I am currently working with:
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
df <- data.frame(item1,item2,n)
Which gives this dataframe:
item1 item2 n
1 water tree 200
2 water dog 83
3 water cat 34
4 water fish 34
5 water eagle 34
6 sun bird 300
7 sun table 250
8 sun bed 77
9 sun flower 77
10 moon house 122
11 moon desk 46
12 moon tiger 46
Ultimately, I would like to pass the data to the function papaja::apa_table, which requires a data.frame (or a matrix / list). I therefore need to reshape the data.
My question:
How can I reshape the data (preferably with dplyr) to get the following structure?
water_item2 water_n sun_item2 sun_n moon_item2 moon_n
1 tree 200 bird 300 house 122
2 dog 83 table 250 desk 46
3 cat 34 bed 77 tiger 46
4 fish 34 flower 77 <NA> <NA>
5 eagle 34 <NA> <NA> <NA> <NA>
We can borrow an approach from an old answer of mine to a different question, and modify a classic gather(), unite(), spread() strategy by creating unique identifiers by group to avoid duplicate identifiers, then dropping that variable:
library(dplyr)
library(tidyr)
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
# Owing to Richard Telford's excellent comment,
# I use data_frame() (or equivalently for our purposes,
# data.frame(..., stringsAsFactors = FALSE))
# to avoid turning the strings into factors
df <- data_frame(item1,item2,n)
df %>%
group_by(item1) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, item2, n) %>%
unite(temp2, item1, temp, sep = '_') %>%
spread(temp2, val) %>%
select(-id)
# A tibble: 5 x 6
moon_item2 moon_n sun_item2 sun_n water_item2 water_n
<chr> <chr> <chr> <chr> <chr> <chr>
1 house 122 bird 300 tree 200
2 desk 46 table 250 dog 83
3 tiger 46 bed 77 cat 34
4 NA NA flower 77 fish 34
5 NA NA NA NA eagle 34

How can you create custom headers using Table function in R?

I have a data frame for each team that looks like nebraska below. However, some of these poor teams don't have a single win, so their $Outcome column has nothing but L in them.
> nebraska
Teams Away/Home Score Outcome
1 Arkansas State Away 36
2 Nebraska Home 43 W
3 Nebraska Away 35 L
4 Oregon Home 42
5 Northern Illinois Away 21
6 Nebraska Home 17 L
7 Rutgers Away 17
8 Nebraska Home 27 W
9 Nebraska Away 28 W
10 Illinois Home 6
11 Wisconsin Away 38
12 Nebraska Home 17 L
13 Ohio State Away 56
14 Nebraska Home 14 L
When I run table(nebraska$Outcome it gives me my expected outcome:
table(nebraska$Outcome)
L W
7 4 3
However, for the teams that don't have a single win (like Baylor), or only have wins, it only gives me something like:
table(baylor$Outcome)
L
7 7
I'd like to specify custom headers for the table function so that I can get have something like this output:
table(baylor$Outcome)
L W
7 7 0
I've tried passing the argument dnn to the table function call, but it throws an error with the following code:
> table(baylor$Outcome,dnn = c("W","L",""))
Error in names(dn) <- dnn :
'names' attribute [3] must be the same length as the vector [1]
Can someone tell me how I can tabulate these wins and losses correctly?
Try this:
with(rle(sort(nebraska$Outcome)),
data.frame(W = max(0, lengths[values == "W"]),
L = max(0, lengths[values == "L"])))
# W L
#1 3 4
I don't think this has to be that complicated. Just make baylor$Outcome a factor and then table. E.g.:
# example data
baylor <- data.frame(Outcome = c("L","L","L"))
Then it is just:
baylor$Outcome <- factor(baylor$Outcome, levels=c("","L","W"))
table(baylor$Outcome)
# L W
#0 3 0
Following a tidy workflow, I offer...
library(dplyr)
library(tidyr)
df <- nebraska %>%
group_by(Teams, Outcome) %>%
summarise(n = n()) %>%
spread(Outcome, n) %>%
select(-c(`<NA>`))
# # A tibble: 8 x 3
# # Groups: Teams [8]
# Teams L W
# * <chr> <int> <int>
# 1 Arkansas State NA NA
# 2 Illinois NA NA
# 3 Nebraska 4 3
# 4 Northern Illinois NA NA
# 5 Ohio State NA NA
# 6 Oregon NA NA
# 7 Rutgers NA NA
# 8 Wisconsin NA NA
...and I couldn't help myself but to pretty with knitr::kable and kableExtra
library(knitr)
library(kableExtra)
df %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Resources