Find html table name and scrape in R - r

I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)
named list()
Warning message:
XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'
This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?

Consider using readLines() to scrape the html page content and use result in readHTMLTable():
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)
readHTMLTable(webpage, header = T, stringsAsFactors = F) # LIST OF 3 TABLES
# $`NULL`
# Name FIPS State Numeric Code Official USPS Code
# 1 Alabama 01 AL
# 2 Alaska 02 AK
# 3 Arizona 04 AZ
# 4 Arkansas 05 AR
# 5 California 06 CA
# 6 Colorado 08 CO
# 7 Connecticut 09 CT
# 8 Delaware 10 DE
# 9 District of Columbia 11 DC
# 10 Florida 12 FL
# 11 Georgia 13 GA
# 12 Hawaii 15 HI
# 13 Idaho 16 ID
# 14 Illinois 17 IL
# ...
For specific dataframe return:
fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]

Another solution using rvest instead of XML is:
require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>%
html_table %>% .[[1]]

Related

Web scraping with R (rvest)

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)
You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))
You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.
Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

Can you use a dataframe to assist with "find and replace" in R

I am trying to clean some Census data where all the States are given a FIPS code instead of the state abbreviation. I want to run something to go through the column with the FIPS code and convert them to the state abbreviation. Find all the 1's and convert them to AL, all the 2's to AK and so one. I know i can do this with ifelse statement but was wondering if there was a more efficient way with out writing 51 ifelse statements. Thank you all for your assistance.
Here's a try. I'll use data from https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html for valid FIPS codes, and make a fake "bad data" frame.
FIPS <- read.table("https://www2.census.gov/geo/docs/reference/state.txt",
sep = "|", header = TRUE, colClasses = "character")
head(FIPS)
# STATE STUSAB STATE_NAME STATENS
# 1 01 AL Alabama 01779775
# 2 02 AK Alaska 01785533
# 3 04 AZ Arizona 01779777
# 4 05 AR Arkansas 00068085
# 5 06 CA California 01779778
# 6 08 CO Colorado 01779779
baddata <- data.frame(stateabbr = c("AL", "AK", "22"))
baddata
# stateabbr
# 1 AL
# 2 AK
# 3 22
Base R
fixeddata <- merge(baddata, FIPS[,c("STATE", "STUSAB")],
by.x = "stateabbr", by.y = "STATE", all.x = TRUE)
fixeddata
# stateabbr STUSAB
# 1 22 LA
# 2 AK <NA>
# 3 AL <NA>
fixeddata$stateabbr <- ifelse(is.na(fixeddata$STUSAB), fixeddata$STUSAB, fixeddata$stateabbr)
fixeddata$STUSAB <- NULL
fixeddata
# stateabbr
# 1 22
# 2 <NA>
# 3 <NA>
dplyr
library(dplyr)
left_join(baddata, FIPS[,c("STATE", "STUSAB")], by = c("stateabbr" = "STATE")) %>%
mutate(stateabbr = coalesce(STUSAB, stateabbr)) %>%
select(-STUSAB)
# stateabbr
# 1 AL
# 2 AK
# 3 LA

Opening .bcp files in R

I have been trying to convert UK charity commission data which is in .bcp file format into .csv file format which could then be read into R. The data I am referring to is available here: http://data.charitycommission.gov.uk/. What I am trying to do is turn these .bcp files into useable dataframes that I can clean and run analyses on in R.
There are suggestions on how to do this through python on this github page https://github.com/ncvo/charity-commission-extract but unfortunately I haven't been able to get these options to work.
I am wondering if there is any syntax or packages that will allow me to open these data in R directly? I haven't been able to find any.
Another option would be to simply open the files within R as a single character vector using readLines. I have done this and the files are delimited with #**# for columns and *##* for rows. (See here: http://data.charitycommission.gov.uk/data-definition.aspx). Is there an R command that would allow me to create a dataframe from a long character string, defining de-limiters for both rows and columns?
R-solution
edited version
Not sure if all .bcp files are in the same format.. I downloaded the dataset you mentioned, and tried a solution for the smallest file; extract_aoo_ref.bcp
library(data.table)
#read the file as-is
text <- readChar("./extract_aoo_ref.bcp",
nchars = file.info( "./extract_aoo_ref.bcp" )$size,
useBytes = TRUE)
#replace column and row separator
text <- gsub( ";", ":", text)
text <- gsub( "#\\*\\*#", ";", text)
text <- gsub( "\\*##\\*", "\n", text, perl = TRUE)
#read the results
result <- data.table::fread( text,
header = FALSE,
sep = ";",
fill = TRUE,
quote = "",
strip.white = TRUE)
head(result,10)
# V1 V2 V3 V4 V5 V6
# 1: A 1 THROUGHOUT ENGLAND AND WALES At least 10 authorities in England and Wales N NA
# 2: B 1 BRACKNELL FOREST BRACKNELL FOREST N NA
# 3: D 1 AFGHANISTAN AFGHANISTAN N 2
# 4: E 1 AFRICA AFRICA N NA
# 5: A 2 THROUGHOUT ENGLAND At least 10 authorities in England only N NA
# 6: B 2 WEST BERKSHIRE WEST BERKSHIRE N NA
# 7: D 2 ALBANIA ALBANIA N 3
# 8: E 2 ASIA ASIA N NA
# 9: A 3 THROUGHOUT WALES At least 10 authorities in Wales only Y NA
# 10: B 3 READING READING N NA
same for the tricky file; extract_charity.bcp
head(result[,1:3],10)
# V1 V2 V3
# 1: 200000 0 HOMEBOUND CRAFTSMEN TRUST
# 2: 200001 0 PAINTERS' COMPANY CHARITY
# 3: 200002 0 THE ROYAL OPERA HOUSE BENEVOLENT FUND
# 4: 200003 0 HERGA WORLD DISTRESS FUND
# 5: 200004 0 THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)
# 6: 200005 0 DEVON AND CORNWALL ROMAN CATHOLIC DEVELOPMENT SOCIETY
# 7: 200006 0 THE HORLEY SICK CHILDREN'S FUND
# 8: 200007 0 THE HOLDENHURST OLD PEOPLE'S HOME TRUST
# 9: 200008 0 LORNA GASCOIGNE TRUST FUND
# 10: 200009 0 THE RALPH LEVY CHARITABLE COMPANY LIMITED
so.. looks like it is working :)

How can you create custom headers using Table function in R?

I have a data frame for each team that looks like nebraska below. However, some of these poor teams don't have a single win, so their $Outcome column has nothing but L in them.
> nebraska
Teams Away/Home Score Outcome
1 Arkansas State Away 36
2 Nebraska Home 43 W
3 Nebraska Away 35 L
4 Oregon Home 42
5 Northern Illinois Away 21
6 Nebraska Home 17 L
7 Rutgers Away 17
8 Nebraska Home 27 W
9 Nebraska Away 28 W
10 Illinois Home 6
11 Wisconsin Away 38
12 Nebraska Home 17 L
13 Ohio State Away 56
14 Nebraska Home 14 L
When I run table(nebraska$Outcome it gives me my expected outcome:
table(nebraska$Outcome)
L W
7 4 3
However, for the teams that don't have a single win (like Baylor), or only have wins, it only gives me something like:
table(baylor$Outcome)
L
7 7
I'd like to specify custom headers for the table function so that I can get have something like this output:
table(baylor$Outcome)
L W
7 7 0
I've tried passing the argument dnn to the table function call, but it throws an error with the following code:
> table(baylor$Outcome,dnn = c("W","L",""))
Error in names(dn) <- dnn :
'names' attribute [3] must be the same length as the vector [1]
Can someone tell me how I can tabulate these wins and losses correctly?
Try this:
with(rle(sort(nebraska$Outcome)),
data.frame(W = max(0, lengths[values == "W"]),
L = max(0, lengths[values == "L"])))
# W L
#1 3 4
I don't think this has to be that complicated. Just make baylor$Outcome a factor and then table. E.g.:
# example data
baylor <- data.frame(Outcome = c("L","L","L"))
Then it is just:
baylor$Outcome <- factor(baylor$Outcome, levels=c("","L","W"))
table(baylor$Outcome)
# L W
#0 3 0
Following a tidy workflow, I offer...
library(dplyr)
library(tidyr)
df <- nebraska %>%
group_by(Teams, Outcome) %>%
summarise(n = n()) %>%
spread(Outcome, n) %>%
select(-c(`<NA>`))
# # A tibble: 8 x 3
# # Groups: Teams [8]
# Teams L W
# * <chr> <int> <int>
# 1 Arkansas State NA NA
# 2 Illinois NA NA
# 3 Nebraska 4 3
# 4 Northern Illinois NA NA
# 5 Ohio State NA NA
# 6 Oregon NA NA
# 7 Rutgers NA NA
# 8 Wisconsin NA NA
...and I couldn't help myself but to pretty with knitr::kable and kableExtra
library(knitr)
library(kableExtra)
df %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Reshaping Data into panel form

I have data where the object name is a variable name like EPS, Profit etc. (around 25 such distinct objects)
The data is arranged like this :
EPS <- read.table(text = "
Year Microsoft Facebook
2001 12 20
2002 15 23
2003 16 19
", header = TRUE)
Profit <- read.table(text = "
Year Microsoft Facebook
2001 15 36
2002 19 40
2003 25 45
", header = TRUE)
I want output like this :
Year Co_Name EPS Profit
2001 Microsoft 12 15
2002 Microsoft 15 19
2003 Microsoft 16 25
2001 Facebook 20 36
2002 Facebook 23 40
2003 Facebook 19 45
How it can be done? Is there any way to arrange data of all variables as a single object? Data of each variable is imported into R from a csv file like EPS.csv, Profit.csv etc. Is there any way to create loop from importing to arranging data in a desired format?
Just for fun we can also achieve the same result using dplyr, tidyr and purrr.
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
list_of_csv <- list.files(path = ".", pattern = ".csv", full.names = TRUE)
file_name <- gsub(".csv", "", basename(list_of_csv))
list_of_csv %>%
map(~ read_csv(.)) %>%
map(~ gather(data = ., key = co_name, value = value, -year)) %>%
reduce(inner_join, by = c("year", "co_name")) %>%
setNames(., c("year", "co_name", file_name))
## Source: local data frame [6 x 4]
## year co_name eps profit
## (int) (fctr) (int) (int)
## 1 2001 microsoft 12 15
## 2 2002 microsoft 15 19
## 3 2003 microsoft 16 25
## 4 2001 facebook 20 36
## 5 2002 facebook 23 40
## 6 2003 facebook 19 45
We can get the datasets in a list. If we already created 'EPS', 'Profit' as objects, use mget to get those in a list, convert to a single data.table with rbindlist, melt to long format and reshape it back to 'wide' with dcast.
library(data.table)#v1.9.6+
DT <- rbindlist(mget(c('EPS', 'Profit')), idcol=TRUE)
DT1 <- dcast(melt(rbindlist(mget(c('EPS', 'Profit')), idcol=TRUE),
id.var=c('.id', 'Year'), variable.name='Co_Name'),
Year+Co_Name~.id, value.var='value')
DT1
# Year Co_Name EPS Profit
#1: 2001 Microsoft 12 15
#2: 2001 Facebook 20 36
#3: 2002 Microsoft 15 19
#4: 2002 Facebook 23 40
#5: 2003 Microsoft 16 25
#6: 2003 Facebook 19 45
If we need to arrange it, use order
DT1[order(factor(Co_Name, levels=unique(Co_Name)))]

Resources