mutate with gsub to clean out commas from a number - table scraped with rvest - r

I'm practising scraping and data cleaning, and have a table which I've scraped from wikipedia. I'm trying to mutate the table to create a column which cleans out the commas from an existing column to return the number. All I'm getting is a column of NAs.
This is my output:
> library(dplyr)
> library(rvest)
>
> pg <- read_html("https://en.wikipedia.org/wiki/Rugby_World_Cup")
> rugby <- pg %>% html_table(., fill = T)
>
> rugby_table <- rugby[[3]]
>
> rugby_table
# A tibble: 9 x 8
Year `Host(s)` `Total attend­ance` Matches `Avg attend­ance` `% change in avg att.` `Stadium capacity` `Attend­ance as % o~
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60%
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79%
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77%
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83%
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83%
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92%
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85%
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95%
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90%
>
> rugby_table2 <- rugby %>%
+ .[[3]] %>%
+ tbl_df %>%
+ mutate(Attendance=as.numeric(gsub("[^0-9.-]+","",'Total attendance')))
>
> rugby_table2
# A tibble: 9 x 9
Year `Host(s)` `Total attend­ance` Matches `Avg attend­ance` `% change in avg~ `Stadium capaci~ `Attend­ance as~ Attendance
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60% NA
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79% NA
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77% NA
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83% NA
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83% NA
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92% NA
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85% NA
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95% NA
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90% NA
Any ideas?

The difficulty here is that gsub is interpreting 'Total attendance' as a character string, not a column name. My natural reaction was to use backticks instead of single quotes, but then I get a message that this object could not be found. I'm not sure what the problem is here, but you can resolve it using across
rugby_table2 <- rugby_table %>%
mutate(Attendance = across(contains("Total"),
function(x) as.numeric(gsub(",", "", x))),
Attendance = Attendance[[1]])
rugby_table2$Attendance
#> [1] 604500 1007760 1100000 1750000 1837547 2263223 1477294 2477805 1698528
EDIT
Ronak Shah has identified the problem, which is that there is an invisible character in the name brought across from the web page, which means the column isn't recognised. So an alternative solution would be:
names(rugby_table)[3] <- "Total attendance"
rugby_table2 <- rugby_table %>%
mutate(Attendance = as.numeric(gsub(",", "", `Total attendance`)))
rugby_table2$Attendance
#> [1] 604500 1007760 1100000 1750000 1837547 2263223 1477294 2477805

The gsub function is to do replacements to all matches of the provided pattern. If you are going to remove all the commas with gsub, the correct syntax would be
rugby_table2 <- rugby %>%
.[[3]] %>%
tbl_df %>%
mutate(Attendance = as.numeric(gsub(",", "", 'Total attendance')))
Edit:
rugby_table <- structure(list(Year = c(1987L, 1991L, 1995L, 1999L, 2003L, 2007L,
2011L, 2015L, 2019L), `Host(s)` = c("AustraliaNewZealand", "EnglandFranceIrelandScotlandWales",
"SouthAfrica", "Wales", "Australia", "France", "NewZealand",
"England", "Japan"), `Total attendance` = c("604,500", "1,007,760",
"1,100,000", "1,750,000", "1,837,547", "2,263,223", "1,477,294",
"2,477,805", "1,698,528"), Matches = c("32", "32", "32", "41",
"48", "48", "48", "48", "45+"), `Avg attendance` = c("20,156",
"31,493", "34,375", "42,683", "38,282", "47,150", "30,777", "51,621",
"37,745"), `% change in avg att` = c("—", "56%", "9%", "24%",
"–10%", "23%", "–35%", "68%", "–27%"), `Stadium capacity` = c("1,006,350",
"1,212,800", "1,423,850", "2,104,500", "2,208,529", "2,470,660",
"1,732,000", "2,600,741", "1,811,866"), `Attendance as % o~` = c("60%",
"79%", "77%", "83%", "83%", "92%", "85%", "95%", "90%")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
rugby_table %>%
mutate(Attendance = as.numeric(gsub(",", "", `Total attendance`))) %>%
select(Attendance)
#> # A tibble: 9 x 1
#> Attendance
#> <dbl>
#> 1 604500
#> 2 1007760
#> 3 1100000
#> 4 1750000
#> 5 1837547
#> 6 2263223
#> 7 1477294
#> 8 2477805
#> 9 1698528

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

I lose the constant variables (including id) when using pivot_longer with multiple variables

I try to reshape the following
country
region
abc2001
abc2002
xyz2001
xyz2002
Japan
East Asia
1
2
4.5
5.5
to the following
country
region
year
abc
xyz
Japan
East Asia
2001
1
4.5
Japan
East Asia
2002
2
5.5
actually there are five more variables in the same way.
I use the following code:
long <- data %>% pivot_longer(cols = c(-country, -region), names_to = c(".value", "year"), names_pattern = "([^\\.]*)\\.*(\\d{4})")
The result is long version of the data except that I lose country and region variables. What do I do wrong? Or how else can I do this better?
Thank you in advance.
We may change the regex pattern to match one or more non-digits(\\D+) as the first capture group and one or more digits (\\d+) as the second one
librarytidyr)
pivot_longer(data, cols = c(-country, -region),
names_to = c(".value", "year"), names_pattern = "(\\D+)(\\d+)")
-output
# A tibble: 2 × 5
country region year abc xyz
<chr> <chr> <chr> <int> <dbl>
1 Japan East Asia 2001 1 4.5
2 Japan East Asia 2002 2 5.5
data
data <- structure(list(country = "Japan", region = "East Asia", abc2001 = 1L,
abc2002 = 2L, xyz2001 = 4.5, xyz2002 = 5.5),
class = "data.frame", row.names = c(NA,
-1L))
Update: see comments as #akrun noted, here is better regex with lookaround:
rename_with(., ~str_replace(names(data), "(?<=\\D)(?=\\d)", "\\_"))
First answer:
Here is a version with names_sep. The challenge was to add an underscore in the column names. The preferred answer is that of #akrun:
(.*) - Group 1: any zero or more chars as many as possible
(\\d{4}$) - Group 2: for digits at the end
library(dplyr)
library(tidyr)
data %>%
rename_with(., ~sub("(.*)(\\d{4}$)", "\\1_\\2", names(data))) %>%
pivot_longer(-c(country, region),
names_to =c(".value","Year"),
names_sep ="_"
)
country region Year abc xyz
<chr> <chr> <chr> <int> <dbl>
1 Japan East Asia 2001 1 4.5
2 Japan East Asia 2002 2 5.5

How to rearrange data shape with R

Current shape of the data:
# A tibble: 92 × 9
state category `1978` `1983` `1988` `1993` `1999` `2006` `2013`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Alabama Jail population 3,642 3,838 4,363 7,873 9,327 11,430 10,436
2 Alabama Percent pre-trial 28% 47% 57% 48% 58% 70% 68%
Wanted shape of the data:
state jail pretrial year
Alabama 3642 28% 1978
Alabama 3838 47% 1983
Alabama 4363 57% 1988
Alabama 7873 48% 1993
Alabama 9327 58% 1999
Alabama 11430 70% 2006
Alabama 10436 68% 2013
I've tried various attempts of using dplyr and pivot_wider / pivot_longer and have gotten close but have spent so much time trying to figure this one out. I'm not looking so much for a code recipe to tell me how to do it, but I haven't even been able to find a similar example to go by. If there is a name for this please share that and I can do the lookup and figure out the code, just unsure what to search.
Here's a dplyr suggestion:
library(dplyr)
library(tidyr) # pivot_*
dat %>%
mutate(category = gsub(" .*", "", category)) %>%
pivot_longer(-c(state, category), names_to = "year") %>%
pivot_wider(c(state, year), names_from = category, values_from = value)
# # A tibble: 7 x 4
# state year Jail Percent
# <chr> <chr> <chr> <chr>
# 1 Alabama 1978 3,642 28%
# 2 Alabama 1983 3,838 47%
# 3 Alabama 1988 4,363 57%
# 4 Alabama 1993 7,873 48%
# 5 Alabama 1999 9,327 58%
# 6 Alabama 2006 11,430 70%
# 7 Alabama 2013 10,436 68%
You may want to clean up the columns a bit (for numbers, etc), perhaps
dat %>%
mutate(category = gsub(" .*", "", category)) %>%
pivot_longer(-c(state, category), names_to = "year") %>%
pivot_wider(c(state, year), names_from = category, values_from = value) %>%
mutate(across(c(year, Jail), ~ as.integer(gsub("\\D", "", .))))
# # A tibble: 7 x 4
# state year Jail Percent
# <chr> <int> <int> <chr>
# 1 Alabama 1978 3642 28%
# 2 Alabama 1983 3838 47%
# 3 Alabama 1988 4363 57%
# 4 Alabama 1993 7873 48%
# 5 Alabama 1999 9327 58%
# 6 Alabama 2006 11430 70%
# 7 Alabama 2013 10436 68%
(There are many ways to deal with cleaning it up.)
Data
dat <- structure(list(state = c("Alabama", "Alabama"), category = c("Jail population", "Percent pre-trial"), `1978` = c("3,642", "28%"), `1983` = c("3,838", "47%"), `1988` = c("4,363", "57%"), `1993` = c("7,873", "48%"), `1999` = c("9,327", "58%"), `2006` = c("11,430", "70%"), `2013` = c("10,436", "68%")), row.names = c("1", "2"), class = "data.frame")

How to download precipitation data using rnoaa

I am new to the 'rnoaa' R package. I am wondering how do I find stationsid names to identify stations. I am interested in downloading hourly or daily precipitation data from 2011 to 2020 from the Prince William Sound Alaska area. I looked here: https://www.ncdc.noaa.gov/cdo-web/search but it seems to have data only up to 2014. Could someone give me a hint on what rnoaa function to use to download the desired rainfall data?
I read that the following rnoaa function:
cpc_prcp(date = "1998-04-23", drop_undefined = TRUE)
However, I don't know what to include inside the function to get the data that I am looking for and also the range of dates (2011 to 2020)
You could try this workflow:
An internet search gives the latitude and longitude for Prince William Sound Alaska area.
library(rnoaa)
# create a data frame for Prince William latitude and longitude
lat_lon_df <- data.frame(id = "pw",
lat = 60.690545,
lon = -147.097055)
# find 10 closest monitors to Prince William
mon_near_pw <-
meteo_nearby_stations(
lat_lon_df = lat_lon_df,
lat_colname = "lat",
lon_colname = "lon",
var = "PRCP",
year_min = 2011,
year_max = 2020,
limit = 10,
)
mon_near_pw
#> $pw
#> # A tibble: 10 x 5
#> id name latitude longitude distance
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 USC00501240 CANNERY CREEK 61.0 -148. 42.9
#> 2 USC00509747 WALLY NOERENBERG HATCHERY 60.8 -148. 55.1
#> 3 USS0048L06S Esther Island 60.8 -148. 55.3
#> 4 USC00505604 MAIN BAY 60.5 -148. 57.6
#> 5 USS0046M04S Sugarloaf Mtn 61.1 -146. 61.1
#> 6 USC00509687 VALDEZ 61.1 -146. 62.4
#> 7 USW00026442 VALDEZ WSO 61.1 -146. 63.4
#> 8 US1AKVC0005 VALDEZ 3.6 ENE 61.1 -146. 66.3
#> 9 USC00509685 VALDEZ AIRPORT 61.1 -146. 67.3
#> 10 USC00502179 CORDOVA WWTP 60.5 -146. 74.0
# extract precipitation data for the first location
pw_prcp_dat <-
meteo_pull_monitors(
monitors = mon_near_pw$pw$id[1],
date_min = "2011-01-01",
date_max = "2020-12-31",
var = "PRCP"
)
head(pw_prcp_dat)
#> # A tibble: 6 x 3
#> id date prcp
#> <chr> <date> <dbl>
#> 1 USC00501240 2011-01-01 704
#> 2 USC00501240 2011-01-02 742
#> 3 USC00501240 2011-01-03 211
#> 4 USC00501240 2011-01-04 307
#> 5 USC00501240 2011-01-05 104
#> 6 USC00501240 2011-01-06 0
# out of curiosity have plotted monthly summary of precipitation.
# For metadata see: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt
# PRCP = Precipitation (tenths of mm)
library(dplyr)
library(lubridate)
library(ggplot2)
pw_prcp_dat %>%
mutate(year = year(date),
month = month(date)) %>%
group_by(year, month) %>%
summarise(prcp = sum(prcp, na.rm = TRUE) / 10) %>%
ggplot(aes(factor(month), prcp))+
geom_col()+
facet_wrap(~year)+
labs(y = "Precipitation [mm]",
x = "Month")+
theme_bw()
Created on 2021-08-22 by the reprex package (v2.0.0)

ratio calculation and sort the calculated rates

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-
Scraping/master/countries.csv',
stringsAsFactors = FALSE)
install.packages("tidyverse")
library(tidyverse)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
group_by(countryName) %>%
unique() %>%
summarize(population = sum(population, na.rm = TRUE),
confirmed = sum(confirmed, na.rm = TRUE),
recovered = sum(recovered, na.rm = TRUE),
death = sum(death, na.rm = TRUE),
death_prop = paste0(as.character(death), "/", as.character(population))
)
in this code
population / death rate was calculated.
highest population / death have rate
Finding 10 countries.
confirmed and recovered
dont will be available.
10 x 6
countryName population confirmed recovered death death_prop
<chr> <dbl> <int> <int> <int> <chr>
1 Afghanistan 4749258212 141652 16505 3796 3796/4749258212
2 Albania 351091234 37233 22518 1501 1501/351091234
3 Algeria 5349827368 206413 88323 20812 20812/5349827368
4 Andorra 9411324 38518 18054 2015 2015/9411324
5 Angola 4009685184 1620 435 115 115/4009685184
6 Anguilla 1814018 161 92 0 0/1814018
7 Antigua and Barbuda 11947338 1230 514 128 128/11947338
8 Argentina 5513884428 232975 66155 10740 10740/5513884428
9 Armenia 361515646 121702 46955 1626 1626/361515646
10 Aruba 13025452 5194 3135 91 91/13025452
data is an example.
the information is not correct.
The data is in cumulative format meaning all the values for today have all the values till yesterday. So take only max values of each column and calculate death_prop.
library(dplyr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
summarise_at(vars(population:death), max, na.rm = TRUE) %>%
mutate(death_prop = death/population * 100) %>%
arrange(desc(death_prop))
# A tibble: 215 x 5
# countryName population year death death_prop
# <chr> <dbl> <dbl> <int> <dbl>
# 1 San Marino 33860 2019 42 0.124
# 2 Belgium 11589623 2020 9312 0.0803
# 3 Andorra 77142 2019 51 0.0661
# 4 Spain 46754778 2020 28752 0.0615
# 5 Italy 60461826 2020 32877 0.0544
# 6 United Kingdom 67886011 2020 36914 0.0544
# 7 France 65273511 2020 28432 0.0436
# 8 Sweden 10099265 2020 4029 0.0399
# 9 Sint Maarten 42388 2019 15 0.0354
#10 Netherlands 17134872 2020 5830 0.0340
# … with 205 more rows

Resources