How to rearrange data shape with R - r

Current shape of the data:
# A tibble: 92 × 9
state category `1978` `1983` `1988` `1993` `1999` `2006` `2013`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Alabama Jail population 3,642 3,838 4,363 7,873 9,327 11,430 10,436
2 Alabama Percent pre-trial 28% 47% 57% 48% 58% 70% 68%
Wanted shape of the data:
state jail pretrial year
Alabama 3642 28% 1978
Alabama 3838 47% 1983
Alabama 4363 57% 1988
Alabama 7873 48% 1993
Alabama 9327 58% 1999
Alabama 11430 70% 2006
Alabama 10436 68% 2013
I've tried various attempts of using dplyr and pivot_wider / pivot_longer and have gotten close but have spent so much time trying to figure this one out. I'm not looking so much for a code recipe to tell me how to do it, but I haven't even been able to find a similar example to go by. If there is a name for this please share that and I can do the lookup and figure out the code, just unsure what to search.

Here's a dplyr suggestion:
library(dplyr)
library(tidyr) # pivot_*
dat %>%
mutate(category = gsub(" .*", "", category)) %>%
pivot_longer(-c(state, category), names_to = "year") %>%
pivot_wider(c(state, year), names_from = category, values_from = value)
# # A tibble: 7 x 4
# state year Jail Percent
# <chr> <chr> <chr> <chr>
# 1 Alabama 1978 3,642 28%
# 2 Alabama 1983 3,838 47%
# 3 Alabama 1988 4,363 57%
# 4 Alabama 1993 7,873 48%
# 5 Alabama 1999 9,327 58%
# 6 Alabama 2006 11,430 70%
# 7 Alabama 2013 10,436 68%
You may want to clean up the columns a bit (for numbers, etc), perhaps
dat %>%
mutate(category = gsub(" .*", "", category)) %>%
pivot_longer(-c(state, category), names_to = "year") %>%
pivot_wider(c(state, year), names_from = category, values_from = value) %>%
mutate(across(c(year, Jail), ~ as.integer(gsub("\\D", "", .))))
# # A tibble: 7 x 4
# state year Jail Percent
# <chr> <int> <int> <chr>
# 1 Alabama 1978 3642 28%
# 2 Alabama 1983 3838 47%
# 3 Alabama 1988 4363 57%
# 4 Alabama 1993 7873 48%
# 5 Alabama 1999 9327 58%
# 6 Alabama 2006 11430 70%
# 7 Alabama 2013 10436 68%
(There are many ways to deal with cleaning it up.)
Data
dat <- structure(list(state = c("Alabama", "Alabama"), category = c("Jail population", "Percent pre-trial"), `1978` = c("3,642", "28%"), `1983` = c("3,838", "47%"), `1988` = c("4,363", "57%"), `1993` = c("7,873", "48%"), `1999` = c("9,327", "58%"), `2006` = c("11,430", "70%"), `2013` = c("10,436", "68%")), row.names = c("1", "2"), class = "data.frame")

Related

mutate with gsub to clean out commas from a number - table scraped with rvest

I'm practising scraping and data cleaning, and have a table which I've scraped from wikipedia. I'm trying to mutate the table to create a column which cleans out the commas from an existing column to return the number. All I'm getting is a column of NAs.
This is my output:
> library(dplyr)
> library(rvest)
>
> pg <- read_html("https://en.wikipedia.org/wiki/Rugby_World_Cup")
> rugby <- pg %>% html_table(., fill = T)
>
> rugby_table <- rugby[[3]]
>
> rugby_table
# A tibble: 9 x 8
Year `Host(s)` `Total attend­ance` Matches `Avg attend­ance` `% change in avg att.` `Stadium capacity` `Attend­ance as % o~
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60%
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79%
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77%
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83%
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83%
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92%
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85%
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95%
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90%
>
> rugby_table2 <- rugby %>%
+ .[[3]] %>%
+ tbl_df %>%
+ mutate(Attendance=as.numeric(gsub("[^0-9.-]+","",'Total attendance')))
>
> rugby_table2
# A tibble: 9 x 9
Year `Host(s)` `Total attend­ance` Matches `Avg attend­ance` `% change in avg~ `Stadium capaci~ `Attend­ance as~ Attendance
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60% NA
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79% NA
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77% NA
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83% NA
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83% NA
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92% NA
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85% NA
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95% NA
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90% NA
Any ideas?
The difficulty here is that gsub is interpreting 'Total attendance' as a character string, not a column name. My natural reaction was to use backticks instead of single quotes, but then I get a message that this object could not be found. I'm not sure what the problem is here, but you can resolve it using across
rugby_table2 <- rugby_table %>%
mutate(Attendance = across(contains("Total"),
function(x) as.numeric(gsub(",", "", x))),
Attendance = Attendance[[1]])
rugby_table2$Attendance
#> [1] 604500 1007760 1100000 1750000 1837547 2263223 1477294 2477805 1698528
EDIT
Ronak Shah has identified the problem, which is that there is an invisible character in the name brought across from the web page, which means the column isn't recognised. So an alternative solution would be:
names(rugby_table)[3] <- "Total attendance"
rugby_table2 <- rugby_table %>%
mutate(Attendance = as.numeric(gsub(",", "", `Total attendance`)))
rugby_table2$Attendance
#> [1] 604500 1007760 1100000 1750000 1837547 2263223 1477294 2477805
The gsub function is to do replacements to all matches of the provided pattern. If you are going to remove all the commas with gsub, the correct syntax would be
rugby_table2 <- rugby %>%
.[[3]] %>%
tbl_df %>%
mutate(Attendance = as.numeric(gsub(",", "", 'Total attendance')))
Edit:
rugby_table <- structure(list(Year = c(1987L, 1991L, 1995L, 1999L, 2003L, 2007L,
2011L, 2015L, 2019L), `Host(s)` = c("AustraliaNewZealand", "EnglandFranceIrelandScotlandWales",
"SouthAfrica", "Wales", "Australia", "France", "NewZealand",
"England", "Japan"), `Total attendance` = c("604,500", "1,007,760",
"1,100,000", "1,750,000", "1,837,547", "2,263,223", "1,477,294",
"2,477,805", "1,698,528"), Matches = c("32", "32", "32", "41",
"48", "48", "48", "48", "45+"), `Avg attendance` = c("20,156",
"31,493", "34,375", "42,683", "38,282", "47,150", "30,777", "51,621",
"37,745"), `% change in avg att` = c("—", "56%", "9%", "24%",
"–10%", "23%", "–35%", "68%", "–27%"), `Stadium capacity` = c("1,006,350",
"1,212,800", "1,423,850", "2,104,500", "2,208,529", "2,470,660",
"1,732,000", "2,600,741", "1,811,866"), `Attendance as % o~` = c("60%",
"79%", "77%", "83%", "83%", "92%", "85%", "95%", "90%")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
rugby_table %>%
mutate(Attendance = as.numeric(gsub(",", "", `Total attendance`))) %>%
select(Attendance)
#> # A tibble: 9 x 1
#> Attendance
#> <dbl>
#> 1 604500
#> 2 1007760
#> 3 1100000
#> 4 1750000
#> 5 1837547
#> 6 2263223
#> 7 1477294
#> 8 2477805
#> 9 1698528

Combining rows and generating category counts

I want to be able to first combine rows with a similar attribute into one(for example, one row for each City/Year), and then find the specific counts for types of categories for each of those rows.
For example, with this as the original data:
City Year Type of Death
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide
I want to be able to produce something like this:
City Year n_Total n_Homicides n_Suicides
NYC 1995 1 1 0
NYC 1996 2 1 1
LA 1995 3 1 2
I've tried something like the below, but it only gives me the n_Total and doesn't take into account the splits for n_Homicides and n_Suicides:
library(dplyr)
total_deaths <- data %>%
group_by(city, year)%>%
summarize(n_Total= n())
You may do this
library(tidyverse, warn.conflicts = F)
df <- read.table(header = T, text = 'City Year TypeofDeath
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide')
df %>%
pivot_wider(names_from = TypeofDeath, values_fn = length, values_from = TypeofDeath, values_fill = 0, names_prefix = 'n_') %>%
mutate(n_total = rowSums(select(cur_data(), starts_with('n_'))))
#> # A tibble: 3 x 5
#> City Year n_Homicide n_Suicide n_total
#> <chr> <int> <int> <int> <dbl>
#> 1 NYC 1995 1 0 1
#> 2 NYC 1996 1 1 2
#> 3 LA 1995 1 2 3
Created on 2021-07-05 by the reprex package (v2.0.0)
If you don't have too many types of death, then something simple (albeit a little "manual") like this might have some appeal.
library(dplyr, warn.conflicts = FALSE)
df <- read.table(header = TRUE, text = 'City Year TypeofDeath
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide')
df %>%
group_by(City, Year) %>%
summarize(n_Total = n(),
n_Suicide = sum(TypeofDeath == "Suicide"),
n_Homicide = sum(TypeofDeath == "Homicide"))
#> `summarise()` has grouped output by 'City'. You can override using the `.groups` argument.
#> # A tibble: 3 x 5
#> # Groups: City [2]
#> City Year n_Total n_Suicide n_Homicide
#> <chr> <int> <int> <int> <int>
#> 1 LA 1995 3 2 1
#> 2 NYC 1995 1 0 1
#> 3 NYC 1996 2 1 1
Created on 2021-07-05 by the reprex package (v2.0.0)
You can first dummify your factor variable using the fastDummies package, than summarise(). This is a more general and versatile approach that can be used seamlessly with any number of unique types of death.
If you only have two types of death and will settle for a simpler (though more "manual") approach, you can use the other suggestions with summarise(x=..., y=..., total=n()
library(dplyr)
library(fastDummies)
df%>%fastDummies::dummy_cols('TypeofDeath', remove_selected_columns = TRUE)%>%
group_by(City, Year)%>%
summarise(across(contains('Type'), sum),
total_deaths=n())
# A tibble: 3 x 5
# Groups: City [2]
City Year TypeofDeath_Homicide TypeofDeath_Suicide total_deaths
<chr> <int> <int> <int> <int>
1 LA 1995 1 2 3
2 NYC 1995 1 0 1
3 NYC 1996 1 1 2

ratio calculation and sort the calculated rates

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-
Scraping/master/countries.csv',
stringsAsFactors = FALSE)
install.packages("tidyverse")
library(tidyverse)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
group_by(countryName) %>%
unique() %>%
summarize(population = sum(population, na.rm = TRUE),
confirmed = sum(confirmed, na.rm = TRUE),
recovered = sum(recovered, na.rm = TRUE),
death = sum(death, na.rm = TRUE),
death_prop = paste0(as.character(death), "/", as.character(population))
)
in this code
population / death rate was calculated.
highest population / death have rate
Finding 10 countries.
confirmed and recovered
dont will be available.
10 x 6
countryName population confirmed recovered death death_prop
<chr> <dbl> <int> <int> <int> <chr>
1 Afghanistan 4749258212 141652 16505 3796 3796/4749258212
2 Albania 351091234 37233 22518 1501 1501/351091234
3 Algeria 5349827368 206413 88323 20812 20812/5349827368
4 Andorra 9411324 38518 18054 2015 2015/9411324
5 Angola 4009685184 1620 435 115 115/4009685184
6 Anguilla 1814018 161 92 0 0/1814018
7 Antigua and Barbuda 11947338 1230 514 128 128/11947338
8 Argentina 5513884428 232975 66155 10740 10740/5513884428
9 Armenia 361515646 121702 46955 1626 1626/361515646
10 Aruba 13025452 5194 3135 91 91/13025452
data is an example.
the information is not correct.
The data is in cumulative format meaning all the values for today have all the values till yesterday. So take only max values of each column and calculate death_prop.
library(dplyr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
summarise_at(vars(population:death), max, na.rm = TRUE) %>%
mutate(death_prop = death/population * 100) %>%
arrange(desc(death_prop))
# A tibble: 215 x 5
# countryName population year death death_prop
# <chr> <dbl> <dbl> <int> <dbl>
# 1 San Marino 33860 2019 42 0.124
# 2 Belgium 11589623 2020 9312 0.0803
# 3 Andorra 77142 2019 51 0.0661
# 4 Spain 46754778 2020 28752 0.0615
# 5 Italy 60461826 2020 32877 0.0544
# 6 United Kingdom 67886011 2020 36914 0.0544
# 7 France 65273511 2020 28432 0.0436
# 8 Sweden 10099265 2020 4029 0.0399
# 9 Sint Maarten 42388 2019 15 0.0354
#10 Netherlands 17134872 2020 5830 0.0340
# … with 205 more rows

How do I use the usmap package to convert the state variable to the FIPS code?

I also want to change the state column to be in terms of the FIPS code. Just not sure what parameters to use and how to do this since I am new to R.
Here are the parameters given by R:
plot_usmap(regions = c("states", "state", "counties", "county"),
include = c(), data = data.frame(), values = "values",
theme = theme_map(), lines = "black", labels = FALSE,
label_color = "black")
It is unclear exactly what you are trying to achieve without an example, but here is how I was able to convert a column state in a data.frame from the abbreviation to the FIPS code:
> library(usmap)
> df <- statepop[1:5, -1]
> names(df)[1] <- 'state'
> df
# A tibble: 5 x 3
state full pop_2015
<chr> <chr> <dbl>
1 AL Alabama 4858979
2 AK Alaska 738432
3 AZ Arizona 6828065
4 AR Arkansas 2978204
5 CA California 39144818
> df$fips <- fips(df$state)
> df
# A tibble: 5 x 4
state full pop_2015 fips
<chr> <chr> <dbl> <chr>
1 AL Alabama 4858979 01
2 AK Alaska 738432 02
3 AZ Arizona 6828065 04
4 AR Arkansas 2978204 05
5 CA California 39144818 06

R: conditional aggregate based on factor level and year

I have a dataset in R which I am trying to aggregate by column level and year which looks like this:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
What I would like to create is where for each Year, I aggregate the PolicyNo across states taking into account the date the policy was repealed. The results I would then get is:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
I am not sure how to go about splitting and aggregating the data conditional on the repeal data and was wondering if there is a way to achieve this is R easily.
It may help you to break this up into two distinct problems.
Get a table that shows the change in PolicyNo in every city-state-year.
Summarize that table to show the PolicyNo in each state-year.
To accomplish (1) we add the missing years with NA PolicyNo, and add repeals as negative PolicyNo observations.
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
Now the table shows every city-state-year and incorporates repeals. This is a table we can summarize.
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
That gets us PolicyNo change in each state-year. A cumulative sum over the changes gets us levels.
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
With the data.table package you could do it as follows:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
The result of the above code:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
As you can see, there are several steps needed to come to the desired endresult:
First you convert to a data.table (setDT(dat)) and reshape this into long format and remove the rows with no Year
Then you make the value for the rows that have 'Year_repealed' to negative.
With a cross-join (CJ) you make sure that alle the years for each state are present and convert the NA-values in the PolicyNo column to zero.
Finally, you summarise by year and do a cumulative sum on the result.

Resources