R group and summarize census data - r

I have census data that is listed by country and separated by wards. There is also a variable for continent. Here is a sample dataset.
df1 <- data.frame(country = c("Brazil", "Colombia", "Croatia", "France"), ward_1 = c(45, 35, 15, 80), ward_2 = c(25, 55, 10, 145), ward_23 = c(105, 65, 25, 85), continent = c("Americas", "Americas", "Europe", "Europe"))
I need to sum by continent for each of the wards. This is the output I am trying to achieve:
df2 <- data.frame(continent = c("Americas", "Europe"), ward_1 = c(80, 95), ward_2 = c(80, 155), ward_23 = c(170, 110))
I think I have to use group_by(continent) but then how do you output the sum for each ward?

What you need is summarise() after group_by().
In across(), it sums up everything in columns with the name that starts_with "ward".
library(dplyr)
df1 %>%
group_by(continent) %>%
summarize(across(starts_with("ward"), ~sum(.)))
# A tibble: 2 x 4
continent ward_1 ward_2 ward_23
<chr> <dbl> <dbl> <dbl>
1 Americas 80 80 170
2 Europe 95 155 110

Related

Add new column and aggregate data

I am a novice at R programming and stuck with a problem.
Here's a sample dataset:
df <- data.frame(
area_id = c(31,34,36,33,28,35, 31,34,36,33,28,35),
description = c('paramount','sony','star','miramax','pixar','zee', 'paramount','sony','star','miramax','pixar','zee'),
footfall = c(200, 354, 543, 123, 456, 634, 356, 765, 345, 235, 657, 524),
income = c(21000, 19000, 35000, 18000, 12000, 190000, 21000, 19000, 35000, 18000, 12000, 190000),
year = c(2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020));
Now, I have two requirements:
Adding a column named "region" with values based on "area_id";
So, areas with "area_id" = 28, 34, 36 should have value as "West" in "region" column.
Similarly, areas with "area_id" = 31, 33, 35 should have value as "East" in "region" column.
Finally, I want a summary table stratified by year and aggregated region-wise. The final table should look like below:
Can anyone please help me out?
Do it like this
library(tidyverse)
west <- c(28, 34, 36)
df %>% mutate(region = case_when(area_id %in% west ~ "West",
TRUE ~ "East")) %>%
pivot_longer(cols = c(footfall, income), names_to = "Header", values_to = "val") %>%
group_by(region, Header, year) %>% summarise(val = sum(val)) %>%
pivot_wider(id_cols = c(region, Header), names_from = year, values_from = val) %>%
mutate(Total = `2019` + `2020`) -> df2
# A tibble: 4 x 5
# Groups: region, Header [4]
region Header `2019` `2020` Total
<chr> <chr> <dbl> <dbl> <dbl>
1 East footfall 957 1115 2072
2 East income 229000 229000 458000
3 West footfall 1353 1767 3120
4 West income 66000 66000 132000
If you assign the above result to say df2 and check its class
class(df2)
[1] "tbl_df" "tbl" "data.frame"
which will be same as that of class(df)

rename_with but predicate based on value in other variable

Is there a way to rename_with but instead of the predicate function be on the column name, the predicate function would be based on a value in another variable?
Say I have a dataset as follows:
data <- tibble(home_team = c("SF", "KC", "JAX", "WAS", "BUF"),
away_team = c("GB", "CAR", "HOU", "NYG", "SEA"),
home_total = c(21, 25, 30, 22, 23.5),
home_plays = c(65, 64, 63, 57, 60),
away_total = c(30, 22, 25, 22, 25),
away_plays = c(56, 62, 66, 59, 62))
And I am trying to get it to look something like:
finalized_data <- tibble(team = c("SF", "KC", "JAX", "WAS", "BUF", "GB", "CAR", "HOU", "NYG", "SEA"),
total = c(21, 25, 30, 22, 23.5, 30, 22, 25, 22, 25),
plays = c(65, 64, 63, 57, 60, 56, 62, 66, 59, 62))
Currently the best way I know how is with a mutate function that gets long when theres a lot of variables, and there's got to be a cleaner way to do it since its essentially a rename I'm doing based on a variable in the data.
current_way <- data %>%
pivot_longer(c(home_team, away_team), names_to = "team_type", values_to = "team") %>%
mutate(total = ifelse(str_detect(team_type, "home_team"), home_total, away_total),
plays = ifelse(str_detect(team_type, "home_team"), home_plays, away_plays)) %>%
select(team, total, plays)
Any thoughts, or is there even a way to do it in the pivot function that I am missing?
Here is an option with pivot_longer by making use of the column names pattern to split into columns
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = everything(), names_to = c("grp", ".value"),
names_sep = "_") %>%
arrange(desc(grp)) %>%
select(-grp)
-output
# A tibble: 10 x 3
# team total plays
# <chr> <dbl> <dbl>
# 1 SF 21 65
# 2 KC 25 64
# 3 JAX 30 63
# 4 WAS 22 57
# 5 BUF 23.5 60
# 6 GB 30 56
# 7 CAR 22 62
# 8 HOU 25 66
# 9 NYG 22 59
#10 SEA 25 62

in dplyr, how to select and filter on different columns, depending on whether a certain column is in the dataframe?

I need to extract a few rows & columns from a dataframe:
library(dplyr)
foo <- structure(list(iso3c = c("SWZ", "SVN", "NZL", "JAM", "ESP", "LSO",
"ATG", "GEO", "GIB", "BHS"), country = c("Eswatini", "Slovenia",
"New Zealand", "Jamaica", "Spain", "Lesotho", "Antigua & Barbuda",
"Georgia", "Gibraltar", "Bahamas"), confirmed = c(1, 141, 1522, 0, 148220, 4,
19, 794, NA, 102), deaths = c(0, 0, 22, 0, 14792, 0, 2, 12, NA,
11)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
compute_epidemic_curve_data <- function(df, country_code) {
epidemic_curve_data <- df %>%
filter(iso3c == country_code) %>%
select(iso3c,
country,
confirmed)
return(epidemic_curve_data)
}
print(result <- compute_epidemic_curve_data(foo, "SVN"))
However, the data can come from different sources, which means that sometimes the dataframe will have a different structure. Basically, column iso3c is called id, column country is called admin_region and an additional column called tests is present. E.g.:
bar <- structure(list(id = c("SWZ", "SVN", "NZL", "JAM", "ESP", "LSO",
"ATG", "GEO", "GIB", "BHS"), admin_region = c("Eswatini", "Slovenia",
"New Zealand", "Jamaica", "Spain", "Lesotho", "Antigua & Barbuda",
"Georgia", "Gibraltar", "Bahamas"), confirmed = c(1, 141, 1522, 0, 148220, 4,
19, 794, NA, 102), deaths = c(0, 0, 22, 0, 14792, 0, 2, 12, NA,
11), tests = c(2, 282, 3044, 0, 296440, 8, 38, 1588, NA, 204)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Now, compute_epidemic_curve_data must also return tests, i.e., it becomes:
compute_epidemic_curve_data <- function(df, country_code) {
epidemic_curve_data <- df %>%
filter(id == country_code) %>%
select(id,
admin_region,
date,
confirmed,
tests)
return(epidemic_curve_data)
}
The barbaric way to solve this would be:
compute_epidemic_curve_data <- function(df, country_code) {
if("id" %in% colnames(df))
{
epidemic_curve_data <- df %>%
filter(id == country_code) %>%
select(id,
admin_region,
date,
confirmed,
tests)
}
else
{
epidemic_curve_data <- df %>%
filter(iso3c == country_code) %>%
select(iso3c,
country,
date,
confirmed)
}
return(epidemic_curve_data)
}
but it seems a bad idea to duplicate so much code. Is it possible to have the same function handle two data sources, while at the same time reducing code duplication?
We can also use filter_at with matches
compute_epidemic_curve_data <- function(df, country_code) {
df %>%
filter_at(vars(matches('iso3c|id')), ~ . == country_code) %>%
#or with across
#filter(across(matches('iso3c|id'), ~ . == country_code)) %>%
select(matches('iso3c|id'), everything(), -deaths)
}
-testing
compute_epidemic_curve_data(foo, "SVN")
# A tibble: 1 x 3
# iso3c country confirmed
# <chr> <chr> <dbl>
#1 SVN Slovenia 141
compute_epidemic_curve_data(bar, "SVN")
# A tibble: 1 x 4
# id admin_region confirmed tests
# <chr> <chr> <dbl> <dbl>
#1 SVN Slovenia 141 282
The idiomatic way to choose between possible column names dynamically within a tidyverse selecting function is to use tidyselect::any_of:
compute_epidemic_curve_data <- function(df, country_code)
{
df <- if("iso3c" %in% names(df))
filter(df, iso3c == country_code)
else
filter(df, id == country_code)
select(df, tidyselect::any_of(c("id", "iso3c","country", "confirmed",
"admin_region", "date", "tests")))
}
Resulting in
print(result <- compute_epidemic_curve_data(foo, "SVN"))
#> # A tibble: 1 x 3
#> iso3c country confirmed
#> <chr> <chr> <dbl>
#> 1 SVN Slovenia 141
print(result <- compute_epidemic_curve_data(bar, "SVN"))
#> # A tibble: 1 x 4
#> id confirmed admin_region tests
#> <chr> <dbl> <chr> <dbl>
#> 1 SVN 141 Slovenia 282
This avoids some duplication but whether it is more elegant or readable is debatable
library(dplyr)
compute_epidemic_curve_data <- function(df, country_code) {
if("id" %in% colnames(df)) {
id <- "id"
sel <- c(id, "admin_region", "tests")
} else {
id <- "iso3c"
sel <- c(id, "country")
}
epidemic_curve_data <- df %>%
filter(!!sym(id) == country_code) %>%
select(all_of(sel), confirmed)
return(epidemic_curve_data)
}
compute_epidemic_curve_data(bar, "SVN")
#> # A tibble: 1 x 4
#> id admin_region tests confirmed
#> <chr> <chr> <dbl> <dbl>
#> 1 SVN Slovenia 282 141
compute_epidemic_curve_data(foo, "SVN")
#> # A tibble: 1 x 3
#> iso3c country confirmed
#> <chr> <chr> <dbl>
#> 1 SVN Slovenia 141
Created on 2020-07-11 by the reprex package (v0.3.0)

How to plot this picture using ggplot2?

Above is my dataset, just a simple dataset. It shows the GDP per capita of the richest and the poorest regions in nine countries in 2000 and 2015 as well as the gap of GDP per capita between the poorest and richest regions. Below is the reproducible example of this dataset:
structure(list(Country = c("Britain", "Germany", "United State",
"France", "South Korea", "Italy", "Japan", "Spain", "Sweden"),
Poor2000 = c(69, 50, 74, 52, 79, 50, 80, 80, 90), Poor2015 = c(61,
48, 73, 50, 73, 52, 78, 84, 82), Rich2000 = c(848, 311, 290,
270, 212, 180, 294, 143, 148), Rich2015 = c(1150, 391, 310,
299, 200, 198, 290, 151, 149)), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
I wanna make a plot like this:
In this plot I just wanna show the GDP per capita of the poorest regions in the nine countries in 2000 and 2015 (the draft picture just has three countries for the sake of convenience). But I don't know how to do it using ggplot. Because it seems like I need to set x-axis as "Country" and y-axis as "Poor2000" and "Poor2015" the two variables. I don't know how to do that. Thanks many in advance.
Here a possible solution. Starting from your dataframe, you can first create a new dataframe that will reshape it into a longer format. FOr doing that, I used pivot_longer function from tidyr package:
library(tidyr)
library(dplyr)
DF <- df %>% select(Country, Poor2000, Poor2015) %>%
mutate(Diff = Poor2015 - Poor2000) %>%
pivot_longer(-Country, names_to = "Poor", values_to = "value")
# A tibble: 27 x 3
Country Poor value
<fct> <chr> <dbl>
1 Britain Poor2000 69
2 Britain Poor2015 61
3 Britain Diff -8
4 Germany Poor2000 50
5 Germany Poor2015 48
6 Germany Diff -2
7 United States Poor2000 74
8 United States Poor2015 73
9 United States Diff -1
10 France Poor2000 52
# … with 17 more rows
We will also create a second dataframe that will contain the difference of values between Poor2000 and Poor2015:
DF_second_label <- df %>% select(Country, Poor2000, Poor2015) %>%
group_by(Country) %>%
mutate(Diff = Poor2015 - Poor2000, ypos = max(Poor2000,Poor2015))
# A tibble: 9 x 5
# Groups: Country [9]
Country Poor2000 Poor2015 Diff ypos
<fct> <dbl> <dbl> <dbl> <dbl>
1 Britain 69 61 -8 69
2 Germany 50 48 -2 50
3 United States 74 73 -1 74
4 France 52 50 -2 52
5 South Korea 79 73 -6 79
6 Italy 50 52 2 52
7 Japan 80 78 -2 80
8 Spain 80 84 4 84
9 Sweden 90 82 -8 90
Then, we can plot both new dataframe in ggplot2 and select only countries of interest by using subset function:
ggplot(subset(DF, Poor != "Diff" & Country %in% c("Britain","South Korea","Sweden")),
aes(x = Country, y = value, fill = Poor))+
geom_col(position = position_dodge())+
geom_text(aes(label = value), position = position_dodge(0.9), vjust = -0.5, show.legend = FALSE)+
geom_text(inherit.aes = FALSE,
data = subset(DF_second_label, Country %in% c("Britain","South Korea","Sweden")),
aes(x = Country,
y = ypos+10,
label = Diff), color = "darkgreen", size = 6, show.legend = FALSE)+
labs(x = "", y = "GDP per Person", title = "Poor in 2000 & 2015")+
theme(plot.title = element_text(hjust = 0.5))
And you get:
Reproducible example
df <- data.frame(Country = c("Britain","Germany", "United States", "France", "South Korea", "Italy","Japan","Spain","Sweden"),
Poor2000 = c(69,50,74,52,79,50,80,80,90),
Poor2015 = c(61,48,73,50,73,52,78,84,82),
Rich2000 = c(848,311,290,270,212,180,294,143,148))

How can I count the number of grouped pairs in which one row's column value is greater than another?

I have a dataset (df1) with a number of paired values. One row of the pair is for one year (e.g., 2014), the other for a different year (e.g., 2013). For each pair is a value in the column G. I need a count of the number of pairs in which the G value for the higher year is less than the G value for the lesser year.
Here is my dput for the dataset df1:
structure(list(Name = c("A.J. Ellis", "A.J. Ellis", "A.J. Pierzynski",
"A.J. Pierzynski", "Aaron Boone", "Adam Kennedy", "Adam Melhuse",
"Adrian Beltre", "Adrian Beltre", "Adrian Gonzalez", "Alan Zinter",
"Albert Pujols", "Albert Pujols"), Age = c(37, 36, 37, 36, 36,
36, 36, 37, 36, 36, 36, 37, 36), Year = c(2018, 2017, 2014, 2013,
2009, 2012, 2008, 2016, 2015, 2018, 2004, 2017, 2016), Tm = c("SDP",
"MIA", "TOT", "TEX", "HOU", "LAD", "TOT", "TEX", "TEX", "NYM",
"ARI", "LAA", "LAA"), Lg = c("NL", "NL", "ML", "AL", "NL", "NL",
"ML", "AL", "AL", "NL", "NL", "AL", "AL"), G = c(66, 51, 102,
134, 10, 86, 15, 153, 143, 54, 28, 149, 152), PA = c(183, 163,
362, 529, 14, 201, 32, 640, 619, 187, 40, 636, 650)), row.names = c(NA,
13L), class = "data.frame")
Here is a tibble that shows the look of the rows to be checked:
https://www.dropbox.com/s/3nbfi9le568qb3s/grouped-pairs.png?dl=0
Here is the code I used to create the tibble:
df1 %>%
group_by(Name) %>%
filter(n() > 1)
We could arrange the data by Name and Age and check if last value in G is less than first value for each name and count those occurrences with sum.
library(dplyr)
df %>%
arrange(Name, Age) %>%
group_by(Name) %>%
summarise(check = last(G) < first(G)) %>%
pull(check) %>%
sum(., na.rm = TRUE)
#[1] 2
If you want the pairs in which the G value for the higher year is less than the G value for the lesser year we could use filter.
df %>%
arrange(Name, Age) %>%
group_by(Name) %>%
filter(last(G) < first(G))
# Name Age Year Tm Lg G PA
# <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#1 A.J. Pierzynski 36 2013 TEX AL 134 529
#2 A.J. Pierzynski 37 2014 TOT ML 102 362
#3 Albert Pujols 36 2016 LAA AL 152 650
#4 Albert Pujols 37 2017 LAA AL 149 636

Resources