I am a novice at R programming and stuck with a problem.
Here's a sample dataset:
df <- data.frame(
area_id = c(31,34,36,33,28,35, 31,34,36,33,28,35),
description = c('paramount','sony','star','miramax','pixar','zee', 'paramount','sony','star','miramax','pixar','zee'),
footfall = c(200, 354, 543, 123, 456, 634, 356, 765, 345, 235, 657, 524),
income = c(21000, 19000, 35000, 18000, 12000, 190000, 21000, 19000, 35000, 18000, 12000, 190000),
year = c(2019, 2019, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020));
Now, I have two requirements:
Adding a column named "region" with values based on "area_id";
So, areas with "area_id" = 28, 34, 36 should have value as "West" in "region" column.
Similarly, areas with "area_id" = 31, 33, 35 should have value as "East" in "region" column.
Finally, I want a summary table stratified by year and aggregated region-wise. The final table should look like below:
Can anyone please help me out?
Do it like this
library(tidyverse)
west <- c(28, 34, 36)
df %>% mutate(region = case_when(area_id %in% west ~ "West",
TRUE ~ "East")) %>%
pivot_longer(cols = c(footfall, income), names_to = "Header", values_to = "val") %>%
group_by(region, Header, year) %>% summarise(val = sum(val)) %>%
pivot_wider(id_cols = c(region, Header), names_from = year, values_from = val) %>%
mutate(Total = `2019` + `2020`) -> df2
# A tibble: 4 x 5
# Groups: region, Header [4]
region Header `2019` `2020` Total
<chr> <chr> <dbl> <dbl> <dbl>
1 East footfall 957 1115 2072
2 East income 229000 229000 458000
3 West footfall 1353 1767 3120
4 West income 66000 66000 132000
If you assign the above result to say df2 and check its class
class(df2)
[1] "tbl_df" "tbl" "data.frame"
which will be same as that of class(df)
Related
I have census data that is listed by country and separated by wards. There is also a variable for continent. Here is a sample dataset.
df1 <- data.frame(country = c("Brazil", "Colombia", "Croatia", "France"), ward_1 = c(45, 35, 15, 80), ward_2 = c(25, 55, 10, 145), ward_23 = c(105, 65, 25, 85), continent = c("Americas", "Americas", "Europe", "Europe"))
I need to sum by continent for each of the wards. This is the output I am trying to achieve:
df2 <- data.frame(continent = c("Americas", "Europe"), ward_1 = c(80, 95), ward_2 = c(80, 155), ward_23 = c(170, 110))
I think I have to use group_by(continent) but then how do you output the sum for each ward?
What you need is summarise() after group_by().
In across(), it sums up everything in columns with the name that starts_with "ward".
library(dplyr)
df1 %>%
group_by(continent) %>%
summarize(across(starts_with("ward"), ~sum(.)))
# A tibble: 2 x 4
continent ward_1 ward_2 ward_23
<chr> <dbl> <dbl> <dbl>
1 Americas 80 80 170
2 Europe 95 155 110
I need to extract a few rows & columns from a dataframe:
library(dplyr)
foo <- structure(list(iso3c = c("SWZ", "SVN", "NZL", "JAM", "ESP", "LSO",
"ATG", "GEO", "GIB", "BHS"), country = c("Eswatini", "Slovenia",
"New Zealand", "Jamaica", "Spain", "Lesotho", "Antigua & Barbuda",
"Georgia", "Gibraltar", "Bahamas"), confirmed = c(1, 141, 1522, 0, 148220, 4,
19, 794, NA, 102), deaths = c(0, 0, 22, 0, 14792, 0, 2, 12, NA,
11)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
compute_epidemic_curve_data <- function(df, country_code) {
epidemic_curve_data <- df %>%
filter(iso3c == country_code) %>%
select(iso3c,
country,
confirmed)
return(epidemic_curve_data)
}
print(result <- compute_epidemic_curve_data(foo, "SVN"))
However, the data can come from different sources, which means that sometimes the dataframe will have a different structure. Basically, column iso3c is called id, column country is called admin_region and an additional column called tests is present. E.g.:
bar <- structure(list(id = c("SWZ", "SVN", "NZL", "JAM", "ESP", "LSO",
"ATG", "GEO", "GIB", "BHS"), admin_region = c("Eswatini", "Slovenia",
"New Zealand", "Jamaica", "Spain", "Lesotho", "Antigua & Barbuda",
"Georgia", "Gibraltar", "Bahamas"), confirmed = c(1, 141, 1522, 0, 148220, 4,
19, 794, NA, 102), deaths = c(0, 0, 22, 0, 14792, 0, 2, 12, NA,
11), tests = c(2, 282, 3044, 0, 296440, 8, 38, 1588, NA, 204)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Now, compute_epidemic_curve_data must also return tests, i.e., it becomes:
compute_epidemic_curve_data <- function(df, country_code) {
epidemic_curve_data <- df %>%
filter(id == country_code) %>%
select(id,
admin_region,
date,
confirmed,
tests)
return(epidemic_curve_data)
}
The barbaric way to solve this would be:
compute_epidemic_curve_data <- function(df, country_code) {
if("id" %in% colnames(df))
{
epidemic_curve_data <- df %>%
filter(id == country_code) %>%
select(id,
admin_region,
date,
confirmed,
tests)
}
else
{
epidemic_curve_data <- df %>%
filter(iso3c == country_code) %>%
select(iso3c,
country,
date,
confirmed)
}
return(epidemic_curve_data)
}
but it seems a bad idea to duplicate so much code. Is it possible to have the same function handle two data sources, while at the same time reducing code duplication?
We can also use filter_at with matches
compute_epidemic_curve_data <- function(df, country_code) {
df %>%
filter_at(vars(matches('iso3c|id')), ~ . == country_code) %>%
#or with across
#filter(across(matches('iso3c|id'), ~ . == country_code)) %>%
select(matches('iso3c|id'), everything(), -deaths)
}
-testing
compute_epidemic_curve_data(foo, "SVN")
# A tibble: 1 x 3
# iso3c country confirmed
# <chr> <chr> <dbl>
#1 SVN Slovenia 141
compute_epidemic_curve_data(bar, "SVN")
# A tibble: 1 x 4
# id admin_region confirmed tests
# <chr> <chr> <dbl> <dbl>
#1 SVN Slovenia 141 282
The idiomatic way to choose between possible column names dynamically within a tidyverse selecting function is to use tidyselect::any_of:
compute_epidemic_curve_data <- function(df, country_code)
{
df <- if("iso3c" %in% names(df))
filter(df, iso3c == country_code)
else
filter(df, id == country_code)
select(df, tidyselect::any_of(c("id", "iso3c","country", "confirmed",
"admin_region", "date", "tests")))
}
Resulting in
print(result <- compute_epidemic_curve_data(foo, "SVN"))
#> # A tibble: 1 x 3
#> iso3c country confirmed
#> <chr> <chr> <dbl>
#> 1 SVN Slovenia 141
print(result <- compute_epidemic_curve_data(bar, "SVN"))
#> # A tibble: 1 x 4
#> id confirmed admin_region tests
#> <chr> <dbl> <chr> <dbl>
#> 1 SVN 141 Slovenia 282
This avoids some duplication but whether it is more elegant or readable is debatable
library(dplyr)
compute_epidemic_curve_data <- function(df, country_code) {
if("id" %in% colnames(df)) {
id <- "id"
sel <- c(id, "admin_region", "tests")
} else {
id <- "iso3c"
sel <- c(id, "country")
}
epidemic_curve_data <- df %>%
filter(!!sym(id) == country_code) %>%
select(all_of(sel), confirmed)
return(epidemic_curve_data)
}
compute_epidemic_curve_data(bar, "SVN")
#> # A tibble: 1 x 4
#> id admin_region tests confirmed
#> <chr> <chr> <dbl> <dbl>
#> 1 SVN Slovenia 282 141
compute_epidemic_curve_data(foo, "SVN")
#> # A tibble: 1 x 3
#> iso3c country confirmed
#> <chr> <chr> <dbl>
#> 1 SVN Slovenia 141
Created on 2020-07-11 by the reprex package (v0.3.0)
I have a dataset (df1) with a number of paired values. One row of the pair is for one year (e.g., 2014), the other for a different year (e.g., 2013). For each pair is a value in the column G. I need a count of the number of pairs in which the G value for the higher year is less than the G value for the lesser year.
Here is my dput for the dataset df1:
structure(list(Name = c("A.J. Ellis", "A.J. Ellis", "A.J. Pierzynski",
"A.J. Pierzynski", "Aaron Boone", "Adam Kennedy", "Adam Melhuse",
"Adrian Beltre", "Adrian Beltre", "Adrian Gonzalez", "Alan Zinter",
"Albert Pujols", "Albert Pujols"), Age = c(37, 36, 37, 36, 36,
36, 36, 37, 36, 36, 36, 37, 36), Year = c(2018, 2017, 2014, 2013,
2009, 2012, 2008, 2016, 2015, 2018, 2004, 2017, 2016), Tm = c("SDP",
"MIA", "TOT", "TEX", "HOU", "LAD", "TOT", "TEX", "TEX", "NYM",
"ARI", "LAA", "LAA"), Lg = c("NL", "NL", "ML", "AL", "NL", "NL",
"ML", "AL", "AL", "NL", "NL", "AL", "AL"), G = c(66, 51, 102,
134, 10, 86, 15, 153, 143, 54, 28, 149, 152), PA = c(183, 163,
362, 529, 14, 201, 32, 640, 619, 187, 40, 636, 650)), row.names = c(NA,
13L), class = "data.frame")
Here is a tibble that shows the look of the rows to be checked:
https://www.dropbox.com/s/3nbfi9le568qb3s/grouped-pairs.png?dl=0
Here is the code I used to create the tibble:
df1 %>%
group_by(Name) %>%
filter(n() > 1)
We could arrange the data by Name and Age and check if last value in G is less than first value for each name and count those occurrences with sum.
library(dplyr)
df %>%
arrange(Name, Age) %>%
group_by(Name) %>%
summarise(check = last(G) < first(G)) %>%
pull(check) %>%
sum(., na.rm = TRUE)
#[1] 2
If you want the pairs in which the G value for the higher year is less than the G value for the lesser year we could use filter.
df %>%
arrange(Name, Age) %>%
group_by(Name) %>%
filter(last(G) < first(G))
# Name Age Year Tm Lg G PA
# <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#1 A.J. Pierzynski 36 2013 TEX AL 134 529
#2 A.J. Pierzynski 37 2014 TOT ML 102 362
#3 Albert Pujols 36 2016 LAA AL 152 650
#4 Albert Pujols 37 2017 LAA AL 149 636
I have a dataset of pairs of cities V1 and V2. Each cities has a population v1_pop2015 and v2_pop2015.
I would like to create a new dataset with only the cityCode of the biggest city and its populated added of the population of the smallest.
I was able to create the output I want with a for loop. For educationnal purpose, I tried to do it using tidyverse tools without success.
This is a working sample
library(tidyverse)
## Sample dataset
pairs_pop <- structure(list(cityCodeV1 = c(20073, 20888, 20222, 22974, 23792,
20779), cityCodeV2 = c(20063, 204024, 20183, 20406, 23586, 23595
), v1_pop2015 = c(414, 682, 497, 3639, 384, 596), v2_pop2015 = c(384,
757, 5716, 315, 367, 1303)), row.names = c(NA, 6L), class = c("tbl_df",
"tbl", "data.frame"))
pairs_pop
#> # A tibble: 6 x 4
#> cityCodeV1 cityCodeV2 v1_pop2015 v2_pop2015
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 20073 20063 414 384
#> 2 20888 204024 682 757
#> 3 20222 20183 497 5716
#> 4 22974 20406 3639 315
#> 5 23792 23586 384 367
#> 6 20779 23595 596 1303
#### This is working !!!
clean_df <- setNames(data.frame(matrix(ncol = 2, nrow = dim(pairs_pop)[1])),c("to_keep", "to_keep_pop"))
# For each row, determine which city is the biggest and adds the two cities population
for (i in 1:dim(pairs_pop)[1]) {
if(pairs_pop$v1_pop2015[i] > pairs_pop$v2_pop2015[i])
{
clean_df$to_keep[i] = pairs_pop$cityCodeV1[i]
clean_df$to_keep_pop[i] = pairs_pop$v1_pop2015[i] + pairs_pop$v2_pop2015[i]
}
else
{
clean_df$to_keep[i] = pairs_pop$cityCodeV2[i]
clean_df$to_keep_pop[i] = pairs_pop$v1_pop2015[i] + pairs_pop$v2_pop2015[i]
}
}
clean_df
#> to_keep to_keep_pop
#> 1 20073 798
#> 2 204024 1439
#> 3 20183 6213
#> 4 22974 3954
#> 5 23792 751
#> 6 23595 1899
This is where I'm stucked
### trying to tidy it with rowwise, mutate and a function
v1_sup_tov2 <- function(x){
print(x)
if(x$v1_pop2015 > x$v2_pop2015){
return (TRUE)
}
return(FALSE)
}
to_clean_df2 <- pairs_pop %>%
rowwise() %>%
mutate_if(v1_sup_tov2,
to_keep = cityCodeV1,
to_delete= cityCodeV2,
to_keep_pop = v1_pop2015 + v2_pop2015)
The expected output is a dataframe with 2 colums like this:
to_keep: cityCode of the city I want to keep
to_keep_pop: population of that city
clean_df
#> to_keep to_keep_pop
#> 1 20073 798
#> 2 204024 1439
#> 3 20183 6213
#> 4 22974 3954
#> 5 23792 751
#> 6 23595 1899
What about this?
library(dplyr)
## Sample dataset
pairs_pop <- structure(
list(cityCodeV1 = c(20073, 20888, 20222, 22974, 23792, 20779),
cityCodeV2 = c(20063, 204024, 20183, 20406, 23586, 23595),
v1_pop2015 = c(414, 682, 497, 3639, 384, 596),
v2_pop2015 = c(384, 757, 5716, 315, 367, 1303)),
row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))
clean_df <- transmute(pairs_pop,
to_keep = if_else(v1_pop2015 > v2_pop2015, cityCodeV1, cityCodeV2),
to_keep_pop = v1_pop2015 + v2_pop2015)
Just in case one day you get multiple cities with v1, v2, v3, ...
Do not forget to keep all information in your dataframe so that you know what value is related to what. A tidy dataframe.
library(dplyr)
## Sample dataset
pairs_pop <- structure(
list(cityCodeV1 = c(20073, 20888, 20222, 22974, 23792, 20779),
cityCodeV2 = c(20063, 204024, 20183, 20406, 23586, 23595),
v1_pop2015 = c(414, 682, 497, 3639, 384, 596),
v2_pop2015 = c(384, 757, 5716, 315, 367, 1303)),
row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))
# Tidy dataset with all information that was in columns
library(dplyr)
library(tidyr)
library(stringr)
tidy_pairs <- pairs_pop %>%
mutate(city = 1:n()) %>%
gather("key", "value", -city) %>%
mutate(ville = str_extract(key, "([[:digit:]])"),
key = case_when(
grepl("cityCode", key) ~ "cityCode",
grepl("pop", key) ~ "pop",
TRUE ~ "other"
)) %>%
spread(key, value)
And then you can apply the test you want
tidy_pairs %>%
group_by(city) %>%
summarise(to_keep = cityCode[pop == max(pop)],
to_keep_pop = sum(pop))
So I have the following data set (this is a small sample/example of what it looks like, with the original being 7k rows and 30 columns over 7 decades):
Year,Location,Population Total, Median Age, Household Total
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA
I want to create a summary table that indicates how many years of data is available by location for each variable. So something like this would work (for the small example from before):
Location,Population Total, Median Age, Household Total
Adak,1,2,2
Akiachak,2,0,2
Rainfall,1,0,1
Tioga,1,0,0
I'm new to R and haven't used these two commands together so I'm unsure of the syntax. Any help would be wonderful or alternatives.
A solution with summarize_all from dplyr:
library(dplyr)
df %>%
group_by(Location) %>%
summarize_all(funs(sum(!is.na(.)))) %>%
select(-Year)
Or you can use summarize_at:
df %>%
group_by(Location) %>%
summarize_at(vars(-Year), funs(sum(!is.na(.))))
Result:
# A tibble: 4 x 4
Location PopulationTotal MedianAge HouseholdTotal
<chr> <int> <int> <int>
1 Adak 1 2 2
2 Akiachak 2 0 2
3 Rainfall 1 0 1
4 Tioga 1 0 0
Data:
df = read.table(text = "Year,Location,PopulationTotal, MedianAge, HouseholdTotal
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA", header = TRUE, sep = ",", stringsAsFactors = FALSE)
library(dplyr)
df = df %>%
mutate_at(vars(PopulationTotal:HouseholdTotal), as.numeric)
You can do something like this:
x %>%
group_by(Location) %>%
summarise(count_years = n(),
count_pop_total = sum(!is.na(Population_Total)),
count_median_age = sum(!is.na(Median_Age)),
count_house_total = sum(!is.na(Household_Total)))
where you can replace the mean with whatever operation you want to perform. You should take a look at the dplyr vignette for more general solutions.