Cumsum table with grouping - r

How can I get Cumsum table grouped by both Gender and State?
Gender = sample(c('male', 'female'), 100, replace=TRUE)
State = sample(c('CA', 'WA', 'NV', 'OR', "AZ"), 100, replace=TRUE)
Number = sample(1:8, size=100, replace=TRUE)
df <- data.frame(Gender,State, Number)

If we are looking for cumsum table, then
library(data.table)
dcast(setDT(df)[, .N, .(Gender, State, Number)
][, perc := round(100*N/sum(N), 2), .(Gender, State)],
Gender + State ~Number, value.var = 'perc', fill = 0, drop = FALSE)[,
(3:10) := lapply(Reduce(`+`, .SD, accumulate = TRUE),
function(x) paste0(x, "%")), .SDcols = -(1:2)][]

For a simpler approach, I would recommend using dplyr. Dplyr is loaded along with a bunch of other helpful packages when you load tidyverse.
library(tidyverse)
Gender = sample(c('male', 'female'), 100, replace=TRUE)
State = sample(c('CA', 'WA', 'NV', 'OR', "AZ"), 100, replace=TRUE)
Number = sample(1:8, size=100, replace=TRUE)
df <- data.frame(Gender,State, Number)
df <- df %>%
group_by(Gender, State) %>%
mutate(Number_CumSum = cumsum(Number)) %>%
ungroup() %>%
arrange(State, Gender)
head(df)
# A tibble: 6 x 4
Gender State Number Number_CumSum
<fctr> <fctr> <int> <int>
1 female AZ 8 8
2 female AZ 3 11
3 female AZ 4 15
4 female AZ 5 20
5 female AZ 2 22
6 female AZ 7 29

Related

Paste element of a vector into dplyr function

I have the following dataset:
df_x <- data.frame(year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002),
a = c(7, 3, 5),
b = c(5, 8, 1),
c = c(8, 4, 3))
and this vector:
v <- c("a", "b", "c")
Now I want to create a new dataset and summarise a, b, and c by creating new variables (y_a, y_b, and y_c) that calculate the mean of each variable grouped by year.
The code for doing this is the following:
y <- df_x %>% group_by(year) %>% dplyr::summarise(y_a = mean(a, na.rm = TRUE),
y_b = mean(b, na.rm = TRUE),
y_c = mean(c, na.rm = TRUE))
However, I want to use the vector v to read the respective variable from it and paste in into the summarise function:
y <- df_x %>% group_by(year) %>% dplyr::summarise(as.name(paste0("y_", v[1])) = mean(as.name(v[1]), na.rm = TRUE),
as.name(paste0("y_", v[2])) = mean(as.name(v[1]), na.rm = TRUE),
as.name(paste0("y_", v[3])) = mean(as.name(v[1]), na.rm = TRUE))
Doing so, I receive the following error message:
Error: unexpected '=' in "y <- df_x %>% group_by(year) %>% dplyr::summarise(as.name(paste0("y_", v[1])) ="
How can I paste the value of a vector in this summarise function so that it works?
To define a new variable on the left hand side, you need := instead of =. Because you create it with paste0, you need !! to inject the expression and make sure that is correctly evaluated. To access existing columns in dplyr with a string stored in a variable, using .data is the easiest way.
library(dplyr)
df_x <- data.frame(year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002),
a = c(7, 3, 5),
b = c(5, 8, 1),
c = c(8, 4, 3))
v <- c("a", "b", "c")
df_x %>% group_by(year) %>%
dplyr::summarise(!!paste0("y_", v[1]) := mean(.data[[v[1]]], na.rm = TRUE),
!!paste0("y_", v[2]) := mean(.data[[v[1]]], na.rm = TRUE),
!!paste0("y_", v[3]) := mean(.data[[v[1]]], na.rm = TRUE))
#> # A tibble: 3 × 4
#> year y_a y_b y_c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 5 5 5
#> 2 2001 5 5 5
#> 3 2002 5 5 5
Created on 2022-12-21 by the reprex package (v1.0.0)
Here is a one-liner via base R,
aggregate(. ~ year, cbind.data.frame(year = df_x$year, df_x[v]), FUN = \(i)mean(i, na.rm = TRUE))
year a b c
1 2000 5 4.666667 5
2 2001 5 4.666667 5
3 2002 5 4.666667 5
It would be easier with across and modifying the names with .names
library(dplyr)
df_x %>%
group_by(year) %>%
summarise(across(all_of(v), ~ mean(.x, na.rm = TRUE), .names = "y_{.col}"))
-output
# A tibble: 3 × 4
year y_a y_b y_c
<dbl> <dbl> <dbl> <dbl>
1 2000 5 4.67 5
2 2001 5 4.67 5
3 2002 5 4.67 5

Performing pivot_longer() over multiple sets of columns

I am stuck in performing pivot_longer() over multiple sets of columns. Here is the sample dataset
df <- data.frame(
id = c(1, 2),
uid = c("m1", "m2"),
germ_kg = c(23, 24),
mineral_kg = c(12, 17),
perc_germ = c(45, 34),
perc_mineral = c(78, 10))
I need the output dataframe to look like this
out <- df <- data.frame(
id = c(1, 1, 2, 2),
uid = c("m1", "m1", "m2", "m2"),
crop = c("germ", "germ", "mineral", "mineral"),
kg = c(23, 12, 24, 17),
perc = c(45, 78, 34, 10))
df %>%
rename_with(~str_replace(.x,'(.*)_kg', 'kg_\\1')) %>%
pivot_longer(-c(id, uid), names_to = c('.value', 'crop'), names_sep = '_')
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10
If you were to use data.table:
library(data.table)
melt(setDT(df), c('id', 'uid'), patterns(kg = 'kg', perc = 'perc'))
id uid variable kg perc
1: 1 m1 1 23 45
2: 2 m2 1 24 34
3: 1 m1 2 12 78
4: 2 m2 2 17 10
I suspect there might be a simpler way using pivot_long_spec, but one tricky thing here is that your column names don't have a consistent ordering of their semantic components. #Onyambu's answer deals with this nicely by fixing it upsteam.
library(tidyverse)
df %>%
pivot_longer(-c(id, uid)) %>%
separate(name, c("col1", "col2")) %>% # only needed
mutate(crop = if_else(col2 == "kg", col1, col2), # because name
meas = if_else(col2 == "kg", col2, col1)) %>% # structure
select(id, uid, crop, meas, value) %>% # is
pivot_wider(names_from = meas, values_from = value) # inconsistent
# A tibble: 4 x 5
id uid crop kg perc
<dbl> <chr> <chr> <dbl> <dbl>
1 1 m1 germ 23 45
2 1 m1 mineral 12 78
3 2 m2 germ 24 34
4 2 m2 mineral 17 10

Convert logical data from wide to long format in R

I have the following data:
ID cancer cancer_date stroke stroke_date diabetes diabetes_date
1 1 Feb2017 0 Jan2015 1 Jun2015
2 0 Feb2014 1 Jan2015 1 Jun2015
I would like to get
ID condition date
1 cancer xx
1 diabetes xx
2 stroke xx
2 diabetes xx
I tried reshape and gather, but it did not do what I want. Any ideas how can I do this?
This should do it. The key to make it work easily is to change the names of cancer, stroke and diabetes to x_val and then you can use pivot_longer() from tidyr to do the work.
library(tidyr)
library(dplyr)
dat <- tibble::tribble(
~ID, ~cancer, ~cancer_date, ~stroke, ~stroke_date, ~diabetes, ~diabetes_date,
1, 1, "Feb2017", 0, "Jan2015", 1, "Jun2015",
2, 0, "Feb2014", 1, "Jan2015", 1, "Jun2015")
dat %>%
rename("cancer_val" = "cancer",
"stroke_val" = "stroke",
"diabetes_val" = "diabetes") %>%
pivot_longer(cols=-ID,
names_to = c("diagnosis", ".value"),
names_pattern="(.*)_(.*)") %>%
filter(val == 1)
# # A tibble: 4 x 4
# ID diagnosis val date
# <dbl> <chr> <dbl> <chr>
# 1 1 cancer 1 Feb2017
# 2 1 diabetes 1 Jun2015
# 3 2 stroke 1 Jan2015
# 4 2 diabetes 1 Jun2015
library(data.table)
data <- data.table(ID = c(1, 2), cancer = c(1, 0), cancer_date = c("Feb2017", "Feb2014"), stroke = c(0, 1), stroke_date = c("Jan2015", "Jan2015"), diabetes = c(1, 1), diabetes_date = c("Jun2015", "Jun2015"))
datawide <-
melt(data, id.vars = c("ID", "cancer", "stroke", "diabetes"),
measure.vars = c("cancer_date", "stroke_date", "diabetes_date"))
datawide[(cancer == 1 & variable == "cancer_date") |
(stroke == 1 & variable == "stroke_date") |
(diabetes == 1 & variable == "diabetes_date"), .(ID, condition = variable, date = value)]
Try this solution using pivot_longer() and a flag variable to filter the desired states. After pivoting you can filter the values different to zero and only choose the one values. Here the code:
library(tidyverse)
#Code
df2 <- df %>% pivot_longer(cols = -c(ID,contains('_'))) %>%
filter(value!=0) %>% rename(condition=name) %>% select(-value) %>%
pivot_longer(-c(ID,condition)) %>%
separate(name,c('v1','v2'),sep='_') %>%
mutate(Flag=ifelse(condition==v1,1,0)) %>%
filter(Flag==1) %>% select(-c(v1,v2,Flag)) %>%
rename(date=value)
Output:
# A tibble: 4 x 3
ID condition date
<int> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
Some data used:
#Data
df <- structure(list(ID = 1:2, cancer = 1:0, cancer_date = c("Feb2017",
"Feb2014"), stroke = 0:1, stroke_date = c("Jan2015", "Jan2015"
), diabetes = c(1L, 1L), diabetes_date = c("Jun2015", "Jun2015"
)), class = "data.frame", row.names = c(NA, -2L))
If the first obtain is complex, here another choice:
#Code 2
df2 <- df %>% mutate(across(everything(),~as.character(.))) %>%
pivot_longer(cols = -c(ID)) %>%
separate(name,c('condition','v2'),sep = '_') %>%
replace(is.na(.),'val') %>%
pivot_wider(names_from = v2,values_from=value) %>%
filter(val==1) %>% select(-val)
Output:
# A tibble: 4 x 3
ID condition date
<chr> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015

Find proportion of values for two levels that share a common level

I have a dataframe that looks like this:
group <- c('a', 'b', 'a', 'b')
year <- c(1990, 1990, 2000, 2000)
freq <- c(100, 120, 130, 170)
df <- data.frame(group, year, freq)
For each distinct year, I'd like to find the freq value for the row with group a divided by the freq value for the row with group b and add these proportion values to the dataframe. The resulting dataframe should look like this:
group <- c('a', 'b', 'c', 'a', 'b', 'c')
year <- c(1990, 1990, 1990, 2000, 2000, 2000)
freq <- c(100, 120, 100/120, 130, 170, 130/170)
df <- data.frame(group, year, freq)
I tried to get this going with the ugliest of loops below but have taken the train off of the rails. If anyone can help show me how to accomplish this elementary task in R, I'd be grateful!
for (year in unique(df$year)) {
a = df[ which(df$group == 'a' & df$year == year), ]
b = df[ which(df$group == 'b' & df$year == year), ]
proportion = a$freq / b$freq
row = c('c', year, proportion)
rbind(df, row)
}
Here is a tidyverse option
library(tidyverse)
df %>%
spread(group, freq) %>%
mutate(c = a / b) %>%
gather(group, freq, -year) %>%
arrange(year, group)
# year group freq
#1 1990 a 100.0000000
#2 1990 b 120.0000000
#3 1990 c 0.8333333
#4 2000 a 130.0000000
#5 2000 b 170.0000000
#6 2000 c 0.7647059
Explanation: We spread data from long to wide, add a column c = a / b and gather data from wide to long before re-ordering rows to reproduce your expected output.
Split original by year with function split (result is a list).
foo <- split(df, df$year)
For each entry in list foo bind original entry x with new data.frame which has calculated freq
bar <- lapply(foo, function(x)
rbind(x, data.frame(group = "c",
year = x$year[1],
freq = x$freq[1] / x$freq[2])))
# Bind back final result as it's a list (lapply result)
do.call(rbind, bar)
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', concatenate 'group' with 'c' as well as 'freq' with the ratio of 'freq' elements correspondingly
library(data.table)
setDT(df)[, .(group = c(group, 'c'), freq = c(freq, freq[1]/freq[2])), .(year)]
# year group freq
#1: 1990 a 100.0000000
#2: 1990 b 120.0000000
#3: 1990 c 0.8333333
#4: 2000 a 130.0000000
#5: 2000 b 170.0000000
#6: 2000 c 0.7647059
Or rbind the summarised dataset with the original
rbind(setDT(df), df[, .(freq = Reduce(`/`, freq), group = 'c'), .(year)])
Or using tidyverse
library(tidyverse)
df %>%
group_by(year) %>%
summarise(group = list(c(group, 'c')),
freq = list(c(freq, freq[1]/freq[2]))) %>%
unnest
# A tibble: 6 x 3
# year group freq
# <dbl> <chr> <dbl>
#1 1990 a 100
#2 1990 b 120
#3 1990 c 0.833
#4 2000 a 130
#5 2000 b 170
#6 2000 c 0.765
data
df <- structure(list(group = c("a", "b", "a", "b"), year = c(1990,
1990, 2000, 2000), freq = c(100, 120, 130, 170)), row.names = c(NA,
-4L), class = "data.frame")

Condense factor variables for duplicated ID´s in a data frame

I have a data frame with duplicated ID´s. An ID stands for a specific entity. The ID´s are duplicated because the dataset refers to a process that every entity can go through multiple times.
Here is a small example dat:
library(dplyr)
glimpse(dat)
Observations: 6
Variables: 3
$ ID <dbl> 1, 1, 1, 2, 2, 2
$ Amount <dbl> 10, 70, 80, 50, 10, 10
$ Product <fct> A, B, C, B, E, A
ID stands for the entity, Amount stands for the amount of money the entity has spend and Product stands for the good the entity bought.
The issue is that I have to "condense" this data. So, every ID / entity may occur only once. For the continuous variable, this is not an issue because I can simply calculate the mean per ID.
library(tidyr)
dat_con_ID <- dat %>%
select(ID) %>%
unique()
dat_con_Amount <- dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount))
dat_con <- inner_join(dat_con_ID, dat_con_Amount, by = "ID")
glimpse(dat_con)
Observations: 2
Variables: 2
$ ID <dbl> 1, 2
$ Amount <dbl> 53.33333, 23.33333
The problem is, that I can´t calculate the mean of Product because it´s a categorical variable. An option would be to make a dummy variable out of this factor and calculate the mean. But since the original data frame is really huge this is not a good solution. Any Idea how to handle this problem?
May be you are trying to do this:
I am using data.table library. I also modified your data by adding one extra row for ID = 1, so that you can see the difference in the output.
Data:
library('data.table')
dat <- data.table(ID =as.double(c(1, 1, 1, 2, 2, 2,1)),
Amount = as.double(c( 10, 70, 80, 50, 10, 10, 20)),
Product = factor( c('A', 'B', 'C', 'B', 'E', 'A', 'A')))
Code:
# average amount per id
dat[, .(avg_amt = mean(Amount)), by = .(ID) ]
# ID avg_amt
# 1: 1 45.00000
# 2: 2 23.33333
# average product per id
dat[, .SD[, .N, by = Product ][, .( avg_pdt = N/sum(N), Product)], by = .(ID) ]
# ID avg_pdt Product
# 1: 1 0.5000000 A
# 2: 1 0.2500000 B
# 3: 1 0.2500000 C
# 4: 2 0.3333333 B
# 5: 2 0.3333333 E
# 6: 2 0.3333333 A
# combining average amount and average product per id
dat[, .SD[, .N, by = Product ][, .( Product,
avg_pdt = N/sum(N),
avg_amt = mean(Amount))],
by = .(ID) ]
# ID Product avg_pdt avg_amt
# 1: 1 A 0.5000000 45.00000
# 2: 1 B 0.2500000 45.00000
# 3: 1 C 0.2500000 45.00000
# 4: 2 B 0.3333333 23.33333
# 5: 2 E 0.3333333 23.33333
# 6: 2 A 0.3333333 23.33333
edit
Another idea would be to count 'Product' as per 'ID', calculating the mean of 'Amount' and the relative frequencies for each product. spread the data by 'Product' to end up with the data in wide format. So, every ID / entity may occur only once.
dat %>%
add_count(Product, ID) %>%
group_by(ID) %>%
mutate(Amount = mean(Amount),
n = n / n()) %>%
unique() %>%
spread(Product, n, sep = "_") %>%
ungroup()
# A tibble: 2 x 6
# ID Amount Product_A Product_B Product_C Product_E
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1. 45.0 0.500 0.250 0.250 NA
#2 2. 23.3 0.333 0.333 NA 0.333
My first attempt, not what OP was looking for but in case someone is interested:
As suggested by #steveb in the comments, you could summarise Product as a string.
library(dplyr)
dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount),
Product = toString( sort(unique(Product)))
)
# A tibble: 2 x 3
# ID Amount Product
# <dbl> <dbl> <chr>
#1 1. 45.0 A, B, C
#2 2. 23.3 A, B, E
data
dat <- structure(list(ID = c(1, 1, 1, 2, 2, 2, 1), Amount = c(10, 70,
80, 50, 10, 10, 20), Product = structure(c(1L, 2L, 3L, 2L, 4L,
1L, 1L), .Label = c("A", "B", "C", "E"), class = "factor")), .Names = c("ID",
"Amount", "Product"), row.names = c(NA, -7L), .internal.selfref = <pointer: 0x2c14528>, class = c("tbl_df",
"tbl", "data.frame"))

Resources