Use list names inside purrr:::map_dfr function - r

I was trying something relatively simple, but having some struggles. Let's say I have two dataframes df1 and df2:
df1:
id expenditure
1 10
2 20
1 30
2 50
df2:
id expenditure
1 30
2 50
1 60
2 10
I also added them to a named list:
table_list = list()
table_list[['a']] = df1
table_list[['b']] = df2
And now I want to perform some summary operation through a function and then bind those rows:
get_summary = function(table){
final_table = table %>% group_by(id) %>% summarise(total_expenditure= sum(expenditure))
}
And then apply this through map_dfr:
summary = table_list %>% map_dfr(get_summary, id='origin_table')
So, this will create a almost what I'm looking for:
origin_table id total_expenditure
a 1 40
a 2 70
b 1 90
b 2 60
But, what if I would like to do something specific depending on the element of the list that is being passed, something like this:
get_summary = function(table, name){
dummy_list = c(TRUE, FALSE)
names(dummy_list) = c('a', 'b')
final_table = table %>% group_by(id) %>% summarise(total_expenditure= sum(expenditure))
is_true = dummy_list[[name]] # Want to use the original name to call another list
if(is_true) final_table = final_table %>% mutate(total_expenditure = total_expenditure + 1)
return(final_table)
}
This would bring something like this:
origin_table id total_expenditure
a 1 41
a 2 71
b 1 90
b 2 60
So is there any way to use list names inside my function? Or any way to identify which element of my list I'm working with? Maybe map_dfr is too restricted and I have to use something else?
Edit: changed example so it is more grounded in reality

Instead of using map, use imap, which can return the names of the list in .y
library(purrr)
library(dplyr)
get_summary = function(dat, name){
dat %>%
group_by(id) %>%
summarise(total_expenditure= sum(expenditure, na.rm = TRUE),
.groups = "drop") %>%
mutate(total_expenditure = if(name=='a')
total_expenditure + 1 else total_expenditure)
}
-testing
> table_list %>%
imap_dfr(~ get_summary(.x, name = .y), .id = 'origin_table')
# A tibble: 4 × 3
origin_table id total_expenditure
<chr> <int> <dbl>
1 a 1 41
2 a 2 71
3 b 1 90
4 b 2 60
data
table_list <- list(a = structure(list(id = c(1L, 2L, 1L, 2L),
expenditure = c(10L,
20L, 30L, 50L)), class = "data.frame", row.names = c(NA, -4L)),
b = structure(list(id = c(1L, 2L, 1L, 2L), expenditure = c(30L,
50L, 60L, 10L)), class = "data.frame", row.names = c(NA,
-4L)))

Managed to do it, by adding origin_table as a pre-existing column on the dataframes:
df1 = df1 %>% mutate(origin_table = 'a')
df2 = df2 %>% mutate(origin_table = 'b')
Then I can extract the origin by doing the following:
get_summary = function(table){
dummy_list = c(TRUE, FALSE)
names(dummy_list) = c('a', 'b')
origin = table %>% distinct(origin_table) %>% pull
final_table = table %>% group_by(id) %>% summarise(total_expenditure= sum(expenditure))
is_true = dummy_list[[origin ]] # Want to use the original name to call another list
if(is_true) final_table = final_table %>% mutate(total_expenditure = total_expenditure + 1)
return(final_table)
}

Related

Clean way to select variable for calculations depending on other variable value in R

I'm working with a dataframe with the following structure:
ID origin value1 value2
1 A 100 50
1 A 200 100
2 B 10 2
2 B 150 30
So each row can have different origins and I need to make some calculations by ID, but the value variable I'm using depends on the origin variable. So if origin == 'A' I should use value1 and if it's B I should use value2. My code without taking this last condition into account looks like this:
df2 <- df %>%
group_by(ID) %>%
mutate(mean_value = mean(value1, na.rm = TRUE),
sd_value = sd(value1, na.rm = TRUE),
median_value = median(value1, na.rm = TRUE),
cv_value = sd_value1/mean_value1,
p25_value = quantile(value1, 0.25, na.rm = TRUE),
p75_value = quantile(value1, 0.75, na.rm = TRUE))
I know I could add an if_else statement to each line, but I think my code will lose some readability (In my actual data there's multiple origins, which makes this a bit more cumbersome). So, I was thinking of creating a custom function, maybe using map or maybe something using group_by origin, but I'm not finding a good way to implement these options. Any ideas? My desired dataframe would look like this (I'll add only the first mutate column for simplicity):
ID origin value1 value2 mean_value
1 A 100 50 150
1 A 200 100 150
2 B 10 2 16
2 B 150 30 16
So the first mean value is (100 + 200) / 2 (from value1) and the second is (30 + 2) / 2 (from value2).
Thanks!
We could create a temporary column first and then do the mean afterwards. In this way, we may need to use ifelse/case_when only once
library(dplyr)
df %>%
mutate(valuenew = case_when(origin == 'A' ~ value1,
TRUE ~ value2)) %>%
group_by(ID) %>%
mutate(mean_value = mean(valuenew, na.rm = TRUE), .keep = "unused") %>%
ungroup
-output
# A tibble: 4 × 5
ID origin value1 value2 mean_value
<int> <chr> <int> <int> <dbl>
1 1 A 100 50 150
2 1 A 200 100 150
3 2 B 10 2 16
4 2 B 150 30 16
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L), origin = c("A", "A", "B",
"B"), value1 = c(100L, 200L, 10L, 150L), value2 = c(50L, 100L,
2L, 30L)), class = "data.frame", row.names = c(NA, -4L))

Replace all specific values in data.frame with values from another data.frame sequentially R

I have a data.frame (df1) and I want to include a single, most recent age for each of my samples from another data.frame (df2):
df1$age <- df2$age_9[match(df1$Sample_ID, df2$Sample_ID)]
The problem is that in df2 there are 9 columns for age, as each one indicates the age at a specific check-up date (age_1 is from the first visit, age_9 is the age at the 9th visit) and patients dont make all their visits.
How do I add the most recently obtained age from a non empty check up date?
aka, if age_9 == "." replace "." with age_8 then if age_8 == "." replace "." with age_7 ... etc
From this:
View(df1)
Sample Age
1 50
2 .
3 .
To:
View(df1)
Sample Age
1 50
2 49
3 30
From the data df2
View(df2)
Sample Age_1 Age_2 Age_3
1 40 42 44
2 35 49 .
3 30 . .
This is my attempt:
df1$age[which(df1$age == ".")] <- df2$age_8[match(df1$Sample_ID, df2$Sample_ID)]
With base R, we can use max.col to return the last column index for each row, where the 'Age' columns are not ., cbind with sequence of rows to return a row/column index, extract the elements and change the 'Age' column in 'df1', where the 'Age' is .
df1$Age <- ifelse(df1$Age == ".", df2[-1][cbind(seq_len(nrow(df2)),
max.col(df2[-1] != ".", "last"))], df1$Age)
df1 <- type.convert(df1, as.is = TRUE)
-output
df1
# Sample Age
#1 1 50
#2 2 49
#3 3 30
or using tidyverse by reshaping into 'long' format and then do a join after sliceing the last row grouped by 'Sample'
library(dplyr)
library(tidyr)
df2 %>%
mutate(across(starts_with('Age'), as.integer)) %>%
pivot_longer(cols = starts_with('Age'), values_drop_na = TRUE) %>%
group_by(Sample) %>%
slice_tail(n = 1) %>%
ungroup %>%
select(-name) %>%
right_join(df1) %>%
transmute(Sample, Age = coalesce(as.integer(Age), value))
-output
# A tibble: 3 x 2
# Sample Age
# <int> <int>
#1 1 50
#2 2 49
#3 3 30
data
df1 <- structure(list(Sample = 1:3, Age = c("50", ".", ".")),
class = "data.frame",
row.names = c(NA,
-3L))
df2 <- structure(list(Sample = 1:3, Age_1 = c(40L, 35L, 30L), Age_2 = c("42",
"49", "."), Age_3 = c("44", ".", ".")), class = "data.frame",
row.names = c(NA,
-3L))

Creating an interval in for frequency table in R

I have a dataframe I've created in the form
FREQ CNT
0 5
1 20
2 1000
3 3
4 3
I want to further group my results to be in the following form:
CUT CNT
0+1 25
2+3 1003
4+5 ...
.....
I've tried using the between and cut functions in dplyr but it just adds a new interval column to my dataframe can anyone give me a good indication as to where to go to achieve this?
Here is a way to do it in dplyr:
library(dplyr)
df <- df %>%
mutate(id = 1:n()) %>%
mutate(new_freq = ifelse(id %% 2 != 0, paste0(FREQ, "+", lead(FREQ, 1)), paste0(lag(FREQ, 1), "+", FREQ)))
df <- df %>%
group_by(new_freq) %>%
mutate(new_cnt = sum(CNT))
unique(df[, 4:5])
# A tibble: 2 x 2
# Groups: new_freq [2]
# new_freq new_cnt
# <chr> <int>
#1 0+1 25
#2 2+3 1003
data
df <- structure(list(FREQ = 0:3, CNT = c(5L, 20L, 1000L, 3L)), class = "data.frame", row.names = c(NA, -4L))
A non-elegant solution using dplyr... probably a better way to do this.
dat <- data.frame(FREQ = c(0,1,2,3,4), CNT = c(5,20,1000, 3, 3))
dat2 <- dat %>%
mutate(index = 0:(nrow(dat)-1)%/%2) %>%
group_by(index)
dat2 %>%
summarise(new_CNT = sum(CNT)) %>%
left_join(dat2 %>%
mutate(CUT = paste0(FREQ[1], "+", FREQ[2])) %>%
distinct(index, CUT),
by = "index") %>%
select(-index)
# A tibble: 3 x 2
new_CNT CUT
<dbl> <chr>
1 25 0+1
2 1003 2+3
3 3 4+NA

Looping through columns and duplicating data in R

I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3

Embed nested list of data.frames in R

Setup:
I have a tibble (named data) with an embedded list of data.frames.
df1 <- data.frame(name = c("columnName1","columnName2","columnName3"),
value = c("yes", 1L, 0L),
stringsAsFactors = F)
df2 <- data.frame(name = c("columnName1","columnName2","columnName3"),
value = c("no", 1L, 1L),
stringsAsFactors = F)
df3 <- data.frame(name = c("columnName1","columnName2","columnName3"),
value = c("yes", 0L, 0L),
stringsAsFactors = F)
responses = list(df1,
df2,
df3)
data <- tibble(ids = c(23L, 42L, 84L),
responses = responses)
Note this is a simplified example of the data. The original data is from a flat json file and loaded with jsonlite::stream_in() function.
Objective:
My goal is to convert this tibble to another tibble where the embedded data.frames are spread (transposed) as columns; for example, my goal tibble is:
goal <- tibble(ids = c(23L, 42L, 84L),
columnName1 = c("yes","no","yes"),
columnName2 = c(1L, 1L, 0L),
columnName3 = c(0L, 1L, 0L))
# goal tibble
> goal
# A tibble: 3 x 4
ids columnName1 columnName2 columnName3
<int> <chr> <int> <int>
1 23 yes 1 0
2 42 no 1 1
3 84 yes 0 0
My inelegant solution:
Use dplyr::bind_rows() and tidyr::spread():
rdf <- dplyr::bind_rows(data$responses, .id = "id") %>%
tidyr::spread(key = "name", -id)
goal2 <- cbind(ids = data$ids, rdf[,-1]) %>%
as.tibble()
Comparing my solution to the goal:
# produced tibble
> goal2
# A tibble: 3 x 4
ids columnName1 columnName2 columnName3
* <int> <chr> <chr> <chr>
1 23 yes 1 0
2 42 no 1 1
3 84 yes 0 0
Overall, my solution works but has a few problems:
I don't know how to pass the unique ids through bind_rows() which forces me to create a dummy id ("id") which can't match to the original id ("ids"). This forces me to use a cbind() (which I don't like) and manually remove the dummy id (using -1 slicing on rdf).
The format of the columns are lost as my approach converts the integer columns to characters.
Any suggestions on how to improve my solution (especially using tidyverse based packages like tidyjson or tidyr)?
We can loop over the 'responses' column with map, spread it to 'wide' with convert = TRUE so that the column types, create that as a column with transmute and then unnest
library(tidyverse)
data %>%
transmute(ids, ind = map(responses, ~.x %>%
spread(name, value, convert = TRUE))) %>%
unnest
# A tibble: 3 x 4
# ids columnName1 columnName2 columnName3
# <int> <chr> <int> <int>
#1 23 yes 1 0
#2 42 no 1 1
#3 84 yes 0 0
Or using the OP's code, we set the names of the list with 'ids' column, do the bind_rows and then spread
bind_rows(setNames(data$responses, data$ids), .id = 'ids') %>%
spread(name, value, convert = TRUE)

Resources