User defined function with mutate and case_when in R - r

I would like to know if/how can I turn the call bellow into a function that can be used in a task that I do fairly often with my data. Sadly, I can't figure out how to design function from the call that involves mutate, and case_when, both of these functions rely on dplyr package and require number of additional arguments.
Alternatively, the call itself seems redundant to me with so many case_when, perhaps it's possible to reduce how many times its used.
Any help and information about alternative approaches is welcomed.
The call looks like this:
library(dplyr)
library(stringr)
test_data %>%
mutate(
multipleoptions_o1 = case_when(
str_detect(multipleoptions, "option1") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o2 = case_when(
str_detect(multipleoptions, "option2") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o3 = case_when(
str_detect(multipleoptions, "option3") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o4 = case_when(
str_detect(multipleoptions, "option4") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0)
)
Sample data:
structure(list(multipleoptions = c("option1", "option2", "option3",
NA, "option2,option3", "option4")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Desired output of the function:
structure(list(multipleoptions = c("option1", "option2", "option3",
NA, "option2,option3", "option4"), multipleoptions_o1 = c(1,
0, 0, NA, 0, 0), multipleoptions_o2 = c(0, 1, 0, NA, 1, 0), multipleoptions_o3 = c(0,
0, 1, NA, 1, 0), multipleoptions_o4 = c(0, 0, 0, NA, 0, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Arguments of the function should probably be: data (i.e., input dataset), multipleoptions (i.e., the column from data containing answer options, always one column), patterns_to_look_for (i.e., str_detect patterns to look up in the multipleoptions), number_of_options, ideally the number of options can be more or less than 4, (I am not sure if it's achievable), output_columns (i.e., names of new columns, it's always name or original column followed by the option number or option name).

You can avoid the lengthy case_when code by splitting the options into separate elements, taking advantage of nesting/unnesting to get a single column of options, and then spreading to get a separate column for each option.
Updated Answer
library(tidyverse)
# Arguments
# data A data frame
# patterns Regular expression giving the pattern(s) at which to split the options strings
# ... Grouping columns, the first of which must be the "options" column.
# If options has repeated values, then there must be a second grouping
# column (an "ID" column) to differentiate these repeated values.
fnc = function(data, patterns, ...) {
col = quos(...)
data %>%
mutate(option=str_split(!!!col[[1]], patterns)) %>%
unnest %>%
mutate(value=1) %>%
group_by(!!!col) %>%
mutate(num_chosen = ifelse(is.na(!!!col[[1]]), 0, sum(value))) %>%
spread(option, value, fill=0) %>%
select_at(vars(-matches("NA")))
}
fnc(test_data, ",", multipleoptions)
multipleoptions num_chosen option1 option2 option3 option4
1 option1 1 1 0 0 0
2 option2 1 0 1 0 0
3 option2,option3 2 0 1 1 0
4 option3 1 0 0 1 0
5 option4 1 0 0 0 1
6 <NA> 0 0 0 0 0
# Fake data
ops = paste0("option",1:4)
set.seed(2)
d = data_frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))
# Add missing values
d = bind_rows(d[1:5,], data.frame(var=rep(NA,3)), d[6:nrow(d),])
fnc(d %>% mutate(ID=1:n()), ",", var, ID)
var ID num_chosen option1 option2 option3 option4
1 option1 17 1 1 0 0 0
2 option1,option2 12 2 1 1 0 0
3 option1,option2,option3 5 3 1 1 1 0
4 option1,option2,option4,option3 9 4 1 1 1 1
5 option1,option3 2 2 1 0 1 0
6 option1,option3,option4 3 3 1 0 1 1
7 option1,option4,option2 20 3 1 1 0 1
8 option1,option4,option3,option2 13 4 1 1 1 1
9 option2 11 1 0 1 0 0
10 option2,option3 23 2 0 1 1 0
11 option2,option3,option4 21 3 0 1 1 1
12 option3 1 1 0 0 1 0
13 option3 15 1 0 0 1 0
14 option3,option1 4 2 1 0 1 0
15 option3,option2,option4 14 3 0 1 1 1
16 option3,option4,option2,option1 22 4 1 1 1 1
17 option4 10 1 0 0 0 1
18 option4 16 1 0 0 0 1
19 option4 18 1 0 0 0 1
20 option4,option2,option3 19 3 0 1 1 1
21 <NA> 6 0 0 0 0 0
22 <NA> 7 0 0 0 0 0
23 <NA> 8 0 0 0 0 0
Original Answer
test_data %>%
filter(!is.na(multipleoptions)) %>%
mutate(option=str_split(multipleoptions, ",")) %>%
unnest %>%
mutate(value=1) %>%
spread(option, value)
multipleoptions option1 option2 option3 option4
<chr> <dbl> <dbl> <dbl> <dbl>
1 option1 1 NA NA NA
2 option2 NA 1 NA NA
3 option2,option3 NA 1 1 NA
4 option3 NA NA 1 NA
5 option4 NA NA NA 1
Packaging this into a function:
fnc = function(data, col, patterns) {
col = enquo(col)
data %>%
filter(!is.na(!!col)) %>%
mutate(option=str_split(!!col, patterns)) %>%
unnest %>%
mutate(value=1) %>%
spread(option, value)
}
fnc(test_data, multipleoptions, ",")
If your real data has more than one row with the same value of multipleoptons, then this code will work only if there's also an ID column that distinguishes different rows with the same value of multipleoptions. For example:
# Fake data
ops = paste0("option",1:4)
set.seed(2)
d = data.frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))
fnc(d, var, ",")
Error: Duplicate identifiers for rows (1, 27), (16, 28, 30)
# Add unique row identifier
fnc(d %>% mutate(ID = 1:n()), var, ",")
var ID option1 option2 option3 option4
1 option1 14 1 NA NA NA
2 option1,option2 9 1 1 NA NA
3 option1,option2,option3 5 1 1 1 NA
4 option1,option2,option4,option3 6 1 1 1 1
5 option1,option3 2 1 NA 1 NA
6 option1,option3,option4 3 1 NA 1 1
7 option1,option4,option2 17 1 1 NA 1
8 option1,option4,option3,option2 10 1 1 1 1
9 option2 8 NA 1 NA NA
10 option2,option3 20 NA 1 1 NA
11 option2,option3,option4 18 NA 1 1 1
12 option3 1 NA NA 1 NA
13 option3 12 NA NA 1 NA
14 option3,option1 4 1 NA 1 NA
15 option3,option2,option4 11 NA 1 1 1
16 option3,option4,option2,option1 19 1 1 1 1
17 option4 7 NA NA NA 1
18 option4 13 NA NA NA 1
19 option4 15 NA NA NA 1
20 option4,option2,option3 16 NA 1 1 1

Related

Filling in NA values with a sequence by group

I have a data set that looks like the following:
ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3
The first row of each ID starts with 0. I want to fill the NA values with sequential values by group. If there are values before and after the NA values, I need to fill the NA values with a sequence counting up to the first value after the NA values. If there are no values after the NA values, I need to fill the NA values with a sequence counting up from the last value before the NA value. The output should look like following:
ID Count
1 0
1 1
1 1
1 2
1 3
1 4
1 5
1 6
1 7
2 0
2 1
2 2
2 3
This is a little complicated, but I think this does what you want. I left all my helper columns in so you can see what's happening, but the non-needed columns can all be dropped at the end.
library(dplyr)
library(vctrs)
df %>%
group_by(ID, na_group = cumsum(!is.na(Count))) %>%
mutate(n_til_non_na = ifelse(is.na(Count), rev(row_number()), 0L)) %>%
group_by(ID) %>%
mutate(
fill_down = vec_fill_missing(Count, direction = "down"),
fill_up = vec_fill_missing(Count, direction = "up"),
result = case_when(
is.na(fill_up) ~ fill_down + cumsum(is.na(fill_up)),
is.na(Count) ~ fill_up - n_til_non_na,
TRUE ~ Count
)
) %>%
ungroup()
# # A tibble: 13 × 7
# ID Count na_group n_til_non_na fill_down fill_up result
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 1 0 0 0 0
# 2 1 1 2 0 1 1 1
# 3 1 NA 2 1 1 2 1
# 4 1 2 3 0 2 2 2
# 5 1 NA 3 5 2 NA 3
# 6 1 NA 3 4 2 NA 4
# 7 1 NA 3 3 2 NA 5
# 8 1 NA 3 2 2 NA 6
# 9 1 NA 3 1 2 NA 7
# 10 2 0 4 0 0 0 0
# 11 2 NA 4 2 0 3 1
# 12 2 NA 4 1 0 3 2
# 13 2 3 5 0 3 3 3
Using this sample data:
df = read.table(text = 'ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3', header = T)
You can use purrr::accumulate(), first backwards, then forward. While going backwards, replace each missing value with the previous value - 1 to count down; then while moving forwards, replace remaining missing values with the previous value + 1 to count up.
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
mutate(
Count = accumulate(
Count,
\(x, y) ifelse(is.na(x), y - 1, x),
.dir = "backward"
),
Count = accumulate(
Count,
\(x, y) ifelse(is.na(y), x + 1, y)
)
) %>%
ungroup()
# A tibble: 13 × 2
ID Count
<dbl> <dbl>
1 1 0
2 1 1
3 1 1
4 1 2
5 1 3
6 1 4
7 1 5
8 1 6
9 1 7
10 2 0
11 2 1
12 2 2
13 2 3

Counting previous occurrences of certain variable per group and storing as new column

I want to create five new columns that count how often a certain value of "stars" has happened for this business before this particular row (i.e., summing up over all rows with a smaller rolingcount but holding the business constant).
For the first row of each business (i.e., where rolingcount == 0), it should be NA, because there have been no previous occurrences for this business.
Here is an exemplary dataset:
business <-c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
df <- cbind(business, rolingcount, stars)
I feel my problem is related to this, but with a gist, that I don't get to work: Numbering rows within groups in a data frame
I also unsuccessfully experimented with while loops.
Ideally, something like this will be the output. (I leave out previousthree, previoustwo, previousone, because I believe they will work identical).
business <- c("aaa","aaa","aaa","bbb","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ccc","ccc","ccc")
rolingcount <- c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7,8)
stars <- c(5,5,3,5,5,1,2,3,5,1,2,3,4,5,5)
previousfives <- c(NA,1,2,NA,1,2,2,2,NA,1,1,1,1,1,2,3)
previousfours <- c(NA,0,0,NA,0,0,0,0,NA,0,0,0,0,1,1,1)
df <- cbind(business, rolingcount, stars, previousfives, previousfours)`
Since, I will have to do this for more than 10 M rows, a fast option would be cool. Your help is much appreciated! :)
I don't know if the option is really fast, I'm not used to deal with that many rows.
Here is a solution using dplyr package in the tidyverse :
library(tidyverse)
df %>%
as.data.frame() %>%
group_by(business) %>%
mutate(stars = as.numeric(stars),
lag_stars = lag(stars, 1, default = 0),
previousfives = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 5)),
previousfours = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 4)),
previousthrees = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 3)),
previoustwos = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 2)),
previousones = ifelse(lag_stars == 0, NA_real_, cumsum(lag_stars == 1))) %>%
ungroup() %>%
select(-lag_stars)
Output :
# A tibble: 16 x 8
business rolingcount stars previousfives previousfours previousthrees previoustwos previousones
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 aaa 1 5 NA NA NA NA NA
2 aaa 2 5 1 0 0 0 0
3 aaa 3 3 2 0 0 0 0
4 bbb 1 5 NA NA NA NA NA
5 bbb 2 5 1 0 0 0 0
6 bbb 3 1 2 0 0 0 0
7 bbb 4 2 2 0 0 0 1
8 bbb 5 3 2 0 0 1 1
9 ccc 1 5 NA NA NA NA NA
10 ccc 2 1 1 0 0 0 0
11 ccc 3 2 1 0 0 0 1
12 ccc 4 3 1 0 0 1 1
13 ccc 5 4 1 0 1 1 1
14 ccc 6 5 1 1 1 1 1
15 ccc 7 5 2 1 1 1 1
16 ccc 8 5 3 1 1 1 1
Basically, group_by is to perform the operation for each business, and it makes a cumulative lagged sum.
Maybe it'll lead you to another faster idea if it is too slow.
Hope it helped.

How to recode multiple choice answers in R using dplyr?

I have a dataset with answers to a large number of multiple choice questions. I now want to recode these answers in either true (1) or false (0). I
`#ID q1 q2 q3 cq1 cq2 cq3
#1 1 2 1 NA NA NA
#2 1 2 2 NA NA NA
#3 2 2 2 NA NA NA
#4 1 2 1 NA NA NA`
what I want is this:
`#ID q1 q2 q3 cq1 cq2 cq3
#1 1 2 1 0 0 0
#2 1 2 2 0 0 1
#3 2 2 2 1 0 1
#4 1 2 1 0 0 0`
I know that I could write out all answers like this:
`data_re <- data %>%
mutate(cq1 = if_else(q1==2, 1, 0),
cq2 = if_else(q2==1, 1, 0),
cq3 = if_else(q3==2, 1, 0))`
But is there any way how to automatically do this (similar to this approach: How to mutate_at multiple columns on a condition on each value?
However, I would have to generate the variablename of the conditional variable automatically. I tried this:
`names_answer_two_correct <- c("q1", "q3")
cnames_answer_two_correct <- paste0("c", names_answer_two_correct)
for (i in 1:length(names_answer_two_correct)) {
data_re <- data %>%
mutate(names_answer_two_correct[i] = if_else(cnames_answer_two_correct[i]== 2, 1, 0))
}`
But I get "Error: unexpected '=' in:"
Does anyone know a solution?
You can use across to apply the function to multiple columns.
library(dplyr)
names_answer_two_correct <- c("q1", "q3")
data %>%
mutate(across(all_of(names_answer_two_correct),
~as.integer(.==2), .names = 'c{col}'),
cq2 = as.integer(q2==1)) -> data_re
data_re
# ID q1 q2 q3 cq1 cq2 cq3
#1 1 1 2 1 0 0 0
#2 2 1 2 2 0 0 1
#3 3 2 2 2 1 0 1
#4 4 1 2 1 0 0 0

Replace NA with 0 depending on group (rows) and variable names (column)

I have a large data set and want to replace many NAs, but not all.
In one group i want to replace all NAs with 0.
In the other group i want to replace all NAs with 0, but only in variables that do not include a certain part of the variable name e.g. 'b'
Here is an example:
group <- c(1,1,2,2,2)
abc <- c(1,NA,NA,NA,NA)
bcd <- c(2,1,NA,NA,NA)
cde <- c(5,NA,NA,1,2)
df <- data.frame(group,abc,bcd,cde)
group abc bcd cde
1 1 1 2 5
2 1 NA 1 NA
3 2 NA NA NA
4 2 NA NA 1
5 2 NA NA 2
This is what i want:
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
This is what i tried:
#set 0 in first group: this works fine
df[is.na(df) & df$group==1] <- 0
#set 0 in second group but only if the variable name includes b: does not work
df[is.na(df) & df$group==2 & !grepl('b',colnames(df))] <- 0
dplyr solutions are welcome as well as basic
For the second group, create the column index with grep and use that to subset the data while assigning
j1 <- !grepl('b',colnames(df))
df[j1][df$group == 2 & is.na(df[j1])] <- 0
df
# group abc bcd cde
#1 1 1 2 5
#2 1 0 1 0
#3 2 NA NA 0
#4 2 NA NA 1
#5 2 NA NA 2
Using dplyr::mutate_at you can also do:
library(dplyr)
vars_mutate_1 <- names(df)[-1]
vars_mutate_2 <- grep(x = names(df)[-1], pattern = '^(?!.*b).*$', perl = TRUE, value = TRUE)
df %>%
mutate_at(.vars = vars_mutate_1, .funs = funs(if_else(group == 1 & is.na(.), 0, .))) %>%
mutate_at(.vars = vars_mutate_2, .funs = funs(if_else(group == 2 & is.na(.), 0, .)))
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
Alternatively, you can use:
library(dplyr)
df2 <- df %>% mutate_at(vars(names(df)[-1]),
function(x) case_when((group==1 & is.na(x) ) ~ 0,
(group==2 & is.na(x) & !grepl("b",deparse(substitute(x)))) ~ 0,
TRUE ~ x))
> df2
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2

Spreading multiple columns over a column in R

Sorry because this question has been asked several times, but I'm still having trouble wrapping my head around this problem.
So I have a dataframe, of the form:
ID Val Type
1 10 A
2 11 A
2 10 C
3 10 B
3 12 C
4 9 B
It's not much help but you can use
library(tidyr)
test <- data.frame(ID = c(1,2,2,3,3,4),
Val = c(10,11,10,10,12,9),
Type = c('A', 'A', 'C', 'B', 'C', 'B'))
I would like to split it to obtain:
ID A.Type B.Type C.Type A.Val B.Val C.Val
1 1 0 0 10 0 0
2 1 0 1 11 0 10
3 0 1 1 0 10 12
4 0 0 0 0 9 0
I know how to get columns 1:4 using:
table(test[, c(1, 3)]) %>% as.data.frame() %>% spread(Type, Freq)
It's the last three I need help with because in the actual data-frame values are continuous and table can not be used.
You are trying to reshape your data with multiple value variables where the ones are actually implicit, so in order to get the type_... columns, you will need to create a new type variable with ones and then use dcast from data.table package:
library(data.table)
setDT(test)
dcast(test[, type := 1][], ID ~ Type, value.var = c("type", "Val"),fill = 0)
# ID type_A type_B type_C Val_A Val_B Val_C
# 1: 1 1 0 0 10 0 0
# 2: 2 1 0 1 11 0 10
# 3: 3 0 1 1 0 10 12
# 4: 4 0 1 0 0 9 0
Or you can use reshape from base R, where NA has to be manually replaced:
test$type = 1
reshape(test, idvar = "ID", timevar = "Type", direction = "wide")
# ID Val.A type.A Val.C type.C Val.B type.B
# 1 1 10 1 NA NA NA NA
# 2 2 11 1 10 1 NA NA
# 4 3 NA NA 12 1 10 1
# 6 4 NA NA NA NA 9 1

Resources