Related
I have a dataframe with two columns:
df <- data.frame (a = c(NA, 0, NA, NA, NA, NA, 0, 0, NA),
b = c(1, 2, 5, 3, 6, 3, 2, 1, 4))
a b
1 NA 1
2 0 2
3 NA 5
4 NA 3
5 NA 6
6 NA 3
7 0 2
8 0 1
9 NA 4
When the value in column a is 0, I want to replace the value in column b; desired end result is:
a b
1 NA 1
2 0 0
3 NA 5
4 NA 3
5 NA 6
6 NA 3
7 0 0
8 0 0
9 NA 4
I tried various combinations of mutate with ifelse and case_when, and all but one replaces all of column b with column a values, 0 as well as NA.
Failed attemps:
df %>%
mutate(b = case_when(a == 0 ~ 0))
df %>%
mutate(b = case_when(a == 0 ~ 0,
TRUE ~ as.numeric(as.character(a))))
df %>%
mutate(b = ifelse(a==0, a, b))
All result in:
a b
1 NA NA
2 0 0
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 0 0
8 0 0
9 NA NA
After much consternation, I finally found a solution that produces the result I'm after:
df <- df %>%
mutate(b = ifelse(is.na(a), b, a))
a b
1 NA 1
2 0 0
3 NA 5
4 NA 3
5 NA 6
6 NA 3
7 0 0
8 0 0
9 NA 4
But I'm still perplexed as to why the others did not work as expected. Would love some insight here.
Using %in% instead of == can be useful where there are NA values.
In base R the following will give you what you want.
df$b[df$a %in% 0] <- 0
Using this in dplyr is slightly more complicated than base R, but simpler than the previous solutions:
library(dplyr)
df <- df %>% mutate(b = if_else(a %in% 0, 0, b))
The reason for the problems is that NA == 0 gives NA, not FALSE. NA %in% 0 gives FALSE.
A possible solution:
library(dplyr)
df %>%
mutate(b = if_else(a == 0 & !is.na(a), 0, b))
#> a b
#> 1 NA 1
#> 2 0 0
#> 3 NA 5
#> 4 NA 3
#> 5 NA 6
#> 6 NA 3
#> 7 0 0
#> 8 0 0
#> 9 NA 4
In general any operation on an NA becomes an NA, so when comparing vectors that have NA the results will be NA where either of the original items was NA.
If you're willing to eschew dplyr you can do this in base R:
df$b <- ifelse(
is.na(df$a),
df$b,
ifelse(
df$a == 0,
0,
df$b
)
)
I would like to make a some new variables in R based on multiple (>100) other variables.
My dataset looks like this
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br
1 1 1 1 2 2 1 6 0 1 6 1
2 2 2 3 2 5 1 3
3 3 0 0 <NA> 4 1 0 0 <NA> 2 2
4 4 NA 1 2 2 NA 1 1 4
5 5 NA 4 2 3 5 NA 4 3 4 3
The variables diag_x_ais can take integers from 0-6, and diag_x_br can take integers between 1-6.
I would like to make 6 new variables corresponding to the 6 possible diag_x_br values, i.e. the new variables would be called br_1, br_2 ... br_6. These new variables shall then be filled with the maximum value of the corresponding diag_x_ais variables, i.e.
if diag_1_br, diag_2_br, and diag_4_br are all 3, then br_3 should take the maximum value of diag_1_ais, diag_2_ais, and diag_4_ais.
Please also see the example dataset below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA
Hereafter, I would like a final variable which calculates the sum of the up to three largest br_x variables, example displayed below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 sum3
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA 12
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA 4
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA 4
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA 9
My actual dataset has 60 diag_x_ais variables and 60 diag_x_br variables and 4000 rows.
I hope that someone can help me do this in R. Thank you!
I think you could use the following solution. I made a slight modification so that we only sum the first 3 max values:
library(dplyr)
library(purrr)
df %>%
bind_cols(as.data.frame(t(map_dfr(1:6, function(a) pmap_dfc(df, ~ {x <- c(...)[grepl("br", names(df))]
inds <- which(x == a)
if(length(inds) != 0) {
y <- c(...)[grepl("ais", names(df))]
max(y[inds])
} else {
NA
}})))) %>%
setNames(paste0("br", 1:6))) %>%
rowwise() %>%
mutate(sum = sum(sort(as.numeric(c_across(starts_with("br"))), decreasing = TRUE)[1:3], na.rm = TRUE)) %>%
select(starts_with("br"), sum)
Resulting output
# A tibble: 5 x 7
# Rowwise:
br1 br2 br3 br4 br5 br6 sum
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2 NA NA NA NA 2 4
2 3 4 5 NA 2 NA 12
3 NA 4 NA NA NA NA 4
4 2 NA NA 2 NA NA 4
5 NA NA 5 4 NA NA 9
You could use some heavy data.transforming most likely not very efficient on large datasets. There are some empty values, NA and 0 in your dataset. I didn't handle them (and replaced the empty values by NA to make importing easier).
library(tidyr)
library(dplyr)
data %>%
pivot_longer(-sub_id,
names_to = c("name", "cat"),
names_pattern = ".*_(\\d+)_(.*)") %>%
pivot_wider(names_from = "cat") %>%
group_by(sub_id, br) %>%
summarise(value = max(ais), .groups = "drop") %>%
filter(br %in% 1:6) %>%
group_by(sub_id) %>%
mutate(sum = sum(tail(sort(value), 3))) %>%
pivot_wider(names_from = br,
names_glue = "br_{br}") %>%
select(sub_id, paste0("br_", 1:6), sum)
This returns
# A tibble: 5 x 8
sub_id br_1 br_2 br_3 br_4 br_5 br_6 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA NA NA 2 4
2 2 3 4 5 NA 2 NA 12
3 3 NA 4 NA NA NA NA 4
4 4 2 NA NA 2 NA NA 4
5 5 NA NA 5 4 NA NA 9
Piping an addtional right_join(data, by = "sub_id") gives you your example output (minus the order of your columns).
I took an idea from this answer.
Data
data <- structure(list(sub_id = c(1, 2, 3, 4, 5), diag_1_ais = c(1, 2,
0, NA, NA), diag_2_ais = c(1, 1, 0, NA, 4), diag_3_ais = c(2,
4, NA, 1, 2), diag_4_ais = c(2, 3, 4, 2, 3), diag_5_ais = c(1,
5, 1, 2, 5), diag_1_br = c(6, 5, 0, NA, NA), diag_2_br = c(0,
2, 0, NA, 4), diag_3_br = c(1, 2, NA, 1, 3), diag_4_br = c(6,
1, 2, 1, 4), diag_5_br = c(1, 3, 2, 4, 3)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
For the first part:
data <- data.frame(sub_id = c(1,2,3,4,5),
diag_1_ais = c(1,2,0,NA,NA),
diag_2_ais = c(1,1,0,NA,4),
diag_3_ais = c(2,4,NA,1,2),
diag_4_ais = c(2,3,4,2,3),
diag_5_ais = c(1,5,1,2,5),
diag_1_br = c(6,5,0,NA,NA),
diag_2_br = c(0,2,0,NA,4),
diag_3_br = c(1,2,NA,1,3),
diag_4_br = c(6,1,2,1,4),
diag_5_br = c(1,3,2,4,3))
calc_br <- function(data, value, firstBr, lastBr) {
br <- c()
for (i in 1:nrow(data)){
if (length(which(data[i,c(firstBr:lastBr)] %in% value))!=0){
br <- c(br, c(max(data[i,which(data[i,c(firstBr:lastBr)] %in% value)+1])))
}
else {
br <- c(br, c(NA))
}
}
result <- br
}
firstBr = 7
lastBr = 11
data$br_1 <- calc_br(data,1,firstBr,lastBr)
data$br_2 <- calc_br(data,2,firstBr,lastBr)
data$br_3 <- calc_br(data,3,firstBr,lastBr)
data$br_4 <- calc_br(data,4,firstBr,lastBr)
data$br_5 <- calc_br(data,5,firstBr,lastBr)
data$br_6 <- calc_br(data,6,firstBr,lastBr)
This should yield the same results as in your example. You should only have to exchange lastBr and firstBr (to 62 and 122 i would guess).
For the second part this should do the trick:
br_sum <- c()
for (i in 1:nrow(data)){
br_sum <- c(br_sum, sum(data[i,lastBr+tail(order(data[i,c((lastBr+1):(lastBr+6))], na.last = NA), 3)]))
}
data$br_sum <- br_sum
For completness here my results:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br
1 1 1 1 2 2 1 6
2 2 2 1 4 3 5 5
3 3 0 0 NA 4 1 0
4 4 NA NA 1 2 2 NA
5 5 NA 4 2 3 5 NA
diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 br_sum
1 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 3 3 4 5 NA 2 NA 12
3 0 NA 2 2 NA 4 NA NA NA NA 4
4 NA 1 1 4 2 NA NA 2 NA NA 4
5 4 3 4 3 NA NA 5 4 NA NA 9
I have a dataframe df of integers across 6 variables.
a <- c(NA, NA, NA, 0, 0, 1, 1, 1)
b <- c(NA, NA, NA, 2, 2, 3, 3, 3)
c <- c(NA, NA, NA, 2, 2, 3, 3, 3)
d <- c(NA, NA, NA, 1, 1, 2, 2, 2)
e <- c(NA, NA, NA, 0, 0, 1, 1, 1)
f <- c(NA, NA, NA, 0, 0, 1, 1, 1)
df <- data.frame(a, b, c, d, e, f)
print(df)
a b c d e f
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 0 2 2 1 0 0
5 0 2 2 1 0 0
6 1 3 3 2 1 1
7 1 3 3 2 1 1
8 1 3 3 2 1 1
I would like to add 1 to each row that contains a zero, resulting in:
a b c d e f
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 1 3 3 2 1 1
5 1 3 3 2 1 1
6 1 3 3 2 1 1
7 1 3 3 2 1 1
8 1 3 3 2 1 1
I've been able to test if a row contains a zero with the following code, which adds a new column of "TRUE" or "FALSE".
df$cont0 <- apply(df, 1, function(x) any(x %in% "0"))
I thought I would this new value to then add 1 to reach row where df$cont0 == "TRUE"
ifelse(df$cont0 == "TRUE", df + 1, df)
This ends up creating a nested list that still does not perform the correct operation. I understand that ifelse is already vectorized, but other than that I'm not sure how to approach this issue. I am open to splitting apart the df into "TRUE" and "FALSE" conditions, then performing the operation on df$cont0 == "TRUE", but they need to be re-merged in the original order as the data are chronological and row order therefore matters. However I suspect there's an easier solution. Thank you!
Create a logical index with rowSums on the logical matrix and use that as row index to add
i1 <- rowSums(df == 0, na.rm = TRUE) > 0
df[i1,] <- df[i1, ] + 1
-ouptut
> df
a b c d e f
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 1 3 3 2 1 1
5 1 3 3 2 1 1
6 1 3 3 2 1 1
7 1 3 3 2 1 1
8 1 3 3 2 1 1
Regarding the use of ifelse on a logical vector, it is related to the property of ifelse that it requires all the arguments to be of same length which is not met in the OP's case
Just try to get row index first :
index <- rowIndex(af == 0, na.rm = TRUE) > 0
af[index,] <- af[index, ] + 1
It should work.
I have a dataset with answers to a large number of multiple choice questions. I now want to recode these answers in either true (1) or false (0). I
`#ID q1 q2 q3 cq1 cq2 cq3
#1 1 2 1 NA NA NA
#2 1 2 2 NA NA NA
#3 2 2 2 NA NA NA
#4 1 2 1 NA NA NA`
what I want is this:
`#ID q1 q2 q3 cq1 cq2 cq3
#1 1 2 1 0 0 0
#2 1 2 2 0 0 1
#3 2 2 2 1 0 1
#4 1 2 1 0 0 0`
I know that I could write out all answers like this:
`data_re <- data %>%
mutate(cq1 = if_else(q1==2, 1, 0),
cq2 = if_else(q2==1, 1, 0),
cq3 = if_else(q3==2, 1, 0))`
But is there any way how to automatically do this (similar to this approach: How to mutate_at multiple columns on a condition on each value?
However, I would have to generate the variablename of the conditional variable automatically. I tried this:
`names_answer_two_correct <- c("q1", "q3")
cnames_answer_two_correct <- paste0("c", names_answer_two_correct)
for (i in 1:length(names_answer_two_correct)) {
data_re <- data %>%
mutate(names_answer_two_correct[i] = if_else(cnames_answer_two_correct[i]== 2, 1, 0))
}`
But I get "Error: unexpected '=' in:"
Does anyone know a solution?
You can use across to apply the function to multiple columns.
library(dplyr)
names_answer_two_correct <- c("q1", "q3")
data %>%
mutate(across(all_of(names_answer_two_correct),
~as.integer(.==2), .names = 'c{col}'),
cq2 = as.integer(q2==1)) -> data_re
data_re
# ID q1 q2 q3 cq1 cq2 cq3
#1 1 1 2 1 0 0 0
#2 2 1 2 2 0 0 1
#3 3 2 2 2 1 0 1
#4 4 1 2 1 0 0 0
I would like to know if/how can I turn the call bellow into a function that can be used in a task that I do fairly often with my data. Sadly, I can't figure out how to design function from the call that involves mutate, and case_when, both of these functions rely on dplyr package and require number of additional arguments.
Alternatively, the call itself seems redundant to me with so many case_when, perhaps it's possible to reduce how many times its used.
Any help and information about alternative approaches is welcomed.
The call looks like this:
library(dplyr)
library(stringr)
test_data %>%
mutate(
multipleoptions_o1 = case_when(
str_detect(multipleoptions, "option1") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o2 = case_when(
str_detect(multipleoptions, "option2") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o3 = case_when(
str_detect(multipleoptions, "option3") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0),
multipleoptions_o4 = case_when(
str_detect(multipleoptions, "option4") ~ 1,
is.na(multipleoptions) ~ NA_real_,
TRUE ~ 0)
)
Sample data:
structure(list(multipleoptions = c("option1", "option2", "option3",
NA, "option2,option3", "option4")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Desired output of the function:
structure(list(multipleoptions = c("option1", "option2", "option3",
NA, "option2,option3", "option4"), multipleoptions_o1 = c(1,
0, 0, NA, 0, 0), multipleoptions_o2 = c(0, 1, 0, NA, 1, 0), multipleoptions_o3 = c(0,
0, 1, NA, 1, 0), multipleoptions_o4 = c(0, 0, 0, NA, 0, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Arguments of the function should probably be: data (i.e., input dataset), multipleoptions (i.e., the column from data containing answer options, always one column), patterns_to_look_for (i.e., str_detect patterns to look up in the multipleoptions), number_of_options, ideally the number of options can be more or less than 4, (I am not sure if it's achievable), output_columns (i.e., names of new columns, it's always name or original column followed by the option number or option name).
You can avoid the lengthy case_when code by splitting the options into separate elements, taking advantage of nesting/unnesting to get a single column of options, and then spreading to get a separate column for each option.
Updated Answer
library(tidyverse)
# Arguments
# data A data frame
# patterns Regular expression giving the pattern(s) at which to split the options strings
# ... Grouping columns, the first of which must be the "options" column.
# If options has repeated values, then there must be a second grouping
# column (an "ID" column) to differentiate these repeated values.
fnc = function(data, patterns, ...) {
col = quos(...)
data %>%
mutate(option=str_split(!!!col[[1]], patterns)) %>%
unnest %>%
mutate(value=1) %>%
group_by(!!!col) %>%
mutate(num_chosen = ifelse(is.na(!!!col[[1]]), 0, sum(value))) %>%
spread(option, value, fill=0) %>%
select_at(vars(-matches("NA")))
}
fnc(test_data, ",", multipleoptions)
multipleoptions num_chosen option1 option2 option3 option4
1 option1 1 1 0 0 0
2 option2 1 0 1 0 0
3 option2,option3 2 0 1 1 0
4 option3 1 0 0 1 0
5 option4 1 0 0 0 1
6 <NA> 0 0 0 0 0
# Fake data
ops = paste0("option",1:4)
set.seed(2)
d = data_frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))
# Add missing values
d = bind_rows(d[1:5,], data.frame(var=rep(NA,3)), d[6:nrow(d),])
fnc(d %>% mutate(ID=1:n()), ",", var, ID)
var ID num_chosen option1 option2 option3 option4
1 option1 17 1 1 0 0 0
2 option1,option2 12 2 1 1 0 0
3 option1,option2,option3 5 3 1 1 1 0
4 option1,option2,option4,option3 9 4 1 1 1 1
5 option1,option3 2 2 1 0 1 0
6 option1,option3,option4 3 3 1 0 1 1
7 option1,option4,option2 20 3 1 1 0 1
8 option1,option4,option3,option2 13 4 1 1 1 1
9 option2 11 1 0 1 0 0
10 option2,option3 23 2 0 1 1 0
11 option2,option3,option4 21 3 0 1 1 1
12 option3 1 1 0 0 1 0
13 option3 15 1 0 0 1 0
14 option3,option1 4 2 1 0 1 0
15 option3,option2,option4 14 3 0 1 1 1
16 option3,option4,option2,option1 22 4 1 1 1 1
17 option4 10 1 0 0 0 1
18 option4 16 1 0 0 0 1
19 option4 18 1 0 0 0 1
20 option4,option2,option3 19 3 0 1 1 1
21 <NA> 6 0 0 0 0 0
22 <NA> 7 0 0 0 0 0
23 <NA> 8 0 0 0 0 0
Original Answer
test_data %>%
filter(!is.na(multipleoptions)) %>%
mutate(option=str_split(multipleoptions, ",")) %>%
unnest %>%
mutate(value=1) %>%
spread(option, value)
multipleoptions option1 option2 option3 option4
<chr> <dbl> <dbl> <dbl> <dbl>
1 option1 1 NA NA NA
2 option2 NA 1 NA NA
3 option2,option3 NA 1 1 NA
4 option3 NA NA 1 NA
5 option4 NA NA NA 1
Packaging this into a function:
fnc = function(data, col, patterns) {
col = enquo(col)
data %>%
filter(!is.na(!!col)) %>%
mutate(option=str_split(!!col, patterns)) %>%
unnest %>%
mutate(value=1) %>%
spread(option, value)
}
fnc(test_data, multipleoptions, ",")
If your real data has more than one row with the same value of multipleoptons, then this code will work only if there's also an ID column that distinguishes different rows with the same value of multipleoptions. For example:
# Fake data
ops = paste0("option",1:4)
set.seed(2)
d = data.frame(var=replicate(20, paste(sample(ops, sample(1:4,1, prob=c(10,8,5,1))), collapse=",")))
fnc(d, var, ",")
Error: Duplicate identifiers for rows (1, 27), (16, 28, 30)
# Add unique row identifier
fnc(d %>% mutate(ID = 1:n()), var, ",")
var ID option1 option2 option3 option4
1 option1 14 1 NA NA NA
2 option1,option2 9 1 1 NA NA
3 option1,option2,option3 5 1 1 1 NA
4 option1,option2,option4,option3 6 1 1 1 1
5 option1,option3 2 1 NA 1 NA
6 option1,option3,option4 3 1 NA 1 1
7 option1,option4,option2 17 1 1 NA 1
8 option1,option4,option3,option2 10 1 1 1 1
9 option2 8 NA 1 NA NA
10 option2,option3 20 NA 1 1 NA
11 option2,option3,option4 18 NA 1 1 1
12 option3 1 NA NA 1 NA
13 option3 12 NA NA 1 NA
14 option3,option1 4 1 NA 1 NA
15 option3,option2,option4 11 NA 1 1 1
16 option3,option4,option2,option1 19 1 1 1 1
17 option4 7 NA NA NA 1
18 option4 13 NA NA NA 1
19 option4 15 NA NA NA 1
20 option4,option2,option3 16 NA 1 1 1