Disclaimer: I think there is a much more efficient solution (perhaps an anonymous function with a list or *apply functions?) hence why I have come to you much more experienced people for help!
The data
Let's say I have a df with participant responses to 3 question As and 3 question Bs e.g.
qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3
EDIT df also contains other columns with other irrelevant data!
I have a vector with correct answers to each of qa1-3 and qb1-3 in sequence with the columns.
correct_answer <- c(1,3,2,2,1,4)
(i.e. for qa1,qa2,qa3,qb1,qb2,qb3)
Desired manipulation
I want to create a new column per question (e.g. qa1_correct), coding for whether the participant has responded correctly (1) or incorrectly (0) based on matching each response in df with corresponding answer in correct_answer. Ideally I would end up with:
qa1, qa2, qa3, qb1, qb2, qb3, qa1_correct, qa2_correct, qa3_correct ...
1, 3, 1, 2, 4, 4, 1, 1, 0, ...
1, 3, 2, 2, 1, 4, 1, 1, 1, ...
2, 3, 1, 2, 1, 4, 0, 1, 0, ...
1, 3, 2, 1, 1, 3, 1, 1, 1, ...
Failed Attempt
This is my attempt for question As only (would repeat for Bs) but it doesn't work (maybe wrong function paste0()?):
index <- c(1:3)
for (i in index) {
df <- df %>% mutate(paste0("qa",i,"_correct") =
case_when(paste0("qa"i) == correct_answer[i] ~ 1,
paste0("qa"i) != correct_answer[i] ~ 0))
}
Many thanks for any guidance!
You can combine mutate and across.
Code 1: Correct_answer as vector
df %>%
mutate(across(everything(),
~as.numeric(.x == correct_answer[names(df) == cur_column()]),
.names = "{.col}_correct"))
Code 2: Correct_answer as data.frame (df_correct)
correct_answer <- c(1,3,2,2,1,4)
df_correct <- data.frame(
matrix(correct_answer, ncol = length(correct_answer))
)
colnames(df_correct) <- names(df)
df %>%
mutate(across(everything(),
.fn = ~as.numeric(.x == df_correct[,cur_column()]),
.names = "{.col}_correct"))
Output
qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct qb2_correct qb3_correct
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 1 1 1 1 1 1
3 2 3 1 2 1 4 0 1 0 1 1 1
4 1 3 2 1 1 3 1 1 1 0 1 0
This may also be an alternative (In R version 4.1.0 onwards that has made apply gain a new argument simplify with default TRUE)
df <- read.table(header = T, text = 'qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3', sep = ',')
df
#> qa1 qa2 qa3 qb1 qb2 qb3
#> 1 1 3 1 2 4 4
#> 2 1 3 2 2 1 4
#> 3 2 3 1 2 1 4
#> 4 1 3 2 1 1 3
correct_answer <- c(1,3,2,2,1,4)
cbind(df,
setNames(as.data.frame(t(apply(df, 1,
\(x) +(x == correct_answer)))),
paste0(names(df), '_correct')))
#> qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct
#> 1 1 3 1 2 4 4 1 1 0 1
#> 2 1 3 2 2 1 4 1 1 1 1
#> 3 2 3 1 2 1 4 0 1 0 1
#> 4 1 3 2 1 1 3 1 1 1 0
#> qb2_correct qb3_correct
#> 1 0 1
#> 2 1 1
#> 3 1 1
#> 4 1 0
Created on 2021-07-23 by the reprex package (v2.0.0)
You can also use the following solution in base R:
cbind(df,
do.call(cbind, mapply(function(x, y) as.data.frame({+(x == y)}),
df, correct_answer, SIMPLIFY = FALSE)) |>
setNames(paste0(names(df), "_corr")))
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Or a potential tidyverse solution could be:
library(tidyr)
library(purrr)
df %>%
mutate(output = pmap(df, ~ setNames(+(c(...) == correct_answer),
paste0(names(df), "_corr")))) %>%
unnest_wider(output)
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Try this:
df_new <- cbind(df, t(apply(df, 1, function(x) as.numeric(x == correct_answer))))
EDIT works with addition of sym()
Found a related solution here Paste variable name in mutate (dplyr) but it only pastes 0's
for (i in index) {
df <- df %>% mutate( !!paste0("qa",i,"_correct") :=
case_when(!!sym(paste0("qa",i)) == correct_answer[i] ~ 1,
!!sym(paste0("qa",i)) != correct_answer[i] ~ 0))
}
Related
I have a series of 14 Boolean variables and I would like to find the top 3 combinations of 3 or more variables (where the value == 1).
Sample data:
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
var1 = c(0, 0, 1, 1, 1, 0, 0, 1),
var2 = c(1, 0, 0, 1, 1, 1, 1, 0),
var3 = c(0, 0, 1, 1, 1, 1, 0, 0),
var4 = c(1, 1, 1, 1, 1, 0, 1, 1),
var5 = c(0, 0, 0, 1, 1, 0, 1, 1)
)
df
> df
ID var1 var2 var3 var4 var5
1 1 0 1 0 1 0
2 2 0 0 0 1 0
3 3 1 0 1 1 0
4 4 1 1 1 1 1
5 5 1 1 1 1 1
6 6 0 1 1 0 0
7 7 0 1 0 1 1
8 8 1 0 0 1 1
I found a solution to bring all column names together per unique occurance:
# Bring to long format
df_long <- df %>%
melt(id.vars = "ID")
# Collapse the variables that have a '1' together per row
df_combo <- ddply(df_long, "ID", summarize,
combos = paste(variable[value == 1], collapse = "/"))
> df_combo
ID combos
1 1 var2/var4
2 2 var4
3 3 var1/var3/var4
4 4 var1/var2/var3/var4/var5
5 5 var1/var2/var3/var4/var5
6 6 var2/var3
7 7 var2/var4/var5
8 8 var1/var4/var5
If I only wanted counts on unique combinations this would be fine, but I would like to know the number of times each combination of 3 or more variables occurs, even in cases where other variables also occur. The combination (var1/var4/var5) occurs 3 times in the above example, but twice it occurs next to two other variables.
There must be an easy way to extract this information, just can't think of it. Thank you for your help!!
An attempt, using combn as the workhorse function.
arr <- which(df[-1] == 1, arr.ind=TRUE)
tmp <- tapply(arr[,"col"], arr[,"row"],
FUN=function(x) if (length(x) >= 3) combn(x,3, simplify=FALSE) )
tmp <- data.frame(do.call(rbind, unlist(tmp, rec=FALSE)))
aggregate(count ~ . , cbind(tmp, count=1), sum)
## X1 X2 X3 count
##1 1 2 3 2
##2 1 2 4 2
##3 1 3 4 3
##4 2 3 4 2
##5 1 2 5 2
##6 1 3 5 2
##7 2 3 5 2
##8 1 4 5 3
##9 2 4 5 3
##10 3 4 5 2
I want to find a way to replace consecutive same values into 0 at the beginning of each trial, but once the value has changed it should stop replacing and keep the value. It should occur every trials per subject.
For example, first subject has multiple trials (1, 2, etc). At the beginning of each trial, there may be some consecutive rows with the same value (e.g., 1, 1, 1). For these values, I would like to replace them to 0. However, once the value has changed from 1 to 0, I want to keep the values in the rest of the trial (e.g., 0, 0, 1).
subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
trial <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
value <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1)
df <- data.frame(subject, trial, value)
Thus, from the original data frame, I would like to have a new variable (value_new) like below.
subject trial value value_new
1 1 1 1 0
2 1 1 1 0
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
6 1 1 1 1
7 1 2 1 0
8 1 2 1 0
9 1 2 0 0
10 1 2 1 1
11 1 2 1 1
12 1 2 1 1
I was thinking to use tidyr and group_by(subject, trial) and mutate a new variable using conditional statement, but no idea how to do that. I guess I need to use rle(), but again, have no clue of how to replace the consecutive values into 0, and stop replacing once the value has changed and keep the rest of the values.
Any suggestions or advice would be really appreciated!
You can use rleid from data.table :
library(data.table)
setDT(df)[, new_value := value * +(rleid(value) > 1), .(subject, trial)]
df
# subject trial value new_value
# 1: 1 1 1 0
# 2: 1 1 1 0
# 3: 1 1 1 0
# 4: 1 1 0 0
# 5: 1 1 0 0
# 6: 1 1 1 1
# 7: 1 2 1 0
# 8: 1 2 1 0
# 9: 1 2 0 0
#10: 1 2 1 1
#11: 1 2 1 1
#12: 1 2 1 1
You can also do this with dplyr :
library(dplyr)
df %>%
group_by(subject, trial) %>%
mutate(new_value = value * +(rleid(value) > 1))
Being a novice on R, I have trouble setting up the appropriate code (I would still say that it must include if/else commands and a loop).
In concrete terms, I would like to compare two pieces of information (see simplified example, because my actual database is rather long): "Monthly_category" and "Ref_category". The "Ref_category" to be taken into consideration is calculated only at the 5th period for each element (because then we move to the next element), thanks to the mode formula, for each element (Element_id).
Months Element_Id Monthly_Category Ref_Category Expected_output
1 1 3 NA 0
2 1 2 NA 0
3 1 2 NA 1
4 1 1 NA 1
5 1 3 3 0
1 2 6 2 0
2 2 6 6 1
3 2 NA 1 0
4 2 NA 6 0
5 2 1 1 0
More precisely, I would like to put 1 as soon as the "Monthly_category" differs 2 periods in a row from the selected "Ref_category" which is calculated every 5 observations. Otherwise, set 0.
In addition, I would like the lines or Monthly_category = NA to give 0 directly because in the end, I will only take into account lines where I have 1s (and NA doesn't interest me).
For each element (1 element = 5 lines), the reference category is calculated at the end of the 5 periods using the mode. However, by stretching the formula, we have values in each line while I have to consider each time only the last value (so every 5 lines). That's why I thought we needed 2 loops: one to check each line for the monthly category and one to check the reference category every 5 lines.
Do you have any idea of the code that could allow me to do this?
A very big thank you if someone can enlighten me,
Vanie
First of all, please have a look at the questions that #John Coleman and I asked you into the comments because my solution may change based on your request.
Anyway, you don't need an explicit for loop or an explicit if else to get the job done.
In R, you usually prefer not to write directly any for loop. You'd better use a functional like lapply. In this case the dplyr package takes care of any implicit looping.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, 1, 1, 0,
4 , 2, 1, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
df %>%
# check if elements are equal
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, Ref_Category, identical)) %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... define your Expected Output
mutate(Real_Expected_output = as.integer(lag(Real_Expected_output, default = FALSE) &
lag(Real_Expected_output, 2, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 1
# 1 2 6 2 0 0
# 2 2 6 6 1 0
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
Real_Expected_output is not the same of your Expected_output just because I do believe your expected result contradicts your written requests as I said in one of the comments.
EDIT:
Based on your comment, I suppose this is what you're looking for.
Again: no loops, you just need to use wisely the tools that the dplyr package is already providing, i.e. last, group_by, mutate
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... check if Monthly Category is equal to the last Ref_Category
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, last(Ref_Category), identical)) %>%
# ... and define your Expected Output
mutate(Real_Expected_output = as.integer(Real_Expected_output &
lag(Real_Expected_output, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 0
# 1 2 6 2 0 0
# 2 2 6 6 1 1
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
EDIT 2:
I'll edit it again based on your request. At this point I'd suggest you to create an external function to handle your problem. It looks cleaner.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, NA, 1, 0,
4 , 2, NA, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
get_output <- function(mon, ref){
# set here your condition
exp <- !is.na(mon) & !map2_lgl(mon, last(ref), identical)
# check exp and lag(exp), then convert to integer
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... launch your function
mutate(Real_Expected_output = get_output(Monthly_Category, Ref_Category)) %>%
ungroup()
# # A tibble: 10 x 6
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 3 NA 0 0
# 2 2 1 2 NA 0 0
# 3 3 1 2 NA 1 1
# 4 4 1 1 NA 1 1
# 5 5 1 3 3 0 0
# 6 1 2 6 2 0 0
# 7 2 2 6 6 1 1
# 8 3 2 NA 1 0 0
# 9 4 2 NA 6 0 0
# 10 5 2 1 1 0 0
Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.
I have the following sample data frame:
df <- data.frame("Alpha" = c(NA, NA, 6, 5, 4, 6, 5, 3), "Beta" = c(3, 3, 4, 2, 6, NA, NA, NA), "Gamma" =c(6, 2, 3, 1, NA, NA, 5, 4))
From this data, I would like to get a count of all values between 0 and 6 for each column. The data frame does not contain all values between 0 and 6, so the final output would look something like this:
result <- data.frame("value"=c(0, 1, 2, 3, 4, 5, 6),
"Alpha"=c(0, 0, 0, 1, 1, 2, 2),
"Beta"=c(0, 0, 1, 2, 1, 0, 1),
"Gamma"=c(0, 1, 1, 1, 1, 1, 1))
value Alpha Beta Gamma
0 0 0 0
1 0 0 1
2 0 1 1
3 1 2 1
4 1 1 1
5 2 0 1
6 2 1 1
My first inclination was to reiterate the distinct() function in dplyr. I was thinking of using something like this:
df.alpha <- df %>% distinct(Alpha)
df.beta <- df %>% distinct(Beta)
df.gamma <- df %>% distinct(Gamma)
Afterward, I would bind them together. However, I encounter three issues:
There's a lot of copy and pasting here (there are more columns in my real data frame)
The results do not have the same length, which makes binding difficult; and
"0" is not a value in the original table, so it does not get counted in the results.
I found a similar question in this stackoverflow post on counting a specific value in multiple columns at once. However, unlike that post, the issue I have here is that there is no variable to "group by".
Do folks have any suggestions on how I can produce a count of values between a range of integers for all columns? Thanks so much!
Maybe something like this:
> df[] <- lapply(df,function(x) factor(x,levels = 0:6))
> data.frame(lapply(df,tabulate))
Alpha Beta Gamma
1 0 0 0
2 0 0 1
3 0 1 1
4 1 2 1
5 1 1 1
6 2 0 1
7 2 1 1
A one-liner similar to joran's answer is
cbind.data.frame(values=0:6, sapply(df, function(x) table(factor(x, levels=0:6))))
this returns
values Alpha Beta Gamma
0 0 0 0 0
1 1 0 0 1
2 2 0 1 1
3 3 1 2 1
4 4 1 1 1
5 5 2 0 1
6 6 2 1 1
Replacing table with the tabulate function should speed up the result and also simplify the output.
Another idea with tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate_all(factor, levels = 0:6) %>%
map_dfc(~ c(table(.))) %>%
cbind(values = 0:6, .)
Result:
values Alpha Beta Gamma
1 0 0 0 0
2 1 0 0 1
3 2 0 1 1
4 3 1 2 1
5 4 1 1 1
6 5 2 0 1
7 6 2 1 1