Related
Disclaimer: I think there is a much more efficient solution (perhaps an anonymous function with a list or *apply functions?) hence why I have come to you much more experienced people for help!
The data
Let's say I have a df with participant responses to 3 question As and 3 question Bs e.g.
qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3
EDIT df also contains other columns with other irrelevant data!
I have a vector with correct answers to each of qa1-3 and qb1-3 in sequence with the columns.
correct_answer <- c(1,3,2,2,1,4)
(i.e. for qa1,qa2,qa3,qb1,qb2,qb3)
Desired manipulation
I want to create a new column per question (e.g. qa1_correct), coding for whether the participant has responded correctly (1) or incorrectly (0) based on matching each response in df with corresponding answer in correct_answer. Ideally I would end up with:
qa1, qa2, qa3, qb1, qb2, qb3, qa1_correct, qa2_correct, qa3_correct ...
1, 3, 1, 2, 4, 4, 1, 1, 0, ...
1, 3, 2, 2, 1, 4, 1, 1, 1, ...
2, 3, 1, 2, 1, 4, 0, 1, 0, ...
1, 3, 2, 1, 1, 3, 1, 1, 1, ...
Failed Attempt
This is my attempt for question As only (would repeat for Bs) but it doesn't work (maybe wrong function paste0()?):
index <- c(1:3)
for (i in index) {
df <- df %>% mutate(paste0("qa",i,"_correct") =
case_when(paste0("qa"i) == correct_answer[i] ~ 1,
paste0("qa"i) != correct_answer[i] ~ 0))
}
Many thanks for any guidance!
You can combine mutate and across.
Code 1: Correct_answer as vector
df %>%
mutate(across(everything(),
~as.numeric(.x == correct_answer[names(df) == cur_column()]),
.names = "{.col}_correct"))
Code 2: Correct_answer as data.frame (df_correct)
correct_answer <- c(1,3,2,2,1,4)
df_correct <- data.frame(
matrix(correct_answer, ncol = length(correct_answer))
)
colnames(df_correct) <- names(df)
df %>%
mutate(across(everything(),
.fn = ~as.numeric(.x == df_correct[,cur_column()]),
.names = "{.col}_correct"))
Output
qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct qb2_correct qb3_correct
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 1 1 1 1 1 1
3 2 3 1 2 1 4 0 1 0 1 1 1
4 1 3 2 1 1 3 1 1 1 0 1 0
This may also be an alternative (In R version 4.1.0 onwards that has made apply gain a new argument simplify with default TRUE)
df <- read.table(header = T, text = 'qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3', sep = ',')
df
#> qa1 qa2 qa3 qb1 qb2 qb3
#> 1 1 3 1 2 4 4
#> 2 1 3 2 2 1 4
#> 3 2 3 1 2 1 4
#> 4 1 3 2 1 1 3
correct_answer <- c(1,3,2,2,1,4)
cbind(df,
setNames(as.data.frame(t(apply(df, 1,
\(x) +(x == correct_answer)))),
paste0(names(df), '_correct')))
#> qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct
#> 1 1 3 1 2 4 4 1 1 0 1
#> 2 1 3 2 2 1 4 1 1 1 1
#> 3 2 3 1 2 1 4 0 1 0 1
#> 4 1 3 2 1 1 3 1 1 1 0
#> qb2_correct qb3_correct
#> 1 0 1
#> 2 1 1
#> 3 1 1
#> 4 1 0
Created on 2021-07-23 by the reprex package (v2.0.0)
You can also use the following solution in base R:
cbind(df,
do.call(cbind, mapply(function(x, y) as.data.frame({+(x == y)}),
df, correct_answer, SIMPLIFY = FALSE)) |>
setNames(paste0(names(df), "_corr")))
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Or a potential tidyverse solution could be:
library(tidyr)
library(purrr)
df %>%
mutate(output = pmap(df, ~ setNames(+(c(...) == correct_answer),
paste0(names(df), "_corr")))) %>%
unnest_wider(output)
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Try this:
df_new <- cbind(df, t(apply(df, 1, function(x) as.numeric(x == correct_answer))))
EDIT works with addition of sym()
Found a related solution here Paste variable name in mutate (dplyr) but it only pastes 0's
for (i in index) {
df <- df %>% mutate( !!paste0("qa",i,"_correct") :=
case_when(!!sym(paste0("qa",i)) == correct_answer[i] ~ 1,
!!sym(paste0("qa",i)) != correct_answer[i] ~ 0))
}
I want to find a way to replace consecutive same values into 0 at the beginning of each trial, but once the value has changed it should stop replacing and keep the value. It should occur every trials per subject.
For example, first subject has multiple trials (1, 2, etc). At the beginning of each trial, there may be some consecutive rows with the same value (e.g., 1, 1, 1). For these values, I would like to replace them to 0. However, once the value has changed from 1 to 0, I want to keep the values in the rest of the trial (e.g., 0, 0, 1).
subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
trial <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
value <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1)
df <- data.frame(subject, trial, value)
Thus, from the original data frame, I would like to have a new variable (value_new) like below.
subject trial value value_new
1 1 1 1 0
2 1 1 1 0
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
6 1 1 1 1
7 1 2 1 0
8 1 2 1 0
9 1 2 0 0
10 1 2 1 1
11 1 2 1 1
12 1 2 1 1
I was thinking to use tidyr and group_by(subject, trial) and mutate a new variable using conditional statement, but no idea how to do that. I guess I need to use rle(), but again, have no clue of how to replace the consecutive values into 0, and stop replacing once the value has changed and keep the rest of the values.
Any suggestions or advice would be really appreciated!
You can use rleid from data.table :
library(data.table)
setDT(df)[, new_value := value * +(rleid(value) > 1), .(subject, trial)]
df
# subject trial value new_value
# 1: 1 1 1 0
# 2: 1 1 1 0
# 3: 1 1 1 0
# 4: 1 1 0 0
# 5: 1 1 0 0
# 6: 1 1 1 1
# 7: 1 2 1 0
# 8: 1 2 1 0
# 9: 1 2 0 0
#10: 1 2 1 1
#11: 1 2 1 1
#12: 1 2 1 1
You can also do this with dplyr :
library(dplyr)
df %>%
group_by(subject, trial) %>%
mutate(new_value = value * +(rleid(value) > 1))
Being a novice on R, I have trouble setting up the appropriate code (I would still say that it must include if/else commands and a loop).
In concrete terms, I would like to compare two pieces of information (see simplified example, because my actual database is rather long): "Monthly_category" and "Ref_category". The "Ref_category" to be taken into consideration is calculated only at the 5th period for each element (because then we move to the next element), thanks to the mode formula, for each element (Element_id).
Months Element_Id Monthly_Category Ref_Category Expected_output
1 1 3 NA 0
2 1 2 NA 0
3 1 2 NA 1
4 1 1 NA 1
5 1 3 3 0
1 2 6 2 0
2 2 6 6 1
3 2 NA 1 0
4 2 NA 6 0
5 2 1 1 0
More precisely, I would like to put 1 as soon as the "Monthly_category" differs 2 periods in a row from the selected "Ref_category" which is calculated every 5 observations. Otherwise, set 0.
In addition, I would like the lines or Monthly_category = NA to give 0 directly because in the end, I will only take into account lines where I have 1s (and NA doesn't interest me).
For each element (1 element = 5 lines), the reference category is calculated at the end of the 5 periods using the mode. However, by stretching the formula, we have values in each line while I have to consider each time only the last value (so every 5 lines). That's why I thought we needed 2 loops: one to check each line for the monthly category and one to check the reference category every 5 lines.
Do you have any idea of the code that could allow me to do this?
A very big thank you if someone can enlighten me,
Vanie
First of all, please have a look at the questions that #John Coleman and I asked you into the comments because my solution may change based on your request.
Anyway, you don't need an explicit for loop or an explicit if else to get the job done.
In R, you usually prefer not to write directly any for loop. You'd better use a functional like lapply. In this case the dplyr package takes care of any implicit looping.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, 1, 1, 0,
4 , 2, 1, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
df %>%
# check if elements are equal
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, Ref_Category, identical)) %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... define your Expected Output
mutate(Real_Expected_output = as.integer(lag(Real_Expected_output, default = FALSE) &
lag(Real_Expected_output, 2, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 1
# 1 2 6 2 0 0
# 2 2 6 6 1 0
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
Real_Expected_output is not the same of your Expected_output just because I do believe your expected result contradicts your written requests as I said in one of the comments.
EDIT:
Based on your comment, I suppose this is what you're looking for.
Again: no loops, you just need to use wisely the tools that the dplyr package is already providing, i.e. last, group_by, mutate
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... check if Monthly Category is equal to the last Ref_Category
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, last(Ref_Category), identical)) %>%
# ... and define your Expected Output
mutate(Real_Expected_output = as.integer(Real_Expected_output &
lag(Real_Expected_output, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 0
# 1 2 6 2 0 0
# 2 2 6 6 1 1
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
EDIT 2:
I'll edit it again based on your request. At this point I'd suggest you to create an external function to handle your problem. It looks cleaner.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, NA, 1, 0,
4 , 2, NA, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
get_output <- function(mon, ref){
# set here your condition
exp <- !is.na(mon) & !map2_lgl(mon, last(ref), identical)
# check exp and lag(exp), then convert to integer
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... launch your function
mutate(Real_Expected_output = get_output(Monthly_Category, Ref_Category)) %>%
ungroup()
# # A tibble: 10 x 6
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 3 NA 0 0
# 2 2 1 2 NA 0 0
# 3 3 1 2 NA 1 1
# 4 4 1 1 NA 1 1
# 5 5 1 3 3 0 0
# 6 1 2 6 2 0 0
# 7 2 2 6 6 1 1
# 8 3 2 NA 1 0 0
# 9 4 2 NA 6 0 0
# 10 5 2 1 1 0 0
Hi I would really appreciate some help for this, I really couldn't find the solution in previous questions.
I have a tibble in long format (rows grouped by id and arranged by time).
I want to create a variable "eleg" based on "varx". The condition would be that "eleg" = 1 if "varx" in the previous 3 rows == 0 and in the current row varx == 1, if not = 0, for each ID. If possible using dplyr.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3)
time <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,1,2,3,4)
varx <- c(0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,1)
eleg <- c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1)
table <- data.frame(id, time, varx, eleg)
In my real dataset the condition is "in the previous 24 rows" and the same ID could have eleg == 1 more than one time if it suits the condition.
Thank you.
One of the approach could be
library(dplyr)
m <- 3 #number of times previous rows are looked back
df %>%
group_by(id) %>%
mutate(eleg = ifelse(rowSums(sapply(1:m, function(k) lag(varx, n = k, order_by = id, default = 1) == 0)) == m & varx == 1,
1,
0)) %>%
data.frame()
which gives
id time varx eleg
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 0 0
5 1 5 1 1
6 1 6 1 0
7 1 7 0 0
8 2 1 0 0
9 2 2 1 0
10 2 3 1 0
11 2 4 1 0
12 2 5 1 0
13 2 6 1 0
14 3 1 0 0
15 3 2 0 0
16 3 3 0 0
17 3 4 1 1
Sample data:
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3), time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
1, 2, 3, 4), varx = c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1)), .Names = c("id", "time", "varx"), row.names = c(NA,
-17L), class = "data.frame")
library(data.table)
df %>%
mutate(elegnew = ifelse(Reduce("+", shift(df$varx, 1:3)) == 0 & df$varx == 1, 1, 0))
id time varx eleg elegnew
1 1 1 0 0 0
2 1 2 0 0 0
3 1 3 0 0 0
4 1 4 0 0 0
5 1 5 1 1 1
6 1 6 1 0 0
7 1 7 0 0 0
8 2 1 0 0 0
9 2 2 1 0 0
10 2 3 1 0 0
11 2 4 1 0 0
12 2 5 1 0 0
13 2 6 1 0 0
14 3 1 0 0 0
15 3 2 0 0 0
16 3 3 0 0 0
17 3 4 1 1 1
Here's another approach, using dplyr and zoo:
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
mutate(elegnew = as.integer(varx == 1 &
rollsum(varx == 1, k = 4, align = "right", fill = 0) == 1))
# # A tibble: 17 x 5
# # Groups: id [3]
# id time varx eleg elegnew
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 1. 1. 0. 0. 0
# 2 1. 2. 0. 0. 0
# 3 1. 3. 0. 0. 0
# 4 1. 4. 0. 0. 0
# 5 1. 5. 1. 1. 1
# 6 1. 6. 1. 0. 0
# 7 1. 7. 0. 0. 0
# 8 2. 1. 0. 0. 0
# 9 2. 2. 1. 0. 0
# 10 2. 3. 1. 0. 0
# 11 2. 4. 1. 0. 0
# 12 2. 5. 1. 0. 0
# 13 2. 6. 1. 0. 0
# 14 3. 1. 0. 0. 0
# 15 3. 2. 0. 0. 0
# 16 3. 3. 0. 0. 0
# 17 3. 4. 1. 1. 1
The idea is to group by id and then check a) whether varx is 1 and b) whether the sum of varx=1 events in the previous 3 plus current row (k=4) is 1 (which means all previous 3 must be 0). I assume that varx is either 0 or 1.
You have asked for a dplyr solution, preferably.
The following is a base R one, with a function that you can adapt to "in the previous 24 rows", just pass n = 24 to the function.
fun <- function(DF, crit = "varx", new = "eleg", n = 3){
DF[[new]] <- 0
for(i in seq_len(nrow(DF))[-seq_len(n)]){
if(all(DF[[crit]][(i - n):(i - 1)] == 0) && DF[[crit]][i] == 1)
DF[[new]][i] <- 1
}
DF
}
sp <- split(table[-4], table[-4]$id)
new_df <- do.call(rbind, lapply(sp, fun))
row.names(new_df) <- NULL
identical(table, new_df)
#[1] TRUE
Note that if you are creating a new column, eleg, you would probably not need to split table[-4], just table since the 4th column wouldn't exist yet.
You could do do.call(rbind, lapply(sp, fun, n = 24)) and the rest would be the same.
I am using panel data with multiple subjects (id) and have an event (first_occurrence) that occurs on different days. My goal is to create a new variable (result) that is 1 on the 2 days preceding the first occurrence, the day of the first occurrence, and the 2 days following the first occurrence.
Here is an example that includes both the sample data and the desired output:
data <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3), day = c(0, 1, 2, 3, 4, 5, 6, 7, 0, 1,
2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6), first_occurrence = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1), desired_output = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)), .Names = c("id",
"day", "first_occurrence", "desired_output"), row.names = c(NA,
-21L), class = "data.frame")
Although this may not be the most efficient solution, I managed to get the code working in Stata (please see below for Stata code), although I would like to get it working in R as well and would appreciate any thoughts folks have.
Thanks!
Stata code:
tsset id day
gen run = .
by id: replace run = cond(L.run == ., 1, L.run + 1)
gen test = .
replace test = run if(first_occurrence == 1)
gen test2 = .
by id: replace test2 = test[_n-1]
gen test3 = .
by id: replace test3 = test[_n-2]
gen test4 = .
by id: replace test4 = test[_n+1]
gen test5 = .
by id: replace test5 = test[_n+2]
egen test_sum = rowtotal(test test2 test3 test4 test5)
replace test_sum = 1 if(test_sum >= 1)
rename test_sum result
drop run test test2 test3 test4 test5
Here's another approach using the package dplyr:
require(dplyr) #install and load the package
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
group_by(id) %.%
mutate(n = 1:n(),
result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
# id day first_occurrence desired_output result
#1 1 0 0 1 1
#2 1 1 0 1 1
#3 1 2 1 1 1
#4 1 3 0 1 1
#5 1 4 0 1 1
#6 1 5 0 0 0
#7 1 6 0 0 0
#8 1 7 0 0 0
#9 2 0 0 0 0
#10 2 1 0 0 0
#11 2 2 0 1 1
#12 2 3 0 1 1
#13 2 4 1 1 1
#14 2 5 0 1 1
#15 3 0 0 0 0
#16 3 1 0 0 0
#17 3 2 0 0 0
#18 3 3 0 0 0
#19 3 4 0 1 1
#20 3 5 0 1 1
#21 3 6 1 1 1
What the code does is, first group by id and then it will add another column (n) where it counts the rows per group from 1 to the number of rows per group. Then it creates another column result with an ifelse that will check the absolute difference between the current n (for each row) and the n where first_occurrence is 1. If that difference is equal to or less than 2, result will be 1 otherwise 0. The last line removes the column n.
Edit:
It would probably be more efficient to place the mutate(n = 1:n()) before the group_by:
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
mutate(n = 1:n()) %.%
group_by(id) %.%
mutate(result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
Here's one way. You can use ave to look by group, and then you can use which.max to find the first occurrence and then calculate the distance from that value for all the other values
close<-(with(data, ave(first_occurrence, id, FUN=function(x)
abs(seq_along(x)-which.max(x)))
)<=2)+0
Here I use +0 to turn the logical values into 0/1 values. Now you can combine that with your existing data
cbind(data, close)
And that gives
id day first_occurrence desired_output close
1 1 0 0 1 1
2 1 1 0 1 1
3 1 2 1 1 1
4 1 3 0 1 1
5 1 4 0 1 1
6 1 5 0 0 0
7 1 6 0 0 0
8 1 7 0 0 0
9 2 0 0 0 0
10 2 1 0 0 0
11 2 2 0 1 1
12 2 3 0 1 1
13 2 4 1 1 1
14 2 5 0 1 1
15 3 0 0 0 0
16 3 1 0 0 0
17 3 2 0 0 0
18 3 3 0 0 0
19 3 4 0 1 1
20 3 5 0 1 1
21 3 6 1 1 1
as desired. Note that this method assumes that the data is sorted by day.