Mutate a column in data.table with ifelse and group by - r

I have some dplyr code I'm moving to data.table, this is a problem I just ran into. I want the difference from one row to the next in b stored in column c if a is greater or equal than 3. However after running this code:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
setDT(df)
df[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)]
all the elements in c are 0. Why is this?
df
a b c
1: 1 0 0
2: 1 1 0
3: 1 0 0
4: 1 1 0
5: 2 0 0
6: 2 1 0
7: 2 1 0
8: 3 0 0
9: 3 3 0
10: 3 4 0
11: 3 5 0
What I thought was the equivalent dplyr:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
df %>%
group_by(a) %>%
mutate(c = ifelse( a >= 3, c(0, diff(b)), b))

From the help for ifelse(test, yes, no), it should return...
A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no. The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.
However:
> df %>% group_by(a) %>% do(print(.$a))
[1] 1 1 1 1
[1] 2 2 2
[1] 3 3 3 3
> data.table(df)[, print(a), by=a]
[1] 1
[1] 2
[1] 3
As explained in the help pages, since the first argument has a length of one, if you pass vectors for the other parts, only their first element is used:
> ifelse(TRUE, 1:10, eleventy + million)
[1] 1
You should probably use if ... else ... when working with a constant value, like...
> data.table(df)[, b := if (a >= 3) c(0, diff(b)) else b, by=a]
or even better, in this case you can assign to a subset:
> data.table(df)[a >= 3, b := c(0, diff(b)), by=a]
Regarding why a has length 1 for the data.table idiom, see its FAQ question "Inside each group, why are the group variables length-1?"

I am creating a dataset which has non-zero values for b as the first element of each group by a to illustrate better. Your previous dataset had all zeros and also c(0,diff(b)) was starting with zero so it was hard to differentiate.
What happens here is that output of ifelse is a vector of length 1.
library(data.table)
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(10, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
Look below:
setDT(df)[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 10
#> 3: 1 0 10
#> 4: 1 1 10
#> 5: 2 0 0
#> 6: 2 1 0
#> 7: 2 1 0
#> 8: 3 0 0
#> 9: 3 3 0
#> 10: 3 4 0
#> 11: 3 5 0
Now, let's look at some other examples; here I am using a simple vector of length 4 (instead of c(0,diff(b))):
setDT(df)[ , c := ifelse(a >= 3L, c(20,2,3,4), -999), by=a][]
#> a b c
#> 1: 1 10 -999
#> 2: 1 1 -999
#> 3: 1 0 -999
#> 4: 1 1 -999
#> 5: 2 0 -999
#> 6: 2 1 -999
#> 7: 2 1 -999
#> 8: 3 0 20
#> 9: 3 3 20
#> 10: 3 4 20
#> 11: 3 5 20
You see that still the first element is getting assigned to all the rows of c for that group of a.
A work-around is using diff on a to see when it's not changing (i.e. diff(a)==0) and use that as a pseudo-grouping along with the other condition; like below:
setDT(df)[, c := ifelse(a >= 3 & c(F,diff(a)==0), c(0,diff(b)), b)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 1
#> 3: 1 0 0
#> 4: 1 1 1
#> 5: 2 0 0
#> 6: 2 1 1
#> 7: 2 1 1
#> 8: 3 0 0
#> 9: 3 3 3
#> 10: 3 4 1
#> 11: 3 5 1

Related

How to find the most commonly occurring combinations of Boolean variables by row in R

I have a series of 14 Boolean variables and I would like to find the top 3 combinations of 3 or more variables (where the value == 1).
Sample data:
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
var1 = c(0, 0, 1, 1, 1, 0, 0, 1),
var2 = c(1, 0, 0, 1, 1, 1, 1, 0),
var3 = c(0, 0, 1, 1, 1, 1, 0, 0),
var4 = c(1, 1, 1, 1, 1, 0, 1, 1),
var5 = c(0, 0, 0, 1, 1, 0, 1, 1)
)
df
> df
ID var1 var2 var3 var4 var5
1 1 0 1 0 1 0
2 2 0 0 0 1 0
3 3 1 0 1 1 0
4 4 1 1 1 1 1
5 5 1 1 1 1 1
6 6 0 1 1 0 0
7 7 0 1 0 1 1
8 8 1 0 0 1 1
I found a solution to bring all column names together per unique occurance:
# Bring to long format
df_long <- df %>%
melt(id.vars = "ID")
# Collapse the variables that have a '1' together per row
df_combo <- ddply(df_long, "ID", summarize,
combos = paste(variable[value == 1], collapse = "/"))
> df_combo
ID combos
1 1 var2/var4
2 2 var4
3 3 var1/var3/var4
4 4 var1/var2/var3/var4/var5
5 5 var1/var2/var3/var4/var5
6 6 var2/var3
7 7 var2/var4/var5
8 8 var1/var4/var5
If I only wanted counts on unique combinations this would be fine, but I would like to know the number of times each combination of 3 or more variables occurs, even in cases where other variables also occur. The combination (var1/var4/var5) occurs 3 times in the above example, but twice it occurs next to two other variables.
There must be an easy way to extract this information, just can't think of it. Thank you for your help!!
An attempt, using combn as the workhorse function.
arr <- which(df[-1] == 1, arr.ind=TRUE)
tmp <- tapply(arr[,"col"], arr[,"row"],
FUN=function(x) if (length(x) >= 3) combn(x,3, simplify=FALSE) )
tmp <- data.frame(do.call(rbind, unlist(tmp, rec=FALSE)))
aggregate(count ~ . , cbind(tmp, count=1), sum)
## X1 X2 X3 count
##1 1 2 3 2
##2 1 2 4 2
##3 1 3 4 3
##4 2 3 4 2
##5 1 2 5 2
##6 1 3 5 2
##7 2 3 5 2
##8 1 4 5 3
##9 2 4 5 3
##10 3 4 5 2

Using R dplyr::mutate() with a for loop and dynamic variables

Disclaimer: I think there is a much more efficient solution (perhaps an anonymous function with a list or *apply functions?) hence why I have come to you much more experienced people for help!
The data
Let's say I have a df with participant responses to 3 question As and 3 question Bs e.g.
qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3
EDIT df also contains other columns with other irrelevant data!
I have a vector with correct answers to each of qa1-3 and qb1-3 in sequence with the columns.
correct_answer <- c(1,3,2,2,1,4)
(i.e. for qa1,qa2,qa3,qb1,qb2,qb3)
Desired manipulation
I want to create a new column per question (e.g. qa1_correct), coding for whether the participant has responded correctly (1) or incorrectly (0) based on matching each response in df with corresponding answer in correct_answer. Ideally I would end up with:
qa1, qa2, qa3, qb1, qb2, qb3, qa1_correct, qa2_correct, qa3_correct ...
1, 3, 1, 2, 4, 4, 1, 1, 0, ...
1, 3, 2, 2, 1, 4, 1, 1, 1, ...
2, 3, 1, 2, 1, 4, 0, 1, 0, ...
1, 3, 2, 1, 1, 3, 1, 1, 1, ...
Failed Attempt
This is my attempt for question As only (would repeat for Bs) but it doesn't work (maybe wrong function paste0()?):
index <- c(1:3)
for (i in index) {
df <- df %>% mutate(paste0("qa",i,"_correct") =
case_when(paste0("qa"i) == correct_answer[i] ~ 1,
paste0("qa"i) != correct_answer[i] ~ 0))
}
Many thanks for any guidance!
You can combine mutate and across.
Code 1: Correct_answer as vector
df %>%
mutate(across(everything(),
~as.numeric(.x == correct_answer[names(df) == cur_column()]),
.names = "{.col}_correct"))
Code 2: Correct_answer as data.frame (df_correct)
correct_answer <- c(1,3,2,2,1,4)
df_correct <- data.frame(
matrix(correct_answer, ncol = length(correct_answer))
)
colnames(df_correct) <- names(df)
df %>%
mutate(across(everything(),
.fn = ~as.numeric(.x == df_correct[,cur_column()]),
.names = "{.col}_correct"))
Output
qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct qb2_correct qb3_correct
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 1 1 1 1 1 1
3 2 3 1 2 1 4 0 1 0 1 1 1
4 1 3 2 1 1 3 1 1 1 0 1 0
This may also be an alternative (In R version 4.1.0 onwards that has made apply gain a new argument simplify with default TRUE)
df <- read.table(header = T, text = 'qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3', sep = ',')
df
#> qa1 qa2 qa3 qb1 qb2 qb3
#> 1 1 3 1 2 4 4
#> 2 1 3 2 2 1 4
#> 3 2 3 1 2 1 4
#> 4 1 3 2 1 1 3
correct_answer <- c(1,3,2,2,1,4)
cbind(df,
setNames(as.data.frame(t(apply(df, 1,
\(x) +(x == correct_answer)))),
paste0(names(df), '_correct')))
#> qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct
#> 1 1 3 1 2 4 4 1 1 0 1
#> 2 1 3 2 2 1 4 1 1 1 1
#> 3 2 3 1 2 1 4 0 1 0 1
#> 4 1 3 2 1 1 3 1 1 1 0
#> qb2_correct qb3_correct
#> 1 0 1
#> 2 1 1
#> 3 1 1
#> 4 1 0
Created on 2021-07-23 by the reprex package (v2.0.0)
You can also use the following solution in base R:
cbind(df,
do.call(cbind, mapply(function(x, y) as.data.frame({+(x == y)}),
df, correct_answer, SIMPLIFY = FALSE)) |>
setNames(paste0(names(df), "_corr")))
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Or a potential tidyverse solution could be:
library(tidyr)
library(purrr)
df %>%
mutate(output = pmap(df, ~ setNames(+(c(...) == correct_answer),
paste0(names(df), "_corr")))) %>%
unnest_wider(output)
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Try this:
df_new <- cbind(df, t(apply(df, 1, function(x) as.numeric(x == correct_answer))))
EDIT works with addition of sym()
Found a related solution here Paste variable name in mutate (dplyr) but it only pastes 0's
for (i in index) {
df <- df %>% mutate( !!paste0("qa",i,"_correct") :=
case_when(!!sym(paste0("qa",i)) == correct_answer[i] ~ 1,
!!sym(paste0("qa",i)) != correct_answer[i] ~ 0))
}

Replace the same values in the consecutive rows and stop replacing once the value has changed in R

I want to find a way to replace consecutive same values into 0 at the beginning of each trial, but once the value has changed it should stop replacing and keep the value. It should occur every trials per subject.
For example, first subject has multiple trials (1, 2, etc). At the beginning of each trial, there may be some consecutive rows with the same value (e.g., 1, 1, 1). For these values, I would like to replace them to 0. However, once the value has changed from 1 to 0, I want to keep the values in the rest of the trial (e.g., 0, 0, 1).
subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
trial <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
value <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1)
df <- data.frame(subject, trial, value)
Thus, from the original data frame, I would like to have a new variable (value_new) like below.
subject trial value value_new
1 1 1 1 0
2 1 1 1 0
3 1 1 1 0
4 1 1 0 0
5 1 1 0 0
6 1 1 1 1
7 1 2 1 0
8 1 2 1 0
9 1 2 0 0
10 1 2 1 1
11 1 2 1 1
12 1 2 1 1
I was thinking to use tidyr and group_by(subject, trial) and mutate a new variable using conditional statement, but no idea how to do that. I guess I need to use rle(), but again, have no clue of how to replace the consecutive values into 0, and stop replacing once the value has changed and keep the rest of the values.
Any suggestions or advice would be really appreciated!
You can use rleid from data.table :
library(data.table)
setDT(df)[, new_value := value * +(rleid(value) > 1), .(subject, trial)]
df
# subject trial value new_value
# 1: 1 1 1 0
# 2: 1 1 1 0
# 3: 1 1 1 0
# 4: 1 1 0 0
# 5: 1 1 0 0
# 6: 1 1 1 1
# 7: 1 2 1 0
# 8: 1 2 1 0
# 9: 1 2 0 0
#10: 1 2 1 1
#11: 1 2 1 1
#12: 1 2 1 1
You can also do this with dplyr :
library(dplyr)
df %>%
group_by(subject, trial) %>%
mutate(new_value = value * +(rleid(value) > 1))

A condition to all variable in r

I want to make a table consisting of 0 and 1.
If a variable is larger than 0, it will be 1 otherwise 0.
As the dataset has over 1,000 columns, I should use the 'sapply?' function on this question.
how do I make the code?
We can specify the condition and replace the value for a data frame. No "apply" family function is needed.
# Create an example data frame
dt <- data.frame(A = c(0, 1, 2, 3, 4),
B = c(4, 6, 8, 0, 7),
C = c(0, 0, 5, 5, 2))
# View dt
dt
# A B C
# 1 0 4 0
# 2 1 6 0
# 3 2 8 5
# 4 3 0 5
# 5 4 7 2
# Replace values larger than 0 to be 1
dt[dt > 0] <- 1
# View dt again
dt
# A B C
# 1 0 1 0
# 2 1 1 0
# 3 1 1 1
# 4 1 0 1
# 5 1 1 1

R: Transform unordered long data to wide data

I am looking to transform unordered long data to wide data.
mydata <- data.frame(cat = c('a','a','a','b','c','c','c','c'),
color = c( 1, 1, 1, 2, 1, 1, 1, 1),
hat = c( 1, 1, 2, 2, 1, 2, 1, 2),
shoe = c( 0, 1, 1, 2, 1, 1, 1, 3))
cat is ID variable with while color is a descriptive statistic that does not change within cat.
mydata
cat color hat shoe
1 a 1 1 0
2 a 1 1 1
3 a 1 2 1
4 b 2 2 2
5 c 1 1 1
6 c 1 2 1
7 c 1 1 1
8 c 1 2 3
Final Output
cat color hat1 shoe1 hat2 shoe2 hat3 shoe3 hat4 shoe4
1 a 1 1 0 1 1 2 1 NA NA
2 b 2 2 2 NA NA NA NA NA NA
3 c 1 1 1 2 1 1 1 2 3
The challenge I seem to be facing is that there is no "time variable".
Add in a counter by cat and then you can use that as your time variable:
library(data.table)
mydata <- data.table(cat = c('a','a','a','b','c','c','c','c'),
color = c( 1, 1, 1, 2, 1, 1, 1, 1),
hat = c( 1, 1, 2, 2, 1, 2, 1, 2),
shoe = c( 0, 1, 1, 2, 1, 1, 1, 3))
mydata[, "dummy.id" := seq(.N), by=cat]
widedata <- reshape(mydata, idvar='cat', timevar='dummy.id', direction='wide')
We can use dcast from the devel version of data.table i.e. v1.9.5+ for this. We create a sequence variable ('indx') grouped by 'cat', and 'color' column. Then dcast from 'long' to 'wide' and specifying the value.var columns.
library(data.table)#v1.9.5+
mydata[, indx:=1:.N, by = .(cat, color)]
dcast(mydata, cat+color~indx, value.var=c('hat', 'shoe'))
# cat color hat_1 hat_2 hat_3 hat_4 shoe_1 shoe_2 shoe_3 shoe_4
#1: a 1 1 1 2 NA 0 1 1 NA
#2: b 2 2 NA NA NA 2 NA NA NA
#3: c 1 1 2 1 2 1 1 1 3
NOTE: Instructions to install the devel version are here
This can be made compact by combining with getanID(to create the sequence variable) from splitstackshape
library(splitstackshape)
dcast(getanID(mydata, c('cat', 'color')),
cat+color~.id, value.var=c('hat', 'shoe'), sep='')
# cat color hat1 hat2 hat3 hat4 shoe1 shoe2 shoe3 shoe4
#1: a 1 1 1 2 NA 0 1 1 NA
#2: b 2 2 NA NA NA 2 NA NA NA
#3: c 1 1 2 1 2 1 1 1 3

Resources