I am looking to transform unordered long data to wide data.
mydata <- data.frame(cat = c('a','a','a','b','c','c','c','c'),
color = c( 1, 1, 1, 2, 1, 1, 1, 1),
hat = c( 1, 1, 2, 2, 1, 2, 1, 2),
shoe = c( 0, 1, 1, 2, 1, 1, 1, 3))
cat is ID variable with while color is a descriptive statistic that does not change within cat.
mydata
cat color hat shoe
1 a 1 1 0
2 a 1 1 1
3 a 1 2 1
4 b 2 2 2
5 c 1 1 1
6 c 1 2 1
7 c 1 1 1
8 c 1 2 3
Final Output
cat color hat1 shoe1 hat2 shoe2 hat3 shoe3 hat4 shoe4
1 a 1 1 0 1 1 2 1 NA NA
2 b 2 2 2 NA NA NA NA NA NA
3 c 1 1 1 2 1 1 1 2 3
The challenge I seem to be facing is that there is no "time variable".
Add in a counter by cat and then you can use that as your time variable:
library(data.table)
mydata <- data.table(cat = c('a','a','a','b','c','c','c','c'),
color = c( 1, 1, 1, 2, 1, 1, 1, 1),
hat = c( 1, 1, 2, 2, 1, 2, 1, 2),
shoe = c( 0, 1, 1, 2, 1, 1, 1, 3))
mydata[, "dummy.id" := seq(.N), by=cat]
widedata <- reshape(mydata, idvar='cat', timevar='dummy.id', direction='wide')
We can use dcast from the devel version of data.table i.e. v1.9.5+ for this. We create a sequence variable ('indx') grouped by 'cat', and 'color' column. Then dcast from 'long' to 'wide' and specifying the value.var columns.
library(data.table)#v1.9.5+
mydata[, indx:=1:.N, by = .(cat, color)]
dcast(mydata, cat+color~indx, value.var=c('hat', 'shoe'))
# cat color hat_1 hat_2 hat_3 hat_4 shoe_1 shoe_2 shoe_3 shoe_4
#1: a 1 1 1 2 NA 0 1 1 NA
#2: b 2 2 NA NA NA 2 NA NA NA
#3: c 1 1 2 1 2 1 1 1 3
NOTE: Instructions to install the devel version are here
This can be made compact by combining with getanID(to create the sequence variable) from splitstackshape
library(splitstackshape)
dcast(getanID(mydata, c('cat', 'color')),
cat+color~.id, value.var=c('hat', 'shoe'), sep='')
# cat color hat1 hat2 hat3 hat4 shoe1 shoe2 shoe3 shoe4
#1: a 1 1 1 2 NA 0 1 1 NA
#2: b 2 2 NA NA NA 2 NA NA NA
#3: c 1 1 2 1 2 1 1 1 3
Related
I have a series of 14 Boolean variables and I would like to find the top 3 combinations of 3 or more variables (where the value == 1).
Sample data:
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
var1 = c(0, 0, 1, 1, 1, 0, 0, 1),
var2 = c(1, 0, 0, 1, 1, 1, 1, 0),
var3 = c(0, 0, 1, 1, 1, 1, 0, 0),
var4 = c(1, 1, 1, 1, 1, 0, 1, 1),
var5 = c(0, 0, 0, 1, 1, 0, 1, 1)
)
df
> df
ID var1 var2 var3 var4 var5
1 1 0 1 0 1 0
2 2 0 0 0 1 0
3 3 1 0 1 1 0
4 4 1 1 1 1 1
5 5 1 1 1 1 1
6 6 0 1 1 0 0
7 7 0 1 0 1 1
8 8 1 0 0 1 1
I found a solution to bring all column names together per unique occurance:
# Bring to long format
df_long <- df %>%
melt(id.vars = "ID")
# Collapse the variables that have a '1' together per row
df_combo <- ddply(df_long, "ID", summarize,
combos = paste(variable[value == 1], collapse = "/"))
> df_combo
ID combos
1 1 var2/var4
2 2 var4
3 3 var1/var3/var4
4 4 var1/var2/var3/var4/var5
5 5 var1/var2/var3/var4/var5
6 6 var2/var3
7 7 var2/var4/var5
8 8 var1/var4/var5
If I only wanted counts on unique combinations this would be fine, but I would like to know the number of times each combination of 3 or more variables occurs, even in cases where other variables also occur. The combination (var1/var4/var5) occurs 3 times in the above example, but twice it occurs next to two other variables.
There must be an easy way to extract this information, just can't think of it. Thank you for your help!!
An attempt, using combn as the workhorse function.
arr <- which(df[-1] == 1, arr.ind=TRUE)
tmp <- tapply(arr[,"col"], arr[,"row"],
FUN=function(x) if (length(x) >= 3) combn(x,3, simplify=FALSE) )
tmp <- data.frame(do.call(rbind, unlist(tmp, rec=FALSE)))
aggregate(count ~ . , cbind(tmp, count=1), sum)
## X1 X2 X3 count
##1 1 2 3 2
##2 1 2 4 2
##3 1 3 4 3
##4 2 3 4 2
##5 1 2 5 2
##6 1 3 5 2
##7 2 3 5 2
##8 1 4 5 3
##9 2 4 5 3
##10 3 4 5 2
Disclaimer: I think there is a much more efficient solution (perhaps an anonymous function with a list or *apply functions?) hence why I have come to you much more experienced people for help!
The data
Let's say I have a df with participant responses to 3 question As and 3 question Bs e.g.
qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3
EDIT df also contains other columns with other irrelevant data!
I have a vector with correct answers to each of qa1-3 and qb1-3 in sequence with the columns.
correct_answer <- c(1,3,2,2,1,4)
(i.e. for qa1,qa2,qa3,qb1,qb2,qb3)
Desired manipulation
I want to create a new column per question (e.g. qa1_correct), coding for whether the participant has responded correctly (1) or incorrectly (0) based on matching each response in df with corresponding answer in correct_answer. Ideally I would end up with:
qa1, qa2, qa3, qb1, qb2, qb3, qa1_correct, qa2_correct, qa3_correct ...
1, 3, 1, 2, 4, 4, 1, 1, 0, ...
1, 3, 2, 2, 1, 4, 1, 1, 1, ...
2, 3, 1, 2, 1, 4, 0, 1, 0, ...
1, 3, 2, 1, 1, 3, 1, 1, 1, ...
Failed Attempt
This is my attempt for question As only (would repeat for Bs) but it doesn't work (maybe wrong function paste0()?):
index <- c(1:3)
for (i in index) {
df <- df %>% mutate(paste0("qa",i,"_correct") =
case_when(paste0("qa"i) == correct_answer[i] ~ 1,
paste0("qa"i) != correct_answer[i] ~ 0))
}
Many thanks for any guidance!
You can combine mutate and across.
Code 1: Correct_answer as vector
df %>%
mutate(across(everything(),
~as.numeric(.x == correct_answer[names(df) == cur_column()]),
.names = "{.col}_correct"))
Code 2: Correct_answer as data.frame (df_correct)
correct_answer <- c(1,3,2,2,1,4)
df_correct <- data.frame(
matrix(correct_answer, ncol = length(correct_answer))
)
colnames(df_correct) <- names(df)
df %>%
mutate(across(everything(),
.fn = ~as.numeric(.x == df_correct[,cur_column()]),
.names = "{.col}_correct"))
Output
qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct qb2_correct qb3_correct
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 1 1 1 1 1 1
3 2 3 1 2 1 4 0 1 0 1 1 1
4 1 3 2 1 1 3 1 1 1 0 1 0
This may also be an alternative (In R version 4.1.0 onwards that has made apply gain a new argument simplify with default TRUE)
df <- read.table(header = T, text = 'qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3', sep = ',')
df
#> qa1 qa2 qa3 qb1 qb2 qb3
#> 1 1 3 1 2 4 4
#> 2 1 3 2 2 1 4
#> 3 2 3 1 2 1 4
#> 4 1 3 2 1 1 3
correct_answer <- c(1,3,2,2,1,4)
cbind(df,
setNames(as.data.frame(t(apply(df, 1,
\(x) +(x == correct_answer)))),
paste0(names(df), '_correct')))
#> qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct
#> 1 1 3 1 2 4 4 1 1 0 1
#> 2 1 3 2 2 1 4 1 1 1 1
#> 3 2 3 1 2 1 4 0 1 0 1
#> 4 1 3 2 1 1 3 1 1 1 0
#> qb2_correct qb3_correct
#> 1 0 1
#> 2 1 1
#> 3 1 1
#> 4 1 0
Created on 2021-07-23 by the reprex package (v2.0.0)
You can also use the following solution in base R:
cbind(df,
do.call(cbind, mapply(function(x, y) as.data.frame({+(x == y)}),
df, correct_answer, SIMPLIFY = FALSE)) |>
setNames(paste0(names(df), "_corr")))
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Or a potential tidyverse solution could be:
library(tidyr)
library(purrr)
df %>%
mutate(output = pmap(df, ~ setNames(+(c(...) == correct_answer),
paste0(names(df), "_corr")))) %>%
unnest_wider(output)
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Try this:
df_new <- cbind(df, t(apply(df, 1, function(x) as.numeric(x == correct_answer))))
EDIT works with addition of sym()
Found a related solution here Paste variable name in mutate (dplyr) but it only pastes 0's
for (i in index) {
df <- df %>% mutate( !!paste0("qa",i,"_correct") :=
case_when(!!sym(paste0("qa",i)) == correct_answer[i] ~ 1,
!!sym(paste0("qa",i)) != correct_answer[i] ~ 0))
}
Being a novice on R, I have trouble setting up the appropriate code (I would still say that it must include if/else commands and a loop).
In concrete terms, I would like to compare two pieces of information (see simplified example, because my actual database is rather long): "Monthly_category" and "Ref_category". The "Ref_category" to be taken into consideration is calculated only at the 5th period for each element (because then we move to the next element), thanks to the mode formula, for each element (Element_id).
Months Element_Id Monthly_Category Ref_Category Expected_output
1 1 3 NA 0
2 1 2 NA 0
3 1 2 NA 1
4 1 1 NA 1
5 1 3 3 0
1 2 6 2 0
2 2 6 6 1
3 2 NA 1 0
4 2 NA 6 0
5 2 1 1 0
More precisely, I would like to put 1 as soon as the "Monthly_category" differs 2 periods in a row from the selected "Ref_category" which is calculated every 5 observations. Otherwise, set 0.
In addition, I would like the lines or Monthly_category = NA to give 0 directly because in the end, I will only take into account lines where I have 1s (and NA doesn't interest me).
For each element (1 element = 5 lines), the reference category is calculated at the end of the 5 periods using the mode. However, by stretching the formula, we have values in each line while I have to consider each time only the last value (so every 5 lines). That's why I thought we needed 2 loops: one to check each line for the monthly category and one to check the reference category every 5 lines.
Do you have any idea of the code that could allow me to do this?
A very big thank you if someone can enlighten me,
Vanie
First of all, please have a look at the questions that #John Coleman and I asked you into the comments because my solution may change based on your request.
Anyway, you don't need an explicit for loop or an explicit if else to get the job done.
In R, you usually prefer not to write directly any for loop. You'd better use a functional like lapply. In this case the dplyr package takes care of any implicit looping.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, 1, 1, 0,
4 , 2, 1, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
df %>%
# check if elements are equal
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, Ref_Category, identical)) %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... define your Expected Output
mutate(Real_Expected_output = as.integer(lag(Real_Expected_output, default = FALSE) &
lag(Real_Expected_output, 2, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 1
# 1 2 6 2 0 0
# 2 2 6 6 1 0
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
Real_Expected_output is not the same of your Expected_output just because I do believe your expected result contradicts your written requests as I said in one of the comments.
EDIT:
Based on your comment, I suppose this is what you're looking for.
Again: no loops, you just need to use wisely the tools that the dplyr package is already providing, i.e. last, group_by, mutate
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... check if Monthly Category is equal to the last Ref_Category
mutate(Real_Expected_output = !map2_lgl(Monthly_Category, last(Ref_Category), identical)) %>%
# ... and define your Expected Output
mutate(Real_Expected_output = as.integer(Real_Expected_output &
lag(Real_Expected_output, default = FALSE))) %>%
ungroup()
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 3 NA 0 0
# 2 1 2 NA 0 0
# 3 1 2 NA 1 1
# 4 1 1 NA 1 1
# 5 1 3 3 0 0
# 1 2 6 2 0 0
# 2 2 6 6 1 1
# 3 2 1 1 0 0
# 4 2 1 6 0 0
# 5 2 1 1 0 0
EDIT 2:
I'll edit it again based on your request. At this point I'd suggest you to create an external function to handle your problem. It looks cleaner.
df <- tibble::tribble(~Months, ~Element_Id, ~Monthly_Category, ~Ref_Category, ~Expected_output,
1 , 1, 3, NA, 0,
2 , 1, 2, NA, 0,
3 , 1, 2, NA, 1,
4 , 1, 1, NA, 1,
5 , 1, 3, 3, 0,
1 , 2, 6, 2, 0,
2 , 2, 6, 6, 1,
3 , 2, NA, 1, 0,
4 , 2, NA, 6, 0,
5 , 2, 1, 1, 0)
library(dplyr)
library(purrr)
get_output <- function(mon, ref){
# set here your condition
exp <- !is.na(mon) & !map2_lgl(mon, last(ref), identical)
# check exp and lag(exp), then convert to integer
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
# sort by Element_Id and Months just in case your data is messy
arrange(Element_Id, Months) %>%
# For each Element_Id ...
group_by(Element_Id) %>%
# ... launch your function
mutate(Real_Expected_output = get_output(Monthly_Category, Ref_Category)) %>%
ungroup()
# # A tibble: 10 x 6
# Months Element_Id Monthly_Category Ref_Category Expected_output Real_Expected_output
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 1 1 3 NA 0 0
# 2 2 1 2 NA 0 0
# 3 3 1 2 NA 1 1
# 4 4 1 1 NA 1 1
# 5 5 1 3 3 0 0
# 6 1 2 6 2 0 0
# 7 2 2 6 6 1 1
# 8 3 2 NA 1 0 0
# 9 4 2 NA 6 0 0
# 10 5 2 1 1 0 0
I have some dplyr code I'm moving to data.table, this is a problem I just ran into. I want the difference from one row to the next in b stored in column c if a is greater or equal than 3. However after running this code:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
setDT(df)
df[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)]
all the elements in c are 0. Why is this?
df
a b c
1: 1 0 0
2: 1 1 0
3: 1 0 0
4: 1 1 0
5: 2 0 0
6: 2 1 0
7: 2 1 0
8: 3 0 0
9: 3 3 0
10: 3 4 0
11: 3 5 0
What I thought was the equivalent dplyr:
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
df %>%
group_by(a) %>%
mutate(c = ifelse( a >= 3, c(0, diff(b)), b))
From the help for ifelse(test, yes, no), it should return...
A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no. The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.
However:
> df %>% group_by(a) %>% do(print(.$a))
[1] 1 1 1 1
[1] 2 2 2
[1] 3 3 3 3
> data.table(df)[, print(a), by=a]
[1] 1
[1] 2
[1] 3
As explained in the help pages, since the first argument has a length of one, if you pass vectors for the other parts, only their first element is used:
> ifelse(TRUE, 1:10, eleventy + million)
[1] 1
You should probably use if ... else ... when working with a constant value, like...
> data.table(df)[, b := if (a >= 3) c(0, diff(b)) else b, by=a]
or even better, in this case you can assign to a subset:
> data.table(df)[a >= 3, b := c(0, diff(b)), by=a]
Regarding why a has length 1 for the data.table idiom, see its FAQ question "Inside each group, why are the group variables length-1?"
I am creating a dataset which has non-zero values for b as the first element of each group by a to illustrate better. Your previous dataset had all zeros and also c(0,diff(b)) was starting with zero so it was hard to differentiate.
What happens here is that output of ifelse is a vector of length 1.
library(data.table)
df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c(10, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))
Look below:
setDT(df)[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 10
#> 3: 1 0 10
#> 4: 1 1 10
#> 5: 2 0 0
#> 6: 2 1 0
#> 7: 2 1 0
#> 8: 3 0 0
#> 9: 3 3 0
#> 10: 3 4 0
#> 11: 3 5 0
Now, let's look at some other examples; here I am using a simple vector of length 4 (instead of c(0,diff(b))):
setDT(df)[ , c := ifelse(a >= 3L, c(20,2,3,4), -999), by=a][]
#> a b c
#> 1: 1 10 -999
#> 2: 1 1 -999
#> 3: 1 0 -999
#> 4: 1 1 -999
#> 5: 2 0 -999
#> 6: 2 1 -999
#> 7: 2 1 -999
#> 8: 3 0 20
#> 9: 3 3 20
#> 10: 3 4 20
#> 11: 3 5 20
You see that still the first element is getting assigned to all the rows of c for that group of a.
A work-around is using diff on a to see when it's not changing (i.e. diff(a)==0) and use that as a pseudo-grouping along with the other condition; like below:
setDT(df)[, c := ifelse(a >= 3 & c(F,diff(a)==0), c(0,diff(b)), b)][]
#> a b c
#> 1: 1 10 10
#> 2: 1 1 1
#> 3: 1 0 0
#> 4: 1 1 1
#> 5: 2 0 0
#> 6: 2 1 1
#> 7: 2 1 1
#> 8: 3 0 0
#> 9: 3 3 3
#> 10: 3 4 1
#> 11: 3 5 1
I want to make a table consisting of 0 and 1.
If a variable is larger than 0, it will be 1 otherwise 0.
As the dataset has over 1,000 columns, I should use the 'sapply?' function on this question.
how do I make the code?
We can specify the condition and replace the value for a data frame. No "apply" family function is needed.
# Create an example data frame
dt <- data.frame(A = c(0, 1, 2, 3, 4),
B = c(4, 6, 8, 0, 7),
C = c(0, 0, 5, 5, 2))
# View dt
dt
# A B C
# 1 0 4 0
# 2 1 6 0
# 3 2 8 5
# 4 3 0 5
# 5 4 7 2
# Replace values larger than 0 to be 1
dt[dt > 0] <- 1
# View dt again
dt
# A B C
# 1 0 1 0
# 2 1 1 0
# 3 1 1 1
# 4 1 0 1
# 5 1 1 1