Im trying to create a synthetic dataset, but im struggling a bit
Is there a way to create a column based on the values in another column?
between subject design and my participant are dividend in two conditions
(condition 1 = 0 condition 2 = 1).
I want to make a column "Trial_1" = 0 = Absence, 1 = Presence, but just for the participants in one of the conditions?
df <- data.fram(
Id = seq(1, 10, by=1),
Age = sample(1:5, 10, replace = TRUE)
Condition = sample(0:1, 10, replace = TRUE)
Trial_1 = sample(0:1, 10, replace = TRUE, prob = c(0.3, 0.7)))
##BUT, want Trial_1 just do it for partisans' in in condition = 1
And if there is an easy way to make the probability based on age, that would be amazing!
Thanks in advance!
You can create df with Id, Age, Condition columns first, and then use rowwise() and mutate() (both from dplyr package) to create Trial_1.
library(dplyr)
df %>%
rowwise() %>%
mutate(Trial_1 = sample(0:1, 1, prob=c(1-Age/10, Condition*Age/10)))
Here, note that the probability of 0 and 1 is 1-Age/10 and Age/10, respectively, to make it age-dependent; you would want to change this to whatever dependence on age you would like.
Also, note that I multiply the probability corresponding to 1 by Condition, ensuring that Condition=0 rows always get 0.
Output:
Id Age Condition Trial_1
<dbl> <int> <int> <int>
1 1 1 0 0
2 2 3 1 1
3 3 1 0 0
4 4 4 1 0
5 5 3 1 0
6 6 5 1 1
7 7 4 1 0
8 8 5 1 0
9 9 3 0 0
10 10 2 1 0
If you prefer those rows to be NA, then do something like this instead:
df %>%
rowwise() %>%
mutate(Trial_1 = if_else(Condition==1, sample(0:1, 1, prob=c(1-Age/10, Age/10)), NA_integer_))
Output:
Id Age Condition Trial_1
<dbl> <int> <int> <int>
1 1 1 0 NA
2 2 3 1 1
3 3 1 0 NA
4 4 4 1 1
5 5 3 1 0
6 6 5 1 1
7 7 4 1 0
8 8 5 1 0
9 9 3 0 NA
10 10 2 1 0
Input:
structure(list(Id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Age = c(1L,
3L, 1L, 4L, 3L, 5L, 4L, 5L, 3L, 2L), Condition = c(0L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
I'd do it in two steps - first create the dataframe and then the Trial column. My solution isn't super elegant, but it's straightforward and doesn't require anything but base R. I hope it helps.
df <- data.frame(
Id = seq(1, 10, by = 1),
Age = sample(1:5, 10, replace = TRUE),
Condition = sample(0:1, 10, replace = TRUE)
)
df$Trial[df$Condition == 1] <- sample(0:1, sum(df$Condition), prob = c(0.3, 0.7), replace = TRUE)
# more generally, if you want to assign to Trial only when Condition is x
# df$Trial[df$Condition == x] <- sample(0:1, sum(df$Condition == x), prob = c(0.3, 0.7), replace = TRUE)
Related
I have a data set in which subjects have made choices between A and B for 13 different B's. Below is a simplified example of what the data looks like with 54 subjects and 5 choices. (1 is A, 2 is B).
subject choice1 choice2 choice3 choice4 choice5
1 1 1 1 1 2 2
2 2 1 1 2 2 2
3 3 1 2 1 2 2
4 4 1 2 2 2 2
I would like to find the questions in which subjects switch option A to B , i.e. for subject 1 this would be choice4.
In a previous study we did this by computing number of times the subject would choose option A and then selecting the corresponding option B form a separate matrix. See code below.
However, the difference now is that instead of choosing 1 switching point, subjects were asked the questions in a randomized order, and thus there is the possibility of having multiple switching points. For example in the table above, subject 3 switches to B at choice2 and again at choice4.
I would like to find both the first time the subject switches to option B, and the last time (before sticking with B for the rest of the choices).
sure_amounts <- matrix(nrow = 4, ncol = 13) # 4 treatments, 13 questions
sure_amounts[1, ] <- c(0, 2, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 8, 10) # Option B's
sure_amounts[2, ] <- seq(2, 14, 1)
sure_amounts[3, ] <- seq(2, 14, 1)
sure_amounts[4, ] <- seq(2, 14, 1)
b_choice <- matrix(nrow = 201, ncol = 4)
switch_choice <- matrix(nrow = 201, ncol = 4) # switching point form A to B
for(j in 1:4){ # number of treatments
for(i in 201){ # number of subjects
choice = NULL
fl = data$ID == i
k = 1 + 36*(j-1) # 36 before going to the next treatment (due to other questions)
choice = c(data[fl,k:(k+12)])
b_choice[i,j] = length(choice[choice==1])
temp = b_choice[i,j]
switch_choice[i,j] <- ifelse(temp==0, 0, sure_amounts[j, temp])
}
}
Does anyone have any tips on how to approach this? Thanks in advance!
I am not sure how you want your expected output to look like but you can try to get data in long format and for each subject select rows where they switch from 1 -> 2.
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -subject) %>%
group_by(subject) %>%
filter(value == 2 & lag(value) == 1 |
value == 1 & lead(value) == 2)
# subject name value
# <int> <chr> <int>
# 1 1 choice3 1
# 2 1 choice4 2
# 3 2 choice2 1
# 4 2 choice3 2
# 5 3 choice1 1
# 6 3 choice2 2
# 7 3 choice3 1
# 8 3 choice4 2
# 9 4 choice1 1
#10 4 choice2 2
Here we can see that subject 1 moves from 1 -> 2 from choice3 -> choice4 and so on.
data
df <- structure(list(subject = 1:4, choice1 = c(1L, 1L, 1L, 1L), choice2 = c(1L,
1L, 2L, 2L), choice3 = c(1L, 2L, 1L, 2L), choice4 = c(2L, 2L,
2L, 2L), choice5 = c(2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -4L))
A Base R solution:
Essentially this code only substracts a lag of the decisions and detects if the difference is not equal to zero.
Code:
lapply(as.data.frame(t(df_1)[-1,]), function(x){
t <- x - c(x[-1], 0) # row substracted by shortened row
z <- which(t[-length(t)] != 0) # values not equal to zero and rm last value
z + 1 # remove lag
})
# $`1`
# [1] 4
# $`2`
# [1] 3
# $`3`
# [1] 2 3 4
# $`4`
# [1] 2
Data:
df_1 <- read.table(text = " subject choice1 choice2 choice3 choice4 choice5
1 1 1 1 1 2 2
2 2 1 1 2 2 2
3 3 1 2 1 2 2
4 4 1 2 2 2 2 ", header = T)
An alternative approach:
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(g = paste0(choice1, choice2, choice3, choice4, choice5),
switches = as.character(map(g, ~pluck(str_locate_all(.x, "12"), 1)))) %>%
select(-g)
#> subject choice1 choice2 choice3 choice4 choice5 switches
#> 1 1 1 1 1 2 2 3:4
#> 2 2 1 1 2 2 2 2:3
#> 3 3 1 2 1 2 2 c(1, 3, 2, 4)
#> 4 4 1 2 2 2 2 1:2
data
df <- structure(list(subject = 1:4, choice1 = c(1L, 1L, 1L, 1L), choice2 = c(1L,
1L, 2L, 2L), choice3 = c(1L, 2L, 1L, 2L), choice4 = c(2L, 2L,
2L, 2L), choice5 = c(2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Created on 2020-07-10 by the reprex package (v0.3.0)
I have this df:
> df <- data.frame(Adults = sample(0:5, 10, replace = TRUE),
+ Children = sample(0:2, 10, replace = TRUE),
+ Teens = sample(1:3, 10, replace = TRUE),
+ stringsAsFactors = FALSE)
> df
Adults Children Teens
1 5 0 1
2 5 1 2
3 5 2 3
4 5 2 2
5 0 1 2
6 5 1 3
7 0 2 3
8 4 2 1
9 4 0 1
10 1 2 1
We can see that Children doesn't have 3,4,5 values and Teens doesn't have 0,4,5 values. However, we know that Adults, Children, and Teens could have from 0 to 5.
When I use group_by() with summarise(), summarise drops the columns I'm not grouping. The code:
df %>%
group_by(Adults) %>% mutate(n_Adults = n()) %>%
group_by(Teens) %>% mutate(n_Teens = n()) %>%
group_by(Children) %>% mutate(n_Children = n())
And when I group by c(0,1,2,3,4,5) (in order to have all the possible values) it gives me this error:
Error in mutate_impl(.data, dots) : Column `c(0, 1, 2, 3, 4, 5)` must be length 10 (the number of rows) or one, not 6
I'm looking for this output:
Values n_Adults n_Children n_Teens p_Adults p_Children p_Teens
0 2 2 0 0.2 0.2 0
1 1 3 4 0.1 0.1 0.4
2 0 5 3 0 0 0.3
3 0 0 3 0 0 0.3
4 2 0 0 0.2 0.2 0
5 5 0 0 0.5 0.5 0
Where n_ is the count of the respective column and p_ is the percentage of the respective column.
We can gather the data into 'long' format, get the frequency with count after converting the 'value' to factor with levels specified as 0:5, spread to 'wide' format and create the 'p' columns by dividing with the sum of each column and if needed change the column name (with rename_at)
library(tidyverse)
gather(df) %>%
count(key, value = factor(value, levels = 0:5)) %>%
spread(key, n, fill = 0) %>%
mutate_at(2:4, list(p = ~./sum(.)))%>%
rename_at(2:4, ~ paste0(.x, "_n"))
data
df <- structure(list(Adults = c(1L, 1L, 4L, 3L, 3L, 5L, 1L, 4L, 4L,
1L), Children = c(1L, 1L, 2L, 2L, 0L, 2L, 0L, 0L, 1L, 0L), Teens = c(1L,
2L, 3L, 1L, 1L, 3L, 1L, 2L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
library(reprex)
library(tidyverse)
set.seed(20)
df <- data.frame(Adults = sample(0:5, 10, replace = TRUE),
Children = sample(0:2, 10, replace = TRUE),
Teens = sample(1:3, 10, replace = TRUE),
stringsAsFactors = FALSE)
df
#> Adults Children Teens
#> 1 5 2 2
#> 2 4 2 1
#> 3 1 0 2
#> 4 3 2 1
#> 5 5 0 1
#> 6 5 1 1
#> 7 0 0 3
#> 8 0 0 3
#> 9 1 0 1
#> 10 2 2 3
df_adults <- df %>%
count(Adults) %>%
rename( n_Adults = n)
df_childred <- df %>%
count(Children) %>%
rename( n_Children = n)
df_teens <- df %>%
count(Teens) %>%
rename( n_Teens = n)
df_new <- data.frame(unique_id = 0:5)
df_new <- left_join(df_new,df_adults, by = c("unique_id"="Adults"))
df_new <- left_join(df_new,df_childred, by = c("unique_id"="Children"))
df_new <- left_join(df_new,df_teens, by = c("unique_id"="Teens"))
df_new <- df_new %>%
replace_na(list( n_Adults=0, n_Children=0, n_Teens=0))
df_new %>%
mutate(p_Adults = n_Adults/sum(n_Adults),p_Children = n_Children/sum(n_Children), p_Teens = n_Teens/sum(n_Teens))
#> unique_id n_Adults n_Children n_Teens p_Adults p_Children p_Teens
#> 1 0 2 5 0 0.2 0.5 0.0
#> 2 1 2 1 5 0.2 0.1 0.5
#> 3 2 1 4 2 0.1 0.4 0.2
#> 4 3 1 0 3 0.1 0.0 0.3
#> 5 4 1 0 0 0.1 0.0 0.0
#> 6 5 3 0 0 0.3 0.0 0.0
Created on 2019-02-25 by the reprex package (v0.2.1)
We have some tidy data with treatments (multiple samples and control), time points, and measured values. I want to normalize all the samples by dividing by the corresponding time point in the control variable.
I know how I would do this with each value in its own column, but can't figure out how to us a combination of gather mutate, sumamrise etc from tidyr or dplyr to do this in a straightforward way.
Here is a sample data frame definition:
structure(list(time = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 100, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -9L), class = "data.frame")
Data frame looks like this:
time value treat
1 10 c
2 20 c
3 15 c
1 100 t1
2 210 t1
3 180 t1
1 110 t2
2 180 t2
3 140 t2
Expected output. same but with normvalue column containing c(1,1,1,10,10.5,12,11,9,9.333333)
I'd like to get out columns of normalized value for each treatment and time point using tidyverse procedures...
If you group by time (assuming that, as in the example, it is the grouping variable for time-point) then we can use bracket notation in a mutate statement to search only within the group. We can use that to access the control value for each group and then divide the un-normalized value by that:
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 9 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 1 100 t1 10
5 2 210 t1 10.5
6 3 180 t1 12
7 1 110 t2 11
8 2 180 t2 9
9 3 140 t2 9.33
All this does is take the value column of each row and divide it by the value for the control sample with the same time value. As you can see, it doesn't care if sample t1 is missing an observation for time == 1:
df <- structure(list(time = c(1, 2, 3, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -8L), class = "data.frame")
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 8 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 2 210 t1 10.5
5 3 180 t1 12
6 1 110 t2 11
7 2 180 t2 9
8 3 140 t2 9.33
I have a data frame like this:
CriterionVar Var1 Var2 Var3
3 0 0 0
1 0 0 0
2 0 0 0
5 0 0 0
I want to recode the values of Var1, Var2, and Var3 based on the value of CriterionVar. In pseudocode, it would be something like this:
for each row
if (CriterionVar.value >= Var1.index) Var1 = 1
if (CriterionVar.value >= Var2.index) Var2 = 1
if (CriterionVar.value >= Var3.index) Var3 = 1
The recoded data frame would look like this:
CriterionVar Var1 Var2 Var3
3 1 1 1
1 1 0 0
2 1 1 0
5 1 1 1
Obviously, that is not the way to get it done because (1) the number of VarN columns is determined by a data value, and (2) it's just ugly.
Any help is appreciated.
For more general values of CriterionVar, you can use outer to construct a logical matrix which you can use for indexing like this:
dat[2:4][outer(dat$CriterionVar, seq_along(names(dat)[-1]), ">=")] <- 1
In this example, this returns
dat
CriterionVar Var1 Var2 Var3
1 3 1 1 1
2 1 1 0 0
3 2 1 1 0
4 5 1 1 1
A second method using col, which returns a matrix of the column index, is a tad bit more direct:
dat[2:4][dat$CriterionVar >= col(dat[-1])] <- 1
and returns the desired result.
data
dat <-
structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(0L,
0L, 0L, 0L), Var2 = c(0L, 0L, 0L, 0L), Var3 = c(0L, 0L, 0L, 0L
)), .Names = c("CriterionVar", "Var1", "Var2", "Var3"), class = "data.frame",
row.names = c(NA, -4L))
df[,-1] = lapply(2:NCOL(df), function(i) as.numeric(df[,1] >= (i-1)))
df
# CriterionVar Var1 Var2 Var3
#1 3 1 1 1
#2 1 1 0 0
#3 2 1 1 0
#4 5 1 1 1
DATA
df = structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(1,
1, 1, 1), Var2 = c(1, 0, 1, 1), Var3 = c(1, 0, 0, 1)), .Names = c("CriterionVar",
"Var1", "Var2", "Var3"), row.names = c(NA, -4L), class = "data.frame")
I'm a big proponent of vapply: it's fast, and you know the shape of what it'll return. The only problem is the resulting matrix is usually the "sideways" version of what you want. But t() fixes that easily enough.
n_var_cols <- 3
truncated_criterion <- pmin(dat[["CriterionVar"]], n_var_cols)
row_template <- rep_len(0, n_var_cols)
replace_up_to_index <- function(index) {
replace(row_template, seq_len(index), 1)
}
over_matrix <- vapply(
X = truncated_criterion,
FUN = replace_up_to_index,
FUN.VALUE = row_template
)
over_matrix <- t(over_matrix)
dat[, -1] <- over_matrix
dat
# CriterionVar Var1 Var2 Var3
# 1 3 1 1 1
# 2 1 1 0 0
# 3 2 1 1 0
# 4 5 1 1 1
There was some bookkeeping in the first three lines, but nothing too bad. I used pmin() to restrict the criteria values to be no greater than the number of VarN columns.
I've been using the dplyr package to create aggregated data tables, for example using the following code:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(price1, na.rm = TRUE),2),
mean_price2 = round(mean(price2, na.rm = TRUE),2))
However, I would like to only calculate the mean over the distinct values of price1 and price2 within groups
e.g:
Price1: 1 1 2 1 2 2 1
Goes to (before aggregation):
Price1: 1 2 1 2 1
(and these in general don't have the same numbers of after removal for price1 and price2). I would also like to calculate a count for each (price1 and price2), counting only distinct values within groups. (Groups are defined as two or more identical values adjacent to each other)
I have tried:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),
mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))
And also tried wrapping the columns within the select function with distinct(), but both these throw errors.
Is there a way to do this using dplyr or another similar package without having to write something from scratch?
To satisfy your requirement for distinct, we need to remove successive values that are the same. For numeric vectors, this can be accomplished by:
x <- x[c(1, which(diff(x) != 0)+1)]
The default use of diff computes the difference between adjoining elements in the vector. We use this to detect successive values that are different, for which diff(x) != 0. Since the output differences are lagged by 1, we add 1 to the indices of these distinct elements, and we also want the first element as distinct. For example:
x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1
We can then use this with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))
or, better yet, define the function:
remove.repeats <- function(x) {
x[c(1,which(diff(x) != 0)+1)]
}
and use it with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(remove.repeats(price1)),
mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
count_non_rep_2 = length(remove.repeats(price2)),
mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))
Using this on some example data that is hopefully similar to yours:
df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3),
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)),
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
## calc.method price1 price2
##1 A 1 1
##2 A 1 1
##3 A 2 1
##4 A 1 1
##5 A 2 1
##6 A 2 1
##7 A 1 1
##8 B 1 2
##9 B 1 1
##10 B 2 2
##11 B 2 1
##12 B 2 2
##13 B 2 1
##14 B 1 2
##15 B 3 1
We get:
print(agg_data)
### A tibble: 2 x 6
## calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
## <fctr> <int> <int> <dbl> <int> <dbl>
##1 A 7 5 1.40 1 1.0
##2 B 8 4 1.75 8 1.5