how to mutate new variables with different conditions in r - r

Say I have a df.
df = data.frame(status = c(1, 0, 0, 0, 1, 0, 0, 0),
stratum = c(1,1,1,1, 2,2,2,2),
death = 1:8)
> df
status stratum death
1 1 1 1
2 0 1 2
3 0 1 3
4 0 1 4
5 1 2 5
6 0 2 6
7 0 2 7
8 0 2 8
I want to mutate a new variable named weights. And it should meet the following conditions:
weights should be mutated in stratum group.
the weights value should return death value when the status is 1.
What I expected should like this:
df_wanted = data.frame(status = c(1, 0, 0, 0, 1, 0, 0, 0),
stratum = c(1,1,1,1, 2,2,2,2),
death = 1:8,
weights = c(1,1,1,1, 5,5,5,5))
> df_wanted
status stratum death weights
1 1 1 1 1
2 0 1 2 1
3 0 1 3 1
4 0 1 4 1
5 1 2 5 5
6 0 2 6 5
7 0 2 7 5
8 0 2 8 5
I do not know how to write the code.
Any help will be highly appreciated!

You may get the death value where status = 1.
library(dplyr)
df %>%
group_by(stratum) %>%
mutate(weights = death[status == 1]) %>%
ungroup
The above works because there is exactly 1 value in each group where status = 1. If there are 0 or more than 1 value in a group where status = 1 thann a better option is to use match which will return NA for 0 value and return the 1st death value for more than 1 value.
df %>%
group_by(stratum) %>%
mutate(weights = death[match(1, status)]) %>%
ungroup
# status stratum death weights
# <dbl> <dbl> <int> <int>
#1 1 1 1 1
#2 0 1 2 1
#3 0 1 3 1
#4 0 1 4 1
#5 1 2 5 5
#6 0 2 6 5
#7 0 2 7 5
#8 0 2 8 5

Related

Counting Frequencies of Sequences

Suppose there are two students - each student takes an exam multiple times (e.g.result_id = 1 is the first exam, result_id = 2 is the second exam, etc.). The student can either "pass" (1) or "fail" (0).
The data looks something like this:
library(data.table)
my_data = data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,2), results = c(0,1,0,1,0,0,1,1,1,0,1,1,0,1,0), result_id = c(1,2,3,4,5,6,1,2,3,4,5,6,7,8,9))
my_data = setDT(my_data)
id results result_id
1: 1 0 1
2: 1 1 2
3: 1 0 3
4: 1 1 4
5: 1 0 5
6: 1 0 6
7: 2 1 1
8: 2 1 2
9: 2 1 3
10: 2 0 4
11: 2 1 5
12: 2 1 6
13: 2 0 7
14: 2 1 8
15: 2 0 9
I am interested in counting the number of times that a student passes an exam, given that the student passed the previous two exams.
I tried to do this with the following code:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
My code produces the following results:
> out
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 2
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 1
However, I do not think that my code is correct.
For example, with Student_ID = 2 :
My code says that "Current_Exam = 1, Prev_Exam = 1, Prev_2_Exam = 0" happens 1 time, but looking at the actual data - this does not happen at all
Can someone please show me what I am doing wrong and how I can correct this?
Note: I think that this should be the expected output:
> expected_output
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 1
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 0
You did not consider that you can not shift the results over id without placing NA.
. <- my_data[order(my_data$id, my_data$result_id),] #sort if needed
.$p1 <- ave(.$results, .$id, FUN = \(x) c(NA, x[-length(x)]))
.$p2 <- ave(.$p1, .$id, FUN = \(x) c(NA, x[-length(x)]))
aggregate(list(tally=.$p1), .[c("id","results", "p1", "p2")], length)
# id results p1 p2 tally
#1 1 0 1 0 2
#2 2 0 1 0 1
#3 2 1 1 0 1
#4 1 0 0 1 1
#5 1 1 0 1 1
#6 2 1 0 1 2
#7 2 0 1 1 2
#8 2 1 1 1 1
.
# id results result_id p1 p2
#1 1 0 1 NA NA
#2 1 1 2 0 NA
#3 1 0 3 1 0
#4 1 1 4 0 1
#5 1 0 5 1 0
#6 1 0 6 0 1
#7 2 1 1 NA NA
#8 2 1 2 1 NA
#9 2 1 3 1 1
#10 2 0 4 1 1
#11 2 1 5 0 1
#12 2 1 6 1 0
#13 2 0 7 1 1
#14 2 1 8 0 1
#15 2 0 9 1 0
An option would be to use filter to indicate those which had passed 3 times in a row.
cbind(., n=ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1), sides=1)))
# id results result_id n
#1 1 0 1 NA
#2 1 1 2 NA
#3 1 0 3 1
#4 1 1 4 2
#5 1 0 5 1
#6 1 0 6 1
#7 2 1 1 NA
#8 2 1 2 NA
#9 2 1 3 3
#10 2 0 4 2
#11 2 1 5 2
#12 2 1 6 2
#13 2 0 7 2
#14 2 1 8 2
#15 2 0 9 1
If olny the number of times that a student passes an exam, given that the student passed the previous two exams:
sum(ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1))==3), na.rm=TRUE)
#[1] 1
sum(ave(.$results, .$id, FUN = \(x)
x==1 & c(x[-1], 0) == 1 & c(x[-1:-2], 0, 0) == 1))
#[1] 1
When trying to count events that happen in series, cumsum() comes in quite handy. As opposed to creating multiple lagged variables, this scales well to counts across a larger number of events:
library(tidyverse)
d <- my_data |>
group_by(id) |> # group to cumulate within student only
mutate(
csum = cumsum(results), # cumulative sum of results
i = csum - lag(csum, 3, 0) # substract the cumulative sum from 3 observation before. This gives the number of exams passed in the current and previous 2 observations.
)
# Ungroup to get global count
d |>
ungroup() |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 2 × 2
#> `i == 3` n
#> <lgl> <int>
#> 1 FALSE 14
#> 2 TRUE 1
# Retaining the group gives counts by student
d |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id `i == 3` n
#> <dbl> <lgl> <int>
#> 1 1 FALSE 6
#> 2 2 FALSE 8
#> 3 2 TRUE 1
Since you provided the data as data.table, here is how to do the same in that ecosystem:
my_data[ , csum := cumsum(results), .(id)]
my_data[ , i := csum - lag(csum, 3, 0), .(id)]
my_data[ , .(n_cases = sum(i ==3)), id]
#> id n_cases
#> 1: 1 0
#> 2: 2 1
Here's an approach using dplyr. It uses the lag function to look back 1 and 2 results. If the sum together with the current result is 3, then the condition is met. In the example you provided, the condition is only met once
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(!is.na(threex))
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 0
2 1 1 4 0
3 1 0 5 0
4 1 0 6 0
5 2 1 3 1
6 2 0 4 0
7 2 1 5 0
8 2 1 6 0
9 2 0 7 0
10 2 1 8 0
11 2 0 9 0
If you then just want to capture the cases when the condition is met, add a filter.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1)
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 2 1 3 1
If you are looking to understand how many times the condition is met per id, you can do this.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1) %>%
select(id) %>%
summarize(count = n())
id count
<dbl> <int>
1 2 1

How to track changes in a long format with dplyr?

let's say I had the following dataset in long format, where the id variable represents the participants (group variable):
id
wave
car
household
1
1
0
1
1
2
1
1
1
3
0
1
1
4
1
2
2
1
0
1
2
2
1
2
2
3
1
3
2
4
0
1
3
1
0
1
3
2
0
1
3
3
1
2
3
4
1
1
4
1
0
1
4
2
1
1
4
3
1
1
4
4
1
2
The variable "car" tells whether someone owns a car or not. The variable "household" indicates how many people live in the same household. As you can see, all participants start without owning a car and living alone in the household.
I now want to determine the changes longitudinally so that I end up with
a) only those subjects who own a car and
b) only those subjects who own a car + live with only one other person (not two or more people) in the household.
In each case, only the first change should be counted and as soon as the car is sold (or more than two people live in the household), the data points should be excluded.
So condition a) would be fulfilled for example with proband id 1 to Wave 2. However, only this should be counted, since proband id 1 in Wave 3 no longer owns a car and the subsequent car purchase in Wave 4 represents the second purchase.
Condition b) would be fulfilled, for example, for proband id 2 at Wave 2, but from Wave 3 onwards there are also three people in the household, which is why the data points from Wave 3 onwards are to be excluded. Similarly, if another person moves into the household and you already have a car, a missing value in condition B should arise.
Whether condition a) and/ or condition b) apply is to be calculated in two separate binary variables (yes/no), named, for instance, "cond-a" and "cond-b".
Does anyone know how to do this most cleverly, for example with dplyr (or other R packages)?
I would be extremely grateful for an answer!
I know, that I probably can do this with the group_by() function from dplyr, right?
Here is the code of the data.frame used in this example:
id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
wave <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
car <- c(0,1,0,1,0,1,1,0,0,0,1,1,0,1,0,1)
household <- c(1,1,1,2,1,2,3,1,1,1,2,1,1,2,1,2)
df <- data.frame(id,wave,car,household)
The expected output should be like:
id
wave
car
household
Cond-A
Cond-B
1
1
0
1
0
0
1
2
1
1
1
0
1
3
0
1
0
0
1
4
1
2
NA
NA
2
1
0
1
0
0
2
2
1
2
0
1
2
3
1
3
NA
NA
2
4
0
1
0
0
3
1
0
1
0
0
3
2
0
1
0
0
3
3
1
2
0
1
3
4
1
1
1
NA
4
1
0
1
0
0
4
2
1
1
1
0
4
3
1
1
1
0
4
4
1
2
NA
NA
Edit: Subject 1, Wave 4 is NA because she/ he had already owned a car before (see Wave 2). If car = 1 before and then car = 0 again in the meantime, the data points should be excluded from the second time car = 1 (for both Cond-A and Cond-B). Id 2, Wave 2 shows: If the change from car = 0 is not car = 1 & household = 1, but directly car = 1 & household = 2, then Cond-B shall apply, but not Cond-A. So Cond-A shall only apply if a change from car = 0 & household = 1 is to car = 1 & household = 1. I know this is a tricky question, but if anyone knows the answer, it's probably here! :)
I revised my approach, I think it now comes very close to the desired outcome.
The last piece of logic that I don't understand is id 4, wave 4, why double NA?
The core logic is build into a tempporary variable called car_id which basically shows either NA if someone doesn't have a car at wave t or the id of the car (1, 2 etc.).
library(dplyr)
df %>%
group_by(id) %>%
mutate(car_id = rank(
ifelse(car == 1,
data.table::rleid(car == 0),
NA),
ties.method = "min",
na.last = "keep"),
condition_a = case_when(
car_id == 1 & household == 1 ~ 1,
car_id > 1 | household > 2 ~ NA_real_,
(lag(car) == 0) & car_id == 1 &
(lag(household) == 1) & household == 2 ~ 0,
TRUE ~ 0
),
condition_b =
case_when(
lag(household) != 2 & household == 2 & car_id == 1 ~ 1,
car_id > 1 | household > 2 ~ NA_real_,
lag(household == 2) & household != 2 ~ NA_real_,
household != 0 ~ 0
)
) %>%
select(!car_id)
#> # A tibble: 16 × 6
#> # Groups: id [4]
#> id wave car household condition_a condition_b
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 1 0 0
#> 2 1 2 1 1 1 0
#> 3 1 3 0 1 0 0
#> 4 1 4 1 2 NA NA
#> 5 2 1 0 1 0 0
#> 6 2 2 1 2 0 1
#> 7 2 3 1 3 NA NA
#> 8 2 4 0 1 0 0
#> 9 3 1 0 1 0 0
#> 10 3 2 0 1 0 0
#> 11 3 3 1 2 0 1
#> 12 3 4 1 1 1 NA
#> 13 4 1 0 1 0 0
#> 14 4 2 1 1 1 0
#> 15 4 3 1 1 1 0
#> 16 4 4 1 2 0 1
Created on 2023-01-22 with reprex v2.0.2
Data used from the actual table not the df object:
id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
wave <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
car <- c(0,1,0,1,0,1,1,0,0,0,1,1,0,1,1,1)
household <- c(1,1,1,2,1,2,3,1,1,1,2,1,1,1,1,2)
df <- data.frame(id,wave,car,household)

R Loops under conditions

I am trying to achieve the following, I have been reading a lot about the tidyverse and it has allowed me to easily perform some tasks in R, and I am sure it could solve my problem below but I do not understand how:
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,5),
Buyer = c("A","A","A","A","A","A","A","A"),
Seller = c("C","D","C","D","C","D","D","D"),
Exit = c("0","0","0","0","1","0","0","1"),
End = c("0","0","0","0","0","0","0","1"),
Begin = c("1","1","0","0","0","0","0","0"),
Dist_fromBegin = c("0","0","1","1","2","2","3","4"),
Dist_fromExit = c("4","4","3","3","2","2","1","0")
)
View(Data)
I have the following code for Begin:
setDT(Data)[order(Month, Buyer, Seller), Begin:= {
r <- rowid(Buyer, Seller)
+(r==1L)
}]
But I cannot manage to calculate Exit (1 if this is the last date of A dealing with C or D), End (1 if this is the last date for the Buyer "A") and DistanceBegin (Date Exit - Date Begin) and Distance Exit (Date End - Date Exit)
Date End is the last date for which we see the Buyer (here A), while Exit is the last time we record the relationship between "A" and "C" (or D.)
I have tried several things with dplyr and mutate, grouping Buyer Seller and Month, but with no result right now...
THank you in advance,
If we consider the initial datarame df1 to have only Month, Buyer and Seller information, most of your calculations can be done as shown in the code below by just using mutate.
I added one more buyer and one more seller for the second buyer just to generalize the solution. The code below should meet your needs. You can add more variables as needed.
df1 <- data.frame(
Month = c(1,1,2,2,3,3,4,5),
Buyer = c("A","A","A","A","A","A","A","A"),
Seller = c("C","D","C","D","C","D","D","D")
)
df2 <- data.frame(
Month = c(1,1,2,3,3,3,4,4),
Buyer = c(rep("B",8)),
Seller = c("C","D","C","D","E","D","D","E")
)
df3 <- rbind(df1,df2)
dfe <- df3 %>%
group_by(Buyer,Seller) %>%
slice(tail(row_number(), 1)) %>% mutate(Exit=1, DateExit=Month)
dfe1 <- dfe %>% select(Month,Buyer,Seller,Exit)
dfe2 <- dfe %>% select(Buyer,Seller,DateExit)
df4 <- merge(df3,dfe1, by=c("Month","Buyer","Seller"), all=TRUE)
df <- left_join(df4,dfe2, by=c("Buyer","Seller"), all=TRUE)
df[is.na(df)] <- 0
dfa <- df %>% group_by(Buyer) %>%
mutate(minM=min(Month),maxM=max(Month)) %>%
mutate(Begin = ifelse(minM==Month,1,0),
End = ifelse(maxM==Month,1,0),
Dist_fromBegin = (Month - minM),
Dist_fromEnd = (maxM - Month),
Dist_Begin2Exit = (DateExit - minM),
Dist_Exit2End = (maxM - DateExit),
Distance2Exit = (DateExit - Month)
) %>% select(-minM,-maxM)
dfb <- dfa[order(dfa$Buyer,dfa$Month,dfa$Seller,dfa$Exit),]
dfb
> dfb
# A tibble: 16 x 12
# Groups: Buyer [2]
Month Buyer Seller Exit DateExit Begin End Dist_fromBegin Dist_fromEnd Dist_Begin2Exit Dist_Exit2End Distance2Exit
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A C 0 3 1 0 0 4 2 2 2
2 1 A D 0 5 1 0 0 4 4 0 4
3 2 A C 0 3 0 0 1 3 2 2 1
4 2 A D 0 5 0 0 1 3 4 0 3
5 3 A C 1 3 0 0 2 2 2 2 0
6 3 A D 0 5 0 0 2 2 4 0 2
7 4 A D 0 5 0 0 3 1 4 0 1
8 5 A D 1 5 0 1 4 0 4 0 0
9 1 B C 0 2 1 0 0 3 1 2 1
10 1 B D 0 4 1 0 0 3 3 0 3
11 2 B C 1 2 0 0 1 2 1 2 0
12 3 B D 0 4 0 0 2 1 3 0 1
13 3 B D 0 4 0 0 2 1 3 0 1
14 3 B E 0 4 0 0 2 1 3 0 1
15 4 B D 1 4 0 1 3 0 3 0 0
16 4 B E 1 4 0 1 3 0 3 0 0

Create Counter with Binary Variable

I am trying to create a counter variable that starts over at 1 every time there is a change in a binary variable.
bin <- c(1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0)
df <- as.data.frame(bin)
df <- df %>%
group_by(bin) %>%
mutate(cntr = row_number())
I would like to get the following results:
bin cntr
1 1
0 1
0 2
1 1
1 2
1 3
...
But instead I'm getting:
1 1
0 1
0 2
1 2
1 3
1 4
I understand why this is ... I just don't know how to get my desired results. Any help would be appreciated.
You can easily do this by combining sequence and rle. No packages required.
data.frame(bin, cntr = sequence(rle(bin)$lengths))
# bin cntr
#1 1 1
#2 0 1
#3 0 2
#4 1 1
#5 1 2
#6 1 3
#7 1 4
#8 1 5
#9 0 1
#10 0 2
#11 0 3
#12 0 4
#13 1 1
#14 0 1
#15 1 1
#16 0 1
We need a run-length-id to group the adjacent same elements into a single group. It can be done with rleid from data.table or create a logical index and then do the cumulative sum (cumsum(bin != lag(bin, default = first(bin))))
library(data.table)
library(dplyr)
df %>%
group_by(grp = rleid(bin)) %>%
mutate(cntr = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 16 x 2
# bin cntr
# <dbl> <int>
# 1 1 1
# 2 0 1
# 3 0 2
# 4 1 1
# 5 1 2
# 6 1 3
# 7 1 4
#..
In data.table, this can be done more compactly as the := happens
library(data.table)
setDT(df)[, cntr := rowid(rleid(bin))]
df
# bin cntr
# 1: 1 1
# 2: 0 1
# 3: 0 2
# 4: 1 1
# 5: 1 2
# 6: 1 3
# 7: 1 4
#..

Wide to long data conversion

I have a dataset in 'wide' format that I would like to convert to a non-standard long format. At least, that is how I would characterize this problem.
The original dataset mimics the following:
d1 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1))
d1
id Q1 Q2 Q3
1 1 2 1 3
2 2 3 3 1
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3}. So, in this examples, individual 1 answered 2 to Q1, 1 to Q2, and 3 for Q3. I now need to convert to a 'long' format that would be take the following format. For each individual and each possible answer
d2 <- data.frame('id'= rep(seq(1:2),each=9),
'question' = rep(seq(1:3), each=3),
'option' = rep(seq(1:3)),
'choice' = 0)
d2
id question option choice
1 1 1 1 0
2 1 1 2 0
3 1 1 3 0
4 1 2 1 0
5 1 2 2 0
6 1 2 3 0
7 1 3 1 0
8 1 3 2 0
9 1 3 3 0
10 2 1 1 0
11 2 1 2 0
12 2 1 3 0
13 2 2 1 0
14 2 2 2 0
15 2 2 3 0
16 2 3 1 0
17 2 3 2 0
18 2 3 3 0
The part of I am struggling with is how to 'merge' or 'reshape' the data from d1 into d2 so that the final outcome would look like the following with the choice column reflecting the answers given in dataframe d1:
id question option choice
1 1 1 1 0
2 1 1 2 1
3 1 1 3 0
4 1 2 1 1
5 1 2 2 0
6 1 2 3 0
7 1 3 1 0
8 1 3 2 0
9 1 3 3 1
10 2 1 1 0
11 2 1 2 0
12 2 1 3 1
13 2 2 1 0
14 2 2 2 0
15 2 2 3 1
16 2 3 1 1
17 2 3 2 0
18 2 3 3 0
Individual 1 did not chose option 1 or 3 in question 1, but DID choose option 2 as indicated in the dummy coding in the choice column.
Any thoughts on this would be greatly appreciated.
d3 is the final output.
d1 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1))
library(dplyr)
library(tidyr)
d2 <- d1 %>%
gather(question, option, -id)
d3 <- d2 %>%
complete(id, question, option) %>%
left_join(d2, by = c("id", "question")) %>%
mutate(question = sub("Q", "", question)) %>%
mutate(option.y = ifelse(option.y == option.x, 1, 0)) %>%
rename(option = option.x, choice = option.y)
Update
Here is a more concise approach. dt2 is the final output.
d2 <- d1 %>%
gather(question, option, -id) %>%
mutate(choice = 1) %>%
complete(id, question, option, fill = list("choice" = 0)) %>%
mutate(question = sub("Q", "", question))

Resources