cumsum by participant and reset on 0 R [duplicate] - r

This question already has answers here:
R cumulative sum by condition with reset
(3 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 1 year ago.
I have a data frame that looks like this below. I need to sum the number of correct trials by participant, and reset the counter when it gets to a 0.
Participant TrialNumber Correct
118 1 1
118 2 1
118 3 1
118 4 1
118 5 1
118 6 1
118 7 1
118 8 0
118 9 1
118 10 1
120 1 1
120 2 1
120 3 1
120 4 1
120 5 0
120 6 1
120 7 0
120 8 1
120 9 1
120 10 1
I've tried using splitstackshape:
df$Count <- getanID(cbind(df$Participant, cumsum(df$Correct)))[,.id]
But it cumulatively sums the correct trials when it gets to a 0 and not by participant:
Participant TrialNumber Correct Count
118 1 1 1
118 2 1 1
118 3 1 1
118 4 1 1
118 5 1 1
118 6 1 1
118 7 1 1
118 8 0 2
118 9 1 1
118 10 1 1
120 1 1 1
120 2 1 1
120 3 1 1
120 4 1 1
120 5 0 2
120 6 1 1
120 7 0 2
120 8 1 1
120 9 1 1
120 10 1 1
I then tried using dplyr:
df %>%
group_by(Participant) %>%
mutate(Count=cumsum(Correct)) %>%
ungroup %>%
as.data.frame(df)
Participant TrialNumber Correct Count
118 1 1 1
118 2 1 2
118 3 1 3
118 4 1 4
118 5 1 5
118 6 1 6
118 7 1 7
118 8 0 7
118 9 1 8
118 10 1 9
120 1 1 1
120 2 1 2
120 3 1 3
120 4 1 4
120 5 0 4
120 6 1 5
120 7 0 5
120 8 1 6
120 9 1 7
120 10 1 8
Which gets me closer, but still doesn't reset the counter when it gets to 0. If anyone has any suggestions to do this it would be greatly appreciated, thank you

Does this work?
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(Correct)) %>%
group_by(Participant, grp) %>%
mutate(Count = cumsum(Correct)) %>%
select(- grp)
# A tibble: 10 x 4
# Groups: Participant, grp [6]
grp Participant Correct Count
<int> <chr> <dbl> <dbl>
1 1 A 1 1
2 1 A 1 2
3 1 A 1 3
4 2 A 0 0
5 3 A 1 1
6 3 B 1 1
7 3 B 1 2
8 4 B 0 0
9 5 B 1 1
10 5 B 1 2
Toy data:
df <- data.frame(
Participant = c(rep("A", 5), rep("B", 5)),
Correct = c(1,1,1,0,1,1,1,0,1,1)
)

Related

Change column value based on final condition- but groups by previous week's IDs

Trying to figure out how to code something simple.
I have a dataset that has observations for individuals (small invertebrates) in my experiment over time, including the week, individual's id #, and the observation data of interest (parasite counts). I also have a cumulative total over time for the parasite counts, grouped by the individual's ID, which is what I will actually want per week.
I would like to drop individuals that, by the end of the experiment, never had an observed sample that was positive for parasites, because they were not successfully infected. My plan was to have a binary indicator column that told me if an individual didn't have a positive sample by the end of the experiment, based on the final cumulative total per individual id (it's possible that an individual could give a positive sample one week but not the next, so a 0 cumulative total is more safe). Then I would simply subset the data by the positive binary column, removing individuals who were never positive.
A very simplified version of my dataframe would look something like:
time = c(rep(1,4),rep(2,4),rep(3,4),rep(4,4))
ids = rep(c(101:104),4)
observations = c(rep(c(25,25,0,0),4))
df = data.frame(cbind(time,ids,observations))
df2 = df %>%
group_by(ids) %>%
mutate(cumtot = cumsum(observations))
df2
time ids observations cumtot
<dbl> <dbl> <dbl> <dbl>
1 1 101 25 25
2 1 102 25 25
3 1 103 0 0
4 1 104 0 0
5 2 101 25 50
6 2 102 25 50
7 2 103 0 0
8 2 104 0 0
9 3 101 25 75
10 3 102 25 75
11 3 103 0 0
12 3 104 0 0
13 4 101 25 100
14 4 102 25 100
15 4 103 0 0
16 4 104 0 0
(I will eventually aggregate these data into means/SEMs by week and treatment group.)
What I have tried so far creates a binary "infected" column, but identifies individuals that had a cumulative sum of 0 in week 14 only. What I want is for the code to then apply this binary outcome to all the individual ids from every week (so that I drop that individual from each week's aggregate data). Not sure how to do that...
# Make a column that indicates if a snail has not shed by experiment end
df_dropped = df2 %>%
group_by(ids) %>%
mutate(infected = ifelse(time==max(time)&cumtot==0, 0,1))
df_dropped
time ids observations cumtot infected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 101 25 25 1
2 1 102 25 25 1
3 1 103 0 0 1
4 1 104 0 0 1
5 2 101 25 50 1
6 2 102 25 50 1
7 2 103 0 0 1
8 2 104 0 0 1
9 3 101 25 75 1
10 3 102 25 75 1
11 3 103 0 0 1
12 3 104 0 0 1
13 4 101 25 100 1
14 4 102 25 100 1
15 4 103 0 0 0
16 4 104 0 0 0
I want the output to be:
time ids observations cumtot infected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 101 25 25 1
2 1 102 25 25 1
3 1 103 0 0 0
4 1 104 0 0 0
5 2 101 25 50 1
6 2 102 25 50 1
7 2 103 0 0 0
8 2 104 0 0 0
9 3 101 25 75 1
10 3 102 25 75 1
11 3 103 0 0 0
12 3 104 0 0 0
13 4 101 25 100 1
14 4 102 25 100 1
15 4 103 0 0 0
16 4 104 0 0 0
Thanks.
You can just use any():
library(tidyverse)
df_dropped <- df2 %>%
group_by(ids) %>%
mutate(infected = as.numeric(any(observations > 0)))
df_dropped
#> # A tibble: 16 x 5
#> # Groups: ids [4]
#> time ids observations cumtot infected
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 101 25 25 1
#> 2 1 102 25 25 1
#> 3 1 103 0 0 0
#> 4 1 104 0 0 0
#> 5 2 101 25 50 1
#> 6 2 102 25 50 1
#> 7 2 103 0 0 0
#> 8 2 104 0 0 0
#> 9 3 101 25 75 1
#> 10 3 102 25 75 1
#> 11 3 103 0 0 0
#> 12 3 104 0 0 0
#> 13 4 101 25 100 1
#> 14 4 102 25 100 1
#> 15 4 103 0 0 0
#> 16 4 104 0 0 0
Created on 2022-02-28 by the reprex package (v2.0.1)

Keep previous value if it is under a certain threshold

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

Sum 1:n by group

Have: Dataset I need to sum i:n for each row within each group
demo<-data.frame(th=c(c(0,24,26),(c(0,1,2,4))),hs=c(rep(220,3),c(rep(240,4))),
seq=(c(1:3,1:4)),group=c(rep(1,3),rep(2,4)))
Here's what that looks like:
> demo
th hs seq group
1 0 220 1 1
2 24 220 2 1
3 26 220 3 1
4 0 240 1 2
5 1 240 2 2
6 2 240 3 2
7 4 240 4 2
Need a vector that is a based on the hs, seq, and th columns but that is a summation of the hs column raised to the seq column and times the th columns up to that row within the group.
demo[1,"an"]<- demo[1,"hs"]^demo[1,"seq"] * demo[1,"th"]
demo[2,"an"]<-sum(demo[1,"hs"]^demo[1,"seq"] * demo[1,"th"],
demo[2,"hs"]^demo[2,"seq"] * demo[2,"th"] )
demo[3,"an"]<-sum(demo[1,"hs"]^demo[1,"seq"] * demo[1,"th"],
demo[2,"hs"]^demo[2,"seq"] * demo[2,"th"],
demo[3,"hs"]^demo[3,"seq"] * demo[3,"th"])
demo[6,"an"]<-sum(demo[4,"hs"]^demo[4,"seq"] * demo[4,"th"],
demo[5,"hs"]^demo[5,"seq"] * demo[5,"th"],
demo[6,"hs"]^demo[6,"seq"] * demo[6,"th"])
Here's what that new column (an) should look like
> demo
th hs seq group an
1 0 220 1 1 0
2 24 220 2 1 1161600
3 26 220 3 1 278009600
4 0 240 1 2 NA
5 1 240 2 2 NA
6 2 240 3 2 27705600
7 4 240 4 2 NA
Ignore the NA's in this MRE, those need to be filled in too.
Libraries
library(tidyverse)
Sample data
df <-
read.csv(
text =
"th hs seq group
0 220 1 1
24 220 2 1
26 220 3 1
0 240 1 2
1 240 2 2
2 240 3 2
4 240 4 2",
sep = " ",header = T
)
Code
df %>%
#Grouping by group
group_by(group) %>%
#Applying a cumulative sum of the formula, by group
mutate(an = cumsum(hs^seq*th))
Output
th hs seq group an
<int> <int> <int> <int> <dbl>
1 0 220 1 1 0
2 24 220 2 1 1161600
3 26 220 3 1 278009600
4 0 240 1 2 0
5 1 240 2 2 57600
6 2 240 3 2 27705600
7 4 240 4 2 13298745600
We can use data.table
library(data.table)
setDT(df)[, an := cumsum(hs^seq^th), group]

How can I create a lag difference variable within group relative to baseline?

I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))

How to rank a column with a condition

I have a data frame :
dt <- read.table(text = "
1 390
1 366
1 276
1 112
2 97
2 198
2 400
2 402
3 110
3 625
4 137
4 49
4 9
4 578 ")
The first colomn is Index and the second is distance.
I want to add a colomn to rank the distance by Index in a descending order (the highest distance will be ranked first)
The result will be :
dt <- read.table(text = "
1 390 1
1 66 4
1 276 2
1 112 3
2 97 4
2 198 3
2 300 2
2 402 1
3 110 2
3 625 1
4 137 2
4 49 3
4 9 4
4 578 1")
Another R base approach
> dt$Rank <- unlist(tapply(-dt$V2, dt$V1, rank))
A tidyverse solution
dt %>%
group_by(V1) %>%
mutate(Rank=rank(-V2))
transform(dt,s = ave(-V2,V1,FUN = rank))
V1 V2 s
1 1 390 1
2 1 66 4
3 1 276 2
4 1 112 3
5 2 97 4
6 2 198 3
7 2 300 2
8 2 402 1
9 3 110 2
10 3 625 1
11 4 137 2
12 4 49 3
13 4 9 4
14 4 578 1
You could group, arrange, and rownumber. The result is a bit easier on the eyes than a simple rank, I think, and so worth an extra step.
dt %>%
group_by(V1) %>%
arrange(V1,desc(V2)) %>%
mutate(rank = row_number())
# A tibble: 14 x 3
# Groups: V1 [4]
V1 V2 rank
<int> <int> <int>
1 1 390 1
2 1 366 2
3 1 276 3
4 1 112 4
5 2 402 1
6 2 400 2
7 2 198 3
8 2 97 4
9 3 625 1
10 3 110 2
11 4 578 1
12 4 137 2
13 4 49 3
14 4 9 4
A scrambled alternative is min_rank
dt %>%
group_by(V1) %>%
mutate(min_rank(desc(V2)) )

Resources