counting the rows using group_by of two other columns in r - r

I have data as below. I would like to add a new column that counts whenever column code changes and when ID changes it resets and counter to 1 and start counting.
ID code
1 10
1 10
1 11
1 11
1 21
1 21
2 10
2 10
2 11
2 11
2 11
2 14
2 15
result:
ID code counter
1 10 1
1 10 1
1 11 2
1 11 2
1 21 3
1 21 3
2 10 1
2 10 1
2 11 2
2 11 2
2 11 2
2 14 3
2 15 4

We may use cumsum along with duplicated as in
df %>% group_by(ID) %>% mutate(counter = cumsum(!duplicated(code)))
# A tibble: 13 x 3
# Groups: ID [2]
# ID code counter
# <int> <int> <int>
# 1 1 10 1
# 2 1 10 1
# 3 1 11 2
# 4 1 11 2
# 5 1 21 3
# 6 1 21 3
# 7 2 10 1
# 8 2 10 1
# 9 2 11 2
# 10 2 11 2
# 11 2 11 2
# 12 2 14 3
# 13 2 15 4
If code reverted back, say, from 11 to 10, then counter wouldn't increase. But I guess either that's not possible in your case or that would even be the desired effect.
Here's how duplicated works in this case:
cbind(df[df$ID == 1, "code"], !duplicated(df[df$ID == 1, "code"]))
# [,1] [,2]
# [1,] 10 1
# [2,] 10 0
# [3,] 11 1
# [4,] 11 0
# [5,] 21 1
# [6,] 21 0
Whenever a new value in code appears, it gives a one, and then cumsum finishes the job.

You can do this with dplyr, using lag to find rows where code changes:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(counter = cumsum(c(1, tail(code != lag(code), -1))))
Result:
ID code counter
<int> <int> <dbl>
1 1 10 1
2 1 10 1
3 1 11 2
4 1 11 2
5 1 21 3
6 1 21 3
7 2 10 1
8 2 10 1
9 2 11 2
10 2 11 2
11 2 11 2
12 2 14 3
13 2 15 4

Related

Exclude rows where value used in another row

Imagine you have the following data set:
df = data.frame(ID = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), gender= c(1,2,1,2,2,2,2,1,1,2,1,2,1,2,2,2,2,1,1,2),
PID = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10))
how can I write a code that removes the rows in the df whose gender and PID are the same (see picture). Please imagine that the code is over 1000 rows long (so it should be a solution that automatically searches for the right values to exclude).
base R
df[ave(rep(TRUE, nrow(df)), df[,c("gender","paar")], FUN = function(z) !any(duplicated(z))),]
# ID gender paar
# 1 1 1 1
# 2 2 2 1
# 3 3 1 2
# 4 4 2 2
# 7 7 2 4
# 8 8 1 4
# 9 9 1 5
# 10 10 2 5
# 11 11 1 6
# 12 12 2 6
# 13 13 1 7
# 14 14 2 7
# 17 17 2 9
# 18 18 1 9
# 19 19 1 10
# 20 20 2 10
dplyr
library(dplyr)
df %>%
group_by(gender, paar) %>%
filter(!any(duplicated(cbind(gender, paar)))) %>%
ungroup()
In base R, we may use subset after removing the observations where the group count for 'gender' and 'paar' are not 1
subset(df, ave(seq_along(gender), gender, paar, FUN = length) == 1)
Or with duplicated
df[!(duplicated(df[-1])|duplicated(df[-1], fromLast = TRUE)),]
-output
ID gender paar
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
7 7 2 4
8 8 1 4
9 9 1 5
10 10 2 5
11 11 1 6
12 12 2 6
13 13 1 7
14 14 2 7
17 17 2 9
18 18 1 9
19 19 1 10
20 20 2 10
Here is one more: :-)
library(dplyr)
df %>%
group_by(gender, PID) %>%
filter(is.na(ifelse(n()>1, 1, NA)))
ID gender PID
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
Another dplyr option could be:
df %>%
filter(with(rle(paste0(gender, PID)), rep(lengths == 1, lengths)))
ID gender PID
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
If the duplicated values can occur also between non-consecutive rows:
df %>%
arrange(gender, PID) %>%
filter(with(rle(paste0(gender, PID)), rep(lengths == 1, lengths)))
Using aggregate
na.omit(aggregate(. ~ gender + PID, df, function(x)
ifelse(length(x) == 1, x, NA)))
gender PID ID
1 1 1 1
2 2 1 2
3 1 2 3
4 2 2 4
6 1 4 8
7 2 4 7
8 1 5 9
9 2 5 10
10 1 6 11
11 2 6 12
12 1 7 13
13 2 7 14
15 1 9 18
16 2 9 17
17 1 10 19
18 2 10 20
With dplyr
library(dplyr)
df %>%
group_by(gender, PID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 16 × 3
ID gender PID
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10

How to create another column in a data frame based on repeated observations in another column?

So basically I have a data frame that looks like this:
BX
BY
1
12
1
12
1
12
2
14
2
14
3
5
I want to create another colum ID, which will have the same number for the same values in BX and BY. So the table would look like this then:
BX
BY
ID
1
12
1
1
12
1
1
12
1
2
14
2
2
14
2
3
5
3
Here is a base R way.
Subset the data.frame by the grouping columns, find the duplicated rows and use a standard cumsum trick.
df1<-'BX BY
1 12
1 12
1 12
2 14
2 14
3 5'
df1 <- read.table(textConnection(df1), header = TRUE)
cumsum(!duplicated(df1[c("BX", "BY")]))
#> [1] 1 1 1 2 2 3
df1$ID <- cumsum(!duplicated(df1[c("BX", "BY")]))
df1
#> BX BY ID
#> 1 1 12 1
#> 2 1 12 1
#> 3 1 12 1
#> 4 2 14 2
#> 5 2 14 2
#> 6 3 5 3
Created on 2022-10-12 with reprex v2.0.2
You can do:
transform(dat, ID = as.numeric(interaction(dat, drop = TRUE, lex.order = TRUE)))
BX BY ID
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
Or if you prefer dplyr:
library(dplyr)
dat %>%
group_by(across()) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
# A tibble: 6 × 3
BX BY ID
<dbl> <dbl> <int>
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3

identify whenever values repeat in r

I have a dataframe like this.
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3))
I want to populate a new variable Sequence which identifies whenever Condition starts again from 1.
So the new dataframe would look like this.
Thanks in advance for the help!
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3),
Sequence = c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3))
base R
data$Sequence2 <- cumsum(c(TRUE, data$Condition[-1] == 1 & data$Condition[-nrow(data)] != 1))
data
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
dplyr
library(dplyr)
data %>%
mutate(
Sequence2 = cumsum(Condition == 1 & lag(Condition != 1, default = TRUE))
)
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
This took a while. Finally I find this solution:
library(dplyr)
data %>%
group_by(Sequnce = cumsum(
ifelse(Condition==1, lead(Condition)+1, Condition)
- Condition==1)
)
Condition Sequnce
<dbl> <int>
1 1 1
2 1 1
3 2 1
4 3 1
5 1 2
6 1 2
7 2 2
8 2 2
9 2 2
10 3 2
11 1 3
12 1 3
13 2 3
14 3 3
15 3 3

Replace row value in a data frame group by the smallest value in that group

I have the following data set:
time <- c(0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3,4,5)
value <- c(10,8,6,5,3,2,12,10,6,5,4,2,20,15,16,9,2,2)
group <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
data <- data.frame(time, value, group)
I want to create a new column called data$diff that is equal to data$value minus the value of data$value when data$time == 0 within each group.
I am beginning with the following code
for(i in 1:nrow(data)){
for(n in 1:max(data$group)){
if(data$group[i] == n) {
data$diff[i] <- ???????
}
}
}
But cannot figure out what to put in place of the question marks. The desired output would be this table: https://i.stack.imgur.com/1bAKj.png
Any thoughts are appreciated.
Since in your example data$time == 0 is always the first element of the group, you can use this data.table approach.
library(data.table)
setDT(data)
data[, diff := value[1] - value, by = group]
In case that data$time == 0 is not the first element in each group you can use this:
data[, diff := value[time==0] - value, by = group]
Output:
> data
time value group diff
1: 0 10 1 0
2: 1 8 1 2
3: 2 6 1 4
4: 3 5 1 5
5: 4 3 1 7
6: 5 2 1 8
7: 0 12 2 0
8: 1 10 2 2
9: 2 6 2 6
10: 3 5 2 7
11: 4 4 2 8
12: 5 2 2 10
13: 0 20 3 0
14: 1 15 3 5
15: 2 16 3 4
16: 3 9 3 11
17: 4 2 3 18
18: 5 2 3 18
Here is a base R approach.
within(data, diff <- ave(
seq_along(value), group,
FUN = \(i) value[i][time[i] == 0] - value[i]
))
Output
time value group diff
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
Here is a short way to do it with dplyr.
library(dplyr)
data %>%
group_by(group) %>%
mutate(diff = value[which(time == 0)] - value)
Which gives
# Groups: group [3]
time value group diff
<dbl> <dbl> <dbl> <dbl>
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
library(dplyr)
vals2use <- data %>%
group_by(group) %>%
filter(time==0) %>%
select(c(2,3)) %>%
rename(value4diff=value)
dataNew <- merge(data, vals2use, all=T)
dataNew$diff <- dataNew$value4diff-dataNew$value
dataNew <- dataNew[,c(1,2,3,5)]
dataNew
group time value diff
1 1 0 10 0
2 1 1 8 2
3 1 2 6 4
4 1 3 5 5
5 1 4 3 7
6 1 5 2 8
7 2 0 12 0
8 2 1 10 2
9 2 2 6 6
10 2 3 5 7
11 2 4 4 8
12 2 5 2 10
13 3 0 20 0
14 3 1 15 5
15 3 2 16 4
16 3 3 9 11
17 3 4 2 18
18 3 5 2 18

Conditional difference between two data.frame columns

I have a tidy data.frame of experimental data with subjects ID who were measured three times (Trial) at a varying(!) number of time points (Session) in two different conditions (Direction) on a dependent continuous variable, say LC:
set.seed(5)
nSubjects <- 4
nDirections <- 2
nTrials <- 3
# Between 1 and 3 sessions per subject:
nSessions <- round(runif(nSubjects,
min = 1, max = 3))
mydat <- data.frame(ID = do.call(rep, args = list(1:nSubjects,
times = nSessions * nDirections * nTrials)),
Session = rep(sequence(nSessions),
each = nDirections * nTrials),
Trial = rep(rep(1:nTrials,
each = nDirections),
times = sum(nSessions)),
Direction = rep(c("up", "down"),
times = nTrials * sum(nSessions)),
LC = 1:(nDirections * nTrials * sum(nSessions)))
What I would like to calculate is a vector of length nrow(mydat) that contains the difference in LC between a given subject's and trial's and direction's first and current session. In other words, from each (absolute) LC score of any ID, session, trial and direction, the (absolute) LC from session == 1 of the same ID, trial and direction gets subtracted, like this (for the sake of simplicity I chose LC to be monotonically increasing):
# ID Session Trial Direction LC LC_diff
# 7 2 1 1 up 7 0
# 8 2 1 2 down 8 0
# 9 2 1 3 up 9 0
# 10 2 1 1 down 10 0
# 11 2 1 2 up 11 0
# 12 2 1 3 down 12 0
# 13 2 2 1 up 13 6
# 14 2 2 2 down 14 6
# 15 2 2 3 up 15 6
# 16 2 2 1 down 16 6
# 17 2 2 2 up 17 6
# 18 2 2 3 down 18 6
I thought the following code would yield the desired result:
library(dplyr)
ordered <- group_by(mydat, ID, Session, Trial, Direction)
mydat$LC_diff <- summarise(ordered,
Diff = sum(abs(LC[Trial != 1]),
- abs(LC[Trial == 1])))$Diff
But, alas:
mydat[7:18, ]
# ID Session Trial Direction LC LC_diff
# 7 2 1 1 up 7 -8
# 8 2 1 2 down 8 -7
# 9 2 1 3 up 9 10
# 10 2 1 1 down 10 9
# 11 2 1 2 up 11 12
# 12 2 1 3 down 12 11
# 13 2 2 1 up 13 -14
# 14 2 2 2 down 14 -13
# 15 2 2 3 up 15 16
# 16 2 2 1 down 16 15
# 17 2 2 2 up 17 18
# 18 2 2 3 down 18 17
I am at a complete loss here and would appreciate any pointers to where my code is wrong.
I'm not sure this is what you meant, but with data.table would be like this:
library(data.table)
setDT(mydat)[,new:= abs(LC)-abs(LC[1]),by=.(ID, Trial, Direction)]
mydat[ID==2,]
ID Session Trial Direction LC new
1: 2 1 1 up 7 0
2: 2 1 1 down 8 0
3: 2 1 2 up 9 0
4: 2 1 2 down 10 0
5: 2 1 3 up 11 0
6: 2 1 3 down 12 0
7: 2 2 1 up 13 6
8: 2 2 1 down 14 6
9: 2 2 2 up 15 6
10: 2 2 2 down 16 6
11: 2 2 3 up 17 6
12: 2 2 3 down 18 6

Resources