I would like to to get the sum of the rows after first occurrence of a certain number. In this case it is '10' for instance.
I though If we can know the row number after first occurrence and the ending row number of that group and we can sum in between them.
I can get the first occurrence of '10' each group but I don't know how can get the sum of the rows.
df <- data.frame(gr=rep(c(1,2),c(7,9)),
y_value=c(c(0,0,10,8,8,6,0),c(0,0,10,10,5,4,2,0,0)))
> df
gr y_value
1 1 0
2 1 0
3 1 10
4 1 8
5 1 8
6 1 6
7 1 0
8 2 0
9 2 0
10 2 10
11 2 10
12 2 5
13 2 4
14 2 2
15 2 0
16 2 0
My initial attempt is below which is not working for some reason even for grouping part:(!
library(dplyr)
df%>%
group_by(gr)%>%
mutate(check1=any(y_value==10),row_sum=which(y_value == 10)[1])
Expected output
> df
gr y_value sum_rows_range
1 1 0 22/4
2 1 0 22/4
3 1 10 22/4
4 1 8 22/4
5 1 8 22/4
6 1 6 22/4
7 1 0 22/4
8 2 0 21/6
9 2 0 21/6
10 2 10 21/6
11 2 10 21/6
12 2 5 21/6
13 2 4 21/6
14 2 2 21/6
15 2 0 21/6
16 2 0 21/6
A dplyr solution:
library(dplyr)
df %>%
group_by(gr) %>%
slice(if(any(y_value == 10)) (which.max(y_value == 10)+1):n() else row_number()) %>%
summarize(sum = sum(y_value),
rows = n()) %>%
inner_join(df)
Notes:
The main idea is to slice on the rows after the first 10 occurs. any(y_value == 10)) and else row_number() are just to take care of the case where there are no 10's in y_value.
Reading the documentation for ?which.max, you will notice that when it is applied to a logical vector, in this case y_value == 10, "with both FALSE and TRUE values, which.min(x) and which.max(x) return the index of the first FALSE or TRUE, respectively, as FALSE < TRUE."
In other words, which.max(y_value == 10) will give the index of the first occurrence of 10. By adding 1 to it, I can start sliceing from the value right after the first occurrence of 10.
Result:
# A tibble: 16 × 4
gr sum rows y_value
<dbl> <dbl> <int> <dbl>
1 1 22 4 0
2 1 22 4 0
3 1 22 4 10
4 1 22 4 8
5 1 22 4 8
6 1 22 4 6
7 1 22 4 0
8 2 21 6 0
9 2 21 6 0
10 2 21 6 10
11 2 21 6 10
12 2 21 6 5
13 2 21 6 4
14 2 21 6 2
15 2 21 6 0
16 2 21 6 0
It's a bit convoluted, and I'm not positive it's what you're looking for, but it does match your output.
df %>%
group_by(gr) %>%
mutate(is_ten = cumsum(y_value == 10)) %>%
filter(is_ten > 0) %>%
filter(!(y_value == 10 & is_ten == 1)) %>%
group_by(gr) %>%
summarize(sum_rows_range = paste(sum(y_value), n(), sep = "/")) %>%
right_join(df)
# A tibble: 16 x 3
gr sum_rows_range y_value
<dbl> <chr> <dbl>
1 1 22/4 0
2 1 22/4 0
3 1 22/4 10
4 1 22/4 8
5 1 22/4 8
6 1 22/4 6
7 1 22/4 0
8 2 21/6 0
9 2 21/6 0
10 2 21/6 10
11 2 21/6 10
12 2 21/6 5
13 2 21/6 4
14 2 21/6 2
15 2 21/6 0
16 2 21/6 0
Related
Imagine you have the following data set:
df = data.frame(ID = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), gender= c(1,2,1,2,2,2,2,1,1,2,1,2,1,2,2,2,2,1,1,2),
PID = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10))
how can I write a code that removes the rows in the df whose gender and PID are the same (see picture). Please imagine that the code is over 1000 rows long (so it should be a solution that automatically searches for the right values to exclude).
base R
df[ave(rep(TRUE, nrow(df)), df[,c("gender","paar")], FUN = function(z) !any(duplicated(z))),]
# ID gender paar
# 1 1 1 1
# 2 2 2 1
# 3 3 1 2
# 4 4 2 2
# 7 7 2 4
# 8 8 1 4
# 9 9 1 5
# 10 10 2 5
# 11 11 1 6
# 12 12 2 6
# 13 13 1 7
# 14 14 2 7
# 17 17 2 9
# 18 18 1 9
# 19 19 1 10
# 20 20 2 10
dplyr
library(dplyr)
df %>%
group_by(gender, paar) %>%
filter(!any(duplicated(cbind(gender, paar)))) %>%
ungroup()
In base R, we may use subset after removing the observations where the group count for 'gender' and 'paar' are not 1
subset(df, ave(seq_along(gender), gender, paar, FUN = length) == 1)
Or with duplicated
df[!(duplicated(df[-1])|duplicated(df[-1], fromLast = TRUE)),]
-output
ID gender paar
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
7 7 2 4
8 8 1 4
9 9 1 5
10 10 2 5
11 11 1 6
12 12 2 6
13 13 1 7
14 14 2 7
17 17 2 9
18 18 1 9
19 19 1 10
20 20 2 10
Here is one more: :-)
library(dplyr)
df %>%
group_by(gender, PID) %>%
filter(is.na(ifelse(n()>1, 1, NA)))
ID gender PID
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
Another dplyr option could be:
df %>%
filter(with(rle(paste0(gender, PID)), rep(lengths == 1, lengths)))
ID gender PID
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
If the duplicated values can occur also between non-consecutive rows:
df %>%
arrange(gender, PID) %>%
filter(with(rle(paste0(gender, PID)), rep(lengths == 1, lengths)))
Using aggregate
na.omit(aggregate(. ~ gender + PID, df, function(x)
ifelse(length(x) == 1, x, NA)))
gender PID ID
1 1 1 1
2 2 1 2
3 1 2 3
4 2 2 4
6 1 4 8
7 2 4 7
8 1 5 9
9 2 5 10
10 1 6 11
11 2 6 12
12 1 7 13
13 2 7 14
15 1 9 18
16 2 9 17
17 1 10 19
18 2 10 20
With dplyr
library(dplyr)
df %>%
group_by(gender, PID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 16 × 3
ID gender PID
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 1 2
4 4 2 2
5 7 2 4
6 8 1 4
7 9 1 5
8 10 2 5
9 11 1 6
10 12 2 6
11 13 1 7
12 14 2 7
13 17 2 9
14 18 1 9
15 19 1 10
16 20 2 10
I have a dataframe like this:
df <- data.frame(
id = 1:19,
Area_l = c(1,2,0,0,0,2,3,1,2,0,0,0,0,3,4,0,0,0,0),
Area_r = c(3,2,2,0,0,2,3,1,0,0,0,1,3,3,4,3,0,0,0)
)
I need to filter the dataframe in such a way that all rows are omitted that fulfill two conditions:
(i): Area_l and Area_r are 0
(ii): the paired 0values in Area_l and Area_r are the last values in the columns.
I really have no clue how to implement these two conditions using dplyr. The desired result is this:
df
id Area_l Area_r
1 1 1 3
2 2 2 2
3 3 0 2
4 4 0 0
5 5 0 0
6 6 2 2
7 7 3 3
8 8 1 1
9 9 2 0
10 10 0 0
11 11 0 0
12 12 0 1
13 13 0 3
14 14 3 3
15 15 4 4
16 16 0 3
Any help?
Reverse the order of the dataframe, filter with a cumany condition, then reverse it back.
library(dplyr)
df %>%
map_df(rev) %>%
filter(cumany(Area_l + Area_r != 0)) %>%
map_df(rev)
output
# A tibble: 16 x 3
id Area_l Area_r
<int> <dbl> <dbl>
1 1 1 3
2 2 2 2
3 3 0 2
4 4 0 0
5 5 0 0
6 6 2 2
7 7 3 3
8 8 1 1
9 9 2 0
10 10 0 0
11 11 0 0
12 12 0 1
13 13 0 3
14 14 3 3
15 15 4 4
16 16 0 3
We may use rle
library(dplyr)
df %>%
filter(!if_all(starts_with("Area"), ~ .x == 0 &
inverse.rle(within.list(rle(.x == 0), values[-length(values)] <- FALSE))))
-output
id Area_l Area_r
1 1 1 3
2 2 2 2
3 3 0 2
4 4 0 0
5 5 0 0
6 6 2 2
7 7 3 3
8 8 1 1
9 9 2 0
10 10 0 0
11 11 0 0
12 12 0 1
13 13 0 3
14 14 3 3
15 15 4 4
16 16 0 3
Or another option is
df %>%
filter(if_any(starts_with("Area"),
~ row_number() <= max(row_number() * (.x != 0))))
Or another option is revcumsum from spatstat.utils
library(spatstat.utils)
df %>%
filter(!if_all(starts_with("Area"), ~ revcumsum(.x != 0) <1))
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I am trying to create an additional variable (new variable-> flag) that will number the repetition of observation in my variable starting from 0.
dataset <- data.frame(id = c(1,1,1,2,2,4,6,6,6,7,7,7,7,8))
intended results will look like:
id flag
1 0
1 1
1 2
2 0
2 1
4 0
6 0
6 1
6 2
7 0
7 1
7 2
7 3
8 0
Thank You!
You may try
dataset$flag <- unlist(sapply(rle(dataset$id)$length, function(x) seq(1,x)-1))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
data.table:
library(data.table)
setDT(dataset)[, flag := rowid(id) - 1]
dataset
id flag
1: 1 0
2: 1 1
3: 1 2
4: 2 0
5: 2 1
6: 4 0
7: 6 0
8: 6 1
9: 6 2
10: 7 0
11: 7 1
12: 7 2
13: 7 3
14: 8 0
Base R:
dataset$flag = sequence(rle(dataset$id)$lengths) - 1
dataset
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
Another base option:
transform(dataset,
flag = Reduce(function(x, y) y * x + y, duplicated(id), accumulate = TRUE))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
dplyr -
library(dplyr)
dataset %>% group_by(id) %>% mutate(flag = row_number() - 1)
# id flag
# <dbl> <dbl>
# 1 1 0
# 2 1 1
# 3 1 2
# 4 2 0
# 5 2 1
# 6 4 0
# 7 6 0
# 8 6 1
# 9 6 2
#10 7 0
#11 7 1
#12 7 2
#13 7 3
#14 8 0
Base R with similar logic
transform(dataset, flag = ave(id, id, FUN = seq_along) - 1)
another way to reach what you expect but writing a little more
x <- dataset %>%
group_by(id) %>%
summarise(nreg=n())
df <- data.frame()
for(i in 1:nrow(x)){
flag <- data.frame(id = rep( x$id[i], x$nreg[i] ),
flag = seq(0, x$nreg [i] -1 )
)
df <- rbind(df, flag)
}
I have the following data set:
time <- c(0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3,4,5)
value <- c(10,8,6,5,3,2,12,10,6,5,4,2,20,15,16,9,2,2)
group <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
data <- data.frame(time, value, group)
I want to create a new column called data$diff that is equal to data$value minus the value of data$value when data$time == 0 within each group.
I am beginning with the following code
for(i in 1:nrow(data)){
for(n in 1:max(data$group)){
if(data$group[i] == n) {
data$diff[i] <- ???????
}
}
}
But cannot figure out what to put in place of the question marks. The desired output would be this table: https://i.stack.imgur.com/1bAKj.png
Any thoughts are appreciated.
Since in your example data$time == 0 is always the first element of the group, you can use this data.table approach.
library(data.table)
setDT(data)
data[, diff := value[1] - value, by = group]
In case that data$time == 0 is not the first element in each group you can use this:
data[, diff := value[time==0] - value, by = group]
Output:
> data
time value group diff
1: 0 10 1 0
2: 1 8 1 2
3: 2 6 1 4
4: 3 5 1 5
5: 4 3 1 7
6: 5 2 1 8
7: 0 12 2 0
8: 1 10 2 2
9: 2 6 2 6
10: 3 5 2 7
11: 4 4 2 8
12: 5 2 2 10
13: 0 20 3 0
14: 1 15 3 5
15: 2 16 3 4
16: 3 9 3 11
17: 4 2 3 18
18: 5 2 3 18
Here is a base R approach.
within(data, diff <- ave(
seq_along(value), group,
FUN = \(i) value[i][time[i] == 0] - value[i]
))
Output
time value group diff
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
Here is a short way to do it with dplyr.
library(dplyr)
data %>%
group_by(group) %>%
mutate(diff = value[which(time == 0)] - value)
Which gives
# Groups: group [3]
time value group diff
<dbl> <dbl> <dbl> <dbl>
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
library(dplyr)
vals2use <- data %>%
group_by(group) %>%
filter(time==0) %>%
select(c(2,3)) %>%
rename(value4diff=value)
dataNew <- merge(data, vals2use, all=T)
dataNew$diff <- dataNew$value4diff-dataNew$value
dataNew <- dataNew[,c(1,2,3,5)]
dataNew
group time value diff
1 1 0 10 0
2 1 1 8 2
3 1 2 6 4
4 1 3 5 5
5 1 4 3 7
6 1 5 2 8
7 2 0 12 0
8 2 1 10 2
9 2 2 6 6
10 2 3 5 7
11 2 4 4 8
12 2 5 2 10
13 3 0 20 0
14 3 1 15 5
15 3 2 16 4
16 3 3 9 11
17 3 4 2 18
18 3 5 2 18
Consider the following sample dataset. Id is an individual identifier.
rm(list=ls()); set.seed(1)
n<-100
X<-rbinom(n, 1, 0.5) #binary covariate
j<-rep (1:n)
dat<-data.frame(id=1:n, X)
ntp<- rep(4, n)
mat<-matrix(ncol=3,nrow=1)
m=0; w <- mat
for(l in ntp)
{
m=m+1
ft<- seq(from = 2, to = 8, length.out = l)
# ft<- seq(from = 1, to = 9, length.out = l)
ft<-sort(ft)
seq<-rep(ft,each=2)
seq<-c(0,seq,10)
matid<-cbind( matrix(seq,ncol=2,nrow=l+1,byrow=T ) ,m)
w<-rbind(w,matid)
}
d<-data.frame(w[-1,])
colnames(d)<-c("time1","time2","id")
D <- round( merge(d,dat,by="id") ,2) #merging dataset
nr<-nrow(D)
D$Survival_time<-round(rexp(nr, 0.1)+1,3)
head(D,15)
id time1 time2 X Survival_time
1 1 0 2 0 21.341
2 1 2 4 0 18.987
3 1 4 6 0 4.740
4 1 6 8 0 13.296
5 1 8 10 0 6.397
6 2 0 2 0 10.566
7 2 2 4 0 2.470
8 2 4 6 0 14.907
9 2 6 8 0 8.620
10 2 8 10 0 13.376
11 3 0 2 1 45.239
12 3 2 4 1 11.545
13 3 4 6 1 11.352
14 3 6 8 1 19.760
15 3 8 10 1 7.547
How can I obtain the value at which Survival_time is less that time2 for the very first time per individual. I should end up with the following values
id Survival_time
1 4.740
2 2.470
3 7.547
Also, how can I subset the data to stop individualwise when this condition occurs. i.e obtain
id time1 time2 X Survival_time
1 1 0 2 0 21.341
2 1 2 4 0 18.987
3 1 4 6 0 4.740
6 2 0 2 0 10.566
7 2 2 4 0 2.470
11 3 0 2 1 45.239
12 3 2 4 1 11.545
13 3 4 6 1 11.352
14 3 6 8 1 19.760
15 3 8 10 1 7.547
Using data.table
library(data.table)
setDT(D)[, .SD[seq_len(.N) <= which(Survival_time < time2)[1]], id]
-output
id time1 time2 X Survival_time
1: 1 0 2 0 21.341
2: 1 2 4 0 18.987
3: 1 4 6 0 4.740
4: 2 0 2 0 10.566
5: 2 2 4 0 2.470
6: 3 0 2 1 45.239
7: 3 2 4 1 11.545
8: 3 4 6 1 11.352
9: 3 6 8 1 19.760
10: 3 8 10 1 7.547
Slight variation:
library(dplyr)
D %>% # Take D, and then
group_by(id) %>% # group by id, and then
filter(Survival_time < time2) %>% # keep Survival times < time2, and then
slice(1) %>% # keep the first row per id, and then
ungroup() # ungroup
You can use -
library(dplyr)
D %>%
group_by(id) %>%
summarise(Survival_time = Survival_time[match(TRUE, Survival_time < time2)])
#Also using which.max
#summarise(Survival_time = Survival_time[which.max(Survival_time < time2)])
# id Survival_time
# <int> <dbl>
#1 1 4.74
#2 2 2.47
#3 3 7.55
To select the rows you may till that point you may use -
D %>%
group_by(id) %>%
filter(row_number() <= match(TRUE, Survival_time < time2)) %>%
ungroup
# id time1 time2 X Survival_time
# <int> <int> <int> <int> <dbl>
# 1 1 0 2 0 21.3
# 2 1 2 4 0 19.0
# 3 1 4 6 0 4.74
# 4 2 0 2 0 10.6
# 5 2 2 4 0 2.47
# 6 3 0 2 1 45.2
# 7 3 2 4 1 11.5
# 8 3 4 6 1 11.4
# 9 3 6 8 1 19.8
#10 3 8 10 1 7.55