R, find average length of consecutive time-steps in data.frame - r

I have the following data.frame with time column sorted in ascending order:
colA=c(1,2,5,6,7,10,13,16,19,20,25,40,43,44,50,51,52,53,68,69,77,79,81,82)
colB=rnorm(24)
df=data.frame(time=colA, x=colB)
How can I count and take the average of the consecutive time-steps observed in the time column?
In detail, I need to group the rows in the time column by consecutive observations, e.g. 1,2 and 5,6,7 and 19,20 and 43,44, etc... and then take the average of the length of each group.

You can group clusters of consecutive observations like this:
df$group <- c(0, cumsum(diff(df$time) != 1)) + 1
Which gives:
df
#> time x group
#> 1 1 0.7443742 1
#> 2 2 0.1289818 1
#> 3 5 1.4882743 2
#> 4 6 -0.6626820 2
#> 5 7 -1.1606550 2
#> 6 10 0.3587742 3
#> 7 13 -0.1948464 4
#> 8 16 -0.2952820 5
#> 9 19 0.4966404 6
#> 10 20 0.4849128 6
#> 11 25 0.0187845 7
#> 12 40 0.6347746 8
#> 13 43 0.7544441 9
#> 14 44 0.8335890 9
#> 15 50 0.9657613 10
#> 16 51 1.2938800 10
#> 17 52 -0.1365510 10
#> 18 53 -0.4401387 10
#> 19 68 -1.2272839 11
#> 20 69 -0.2376531 11
#> 21 77 -0.9268582 12
#> 22 79 0.4112354 13
#> 23 81 -0.1988646 14
#> 24 82 -0.5574496 14
You can get the length of these groups by doing:
rle(df$group)$lengths
#> [1] 2 3 1 1 1 2 1 1 2 4 2 1 1 2
And the average length of the consecutive groups is:
mean(rle(df$group)$lengths)
#> [1] 1.714286
And the average of x within each group using
tapply(df$x, df$group, mean)
#> 1 2 3 4 5 6 7
#> 0.4366780 -0.1116876 0.3587742 -0.1948464 -0.2952820 0.4907766 0.0187845
#> 8 9 10 11 12 13 14
#> 0.6347746 0.7940166 0.4207379 -0.7324685 -0.9268582 0.4112354 -0.3781571

Another way, doing some kind of consecutive length encoding:
l = diff(c(0, which(diff(df$time) != 1), nrow(df)))
l
# [1] 2 3 1 1 1 2 1 1 2 4 2 1 1 2
mean(l)
#[1] 1.714286

Related

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I would like to take this dataset and remove the values if they have the same id and age(duplicates) but keep the one with the highest month number.
ID|Age|Month|
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3
And have the outcome be
ID|Age|Month
1 25 12
2 18 11
3 12 10
3 25 10
4 19 10
5 10 3
Note that it removed the duplicates but kept the version with the highest month number.
as a solution option
library(tidyverse)
df <- read.table(text = "ID Age Month
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3", header = T)
df %>%
group_by(ID, Age) %>%
slice_max(Month)
#> # A tibble: 6 x 3
#> # Groups: ID, Age [6]
#> ID Age Month
#> <int> <int> <int>
#> 1 1 25 12
#> 2 2 18 11
#> 3 3 12 10
#> 4 3 25 10
#> 5 4 19 10
#> 6 5 10 3
Created on 2021-02-11 by the reprex package (v1.0.0)
Using dplyr package, the solution:
df %>%
+ group_by(ID, Age) %>%
+ filter(Month == max(Month))
# A tibble: 6 x 3
# Groups: ID, Age [6]
ID Age Month
<dbl> <dbl> <dbl>
1 1 25 12
2 2 18 11
3 3 12 10
4 3 25 10
5 4 19 10
6 5 10 3

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

Duplicate clustered observations and create a unique identifiers for the duplicated clusters

Consider the small dataset df1. There are 5 clusters identified by ID, row_numbers contains a unique value for each observation and weights identifies how many copies we want to each cluster.
df1 <-data.frame(ID=c("10","20","30","30","30", "40", "40","50","50","50","50"), row_numbers = c(1,2,3,4,5,6,7,8,9,10,11),weights=c(4,3,2,2,2,3,3,2,2,2,2))
df1
#> ID row_numbers weights
#> 1 10 1 4
#> 2 20 2 3
#> 3 30 3 2
#> 4 30 4 2
#> 5 30 5 2
#> 6 40 6 3
#> 7 40 7 3
#> 8 50 8 2
#> 9 50 9 2
#> 10 50 10 2
#> 11 50 11 2
The expected output is df2
The most important part of df2 is the new variable "newID". The unique identifiers for the duplicated clusters are stored in newID where newID will identify the clusters by using consecutive integers starting from 1.
df2 <-data.frame(ID=c("10","10","10","10","20","20","20","30","30","30","30","30","30", "40", "40","40", "40","40", "40","50","50","50","50","50","50","50","50"), row_numbers = c(1,1,1,1,2,2,2,3,3,4,4,5,5,6,6,6,7,7,7,8,8,9,9,10,10,11,11),weights=c(4,4,4,4,3,3,3,2,2,2,2,2,2,3,3,3,3,3,3,2,2,2,2,2,2,2,2), newID= c(1,2,3,4,5,6,7,8,8,8,9,9,9,10,10,11,11,12,12,13,13,13,13,14,14,14,14))
df2
#> ID row_numbers weights newID
#> 1 10 1 4 1
#> 2 10 1 4 2
#> 3 10 1 4 3
#> 4 10 1 4 4
#> 5 20 2 3 5
#> 6 20 2 3 6
#> 7 20 2 3 7
#> 8 30 3 2 8
#> 9 30 3 2 8
#> 10 30 4 2 8
#> 11 30 4 2 9
#> 12 30 5 2 9
#> 13 30 5 2 9
#> 14 40 6 3 10
#> 15 40 6 3 10
#> 16 40 6 3 11
#> 17 40 7 3 11
#> 18 40 7 3 12
#> 19 40 7 3 12
#> 20 50 8 2 13
#> 21 50 8 2 13
#> 22 50 9 2 13
#> 23 50 9 2 13
#> 24 50 10 2 14
#> 25 50 10 2 14
#> 26 50 11 2 14
#> 27 50 11 2 14
Here's a solution using a split-apply-bind approach:
df3 <- do.call(rbind, lapply(split(df1, df1$ID), function(x)
{
group_size <- nrow(x)
n_groups <- x$weights[1]
if(is.na(n_groups)) n_groups <- 1
if (n_groups < 1) n_groups <- 1
group_labels <- rep(paste(x$ID[1], seq(n_groups)), each = group_size)
x <- x[rep(seq(group_size), each = n_groups), ]
x$newID <- group_labels
x
}))
df3$newID <- as.numeric(as.factor(df3$newID))
df3 <- `rownames<-`(df3, seq(nrow(df3)))
Which matches your expected output:
df3
#> ID row_numbers weights newID
#> 1 10 1 4 1
#> 2 10 1 4 2
#> 3 10 1 4 3
#> 4 10 1 4 4
#> 5 20 2 3 5
#> 6 20 2 3 6
#> 7 20 2 3 7
#> 8 30 3 2 8
#> 9 30 3 2 8
#> 10 30 4 2 8
#> 11 30 4 2 9
#> 12 30 5 2 9
#> 13 30 5 2 9
#> 14 40 6 3 10
#> 15 40 6 3 10
#> 16 40 6 3 11
#> 17 40 7 3 11
#> 18 40 7 3 12
#> 19 40 7 3 12
#> 20 50 8 2 13
#> 21 50 8 2 13
#> 22 50 9 2 13
#> 23 50 9 2 13
#> 24 50 10 2 14
#> 25 50 10 2 14
#> 26 50 11 2 14
#> 27 50 11 2 14
And we can show this is identical to your desired result:
identical(df2, df3)
#> [1] TRUE
a solution with data.table :
library(data.table)
df1 <-data.frame(ID=c("10","20","30","30","30", "40", "40","50","50","50","50"), row_numbers = c(1,2,3,4,5,6,7,8,9,10,11),weights=c(4,3,2,2,2,3,3,2,2,2,2))
dt1 <- data.table(df1)
# with .x a data.table with cols : ID, row_numbers (integer), weight (integer)
duplicate_weight <- function(.x) {
# get the part to keep unchanged
untouched <- list(
.x[is.na(weights), .(ID, row_numbers, weights = 1, repetition = ID)] ,
.x[weights == 0, .(ID, row_numbers, weights = 1, repetition = ID)],
.x[weights == 1, .(ID, row_numbers, weights = 1, repetition = ID)]
)
# list of the weights > 1
weights_list <- sort(unique(.x[['weights']]))
weights_list <- weights_list[weights_list > 1]
# repeat accordingly to weights
repeated <- lapply(weights_list, # for each weight
function(.y) {
rbindlist( # make a data.table
lapply(1:.y, # repetead .y times
function(.z) {
.x[weights == .y, .(ID, row_numbers, weights = 1, repetition_position = .z)])
}
)
)
}
)
result <- rbindlist(c(untouched, repeated))
setorder(result, ID, repetition_position)
result[, new_id := .GRP, by = .(ID, repetition_position)]
result[, repetition_position := NULL]
result
}
duplicate_weight(dt1)
It's similar to #Allan Cameron

calculate difference between rows, but keep the raw value by group

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)
Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

Resources