calculate difference between rows, but keep the raw value by group - r

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)

Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7

While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

Related

Changing number of observation in a dataset by IDs according to a given value

I have this dataset in R:
and I want to change the data according to nb variable, it means ID = 1 will have 5 rows and ID=2 will have 12 rows as shown below:
is there any R function that I could use it to transform my data :) ?
Thanks in advance
We need uncount from tidyr to expand based on the 'nb' column, by default, it removes the column as .remove = TRUE, change it to FALSE and then create the nb_long by doing a group by row_number()
library(dplyr)
library(tidyr)
df1 %>%
uncount(nb, .remove = FALSE) %>%
group_by(ID) %>%
mutate(nb_long = row_number()) %>%
ungroup
-output
# A tibble: 17 x 3
ID nb nb_long
<int> <dbl> <int>
1 1 5 1
2 1 5 2
3 1 5 3
4 1 5 4
5 1 5 5
6 2 12 1
7 2 12 2
8 2 12 3
9 2 12 4
10 2 12 5
11 2 12 6
12 2 12 7
13 2 12 8
14 2 12 9
15 2 12 10
16 2 12 11
17 2 12 12
data
df1 <- structure(list(ID = 1:2, nb = c(5, 12)),
class = "data.frame", row.names = c(NA,
-2L))
Here is another option. we just map out the values from 1 to nb and then we unnest the vector longer.
#packages
library(tidyverse)
#data
df1 <- structure(list(ID = 1:2, nb = c(5, 12)),
class = "data.frame", row.names = c(NA,
-2L))
#solution
df1 %>%
mutate(nums = map(nb, ~seq(1, .x, by = 1))) %>%
unnest_longer(nums)
#> # A tibble: 17 x 3
#> ID nb nums
#> <int> <dbl> <dbl>
#> 1 1 5 1
#> 2 1 5 2
#> 3 1 5 3
#> 4 1 5 4
#> 5 1 5 5
#> 6 2 12 1
#> 7 2 12 2
#> 8 2 12 3
#> 9 2 12 4
#> 10 2 12 5
#> 11 2 12 6
#> 12 2 12 7
#> 13 2 12 8
#> 14 2 12 9
#> 15 2 12 10
#> 16 2 12 11
#> 17 2 12 12
We can try the following data.table option
> setDT(df)[,.(nb_long = 1:nb),.(ID,nb)]
ID nb nb_long
1: 1 5 1
2: 1 5 2
3: 1 5 3
4: 1 5 4
5: 1 5 5
6: 2 12 1
7: 2 12 2
8: 2 12 3
9: 2 12 4
10: 2 12 5
11: 2 12 6
12: 2 12 7
13: 2 12 8
14: 2 12 9
15: 2 12 10
16: 2 12 11
17: 2 12 12

Matching node IDs of trees with the same topology

I have two phylogenetic trees which have the same topology (expect for branch lengths):
In R using ape:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
> ape::all.equal.phylo(t1,t2,use.edge.length = F,use.tip.label = T)
[1] TRUE
I want to compute the mean branch lengths across the two but the problem is that although their topologies are identical the order at which their nodes are represented is not identical, and not all tree nodes are labeled tips so I don't think there's a simple join solution:
> head(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 10 1 72 HS
2 12 2 30 CP
3 12 3 30.3 CL
4 11 4 62 RN
5 13 5 63 CS
6 13 6 63 BS
> tail(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 5 NA
3 9 10 2 NA
4 10 11 10 NA
5 11 12 32 NA
6 9 13 11 NA
> head(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 12 1 39 CP
2 12 2 39 CL
3 11 3 68 RN
4 10 4 77 HS
5 13 5 63 BS
6 13 6 63 CS
> tail(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 14 NA
3 9 10 5 NA
4 10 11 9 NA
5 11 12 29 NA
6 9 13 19 NA
So it's not clear to me how I'd correspond between any pair of branch lengths in order to take their mean.
Any idea how to match them or reorder t2 according to t1?
Supposedly phytools' matchNodes method is meant for that but it doesn't seem like it's getting it right:
phytools::matchNodes(t1, t2,method = "descendants")
tr1 tr2
[1,] 8 8
[2,] 9 9
[3,] 10 10
[4,] 11 11
[5,] 12 12
[6,] 13 13
At least I'd expect it to correspond the tips correctly, meaning:
dplyr::left_join(dplyr::filter(tidytree::as_tibble(t1),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
+ dplyr::filter(tidytree::as_tibble(t2),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 7 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
But that's not happening.
Ultimately the information for matching is in these tree tibbles because they list the parents of each node, but practically using that information for matching the modes probably requires some recursive steps.
Seems like ape's makeNodeLabel using the md5sum as the method argument, which labels the internal nodes by the tip labels achieves that:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
dplyr::left_join(tidytree::as_tibble(ape::makeNodeLabel(t1, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
tidytree::as_tibble(ape::makeNodeLabel(t2, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 13 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
8 8 da5f57f0a757f7211fcf84c540d9531a 8
9 9 9bffe86cf0a2650b6a3f0d494c0183a9 9
10 10 bcf7b41992a064acd2e3e66fee7fe2d4 10
11 11 d50e0698114c621b49322697267900b7 11
12 12 f0a8c7fa67831514e65cdadbc68c3d31 12
13 13 82ab4cf8ae4a4df14cf87a48dc2638e0 13

Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I would like to take this dataset and remove the values if they have the same id and age(duplicates) but keep the one with the highest month number.
ID|Age|Month|
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3
And have the outcome be
ID|Age|Month
1 25 12
2 18 11
3 12 10
3 25 10
4 19 10
5 10 3
Note that it removed the duplicates but kept the version with the highest month number.
as a solution option
library(tidyverse)
df <- read.table(text = "ID Age Month
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3", header = T)
df %>%
group_by(ID, Age) %>%
slice_max(Month)
#> # A tibble: 6 x 3
#> # Groups: ID, Age [6]
#> ID Age Month
#> <int> <int> <int>
#> 1 1 25 12
#> 2 2 18 11
#> 3 3 12 10
#> 4 3 25 10
#> 5 4 19 10
#> 6 5 10 3
Created on 2021-02-11 by the reprex package (v1.0.0)
Using dplyr package, the solution:
df %>%
+ group_by(ID, Age) %>%
+ filter(Month == max(Month))
# A tibble: 6 x 3
# Groups: ID, Age [6]
ID Age Month
<dbl> <dbl> <dbl>
1 1 25 12
2 2 18 11
3 3 12 10
4 3 25 10
5 4 19 10
6 5 10 3

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

Rolling sum in dplyr

set.seed(123)
df <- data.frame(x = sample(1:10, 20, replace = T), id = rep(1:2, each = 10))
For each id, I want to create a column which has the sum of previous 5 x values.
df %>% group_by(id) %>% mutate(roll.sum = c(x[1:4], zoo::rollapply(x, 5, sum)))
# Groups: id [2]
x id roll.sum
<int> <int> <int>
3 1 3
8 1 8
5 1 5
9 1 9
10 1 10
1 1 36
6 1 39
9 1 40
6 1 41
5 1 37
10 2 10
5 2 5
7 2 7
6 2 6
2 2 2
9 2 39
3 2 32
1 2 28
4 2 25
10 2 29
The 6th row should be 35 (3 + 8 + 5 + 9 + 10), the 7th row should be 33 (8 + 5 + 9 + 10 + 1) and so on.
However, the above function is also including the row itself for calculation. How can I fix it?
library(zoo)
df %>% group_by(id) %>%
mutate(Sum_prev = rollapply(x, list(-(1:5)), sum, fill=NA, align = "right", partial=F))
#you can use rollapply(x, list((1:5)), sum, fill=NA, align = "left", partial=F)
#to sum the next 5 elements scaping the current one
x id Sum_prev
1 3 1 NA
2 8 1 NA
3 5 1 NA
4 9 1 NA
5 10 1 NA
6 1 1 35
7 6 1 33
8 9 1 31
9 6 1 35
10 5 1 32
11 10 2 NA
12 5 2 NA
13 7 2 NA
14 6 2 NA
15 2 2 NA
16 9 2 30
17 3 2 29
18 1 2 27
19 4 2 21
20 10 2 19
There is the rollify function in the tibbletime package that you could use. You can read about it in this vignette: Rolling calculations in tibbletime.
library(tibbletime)
library(dplyr)
rollig_sum <- rollify(.f = sum, window = 5)
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) #added lag() here
# A tibble: 20 x 3
# Groups: id [2]
# x id roll.sum
# <int> <int> <int>
# 1 3 1 NA
# 2 8 1 NA
# 3 5 1 NA
# 4 9 1 NA
# 5 10 1 NA
# 6 1 1 35
# 7 6 1 33
# 8 9 1 31
# 9 6 1 35
#10 5 1 32
#11 10 2 NA
#12 5 2 NA
#13 7 2 NA
#14 6 2 NA
#15 2 2 NA
#16 9 2 30
#17 3 2 29
#18 1 2 27
#19 4 2 21
#20 10 2 19
If you want the NAs to be some other value, you can use, for example, if_else
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) %>%
mutate(roll.sum = if_else(is.na(roll.sum), x, roll.sum))

Resources