Matching node IDs of trees with the same topology - r

I have two phylogenetic trees which have the same topology (expect for branch lengths):
In R using ape:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
> ape::all.equal.phylo(t1,t2,use.edge.length = F,use.tip.label = T)
[1] TRUE
I want to compute the mean branch lengths across the two but the problem is that although their topologies are identical the order at which their nodes are represented is not identical, and not all tree nodes are labeled tips so I don't think there's a simple join solution:
> head(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 10 1 72 HS
2 12 2 30 CP
3 12 3 30.3 CL
4 11 4 62 RN
5 13 5 63 CS
6 13 6 63 BS
> tail(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 5 NA
3 9 10 2 NA
4 10 11 10 NA
5 11 12 32 NA
6 9 13 11 NA
> head(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 12 1 39 CP
2 12 2 39 CL
3 11 3 68 RN
4 10 4 77 HS
5 13 5 63 BS
6 13 6 63 CS
> tail(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 14 NA
3 9 10 5 NA
4 10 11 9 NA
5 11 12 29 NA
6 9 13 19 NA
So it's not clear to me how I'd correspond between any pair of branch lengths in order to take their mean.
Any idea how to match them or reorder t2 according to t1?
Supposedly phytools' matchNodes method is meant for that but it doesn't seem like it's getting it right:
phytools::matchNodes(t1, t2,method = "descendants")
tr1 tr2
[1,] 8 8
[2,] 9 9
[3,] 10 10
[4,] 11 11
[5,] 12 12
[6,] 13 13
At least I'd expect it to correspond the tips correctly, meaning:
dplyr::left_join(dplyr::filter(tidytree::as_tibble(t1),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
+ dplyr::filter(tidytree::as_tibble(t2),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 7 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
But that's not happening.
Ultimately the information for matching is in these tree tibbles because they list the parents of each node, but practically using that information for matching the modes probably requires some recursive steps.

Seems like ape's makeNodeLabel using the md5sum as the method argument, which labels the internal nodes by the tip labels achieves that:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
dplyr::left_join(tidytree::as_tibble(ape::makeNodeLabel(t1, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
tidytree::as_tibble(ape::makeNodeLabel(t2, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 13 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
8 8 da5f57f0a757f7211fcf84c540d9531a 8
9 9 9bffe86cf0a2650b6a3f0d494c0183a9 9
10 10 bcf7b41992a064acd2e3e66fee7fe2d4 10
11 11 d50e0698114c621b49322697267900b7 11
12 12 f0a8c7fa67831514e65cdadbc68c3d31 12
13 13 82ab4cf8ae4a4df14cf87a48dc2638e0 13

Related

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

rbind dataframes by filling missing rows from the first dataframe

I have 4 datasets from 4 rounds of a survey, with the first round containing 5 variables and the next ones containing only 3. This is because the ID (same sample) and the other two variables (v1 and v2) are fixed over time.
df1 <- data.frame(id = c(1:5), round=1, v1 = c(6:10), v2 = c(11:15), v3=c(16:20))
df2 <- data.frame(id = c(1:5), round=2, v3=c(26:30))
df3 <- data.frame(id = c(1:5), round=3, v3=c(36:40))
df4 <- data.frame(id = c(1:5), round=4, v3=c(46:50))
** rbind
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id)
Now when I rbind them, I end up with missing rows for the two fixed variables for rounds 1 to 3:
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 NA NA 26
7 2 2 2 NA NA 27
8 2 3 2 NA NA 28
9 2 4 2 NA NA 29
10 2 5 2 NA NA 30
11 3 1 3 NA NA 36
12 3 2 3 NA NA 37
13 3 3 3 NA NA 38
14 3 4 3 NA NA 39
15 3 5 3 NA NA 40
16 4 1 4 NA NA 46
17 4 2 4 NA NA 47
18 4 3 4 NA NA 48
19 4 4 4 NA NA 49
20 4 5 4 NA NA 50
but I need v1 and v2 to be filled for the next rounds as well by matching the respective ID.
Please let me know if there is any way to do this in R (or in Python).
Thank you.
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50

calculate difference between rows, but keep the raw value by group

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)
Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

R how to fill in NA with rules

data=data.frame(person=c(1,1,1,2,2,2,2,3,3,3,3),
t=c(3,NA,9,4,7,NA,13,3,NA,NA,12),
WANT=c(3,6,9,4,7,10,13,3,6,9,12))
So basically I am wanting to create a new variable 'WANT' which takes the PREVIOUS value in t and ADDS 3 to it, and if there are many NA in a row then it keeps doing this. My attempt is:
library(dplyr)
data %>%
group_by(person) %>%
mutate(WANT_TRY = fill(t) + 3)
Here's one way -
data %>%
group_by(person) %>%
mutate(
# cs = cumsum(!is.na(t)), # creates index for reference value; uncomment if interested
w = case_when(
# rle() gives the running length of NA
is.na(t) ~ t[cumsum(!is.na(t))] + 3*sequence(rle(is.na(t))$lengths),
TRUE ~ t
)
) %>%
ungroup()
# A tibble: 11 x 4
person t WANT w
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
Here is another way. We can do linear interpolation with the imputeTS package.
library(dplyr)
library(imputeTS)
data2 <- data %>%
group_by(person) %>%
mutate(WANT2 = na.interpolation(WANT)) %>%
ungroup()
data2
# # A tibble: 11 x 4
# person t WANT WANT2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3
# 2 1 NA 6 6
# 3 1 9 9 9
# 4 2 4 4 4
# 5 2 7 7 7
# 6 2 NA 10 10
# 7 2 13 13 13
# 8 3 3 3 3
# 9 3 NA 6 6
# 10 3 NA 9 9
# 11 3 12 12 12
This is harder than it seems because of the double NA at the end. If it weren't for that, then the following:
ifelse(is.na(data$t), c(0, data$t[-nrow(data)])+3, data$t)
...would give you want you want. The simplest way, that uses the same logic but doesn't look very clever (sorry!) would be:
.impute <- function(x) ifelse(is.na(x), c(0, x[-length(x)])+3, x)
.impute(.impute(data$t))
...which just cheats by doing it twice. Does that help?
You can use functional programming from purrr and "NA-safe" addition from hablar:
library(hablar)
library(dplyr)
library(purrr)
data %>%
group_by(person) %>%
mutate(WANT2 = accumulate(t, ~.x %plus_% 3))
Result
# A tibble: 11 x 4
# Groups: person [3]
person t WANT WANT2
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12

SMA for multiple items in the same column

I'm trying to create SMA formula for multiple items in the same column. Here's an example of the data I'm working with.
Person Time Value
<chr> <dbl> <dbl>
1 A 1 14
2 A 2 13
3 A 3 17
4 A 4 9
5 A 5 20
6 A 6 5
7 B 1 17
8 B 2 11
9 B 3 18
10 B 4 10
11 B 5 10
12 B 6 20
13 C 1 5
14 C 2 5
15 C 3 11
16 C 4 12
17 C 5 12
18 C 6 9
What I'd like to be able to do is to create another column with the SMA formula for each person (A,B,C, etc.). In this case let's say SMA2. While it works for Person A, I can't get the formula to restart at Person B. Rather Person B's first SMA2 value has Person A's values with it.
Right now I've used this which does give me the SMA I want, just not restarted at each new person:
DataSet$SMA2<-SMA(DataSet$Value, 2)
Any help would be appreciated.
DataSet <- DataSet %>%
group_by(Person) %>%
mutate(sma2 = TTR::SMA(Value,2))
Still came up with this:
# A tibble: 18 x 4
# Groups: Person [3]
Person Time Value sma2
<chr> <dbl> <dbl> <dbl>
1 A 1 14 NA
2 A 2 13 13.5
3 A 3 17 15
4 A 4 9 13
5 A 5 20 14.5
6 A 6 5 12.5
7 B 1 17 11
8 B 2 11 14
9 B 3 18 14.5
10 B 4 10 14
11 B 5 10 10
12 B 6 20 15
13 C 1 5 12.5
14 C 2 5 5
15 C 3 11 8
16 C 4 12 11.5
17 C 5 12 12
18 C 6 9 10.5
Using dplyr, group_by person then use mutate. This will reset the calculations per person.
DataSet <- DataSet %>%
group_by(Person) %>%
mutate(sma2 = TTR::SMA(Value, 2))
# A tibble: 18 x 4
# Groups: Person [3]
Person Time Value sma2
<chr> <int> <int> <dbl>
1 A 1 14 NA
2 A 2 13 13.5
3 A 3 17 15
4 A 4 9 13
5 A 5 20 14.5
6 A 6 5 12.5
7 B 1 17 NA
8 B 2 11 14
9 B 3 18 14.5
10 B 4 10 14
11 B 5 10 10
12 B 6 20 15
13 C 1 5 NA
14 C 2 5 5
15 C 3 11 8
16 C 4 12 11.5
17 C 5 12 12
18 C 6 9 10.5

Resources