R filter removes unexpected values - r

I want to filter out all the row_number >12 from a data frame like this:
head(dat1)
# A tibble: 6 × 7
date order_id product_id row_number shelf_number shelf_level position
<date> <chr> <chr> <dbl> <chr> <chr> <dbl>
1 2020-01-02 ES100025694747 000072489501 6 01 C 51
2 2020-01-02 ES100025694747 000058155401 2 39 B 51
3 2020-01-02 ES100025694747 000067694201 21 28 B 51
4 2020-01-02 ES100025699052 000057235001 9 05 B 31
5 2020-01-02 ES100025699052 000050456101 5 29 D 31
6 2020-01-02 ES100025699052 000067091601 2 17 D 11
The row_number orginally contains values like this:
dat1 %>% distinct(row_number)
# A tibble: 15 × 1
row_number
<dbl>
1 6
2 2
3 21
4 9
5 5
6 1
7 10
8 3
9 4
10 8
11 7
12 20
13 22
14 11
15 12
I filtered like this: dat1 <- dat1 %>% filter(row_number < '13')
The result: instead of keeping all values <13, it removes values from 2 to 9.
dat1 %>% distinct(row_number)
# A tibble: 4 × 1
row_number
<dbl>
1 1
2 10
3 11
4 12
What s wrong with my codes?

Related

How can I extract information of one group based on the filtrates of another group in dplyr

My data frame looks like this but with thousands of entries
type <- rep(c("A","B","C"),4)
time <- c(0,0,0,1,1,1,2,2,2,3,3,3)
counts <- c(0,30,15,30,30,10,31,30,8,30,8,0)
df <- data.frame(time,type,counts)
df
time type counts
1 0 A 0
2 0 B 30
3 0 C 15
4 1 A 30
5 1 B 30
6 1 C 10
7 2 A 31
8 2 B 30
9 2 C 8
10 3 A 30
11 3 B 8
12 3 C 0
I want at each time point bigger than 0 to extract all the types that have counts==30
and then I want to extract for these types their counts at the next time point.
I want my data to look like this
time type counts time_after type_after counts_after
1 A 30 2 A 30
1 B 30 2 B 31
2 B 30 3 B 8
Any help or guidance are appreciated
Not very elegant but should do the job
library(dplyr)
type <- rep(c("A","B","C"),4)
time <- c(0,0,0,1,1,1,2,2,2,3,3,3)
counts <- c(0,30,15,30,30,10,31,30,8,30,8,0)
df <- tibble(time,type,counts)
df
#> # A tibble: 12 x 3
#> time type counts
#> <dbl> <chr> <dbl>
#> 1 0 A 0
#> 2 0 B 30
#> 3 0 C 15
#> 4 1 A 30
#> 5 1 B 30
#> 6 1 C 10
#> 7 2 A 31
#> 8 2 B 30
#> 9 2 C 8
#> 10 3 A 30
#> 11 3 B 8
#> 12 3 C 0
thirties <- df %>%
filter(counts == 30 & time != 0) %>%
mutate(time_after = time + 1)
inner_join(thirties, df, by = c("time_after" = "time",
"type" = "type")) %>%
select(time,
type = type,
counts = counts.x,
time_after,
type_after = type,
count_after = counts.y)
#> # A tibble: 3 x 6
#> time type counts time_after type_after count_after
#> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 1 A 30 2 A 31
#> 2 1 B 30 2 B 30
#> 3 2 B 30 3 B 8

Matching node IDs of trees with the same topology

I have two phylogenetic trees which have the same topology (expect for branch lengths):
In R using ape:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
> ape::all.equal.phylo(t1,t2,use.edge.length = F,use.tip.label = T)
[1] TRUE
I want to compute the mean branch lengths across the two but the problem is that although their topologies are identical the order at which their nodes are represented is not identical, and not all tree nodes are labeled tips so I don't think there's a simple join solution:
> head(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 10 1 72 HS
2 12 2 30 CP
3 12 3 30.3 CL
4 11 4 62 RN
5 13 5 63 CS
6 13 6 63 BS
> tail(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 5 NA
3 9 10 2 NA
4 10 11 10 NA
5 11 12 32 NA
6 9 13 11 NA
> head(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 12 1 39 CP
2 12 2 39 CL
3 11 3 68 RN
4 10 4 77 HS
5 13 5 63 BS
6 13 6 63 CS
> tail(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 14 NA
3 9 10 5 NA
4 10 11 9 NA
5 11 12 29 NA
6 9 13 19 NA
So it's not clear to me how I'd correspond between any pair of branch lengths in order to take their mean.
Any idea how to match them or reorder t2 according to t1?
Supposedly phytools' matchNodes method is meant for that but it doesn't seem like it's getting it right:
phytools::matchNodes(t1, t2,method = "descendants")
tr1 tr2
[1,] 8 8
[2,] 9 9
[3,] 10 10
[4,] 11 11
[5,] 12 12
[6,] 13 13
At least I'd expect it to correspond the tips correctly, meaning:
dplyr::left_join(dplyr::filter(tidytree::as_tibble(t1),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
+ dplyr::filter(tidytree::as_tibble(t2),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 7 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
But that's not happening.
Ultimately the information for matching is in these tree tibbles because they list the parents of each node, but practically using that information for matching the modes probably requires some recursive steps.
Seems like ape's makeNodeLabel using the md5sum as the method argument, which labels the internal nodes by the tip labels achieves that:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
dplyr::left_join(tidytree::as_tibble(ape::makeNodeLabel(t1, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
tidytree::as_tibble(ape::makeNodeLabel(t2, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 13 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
8 8 da5f57f0a757f7211fcf84c540d9531a 8
9 9 9bffe86cf0a2650b6a3f0d494c0183a9 9
10 10 bcf7b41992a064acd2e3e66fee7fe2d4 10
11 11 d50e0698114c621b49322697267900b7 11
12 12 f0a8c7fa67831514e65cdadbc68c3d31 12
13 13 82ab4cf8ae4a4df14cf87a48dc2638e0 13

Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I would like to take this dataset and remove the values if they have the same id and age(duplicates) but keep the one with the highest month number.
ID|Age|Month|
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3
And have the outcome be
ID|Age|Month
1 25 12
2 18 11
3 12 10
3 25 10
4 19 10
5 10 3
Note that it removed the duplicates but kept the version with the highest month number.
as a solution option
library(tidyverse)
df <- read.table(text = "ID Age Month
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3", header = T)
df %>%
group_by(ID, Age) %>%
slice_max(Month)
#> # A tibble: 6 x 3
#> # Groups: ID, Age [6]
#> ID Age Month
#> <int> <int> <int>
#> 1 1 25 12
#> 2 2 18 11
#> 3 3 12 10
#> 4 3 25 10
#> 5 4 19 10
#> 6 5 10 3
Created on 2021-02-11 by the reprex package (v1.0.0)
Using dplyr package, the solution:
df %>%
+ group_by(ID, Age) %>%
+ filter(Month == max(Month))
# A tibble: 6 x 3
# Groups: ID, Age [6]
ID Age Month
<dbl> <dbl> <dbl>
1 1 25 12
2 2 18 11
3 3 12 10
4 3 25 10
5 4 19 10
6 5 10 3

calculate difference between rows, but keep the raw value by group

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)
Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

Unnest vector in dataframe but add list indices column

say I have a tibble such as this:
tibble(x=22:23, y=list(4:6,4:7))
# A tibble: 2 × 2
x y
<int> <list>
1 22 <int [3]>
2 23 <int [4]>
I would like to convert it into a new larger tibble by unnesting the lists (e.g. with unnest), which would give me a tibble with 7 rows. However, I want a new column added that tells me, for a given y-value in a row after unnesting, what the index of that y-value was when it was in list form. Here's what the above would look like after doing this:
# A tibble: 7 × 2
x y index
<int> <int> <int>
1 22 4 1
2 22 5 2
3 22 6 3
4 23 4 1
5 23 5 2
6 23 6 3
7 23 7 4
You can map over y column and bind the index for each element before unnesting:
df %>%
mutate(y = map(y, ~ data.frame(y=.x, index=seq_along(.x)))) %>%
unnest()
# A tibble: 7 x 3
# x y index
# <int> <int> <int>
#1 22 4 1
#2 22 5 2
#3 22 6 3
#4 23 4 1
#5 23 5 2
#6 23 6 3
#7 23 7 4
Here is another version with lengths
df %>%
mutate(index = lengths(y)) %>%
unnest(y) %>%
mutate(index = sequence(unique(index)))
# A tibble: 7 x 3
# x index y
# <int> <int> <int>
#1 22 1 4
#2 22 2 5
#3 22 3 6
#4 23 1 4
#5 23 2 5
#6 23 3 6
#7 23 4 7
By suing unnest and group_by
library(tidyr)
library(dplyr)
df %>%
unnest(y)%>%group_by(x)%>%mutate(index=row_number())
# A tibble: 7 x 3
# Groups: x [2]
x y index
<int> <int> <int>
1 22 4 1
2 22 5 2
3 22 6 3
4 23 4 1
5 23 5 2
6 23 6 3
7 23 7 4
You can also try rowwise and do.
library(tidyverse)
tibble(x=22:23, y=list(4:6,4:7)) %>%
rowwise() %>%
do(tibble(x=.$x, y=unlist(.$y), index=1:length(.$y)))

Resources