In R: Extract similar trajectory patterns from a data table - r

I have a data table that contains several patterns for going from a to c. These patterns are assigned to different expeditions. I want to extract similar patterns for the different expedition_id.
dt<- data.table(departure = c('a', 'a', 'a', 'b', 'a','d','a', 'b'), arrival =
c('a','a','b','c','d','c','b','c'), expedition_id = c(1,2,1,1,3,3,2,2))
>dt
departure arrival expedition_id
a a 1
a a 2
a b 1
b c 1
a d 3
d c 3
a b 2
b c 2
The results that I am trying to get look like different data tables for each unique pattern.
>dt1
departure arrival expedition_list
a a 1,2
a b 1,2
b c 1,2
>dt2
departure arrival expedition_list
a d 3
d c 3
I'd appreciate your help on this one.

You can try:
library(data.table)
dt <- dt[, .(expedition_list = toString(expedition_id)), by = .(departure, arrival)]
dt_list <- split(dt, dt$expedition_list)
list2env(
setNames(
dt_list,
paste0('dt', 1:length(dt_list))
),
.GlobalEnv
)
Output:
dt1
departure arrival expedition_list
1: a a 1, 2
2: a b 1, 2
3: b c 1, 2
dt2
departure arrival expedition_list
1: a d 3
2: d c 3

You asked for data.table but for others this dplyr version might also be helpful:
dt %>%
group_by(departure, arrival) %>%
summarise(expedition_list = paste(expedition_id, collapse = ","))

Related

In R, from dataframe with sourceID and partnerID columns, order dataframe by partner

I'm quite new to R. I have a large dataframe approximating the following:
df <- data.frame(
source = c('a', 'b', 'c', 'e'),
partner = c('b', 'c', 'e', 'a'),
info = c(1,2,3,4)
)
For each row in the dataframe I want to get the info column from the partner and concatenate it to the source row. I'm doing this by building a second dataframe in the following way:
prt <- unlist(df$partner)
collect_partner <- function(x, df) {
df[df[, 'source'] == x, 'info']
}
prt_df <- do.call("rbind", lapply(prt, collect_partner, df)) # slow
final_df <- cbind(df, prt_df)
However, this approach is very slow and I'm sure there must be a better way. Unfortunately I'm finding it hard to articulate what I'm trying to do, so solutions aren't forthcoming from googling etc. Any suggestions would be much appreciated!
If you work with the tidyverse, I'd use a left_join basically with itself. I first create a data.frame that contains only the info about source and info. To make sure that you have only one value per unique entry in source, I use distinct (not necessarily needed).
Then, I join the data to the original data frame:
library(dplyr)
df <- data.frame(
source = c('a', 'b', 'c', 'e'),
partner = c('b', 'c', 'e', 'a'),
info = c(1,2,3,4)
)
source_info <- df %>%
select(source, prt_df = info) %>%
distinct(source, .keep_all = TRUE)
df %>%
left_join(source_info, by = c("partner" = "source"))
#> source partner info prt_df
#> 1 a b 1 2
#> 2 b c 2 3
#> 3 c e 3 4
#> 4 e a 4 1
Created on 2023-02-13 by the reprex package (v1.0.0)
With base R using sapply and ==
df$prt_df <- sapply(df$partner, function(x) which(x == df$source))
df
source partner info prt_df
1 a b 1 2
2 b c 2 3
3 c e 3 4
4 e a 4 1
Using data.table
library(data.table)
dt <- as.data.table(df)
dt[, prt_df := lapply(partner, function(x) which(x == source)), ]
dt
source partner info prt_df
1: a b 1 2
2: b c 2 3
3: c e 3 4
4: e a 4 1
On a slightly modified set dt_m with repeated and missing values.
dt_m[, prt_df := lapply(partner, function(x) which(x == source)), ]
dt_m
source partner info prt_df
1: a b 1
2: a c 2 3
3: c e 3 4
4: e a 4 1,2
modified data
dt_m <- structure(list(source = c("a", "a", "c", "e"), partner = c("b",
"c", "e", "a"), info = c(1, 2, 3, 4)), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))

na.locf in data.table when completing by group

I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward
Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]
data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.

Merge R data frame or data table and overwrite values of multiple columns

How do you merge two data tables (or data frames) in R keeping the non-NA values from each matching column? The question Merge data frames and overwrite values provides a solution if each individual column is specified explicitly (as far as I can tell, at least). But, I have over 40 common columns between the two data tables, and it is somewhat random which of the two has an NA versus a valid value. So, writing ifelse statements for 40 columns seems inefficient.
Below is a simple example, where I'd like to join (merge) the two data.tables by the id and date columns:
dt_1 <- data.table::data.table(id = "abc",
date = "2018-01-01",
a = 3,
b = NA_real_,
c = 4,
d = 6,
e = NA_real_)
setkey(dt_1, id, date)
> dt_1
id date a b c d e
1: abc 2018-01-01 3 NA 4 6 NA
dt_2 <- data.table::data.table(id = "abc",
date = "2018-01-01",
a = 3,
b = 5,
c = NA_real_,
d = 6,
e = NA_real_)
setkey(dt_2, id, date)
> dt_2
id date a b c d e
1: abc 2018-01-01 3 5 NA 6 NA
Here is my desired output:
> dt_out
id date a b c d e
1: abc 2018-01-01 3 5 4 6 NA
I've also tried the dplyr::anti_join solution from left_join two data frames and overwrite without success.
I'd probably put the data in long form and drop dupes:
k = key(dt_1)
DTList = list(dt_1, dt_2)
DTLong = rbindlist(lapply(DTList, function(x) melt(x, id=k)))
setorder(DTLong, na.last = TRUE)
unique(DTLong, by=c(k, "variable"))
id date variable value
1: abc 2018-01-01 a 3
2: abc 2018-01-01 b 5
3: abc 2018-01-01 c 4
4: abc 2018-01-01 d 6
5: abc 2018-01-01 e NA
You can do this by using dplyr::coalesce, which will return the first non-missing value from vectors.
(EDIT: you can use dplyr::coalesce directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)
Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x data frame if they are present).
coalesce_join <- function(x,
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
We can use {powerjoin}, do a left join and deal with the conflicts using coalesce_xy() (which is pretty much dplyr::coalesce()).
library(powerjoin)
power_left_join(dt_1, dt_2, by = "id", conflict = coalesce_xy)
# id date a b c d e
# 1 abc 2018-01-01 3 5 4 6 NA

Counting the result of a left join using dplyr

What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.
With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.
There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1
It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
library(tidylog)
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info
I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1

Get number of same individuals for different groups

I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

Resources