dplyr - Join tables based on a date difference

dplyr - Join tables based on a date difference - r

It's late and I can't figure this out. I'm using lubridate and dlypr.
My data is as follows:
table1 =1 observation per subject with a date
table2= 1 or more observations per subject with associated dates
When I left join I actually add observations. This is because I have multiple records in table 2 that match the key. How can I simply make this a conditional join so that only 1 matching record from table 2 is joined given that its date is closest to the date in table 1.
Sorry if this was verbose.

Use the data.table-package to join. Use roll = "nearest" to get the nearest match..
library(data.table)
dt1 <- data.table( id = 1:10, date = 1:10, stringsAsFactors = FALSE )
dt2 <- data.table( date = 6:15, letter = letters[1:10], stringsAsFactors = FALSE )
dt1[, letter := dt2[dt1, letter, on = "date", roll = "nearest"] ][]
# id date letter
# 1: 1 1 a
# 2: 2 2 a
# 3: 3 3 a
# 4: 4 4 a
# 5: 5 5 a
# 6: 6 6 a
# 7: 7 7 b
# 8: 8 8 c
# 9: 9 9 d
# 10: 10 10 e

Related

How to drop all rows of data frame after specific row number or date in R?

I have a data frame with a certain number of rows.
Would like to drop all rows after a specific row number or after a date.
Any suggestions?
Could not find anything on the web that works for me for the moment...

Here's a way how you can do this:
df <- df[1:2, ] ## one way of selecting rows from first row to row number you want in a data frame
# a b c date
#1 1 2 3 2017-01-01
#2 1 2 3 2017-01-02
df <- df[-(3:nrow(df)), ] ## another way of filtering rows from starting from row which you don't want to total number of rows in a data frame
# a b c date
#1 1 2 3 2017-01-01
#2 1 2 3 2017-01-02
df <- df[df$date < "2017-01-03", ] ## subset based on a date value
# a b c date
#1 1 2 3 2017-01-01
#2 1 2 3 2017-01-02
data
df = data.frame(a = c(1,1,4,4), b = c(2,2,5,5), c = c(3,3,6,6),
date = seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-04"), by = 'day')) ## creating a dummy data frame

We can use head
n <- 5
df2 <- head(df1, n)
df2
# date col2
#1 2019-01-01 -0.5458808
#2 2019-02-01 0.5365853
#3 2019-03-01 0.4196231
#4 2019-04-01 -0.5836272
#5 2019-05-01 0.8474600
Or create a logical vector
df1[seq_len(nrow(df1)) <= n, ]
Or another option is slice
library(dplyr)
df1 %>%
slice(seq_len(n))
Or with data.table
library(data.table)
setDT(df1)[seq_len(n)]
If it is based on a date value
date1 <- as.Date("2019-05-01")
subset(df1, date <= date1)
data
set.seed(24)
df1 <- data.frame(date = seq(as.Date("2019-01-01"), length.out = 10,
by = "month"), col2 = rnorm(10))

na.locf in data.table when completing by group

I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward

Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]

data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.

Collapse redundant rows in data table

I have a data table in the format:
myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)
Col1 Col2
1: A 1
2: A 2
3: A 3
4: B 4
5: B 5
6: B 6
I want show only the highest result for each category in Col1, then collapse all others and present their sum in Col2. It should look like this:
print(myTable)
Col1 Col2
1: A 3
2: Others 3
3: B 6
4: Others 9
I managed to do it with the following code:
unique <- unique(myTable$Col1) # unique values in Col1
myTable2 <- data.table() # empty data table to populate
for(each in unique){
temp <- myTable[Col1 == each, ] # filter myTable for unique Col1 values
temp <- temp[order(-Col2)] # order filtered table increasingly
sumCol2 <- sum(temp$Col2) # sum of values in filtered Col2
temp <- temp[1, ] # retain only first element
remSum <- sumCol2 - sum(temp$Col2) # remaining sum in Col2 (without first element)
temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
myTable2 <- rbindlist(list(myTable2, temp)) # populate data table from beginning
}
This works, but I am trying to shorten a very large data table, so it takes forever.
Is there any better way to approach this?
Thanks.
UPDATE: Actually my procedure is a little bit more complicated. I figured I would be able to develop it myself after the basics were mastered but it seems I will need further help instead. I want to display the 5 highest values in Col1, and collapse the others, but some entries in Col1 do not have 5 values; in these case, all entries should be displayed, and no "Others" row should be added.

Here the data is split into groups according to the value of Col1 (by = Col1). .N is the index of the last row in the given group, so c(Col2[.N], sum(Col2) - Col2[.N])) gives the last value of Col2, and the sum of Col2 minus the last value. The newly created variables are surrounded by .() because .() is an alias for the list() function when using data.table, and the created columns need to go in a list.
library(data.table)
setDT(df)
df[, .(Col1 = c(Col1, 'Others'),
Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
, by = Col1][, -1]
# Col1 Col2
# 1: A 3
# 2: Others 3
# 3: B 6
# 4: Others 9

If it just a matter of displaying things you could the 'tables' packages :
others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)
# Col1 Col2
# A last 3
# others 3
# B last 6
# others 9

do.call(
rbind, lapply(split(myTable, factor(myTable$Col1)), function(x) rbind(x[which.max(x$Col2),], list("Other", sum(x$Col2[-which.max(x$Col2)]))))
)
# Col1 Col2
#1: A 3
#2: Other 3
#3: B 6
#4: Other 9

I did it! I made a new myTable to illustrate. I want to retain only the 4 highest values by category, and collapse the others.
set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)
Col1 Col2
1: A 8
2: A 5
3: A 2
4: B 7
5: B 10
6: B 9
7: B 12
8: B 11
9: C 4
10: C 6
11: C 3
12: C 1
# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)
# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]
# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]
# removes rows with NA inserted in first step
myTable <- na.omit(myTable)
# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4 entries
myTable <- myTable[Col2 != 0]
Owooooo!

Here's a base R solution and the dplyr equivalent:
res <- aggregate(Col2 ~.,transform(
myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
# Col0 Col2
# 1 A 3
# 3 Other 3
# 2 B 6
# 4 Other 9
myTable %>%
group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
summarize_at("Col2",sum) %>%
ungroup %>%
select(-1)
# # A tibble: 4 x 2
# Col1 Col2
# <chr> <int>
# 1 A 3
# 2 Other 3
# 3 B 6
# 4 Other 9

Get number of same individuals for different groups

I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.

Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3

As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C

We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]

Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3

yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....

data.table aggregation with rolling subset on date

I have a set of data along these lines
d1 <- data.frame(
cat1 = sample(c('a', 'b', 'c'), 100, replace = TRUE),
date = rep(Sys.Date() - sample(1:100)),
val = rnorm(100, 50, 5)
)
require(data.table)
d2 <- data.table(d1)
I can get a daily sum without problem
d2[ , list(.N, sum(val)), by = c("cat1", "date")]
I want to get a sum over 2 days (and then 7 days)
This works:
d.list <- sort(unique(d2$date))
o.list <- list()
for(i in seq_along(d.list)){
o.list[[i]] <- d2[d2$date >= d.list[i] - 1 & d2$date <= d.list[i], list(.N, sum(val), max(date)), by = c("cat1")]
}
do.call(rbind, o.list)
But slows down on a bigger data set, and doesn't seem to be the best use of data.table.
Is there a more efficient way?

This is a bit faster:
First we join for exact matches and obtain the last index (in case of multiple matches)
setkey(d2, cat1, date)
tmp1 = d2[unique(d2, by=key(d2)), which=TRUE, mult="last", allow.cartesian=TRUE]
Then, we construct a copy of d2 and change date to date-1 by reference. Then, we perform a join with roll=-Inf - which is next observation carried backwards. In other words, if there's no exact match, it'll fill the next available value.
d3 = copy(d2)[, date := date-1]
setkey(d3, cat1, date)
tmp2 = d2[unique(d3, by=key(d2)), roll=-Inf, which=TRUE, allow.cartesian=TRUE]
From here, we put together the indices:
idx1 = tmp1-tmp2+1L
idx2 = data.table:::vecseq(tmp2, idx1, sum(idx1))
Subset d2 from idx2 and generate unique ids from idx1:
ans1 = d2[idx2][, grp := rep(seq_along(idx1), idx1)]
Finally aggregate by grp and get the desired result:
ans1 = ans1[, list(cat1=cat1[1L], date=date[.N],
N = .N, val=sum(val)), by=grp][, grp:=NULL]
> head(ans1, 10L)
# cat1 date N val
# 1: a 2014-01-20 1 47.69178
# 2: a 2014-01-25 1 52.01006
# 3: a 2014-02-01 1 46.82132
# 4: a 2014-02-06 1 44.62404
# 5: a 2014-02-11 1 49.63218
# 6: a 2014-02-14 1 48.80676
# 7: a 2014-02-22 1 49.27800
# 8: a 2014-02-23 2 96.17617
# 9: a 2014-02-26 1 49.20623
# 10: a 2014-02-28 1 46.72708
The results are identical as in your solution. This one took 0.02 seconds on my laptop, where as yours took 0.58 seconds.
For 7 days, just change:
d3 = copy(d2)[, date := date-1]
to
d3 = copy(d2)[, date := date-6]

It's very poorly explained in the OP what you want, but this seems to be it:
# generate the [date-1,date] sequences for each date
# adjust length.out to suit your needs
dates = d2[, list(date.seq = seq(date, by = -1, length.out = 2)), by = date]
setkey(dates, date.seq)
setkey(d2, date)
# merge and extract info needed
dates[d2][, list(.N, sum(val), date.seq[.N]), by = list(date, cat1)][, !"date"]
# cat1 N V2 V3
# 1: a 1 38.95774 2014-01-21
# 2: a 1 38.95774 2014-01-21
# 3: c 1 55.68445 2014-01-22
# 4: c 2 102.20806 2014-01-23
# 5: c 1 46.52361 2014-01-23
# ---
#164: c 1 50.17986 2014-04-27
#165: b 1 51.43489 2014-04-28
#166: b 2 100.91982 2014-04-29
#167: b 1 49.48493 2014-04-29
#168: c 1 54.93311 2014-04-30

Would it be possible to set up a binned date, and then do by on that?
d2$day7 <- as.integer(d2$date) %/% 7
d2[ , list(.N, sum(val)), by = c("cat1", "day7")]
That would give a binned value - if you want a sliding 7 day window, I'd need to think again. Also, for a binned approach, you might need to subtract an offset before doing the %/% if you want to chose the day of the week the groups start at.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr - Join tables based on a date difference - r

Related

How to drop all rows of data frame after specific row number or date in R?

na.locf in data.table when completing by group

Collapse redundant rows in data table

Get number of same individuals for different groups

data.table aggregation with rolling subset on date

Categories

Resources