Is it possible to compute any window join in data.table - r

I came across a lot of posts asking about window joins
rolling median
rolling regression
Since data.table 1.8.8 and the roll parameter my understanding is that we can do those things. Say we have X and Y with same keys say x,y,t, we want to be able to get for each line of X
all the rows of Y where x,y of Y are matching those of X AND where X$t in [Y$t-w1,Y$t+w2]
Here is an example with (w1,w2)=(1,5)
library(data.table)
A <- data.table(x=c(1,1,1,2,2),y=c(F,F,T,T,T),t=c(407,286,788,882,942),key='x,y,t')
X <- copy(A)
Y <- data.table(x=c(1,1,1,2,2,2,2),y=c(F,F,T,T,T,T,T),u=c(417,285,788,882,941,942,945),IDX=1:7,key='x,y,u')
R) X
x y t
1: 1 FALSE 286
2: 1 FALSE 407
3: 1 TRUE 788
4: 2 TRUE 882
5: 2 TRUE 942
R) Y
x y u IDX
1: 1 FALSE 285 2 # match line 1 as (x,y) ok and 285 in [286-1,286+5]
2: 1 FALSE 417 1 # match no line as (x,y) ok against X[c(1,2),] but 417 is too big
3: 1 TRUE 788 3 # match row 3
4: 2 TRUE 882 4 # match row 4
5: 2 TRUE 941 5 # match row 5
6: 2 TRUE 942 6 # match row 5
7: 2 TRUE 945 7 # match row 5
We cannot do Y[setkey(X[,list(x,y,t)],x,y,t),roll=1] because if we have a perfect match on (x,y,t) data.table will discard potential partial matches with X$t in [Y$t-w1,X$t[.
#get the lower bounds and upper bounds for t
X[,`:=`(lowT=t-1,upT=t+5)]
#we get the first line where Y$u >= X$t-1 but Y$u <= X$t+5
X <- setnames(copy(Y),c('u','IDX'),c('lowT','lowIDX'))[setkey(X,x,y,lowT),roll=-6,rollends=T]
#we get the last line where Y$u <= X$t+5 ...
X <- setnames(copy(Y),c('u','IDX'),c('upT','upIDX'))[setkey(X,x,y,upT),roll=6]
#we get the matching IDX
X[!is.na(lowIDX) & !is.na(upIDX), allIDX:=mapply(`seq`,from=lowIDX,to=upIDX)]
R) X
x y upT upIDX lowT lowIDX t allIDX
1: 1 FALSE 291 2 285 2 286 2
2: 1 FALSE 412 NA 406 NA 407
3: 1 TRUE 793 3 787 3 788 3
4: 2 TRUE 887 4 881 4 882 4
5: 2 TRUE 947 7 941 5 942 5,6,7
My questions are:
Am I correct to think that window joins could not be achieve easily before roll?
Can we solve the pb if we want X$t in ]Y$t-w1,Y$t+w2[ (not compact set anymore) ?

Related

Add column of repeating sequence of sequential numbers

I would like to add a column repeating the numbers 1 to 577 to a dataframe of over 15,000 rows.
This is the dataframe:
> head(corr_dat_cond1)
participant search_difficulty key_resp.corr key_resp.rt target_position distractor1_colour
1 1010 difficult 1 1.0820000 left [0.82,0.31,0]
2 1010 no_search 1 0.5400000 left [-1,-1,-1]
3 1010 difficult 1 0.5119998 down [0.82,0,0.31]
4 1010 no_search 1 0.7079999 right [-1,-1,-1]
5 1010 difficult 1 1.0249999 up [0.82,0.31,0]
6 1010 no_search 1 0.4889998 left [-1,-1,-1]
distractor2_colour non_target_colour non_target_pos cue_uposition target_char non_target_char cue_time
1 [0.82,0,0.31] [0.82,0.31,0] [0.328,0] up = x 1.1
2 [-1,-1,-1] [-1,-1,-1] [0.328,0] right x = 1.2
3 [0.82,0.31,0] [0.82,0,0.31] [0.328,0] down x = 1.0
4 [-1,-1,-1] [-1,-1,-1] [0,0.328] left = x 1.4
5 [0.82,0,0.31] [0.82,0.31,0] [0,-0.328] left x = 1.4
6 [-1,-1,-1] [-1,-1,-1] [0,-0.328] up x = 1.0
cue_colour n cue_validity mrt stdev low_cutoff high_cutoff cond trial_num
1 Mismatch (Onset) cue 577 FALSE 0.7639095 0.2481090 0.0195825 1.5082365 1 1
2 Mismatch (Onset) cue 577 FALSE 0.5530880 0.1243826 0.1799402 0.9262358 1 2
3 Mismatch (Onset) cue 577 TRUE 0.7639095 0.2481090 0.0195825 1.5082365 1 3
4 Match (Color) cue 577 FALSE 0.5530880 0.1243826 0.1799402 0.9262358 1 4
5 Match (Color) cue 577 FALSE 0.7639095 0.2481090 0.0195825 1.5082365 1 5
6 Mismatch (Onset) cue 577 FALSE 0.5530880 0.1243826 0.1799402 0.9262358 1 6
The trial_num column is my initial attempt at trying adding a column of sequential numbers. This is the code I used:
corr_dat_cond1$trial_num <- 1:nrow(corr_dat_cond1)
However, I'd like the numbers to repeat every 577 rows instead of counting all the way up to the number of rows in the dataframe.
Any help would be appreciated! Thank you.
You can use the rep_len function.
trial_num <- rep_len(1:577, nrow(corr_dat_cond1))
This is the same as calling rep with the length.out specified.

Creating a table of results over multiple variables in R

I am using a large dataset that contains multiple variables that contain similar information. The variables range from PR1 through PR25. Each contains information regarding a procedure code. in short the dataframe looks like this:
Obs PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
Where PR1 through PR25 values are factors.
I am looking for a way to make a table of information across all of these variables. For instance, I would like to make a table that shows a count of total number of value "527" for PR1:PR25. I would like to do this for multiple values of interest.
For instance
PR Tot
#222 3
#341 3
#527 2
#569 3
#1600 1
#1660 1
However, I only want to retrieve the frequency for a very specific set of values such as only extracting the frequency of 527 or 1600.
I have initially tried using a simple function like length(which(PR1=="527")), which works but is tedious.
I used the method suggested by Soren using:
library(plyr)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c("527", "5251", "5252", "5253", "5259",
"526", "521", "529", "8512", "8521", "344", "854", "8523", "8541", "8546",
"8542", "8547" , "8544", "8545", "8543", "639",
"064","065","063","0650","0651", "0652", "062", "066", "4040", "4041",
"4042", "0721", "0712","0701", "0702", "070", "0741", "435","436", "4399",
"439", "438", "437", "4381", "4391", "4342", "5122", "5121", "5124", "5123",
"518", "519", "503", "5022", "5012")),]
And got the following output (abbreviated):
codes count
92 062 5
95 064 8
96 0650 2
769 526 8
770 527 8
However, I had a feeling that was incorrect. When I checked it against the output from sapply(df, function(PR1) length(which(PR1 == "527")))
I get the following:
PR1 PR2 PR3 PR4 PR5 PR6 PR7 PR8 ...
1152 36 6 1 2 1 1 1
Which is the correct number of "527" cases in the dataframe. Any suggestions why the first method is giving incorrect sums of factor levels?
Thanks for any help, and let me know if I can provide more info
You can use sapply() or lapply() function to get count of a some value over all columns.
Create data frame df
df <- data.frame(A = 1:4, B = c(4,4,4,4), C = c(2,3,4,4), D = 9:12)
df
# A B C D
# 1 1 4 2 9
# 2 2 4 3 10
# 3 3 4 4 11
# 4 4 4 4 12
Frequency of value "4" in each column A, B, C, and D using sapply() function
sapply(df, function(x) length(which(x == 4)))
A B C D
1 4 2 0
Frequency of value "4" in each column A, B, C, and D using lapply() function
lapply(df, function(x) length(which(x == 4)))
# $A
# [1] 1
# $B
# [1] 4
# $C
# [1] 2
# $D
# [1] 0
The following takes your example and returns an output that may be generalized across all 25 columns. The "plyr" library is used to create the aggregated counts
Scripted as follows:
library(plyr)
df <- data.frame(PR1=c("527","1600","341","222","569"),PR2=c("1422","527","222","569","341"),PR3=c("222","569","341","1422","1660"),stringsAsFactors = T)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c('527','222')),]
Explained as follows:
Create the data frame as specified above. As OP noted values are factors, stringsAsFactors is set to TRUE
df <- data.frame(
PR1=c("527","1600","341","222","569"),
PR2=c("1422","527","222","569","341"),
PR3=c("222","569","341","1422","1660"),
stringsAsFactors = T)
Reviewing results of df
df
PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
As OP asks to combine all the codes across PR1:PR25 a these are unified into a single list by using lapply to loop across all the columns. However, as these are factors -- and it seems that the interest in the in the level value of the factor and not its underlying numeric representation, lapply(df,levels) returns these values. To merge into a single list PR1:PR25 it's simply unlist() and since the column names are seemingly not useful in this case, use.names is set to FALSE. Finally, a data.frame is created with the single column called codes, which is later fed into the ddply() function to get the counts.
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
all_codes
codes
1 1600
2 222
3 341
4 527
5 569
6 1422
7 222
8 341
9 527
10 569
11 1422
12 1660
13 222
14 341
15 569
Uisng ddply() to split() the data.frame on df$codes value and then take the length() of each vector returned by split in ddply()
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result
Reviewing the result gives the PR1:PR25 aggregated count of all the level values of each factor in the original data.frame
codes count
1 1422 2
2 1600 1
3 1660 1
4 222 3
5 341 3
6 527 2
7 569 3
And since we're only interested in specific values (527 given in OP, but here two values of interest are exemplified, 527 and 222:
result[which(result$codes %in% c('527','222')),]
codes count
4 222 3
6 527 2

Conditional data.table merge with .EACHI

I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.
Set up environment
## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")
Create two fake datasets
A create a "big" table with 50 rows (10 values for each ID).
library(data.table)
set.seed(1L)
# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]
ID ValueBig Rank
1: A 266 3
2: A 373 4
3: A 573 5
4: A 909 9
5: A 202 2
---
46: E 790 9
47: E 24 1
48: E 478 2
49: E 733 7
50: E 693 6
Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)
dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))
ID ValueSmall
1: A 478
2: A 862
3: B 439
4: B 245
5: C 71
6: C 100
7: D 317
8: D 519
9: E 663
10: E 407
Merge
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.
## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
Results
ID ValueSmall RankSmall RankSmall2 DesiredRank
1: A 478 1 4 4
2: A 862 1 7 7
3: B 439 3 4 4
4: B 245 1 2 2
5: C 71 1 1 1
6: C 100 1 1 1
7: D 317 1 2 2
8: D 519 3 5 5
9: E 663 2 5 5
10: E 407 1 1 1
Is there a better data.table way of grabbing the max value in another data.table with multiple matches?
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

Add a row of zeros in a data frame created with ddply if there are no observations

I used the function ddply (package plyr) to calculate the mean of a response variable for each group "Trial" and "Treatment". I get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
This data frame suggests that in the trial 4 and treatment B, there are no observations for the response variable (as no row is specified in the data frame). So, is it possible to automatically add a row of zeros in the data frame (built with the function “ddply”) when there are no observations for a given response variable?
I would like to get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
4 B 0 0
We can merge the original dataset with another data.frame created with the full combination of unique values in 'Trial', and 'Treatment'. It will give an output with the missing combinations filled with NA. If needed, this can be changed to 0 (but it is better to have the missing combination as NA).
res <- merge(expand.grid(lapply(df1[1:2], unique)), df1, all.x=TRUE)
is.na(res) <- res==0
Or with dplyr/tidyr, we can use complete (from tidyr)
library(dplyr)
library(tidyr)
df1 %>%
complete(Trial, Treatment, fill= list(N=0, Mean=0))
# Trial Treatment N Mean
# (int) (chr) (dbl) (dbl)
#1 1 A 458 125.258
#2 1 B 459 168.748
#3 2 A 742 214.266
#4 2 B 142 475.786
#5 3 A 247 145.689
#6 3 B 968 234.129
#7 4 A 436 456.287
#8 4 B 0 0.000

Row Differences in Dataframe by Group

My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596

Resources