Group data.table by continuity of variable - r

I want to separate a data.table into groups based on the continuity of one variable.
So to speak, from this data.table:
DT <- data.table(Var1 = c(1:5, 7:10))
I want it to be grouped like this:
# Var1 group
# 1: 1 1 # 1 to 5 is continuous with a maximal difference of 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 7 2 # 6 to 10 is continuous again
# 7: 8 2
# 8: 9 2
# 9: 10 2
The difference of Var1 should not be limited to one like in this minimal example, but be adjustable, so that DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2))) will also be separated into two groups when given a maximal difference of 2.
EDIT:
I should clarify that a 'maximal difference' of 2 or more is meant in a way that differences in Var1 smaller than two should be treated as 'continuous'. Furthermore the variable Var1 should not be limited to integer values. The last thing could be avoided by multiplying e.g. 0.14 by 100 to get 14 and also multiplying 'maximal difference' by 100.

DT[, group := rleid(cumprod(c(1, diff(Var1))))]
# Var1 group
#1: 1 1
#2: 2 1
#3: 3 1
#4: 4 1
#5: 5 1
#6: 7 2
#7: 8 2
#8: 9 2
#9: 10 2
step <- 2
DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))
DT[, group := rleid(cumsum(c(FALSE, diff(Var1) != step)))]
# Var1 group
# 1: 1 1
# 2: 3 1
# 3: 5 1
# 4: 7 1
# 5: 9 1
# 6: 13 2
# 7: 15 2
# 8: 17 2
# 9: 19 2
#10: 21 2
#11: 23 2
#12: 25 2
#13: 27 2
#14: 29 2

A base R solution.
foo <- function(x){
gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
if(length(gr) == 1){
cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr))))
}else{
cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr))))
}
}
All kind of differences are working.
foo(c(seq(1,10, 2), seq(13,30, 2)))
Var1 group
[1,] 1 1
[2,] 3 1
[3,] 5 1
[4,] 7 1
[5,] 9 1
[6,] 13 2
[7,] 15 2
[8,] 17 2
[9,] 19 2
[10,] 21 2
[11,] 23 2
[12,] 25 2
[13,] 27 2
[14,] 29 2
Three groups are working as well.
foo(c(1:5, 7:10, 13:20))
Var1 group
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 7 2
[7,] 8 2
[8,] 9 2
[9,] 10 2
[10,] 13 3
[11,] 14 3
[12,] 15 3
[13,] 16 3
[14,] 17 3
[15,] 18 3
[16,] 19 3
[17,] 20 3
For a data.table you can try:
foo <- function(x){
gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
if(length(gr) == 1){
rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr)))
}else{
rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr)))
}
}
DT[, group := foo(Var1)]

Related

What is the best way to tidy a matrix in R

Is there a best practice means of "tidying" a matrix/array? By "tidy" in this context I mean
one row per element of the matrix
one column per dimension. the elements of these columns give you the "coordinates" of the matrix element which is stored on that row
I have an example here for a 2d matrix, but ideally this would work with an array also (This example works for mm <- array(1:18, c(3,3,3)), but I thought that would be too much to paste in here)
mm <- matrix(1:9, nrow = 3)
mm
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
inds <- which(mm > -Inf, arr.ind = TRUE)
cbind(inds, value = mm[inds])
#> row col value
#> [1,] 1 1 1
#> [2,] 2 1 2
#> [3,] 3 1 3
#> [4,] 1 2 4
#> [5,] 2 2 5
#> [6,] 3 2 6
#> [7,] 1 3 7
#> [8,] 2 3 8
#> [9,] 3 3 9
as.data.frame.table One way to convert from wide to long is the following. See ?as.data.frame.table for more information. No packages are used.
mm <- matrix(1:9, 3)
long <- as.data.frame.table(mm)
The code gives this data.frame:
> long
Var1 Var2 Freq
1 A A 1
2 B A 2
3 C A 3
4 A B 4
5 B B 5
6 C B 6
7 A C 7
8 B C 8
9 C C 9
numbers
If you prefer row and column numbers:
long[1:2] <- lapply(long[1:2], as.numeric)
giving:
> long
Var1 Var2 Freq
1 1 1 1
2 2 1 2
3 3 1 3
4 1 2 4
5 2 2 5
6 3 2 6
7 1 3 7
8 2 3 8
9 3 3 9
names Note that above it used A, B, C, ... because there were no row or column names. They would have been used if present. That is, had there been row and column names and dimension names the output would look like this:
mm2 <- array(1:9, c(3, 3), dimnames = list(A = c("a", "b", "c"), B = c("x", "y", "z")))
as.data.frame.table(mm2, responseName = "Val")
giving:
A B Val
1 a x 1
2 b x 2
3 c x 3
4 a y 4
5 b y 5
6 c y 6
7 a z 7
8 b z 8
9 c z 9
3d
Here is a 3d example:
as.data.frame.table(array(1:8, c(2,2,2)))
giving:
Var1 Var2 Var3 Freq
1 A A A 1
2 B A A 2
3 A B A 3
4 B B A 4
5 A A B 5
6 B A B 6
7 A B B 7
8 B B B 8
2d only For 2d one can alternately use row and col:
sapply(list(row(mm), col(mm), mm), c)
or
cbind(c(row(mm)), c(col(mm)), c(mm))
Either of these give this matrix:
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 1 2
[3,] 3 1 3
[4,] 1 2 4
[5,] 2 2 5
[6,] 3 2 6
[7,] 1 3 7
[8,] 2 3 8
[9,] 3 3 9
Another method is to use arrayInd together with cbind like this.
# a 3 X 3 X 2 array
mm <- array(1:18, dim=c(3,3,2))
Similar to your code, but with the more natural arrayInd function, we have
# get array in desired format
myMat <- cbind(c(mm), arrayInd(seq_along(mm), .dim=dim(mm)))
# add column names
colnames(myMat) <- c("values", letters[24:26])
which returns
myMat
values x y z
[1,] 1 1 1 1
[2,] 2 2 1 1
[3,] 3 3 1 1
[4,] 4 1 2 1
[5,] 5 2 2 1
[6,] 6 3 2 1
[7,] 7 1 3 1
[8,] 8 2 3 1
[9,] 9 3 3 1
[10,] 10 1 1 2
[11,] 11 2 1 2
[12,] 12 3 1 2
[13,] 13 1 2 2
[14,] 14 2 2 2
[15,] 15 3 2 2
[16,] 16 1 3 2
[17,] 17 2 3 2
[18,] 18 3 3 2

Create numbered sequence for occurrences of a given nesting variable

I'm hoping to add to a data set a variable that sequences the instances a certain grouping variable appears. For example:
ids <- c(rep(1,4),rep(2,6),rep(3,2))
I'm wanting another variable that would count the instances each id appears. Creating a vector like this:
1,2,3,4,1,2,3,4,5,6,1,2
With them combined looking something like this:
ids count
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 2 5
10 2 6
11 3 1
12 3 2
Any ideas? Many thanks!
I suggest ave with seq_along
ids <- c(rep(1,4),rep(2,6),rep(3,2))
count <- ave(ids,ids, FUN=seq_along)
cbind(ids, count)
# ids count
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 1 4
# [5,] 2 1
# [6,] 2 2
# [7,] 2 3
# [8,] 2 4
# [9,] 2 5
# [10,] 2 6
# [11,] 3 1
# [12,] 3 2
Or if it is ordered
cbind(ids, count=sequence(unname(table(ids))))
# ids count
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 1 4
# [5,] 2 1
# [6,] 2 2
# [7,] 2 3
# [8,] 2 4
# [9,] 2 5
# [10,] 2 6
# [11,] 3 1
# [12,] 3 2
Or
cbind(ids,within.list(rle(ids), lengths <- sequence(lengths))$lengths)
Or
library(data.table)
dt <- as.data.table(ids)
dt[,count:=seq_len(.N), by=ids]
Or
library(dplyr)
dat <- data.frame(ids)
dat %>%
group_by(ids) %>%
mutate(count=row_number())

Removing duplicates on subset of columns in R

I have a table which is
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 10 0.00040803 0.00255277
[2,] 1 11 3 0.01765470 0.01584580
[3,] 1 6 2 0.15514850 0.15509000
[4,] 1 8 14 0.02100531 0.02572320
[5,] 1 9 4 0.04748648 0.00843252
[6,] 2 5 10 0.00040760 0.06782680
[7,] 2 11 3 0.01765480 0.01584580
[8,] 2 6 2 0.15514810 0.15509000
[9,] 2 8 14 0.02100491 0.02572320
[10,] 2 9 4 0.04748608 0.00843252
[11,] 3 5 10 0.00040760 0.06782680
[12,] 3 11 3 0.01765480 0.01584580
[13,] 3 8 14 0.02100391 0.02572320
[14,] 3 9 4 0.04748508 0.00843252
[15,] 4 5 10 0.00040760 0.06782680
[16,] 4 11 3 0.01765480 0.01584580
[17,] 4 8 14 0.02100391 0.02572320
[18,] 4 9 4 0.04748508 0.00843252
[19,] 5 8 14 0.02100391 0.02572320
[20,] 5 9 4 0.04748508 0.00843252
I want to remove duplicates from this table. However, only colums 2,3,4 matter. Example: rows 1,6,11,15 are identical if only columns 2,3,4 are observed. Note for column 4: is it possible to incorporate that it is considered as being the same as long as it is within 10e-5 of the value? So that rows 1 and 6 would be considered as being identical although the value in column 4 differs slightly (within the tolerance I mentioned)?
Then it would be great to get an output which would be like:
column 2 value | column 3 value | column 1 value at which the the pair has been first observed (with the tolerance) (in the example 1) | column 1 value at which the pair has been last observed (with tolerance) (in the example 4) | value of column 4 at first appearance (0.00040803 in the example)
This is a way of thinking about it, but I'm not sure it's what you're looking for. The logic should be able to get you started though.
dat <- YOUR DATA SET
dat
V1 V2 V3 V4 V5
1 1 5 10 0.00040803 0.00255277
2 1 11 3 0.01765470 0.01584580
3 1 6 2 0.15514850 0.15509000
4 1 8 14 0.02100531 0.02572320
5 1 9 4 0.04748648 0.00843252
# TRUNCATED
dat <- dat[, c(2, 3, 4)]
dat$V4 <- round(dat$V4, 5)
unique(dat)
V2 V3 V4
1 5 10 0.00041
2 11 3 0.01765
3 6 2 0.15515
4 8 14 0.02101
5 9 4 0.04749
9 8 14 0.02100
You could do something like this:
# read your data
yy <- read.csv('your-data.csv', header=F)
## V1 V2 V3 V4 V5
## 1 1 5 10 0.00040803 0.00255277
## 2 1 11 3 0.01765470 0.01584580
## 3 1 6 2 0.15514850 0.15509000
## 4 1 8 14 0.02100531 0.02572320
# create a logical matrix indicating value is within tolerance
mat.eq.tol <- sapply(yy$V4, function(x) abs(yy$V4-x) < 1E-5)
# minimum index
eq.min <- apply(mat.eq.tol, 1, function(x) min(which(x)))
# maximum index
eq.max <- apply(mat.eq.tol, 1, function(x) max(which(x)))
# combine result
res <- cbind(yy$V2, yy$V3, yy$V1[eq.min], yy$V1[eq.max], yy$V4[eq.min])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 10 1 4 0.00040803
## [2,] 11 3 1 4 0.01765470
## [3,] 6 2 1 2 0.15514850
## [4,] 8 14 1 5 0.02100531
## [5,] 9 4 1 5 0.04748648
## [6,] 5 10 1 4 0.00040803

"Loop through" data.table to calculate conditional averages

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.
A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.
Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?

Create dataframe of all array indices in R

Using R, I'm trying to construct a dataframe of the row and col numbers of a given matrix. E.g., if
a <- matrix(c(1:15), nrow=5, ncol=3)
then I'm looking to construct a dataframe that gives:
row col
1 1
1 2
1 3
. .
5 1
5 2
5 3
What I've tried:
row <- matrix(row(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
col <- matrix(col(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
out <- cbind(row, col)
colnames(out) <- c("row", "col")
results in:
row col
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 1 2
[7,] 2 2
[8,] 3 2
[9,] 4 2
[10,] 5 2
[11,] 1 3
[12,] 2 3
[13,] 3 3
[14,] 4 3
[15,] 5 3
Which isn't what I'm looking for, as the sequence of rows and cols in suddenly reversed, even tough I specified "byrow=T". I don't see if and where I'm making a mistake but would hugely appreciate suggestions to overcome this problem. Thanks in advance!
I'd use expand.grid on the vectors 1:ncol and 1:nrow, then flip the columns with [,2:1] to get them in the order you want:
> expand.grid(seq(ncol(a)),seq(nrow(a)))[,2:1]
Var2 Var1
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Use row and col, but more directly manipulate their output ordering since they return corresponding indices in place for the input array. Use t to get the non-default order you want in the end:
data.frame(row = as.vector(t(row(a))), col = as.vector(t(col(a))))
row col
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Or, as a matrix not a data.frame:
cbind(as.vector(t(row(a))), as.vector(t(col(a))))
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3
[10,] 4 1
[11,] 4 2
[12,] 4 3
[13,] 5 1
[14,] 5 2
[15,] 5 3
You may want to have a look at ?expand.grid, which does just about exactly what you want to achieve.
Since there are many ways to skin a cat, I'll chip in with yet another variant based on rep:
data.frame(row=rep(seq(nrow(a)), each=ncol(a)), col=rep(seq(ncol(a)), nrow(a)))
...but to announce a "winner", I think you need to time the solutions:
# Make up a huge matrix...
a <- matrix(runif(1e7), 1e4)
system.time( a1<-data.frame(row = as.vector(t(row(a))),
col = as.vector(t(col(a)))) ) # 0.68 secs
system.time( a2<-expand.grid(col = seq(ncol(a)),
row = seq(nrow(a)))[,2:1] ) # 0.49 secs
system.time( a3<-data.frame(row=rep(seq(nrow(a)), each=ncol(a)),
col=rep(seq(ncol(a)), nrow(a))) ) # 0.59 secs
identical(a1, a2) && identical(a1, a3) # TRUE
...so it seems #Spacedman has the speediest solution!

Resources