After creating a key on a data.table:
set.seed(12345)
DT <- data.table(x = sample(LETTERS[1:3], 10, replace = TRUE),
y = sample(LETTERS[1:3], 10, replace = TRUE))
setkey(DT, x, y)
DT
# x y
# [1,] A B
# [2,] A B
# [3,] B B
# [4,] B B
# [5,] C A
# [6,] C A
# [7,] C A
# [8,] C A
# [9,] C C
# [10,] C C
I would like to get an integer vector giving for each row the corresponding "key index". I hope the expected output (column i) below will help clarify what I mean:
# x y i
# [1,] A B 1
# [2,] A B 1
# [3,] B B 2
# [4,] B B 2
# [5,] C A 3
# [6,] C A 3
# [7,] C A 3
# [8,] C A 3
# [9,] C C 4
# [10,] C C 4
I thought about using something like cumsum(!duplicated(DT[, key(DT), with = FALSE])) but am hoping there is a better solution. I feel this vector could be part of the table's internal representation, and maybe there is a way to access it? Even if it is not the case, what would you suggest?
Update: From v1.8.3, you can simply use the inbuilt special .GRP:
DT[ , i := .GRP, by = key(DT)]
See history for older answers.
I'd probably just do this, since I'm fairly confident that no index counter is available from within the call to [.data.table():
ii <- unique(DT)
ii[ , i := seq_len(nrow(ii))]
DT[ii]
# x y i
# 1: A B 1
# 2: A B 1
# 3: B B 2
# 4: B B 2
# 5: C A 3
# 6: C A 3
# 7: C A 3
# 8: C A 3
# 9: C C 4
# 10: C C 4
You could make this a one-liner, at the expense of an additional call to unique.data.table():
DT[unique(DT)[ , i := seq_len(nrow(unique(DT)))]]
Related
Is there a general function to build a matrix from smaller blocks, i.e. build matrix
A B
C D
from matrices A, B, C, D?
Of course there is this obvious way to create an empty big matrix and use sub-indexing, but isn't there anything simpler, easier and possibly faster?
Here are some base R solutions. Maybe you can use
M <- rbind(cbind(A,B),cbind(C,D))
or
u <- list(list(A,B),list(C,D))
M <- do.call(rbind,Map(function(x) do.call(cbind,x),u))
Example
A <- matrix(1:4,nrow = 2)
B <- matrix(1:6,nrow = 2)
C <- matrix(1:6,ncol = 2)
D <- matrix(1:9,nrow = 3)
such that
> M
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 1 3 5
[2,] 2 4 2 4 6
[3,] 1 4 1 4 7
[4,] 2 5 2 5 8
[5,] 3 6 3 6 9
I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?
It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941
How can I get all the combinations of a list with duplicates. By duplicates I mean an element with itself. I am building a symmetric matrix.
names.list<-c("A","B","C")
as.data.frame(t(combn(names.list,2)))
Result is:
V1 V2
1 A B
2 A C
3 B C
When I want:
V1 V2
1 A A
2 A B
3 A C
4 B B
5 B C
6 C C
Or even:
V1 V2
1 A A
2 A B
3 A C
4 B A
5 B B
6 B C
7 C A
8 C B
9 C C
But my matrices are large so I would like to keep combinations to a minimum (so preferably the second result), since more combinations = more computations = larger run times..
Thanks.
It sounds like you're looking for expand.grid instead of combn:
expand.grid(names.list, names.list)
# Var1 Var2
# 1 A A
# 2 B A
# 3 C A
# 4 A B
# 5 B B
# 6 C B
# 7 A C
# 8 B C
# 9 C C
Update
There's also combinations from "gtools" which would give you your preferred output.
library(gtools)
combinations(3, 2, names.list, repeats = TRUE)
# [,1] [,2]
# [1,] "A" "A"
# [2,] "A" "B"
# [3,] "A" "C"
# [4,] "B" "B"
# [5,] "B" "C"
# [6,] "C" "C"
I have a data frame with entries in R, and want to create all possible unique subsets from this data frame, when each subset should include a unique possible pairwise combination of two columns from the pool of columns in the original data frame. This means that if the number of columns in the original data frame is Y, the number of unique subsets I should get is Y*(Y-1)/2. I also want that the name of the columns in each subset would be the name that was used in the original data frame. How do I do it?
colpairs <- function(d) {
apply(combn(ncol(d),2), 2, function(x) d[,x])
}
x <- colpairs(iris)
sapply(x, head, n=2)
## [[1]]
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
##
## [[2]]
## Sepal.Length Petal.Length
## 1 5.1 1.4
## 2 4.9 1.4
...
I'd use combn to make the indices of your columns, and lapply to take subsets of your data.frame and store them in a list structure. e.g.
# Example data
set.seed(1)
df <- data.frame( a = sample(2,4,repl=T) ,
b = runif(4) ,
c = sample(letters ,4 ),
d = sample( LETTERS , 4 ) )
# Use combn to get indices
ind <- combn( x = 1:ncol(df) , m = 2 , simplify = FALSE )
# ind is the column indices. The indices returned by the example above are (pairs in columns):
#[,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 1 1 2 2 3
#[2,] 2 3 4 3 4 4
# Make subsets, combine in list
out <- lapply( ind , function(x) df[,x] )
[[1]]
# a b
#1 1 0.2016819
#2 1 0.8983897
#3 2 0.9446753
#4 2 0.6607978
[[2]]
# a c
#1 1 q
#2 1 b
#3 2 e
#4 2 x
[[3]]
# a d
#1 1 R
#2 1 J
#3 2 S
#4 2 L
[[4]]
# b c
#1 0.2016819 q
#2 0.8983897 b
#3 0.9446753 e
#4 0.6607978 x
[[5]]
# b d
#1 0.2016819 R
#2 0.8983897 J
#3 0.9446753 S
#4 0.6607978 L
[[6]]
# c d
#1 q R
#2 b J
#3 e S
#4 x L
I am having a hard time trying to subset a data.table (package) in R. Giving a following example
library(data.table)
x = c(rep("a", 6), rep("b", 5))
y = c(0,2,1,0,1,2, 0,1,0,2,1)
z = c(1:6,1:5) + rnorm(11, 0.02, 0.1)
DT = data.table(ind = x, cond = y, dist = z)
ind cond dist
[1,] a 0 1.078966
[2,] a 2 1.987159
[3,] a 1 3.143391
[4,] a 0 3.937058
[5,] a 1 5.037681
[6,] a 2 6.036432
[7,] b 0 1.057809
[8,] b 1 2.144755
[9,] b 0 3.010903
[10,] b 2 3.937765
[11,] b 1 4.976273
I want to subset everything after the first 1 in cond column. In other words, everything that is larger than 3.143391 for a and 2.144755 for b (in this example).
DT.sub <- DT[cond == "1",] # Please, combine this row
DT.sub[,.SD[dist==min(dist)],by=ind] # With this to make the code shorter, if you can.
ind cond dist
[1,] a 1 3.143391
[2,] b 1 2.144755
The result should look like this:
ind cond dist
[1,] a 0 3.937058
[2,] a 1 5.037681
[3,] a 2 6.036432
[4,] b 0 3.010903
[5,] b 2 3.937765
[6,] b 1 4.976273
How about :
DT[,.SD[seq(match(1,cond)+1,.N)],by=ind]
ind cond dist
[1,] a 0 3.937058
[2,] a 1 5.037681
[3,] a 2 6.036432
[4,] b 0 3.010903
[5,] b 2 3.937765
[6,] b 1 4.976273
Btw, it's good to set.seed(1) first so we can work with the same random data.