R - joining more than 2^31 rows with data.table - r

I have an igraph network graph with 103,887 nodes and 4,795,466 ties.
This can be structured as an edgelist in a data.table with almost 9 million rows.
I can find the common neighbors in this network, following #chinsoon12's answer here. See the example below.
This works beautifully for smaller networks, but runs into problems in my use-case because the merge results in more than 2^31 rows.
Are there efficient alternatives on how to deal with this?
Can I split the data and do the computation in steps? The results will be used to query about common neighbors.
Example - modified from #chinsoon12's answer:
g <- random.graph.game(10, p=0.10)
adjSM <- as(get.adjacency(g), "dgTMatrix")
adjDT <- data.table(V1=adjSM#i+1, V2=adjSM#j+1)
res <- adjDT[adjDT, nomatch=0, on="V2", allow.cartesian=TRUE
][V1 < i.V1, .(Neighbours=paste(V2, collapse=",")),
V1 i.V1 Neighbours
1: 4 5 8
2: 4 10 8
3: 5 10 8

If you just want to query the common neighbors, I don't suggest you build up a huge look-up table. Instead, you can use the following code to get the result for your query:
find_common_neighbors <- function(g, Vs) {
which(colSums(distances(g, Vs) == 1) == length(Vs))
such that
> find_common_neighbors(g, c(4, 8))
> find_common_neighbors(g, c(4, 5))
[1] 8
If you need a look-up table, an alternative is to use Neighbours as the key to search its associated node, e.g.,
res <- transform(
data.frame(Neighbours = which(degree(g) >= 2)),
Nodes = sapply(
function(x) toString(neighbors(g, x))
Previous Answer
I think you can use ego over g directly to generate res, e.g.,
Filter(function(x) length(x) > 2, ego(g, 1)),
function(x) {
rbind(combn(x[-1], 2), x[1])
c("V1", "V2", "Neighbours")
which gives
V1 V2 Neighbours
1 4 5 8
2 4 10 8
3 5 10 8

common neighbors
Can I split the data and do the computation in steps?
You can split by V1 to avoid running into the big-merge issue:
neighDT = adjDT[, if (.N > 1) {
cb = combn(V2, 2)
.(a = cb[1, ], b = cb[2, ])
}, by=.(neighbor = V1)]
which gives
neighbor a b
1: 8 4 5
2: 8 4 10
3: 8 5 10
(The OP found gRbase::combnPrim to be faster than combn here.)
How can we collapse all the common neighbors (separated with a comma) for the same combination into one observation?
neighDT_agg = neighDT[order(neighbor),
.(neighbors = toString(neighbor))
, keyby=.(a,b)]
The order ensures that the string is sorted alphabetically. The keyby ensures that the table is sorted by pairs {a,b} and facilitates a simple fast lookup for multiple pairs at once:
# single query
neighDT_agg[.(4,10), neighbors]
# [1] "8"
# multi query
pairs_queryDT = data.table(a = c(4,5,8), b = c(5,10,10))
neighDT_agg[pairs_queryDT, neighbors]
[1] "8" "8" NA
I have an igraph network graph with 103,887 nodes and 4,795,466 ties.
Each call to combn will be making a 2-by-choose(.N, 2) matrix. If a node is connected to all other nodes, then it is a common neighbor to all pairs of other nodes and you'll be facing choose(103887-1, 2) of these pairs. I guess this is more an issue with the way the problem is defined than with the approach to solving it.
The results will be used to query about common neighbors.
For the approach above, you'll need to compute the full neighbors table first.
If you just have a few ad hoc queries about intersecting neighbors:
find_neighbors <- function(a, b){
adjDT[.(c(a, b)), on=.(V1), V2[duplicated(V2)]]
find_neighbors(4, 10)
# [1] 8
This can similarly be wrapped in toString to collapse the values.


R: Is there a way to get unique, closest matches with the rows in the same data.table based on multiple columns?

In R, I want to get unique, closest matches for the rows in a data.table which are identified by unique ids based on values in two columns. Here, I provide a toy example and the code I'm using to achieve this.
dt <- data.table(id = letters,
value_1 = as.integer(runif(26,1,20)),
value_2 = as.integer(runif(26,1,10)))
pairs <- data.table()
while(nrow(dt) >= 2){
k <- dt[c(1)]
m <- dt[-1]
t <- m[k, roll = "nearest",on = .(value_1,value_2)]
pairs <- rbind(pairs,t)
dt <- dt[!dt$id %in% pairs$id & !dt$id %in% pairs$i.id]
pairs <- pairs[,-c(2,3)]
This gives me a data.table with the matched ids and the ones that do not get any matches.
id i.id
1 NA a
2 NA b
3 m c
4 v d
5 y e
6 i f
Is there a way to do this without the loop. I intend to implement this on a data.table with more than 20 million observations? Clearly, using a loop is extremely inefficient. I was wondering if the roll join command can be run on a copy of the main data.table by introducing an exception condition -- so as not to match the same ids with each other. Maybe something like this:
m <- dt
t <- m[dt, roll = "nearest",on = .(value_1,value_2)]
Without the exception, this command merely generates matches of ids with themselves. Also, this does not ensure unique matches.

Shuffling string (non-randomly) for maximal difference

After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?
python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo
Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
myfun <- function(S) {
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
head(myfun(seq_3), 10)
# seq HD
# 10 GATCCTG 7

R: Quickly Performing Operations on Subsets of a Data Frame, then Re-aggregating the Result Without an Inner Function

We have a very large data frame df that can be split by factors. On each subset of the data frame created by this split, we need to perform an operation to increase the number of rows of that subset until it's a certain length. Afterwards, we rbind the subsets to get a bigger version of df.
Is there a way of doing this quickly without using an inner function?
Let's say our subset operation (in a separate .R file) is:
foo <- function(df) { magic }
We've come up with a few ways of doing this:
df <- split(df, factor)
df <- lapply(df, foo)
assign('list.df', list(), envir=.GlobalEnv)
assign('i', 1, envir=.GlobalEnv)
dplyr::group_by(df, factor)
dplyr::mutate(df, foo.list(df.col))
df <- rbindlist(list.df)
rm('list.df', envir=.GlobalEnv)
rm('i', envir=.GlobalEnv)
(In a separate file)
foo.list <- function(df.cols) {
list.df[[i]] <<- magic.df
i <<- i + 1
The issue with the first approach is time. The lapply simply takes too long to really be desirable (on the order of an hour with our data set).
The issue with the second approach is the extremely undesirable side-effect of tampering with the user's global environment. It's significantly faster, but this is something we'd rather avoid if we can.
We've also tried passing in the list and count variables and then trying to substitute them with the variables in the parent environment (A sort of hack to get around R's lack of pass-by-reference).
We've looked at a number of possibly-relevant SO questions (R applying a function to a subset of a data frame, Calculations on subsets of a data frame, R: Pass by reference, e.t.c.) but none of them dealt with our question too well.
If you want to run code, here's something you can copy and paste:
x <- runif(n=10, min=0, max=3)
y <- sample(x=10, replace=FALSE)
factors <- runif(n=10, min=0, max=2)
factors <- floor(factors)
df <- data.frame(factors, x, y)
df now looks like this (length 10):
## We group by factor, then run foo on the groups.
foo <- function(df.subset) {
min <- min(df.subset$y)
max <- max(df.subset$y)
## We fill out df.subset to have everything between the min and
## max values of y. Then we assign the old values of df.subset
## to the corresponding spots.
df.fill <- data.frame(x=rep(0, max-min+1),
factors=rep(df.subset$factors[1], max-min+1))
df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x
So I can take my sample code in the first approach to build a new df (length 18):
Using data.table this doesn't take long due to speedy functionality. If you can, rewrite your function to work with specific variables. The split-apply-combine processing may get a performance boost:
df2 <- setDT(df)[,foo(df), factors]
# user system elapsed
# 1.63 0.39 2.03
Another variation using data.table.. First get the min(y):max(y) part and then join+update:
ans = setDT(df)[, .(x=0, y=min(y):max(y)), by=factors
][df, x := i.x, on=c("factors", "y")][]
# factors x y
# 1: 0 1.25104362 1
# 2: 0 0.16729068 2
# 3: 0 0.00000000 3
# 4: 0 0.02533907 4
# 5: 0 0.00000000 5
# 6: 0 0.00000000 6
# 7: 0 1.80547980 7
# 8: 1 0.34043937 3
# 9: 1 0.00000000 4
# 10: 1 1.51742163 5
# 11: 1 0.15709287 6
# 12: 1 0.00000000 7
# 13: 1 1.26282241 8
# 14: 1 2.88292354 9
# 15: 1 1.78573288 10
Pierre and Roland already provides nice solutions.
If the case is scalability not only in timing but also in memory you can spread the data across number of remote R instances.
In most basic setup it requires only Rserve/RSclient, so no non-CRAN deps.
Spread data across R instances
For easier reproducibility below example will start two R instances on a single localhost machine. You need to start Rserve nodes on remote machines for real scalability.
# start R nodes
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))
# populate data
x = runif(n=5e6,min=0, max=3)
y = sample(x=5e6,replace=FALSE)
factors = runif(n=5e6, min=0, max=2)
factors = floor(factors)
df = data.frame(factors, x, y)
# connect Rserve nodes
rscl = sapply(port, function(port) RS.connect(port = port))
# assign chunks to R nodes
sapply(seq_along(rscl), function(i) RS.assign(rscl[[i]], name = "x", value = df[df$factors == (i-1),]))
# assign magic function to R nodes
foo = function(df) df
sapply(rscl, RS.assign, name = "foo", value = foo)
All processes on remote machines can be performed parallely (using wait=FALSE and RS.collect) which additionally reduce computing timing.
Using lapply + RS.eval
# sequentially
l = lapply(rscl, RS.eval, foo(x))
# parallely
invisible(sapply(rscl, RS.eval, foo(x), wait=FALSE))
l = lapply(rscl, RS.collect)
Using big.data.table::rscl.*
big.data.table package provides few wrappers on RSclient::RS.* functions allowing them to accept list of connections to R nodes.
They doesn't use data.table in any way so can be effectively applied to data.frame, vector or any R type that is chunk-able. Below example uses basic data.frame.
# sequentially
l = rscl.eval(rscl, foo(x), simplify=FALSE)
# parallely
invisible(rscl.eval(rscl, foo(x), wait=FALSE))
l = rscl.collect(rscl, simplify=FALSE)
Using big.data.table
This example requires data on nodes to be stored as data.tables, but gives some convenient api and a lot of other features.
rscl.require(rscl, "data.table")
rscl.eval(rscl, is.data.table(setDT(x))) # is.data.table to suppress collection of `setDT` results
bdt = big.data.table(rscl = rscl)
# parallely by default
bdt[, foo(.SD), factors]
# considering we have data partitioned using `factors` field, the `by` is redundant in that case
bdt[, foo(.SD)]
# optionally use `[[` to access R nodes environment directly
bdt[[expr = foo(x)]]
Clean workspace
# disconnect
# shutdown nodes started from R
l = lapply(setNames(nm = port), function(port) tryCatch(RSconnect(port = port), error = function(e) e, warning = function(w) w))
invisible(lapply(l, function(rsc) if(inherits(rsc, "sockconn")) RSshutdown(rsc)))
I don't think your function works as intended. It relies on y being ordered.
Try using a data.table join with grouping:
df2 <- df[, .SD[data.table(y=seq(.SD[, min(y)], .SD[, max(y)], by = 1)), .SD,
on = "y"], #data.table join
by = factors] #grouping
df2[is.na(x), x:= 0]
setkey(df2, factors, y, x)

R: subsetting dataframe using elements from a vector

I have a data frame which includes a vector of individual identifiers (which are 6 letters) and vectors of numbers
I would like to subset it using a vector of elements (again 6-letters identifiers) taken from another dataframe
Here is what I did (in a simplified version, my dataframe has over 200 columns and 64 rows)
n = c(2, 3, 5, 7, 8, 1)
i = c("abazzz", "bbaxxx", "ccbeee","dddfre", "sdtyuo", "loatvz" )
c = c(10, 2, 10, 2, 12, 34)
df1 = data.frame(n, i, c)
This is the vector whose elements I want to use for subsetting:
v<- c("abazzz", "ccbeee", "lllaaa")
This is what I do to subset
df2<-example[, i==abazzz | ccbeee | lllaaa]
This does not work, the error I get is "abazzz" not found ( I tried with and without "", I tried using the command subset, same error appears)
Moreover I would like to avoid the or operator as the vector I need to use for subsetting has about 50 elements. So, in words, what I would like to do is to subset df2 in order to extract only those individuals who already appear in df1 using their identifiers (column in df1)
Writing this makes me think this must be very easy to do, but I can't figure it out by myself, I tried looking up similar questions but could not find what I was looking for. I hope someone can help me, suggest other posts or manuals so I can learn. Thanks!
Here's another nice option using data.tables binary search (for efficiency)
setkey(setDT(df1), i)[J(v), nomatch = 0]
# n i c
# 1: 2 abazzz 10
# 2: 5 ccbeee 10
Or if you don't want to reorder the data set and keep the syntax similar to base R, you could set a secondary key instead (contributed by #Arun)
set2key(setDT(df1), i)
df1[i %in% v]
Or dplyr (for simplicity)
df1 %>% filter(i %in% v)
# n i c
# 1: 2 abazzz 10
# 2: 5 ccbeee 10
As a side note: as mentioned in comments, never use attach
Instead of
df2<-df1[, i==abazzz | ccbeee | lllaaa]
df2 <- with(df1, df1[i=="abazzz" | i=="ccbeee" | i=="lllaaa", ])
with(df1, df1[i %in% v, ])
Both yield
# n i c
# 1 2 abazzz 10
# 3 5 ccbeee 10

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
# Create index list
if (length(by) == 1)
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
nc <- length(by)
split.index <- by
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
do.call(fun, list(x[, i], ...))
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
means for rows from vectors d,e,f
all in one call you get
if you only know the names of the columns and not the order then you can use:
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn
