dynamic number of columns in a data table with unique random draws - r

Suppose I have the following matrix, for arbitrary J:
set.seed(1)
J=2
n = 100
BB = data.table(r=1:n)
BB[, (paste0("a",seq(J))) := rnorm(n,1,7) ]
So the output is...
> BB
r a1 a2
1: 1 -3.38517668 -3.38517668
2: 2 2.28550327 2.28550327
3: 3 -4.84940029 -4.84940029
...
How come the two columns are identical and now different rnorms?

You can use the super-fast for-set combination:
for(i in seq(J))
set(x = BB, j = paste0('a',i), value = rnorm(n, 1, 7))

Related

R remove duplicated elements for two sparse matrix

We have a sparse matrix using library Matrix:
library(Matrix)
M = sparseMatrix(i = uidx,j = midx,x = freq)
suppose the matrix M is like:
i j x
1 2 0.2
1 3 0.3
1 15 0.15
2 7 0.1
...
280 2 0.6
281 7 0.25
and after some calculation we got another sparse matrix Q
i j x
1 2 18
1 4 16
1 9 8
2 10 19
...
I want to use Q as base matrix and remove those (i,j) already exists in M from Q
something like a set minus:
Q-M
In my example it will brings result like:
i j x
1 4 16
1 9 8
...
#we have 1 2 18 in original Q but 1 2 0.2 with same index (1,2) already exists in M so remove that row from Q.
Any efficient way or existing function to do this work?
to reproduce this case you could run the following code:
library(Matrix)
M = sparseMatrix(i = c(1,1,1),j = c(2,3,15),x = c(0.2,0.3,0.15))
Q = sparseMatrix(i = c(1,1,1),j = c(2,4,9),x = c(18,16,8))
#result should produce a sparse matrix like:
#R = sparseMatrix(i = c(1,1),j = (4,9),x = c(16,8))
You can get there with using the summary function when the Matrix package is loaded. This give a full overview of the sparse matrix (and keeping it as a sparse matrix). Based on this, you can compare values directly. And to select you can compare them to each other. I expanded the example a bit to check if other values are being kept / removed as expected. The result matches what you expect from your R matrix.
library(Matrix)
M = sparseMatrix(i = c(1,1,1,1, 2, 2),
j = c(2,3,15, 16, 4, 8),
x = c(0.2,0.3,0.15, 0.16, 0.2, 0.08))
Q = sparseMatrix(i = c(1,1,1,1, 2),
j = c(2,4,9,16, 4),
x = c(18,16,8,50, 40))
#result should produce a sparse matrix like:
R = sparseMatrix(i = c(1,1),
j = c(4,9),
x = c(16, 8))
# creates a summary of the sparse matrices (summary is coming from Matrix)
summary_m <- summary(M)
summary_q <- summary(Q)
# which records to remove
# all records where i and j match (TRUE). Exclude x value in matching comparison.
# summed this should be 2.
# which shows which records are equal and should be removed.
remove <- which(rowSums(summary_m[, c("i", "j") ] == summary_q[, c("i", "j") ]) == 2)
# build summary sparse matrix from summary_q to keep all Q records that do not match M
q_left <- summary_q[-remove, ]
# build full sparse matrix
result <- sparseMatrix(i = q_left$i, j = q_left$j, x = q_left$x)
identical(result, R)
[1] TRUE

Find combinations of genome without any package

I want to find how many combinations of genome are found in a sequence. I mean for binary combinations: AA,AT,AG,AC,... 16 combinations like that;or for 3-elemented combinations ATG,ACG,... 64 combinations like that. I know how to do that with a package and I will write down it here. I want to create my own code to perform this
seqinr package is perfect on its job. That is the code that i used for;
install.packages('seqinr')
library(seqinr)
m = read.fasta(file='sequence.fasta')
mseq = m[[1]]
count(mseq,2) # gives how many binary combinations are found in the seq
count(mseq,3) # gives how many 3-elemented combinations are found in the seq
This is a slow way to do it. I am certain it is faster in the bioconductor package.
# some practice data
mseq = paste(sample(c("A", "C", "G", "T"), 1000, rep=T), collapse="")
# define a function called count
count = function(mseq, n){
# split the sequence into every possible sub sequence of length n
x = sapply(1:(nchar(mseq) - n + 1), function(i) substr(mseq, i, i+n-1))
# how many unique sub sequences of length R are there?
length(table(x))
}
Actually just checked and this is pretty much how they did it:
function (seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"),
frame = start)
{
if (!missing(frame))
start = frame
istarts <- seq(from = 1 + start, to = length(seq), by = by)
oligos <- seq[istarts]
oligos.levels <- levels(as.factor(words(wordsize, alphabet = alphabet)))
if (wordsize >= 2) {
for (i in 2:wordsize) {
oligos <- paste(oligos, seq[istarts + i - 1], sep = "")
}
}
counts <- table(factor(oligos, levels = oligos.levels))
if (freq == TRUE)
counts <- counts/sum(counts)
return(counts)
}
If you want to find the code for a function use getAnywhere()
getAnywhere(count)
The simple thing to do is just something like this:
# Generate a test sequence
set.seed(1234)
testSeq <- paste(sample(LETTERS[1:3], 100, replace = T), collapse = "")
# Split string into chunks of size 2 and then count occurrences
testBigram <- substring(testSeq, seq(1, nchar(testSeq), 2), seq(2, nchar(testSeq), 2))
table(testBigram)
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
Here is a way using a "function factory" (https://adv-r.hadley.nz/function-factories.html).
The 2-element and 3-element combinations are n-grams of size 2 and 3. So we make this n-gram function factory.
# Generate a function to create a function
ngram <- function(size) {
function(myvector) {
substring(myvector, seq(1, nchar(myvector), size), seq(size, nchar(myvector), size))
}
}
# Assign the functions names (optional)
bigram <- ngram(2)
trigram <- ngram(3)
# 2 element combinations
table(bigram(testSeq))
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
# count of 2 element combinations
length(unique(bigram(testSeq)))
[1] 9
# counting function
count <- function(mseq, n) length(unique(ngram(n)(mseq)))
count(testSeq, 2)
[1] 9
# and if we wanted to do with with 3 element combinations
table(trigram(testSeq))

R Counting the frequency of combinations of three digits efficiently

I have a data.frame where each ID has exactly 3 attributes. For simplification I put only 100 rows, although in my real dataset it's around a 1.000.000. There are around 50 different possible attributes. The attributes are a mixture out of numbers and characters.
data <- data.frame(id = 1:100,
a1 = sample(letters,100,replace = T),
a2 = sample(letters,100,replace = T),
a3 = sample(letters,100,replace = T),
stringsAsFactors=FALSE) %>%
as_tibble()
I want to know what are the most frequent combinations (the order does not matter)
So the outcome is supposed to be something like this
pattern | frequency
a,a,a | 10
A,b,c | 5
a,e,c | 4
... | ....
First I started to create a vector which contains all possible combinations:
possible_combinations <- combn(c(letters,LETTERS),3) %>%
t() %>%
as_tibble() %>%
unite("combination",sep="") %>%
pull()
Then I wrote this nested loop to count the frequencies:
counter = 0
inner_counter = 0
combination_counter = vector(mode = "numeric",length = length (possible_combinations))
for (j in 1:length(possible_combinations)){
for (i in 1:nrow(data)){
# inner Counter Counts when Attribute of one ID is in one combination
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,2]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,3]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,4]] )
# if all three attributes are in a combination, then the Counter increases by one
if(inner_counter == 3) {
counter = counter + 1 }
inner_counter = 0
}
# combination_counter is a vector which saves the frequency with
# which a combination ocurred in all different ids
combination_counter[[j]] = inner_counter
inner_counter = 0
}
I know this is really not very R like, but I don't know how to do it in a different way. The runtime is even bad for my little toy example and it's almost infeasible for my real data.
You could as well do this with base r:
table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ",")))
The problem that you are going to run into is dealing with the massive number of combinations. Even if you try to apply a simple solution of sorting each row, this will cost a lot of time for the number of rows you are dealing with.
Take the following example with the straightforward approach offered by #Lennyy:
set.seed(123)
n <- 1e7
data <- data.frame(id = 1:n,
a1 = sample(letters, n, replace = T),
a2 = sample(letters, n, replace = T),
a3 = sample(letters, n, replace = T),
stringsAsFactors = FALSE)
system.time(t2 <- table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ","))))
user system elapsed
373.281 1.695 375.445
That's a long time...
Here is the output for reference:
head(t2)
a,a,a a,a,b a,a,c a,a,d a,a,e a,a,f
603 1657 1620 1682 1759 1734
We need to somehow code each row quickly without worrying about which column a particular element came from. Additionally, we need to do this in a way that will guarantee uniqueness.
What about a hash table? We can easily do this with Rcpp.
#include <Rcpp.h>
#include <unordered_map>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector countCombos(IntegerMatrix myMat, int numAttr, CharacterVector myAttr) {
unsigned long int numRows = myMat.nrow();
unsigned long int numCols = myMat.ncol();
std::unordered_map<std::string, int> mapOfVecs;
for (std::size_t i = 0; i < numRows; ++i) {
std::vector<int> testVec(numAttr, 0);
for (std::size_t j = 0; j < numCols; ++j) {
++testVec[myMat(i, j) - 1];
}
std::string myKey(testVec.begin(), testVec.end());
auto it = mapOfVecs.find(myKey);
if (it == mapOfVecs.end()) {
mapOfVecs.insert({myKey, 1});
} else {
++(it->second);
}
}
std::size_t count = 0;
IntegerVector out(mapOfVecs.size());
CharacterVector myNames(mapOfVecs.size());
for (const auto& elem: mapOfVecs) {
std::size_t i = 0;
for (auto myChar: elem.first) {
while (myChar) {
myNames[count] += myAttr[i];
--myChar;
}
++i;
}
out[count++] = elem.second;
}
out.attr("names") = myNames;
return out;
}
This offers a great efficiency gain over any of the other solutions posted:
myRows <- 1:nrow(data)
attrCount <- 26
matOfInts <- vapply(2:ncol(data), function(x) {
match(data[, x], letters)
}, myRows, USE.NAMES = FALSE)
system.time(t <- countCombos(matOfInts, attrCount, letters))
user system elapsed
2.570 0.007 2.579
That's over 100 times faster!!!!
And here is the output:
head(t)
jkk ddd qvv ttu aaq ccd
1710 563 1672 1663 1731 1775
Testing equality (the output is in different order, so we must sort first):
identical(sort(unname(t)), as.integer(sort(unname(t2))))
[1] TRUE
Explanation
The countCombos function accepts a matrix of integers. This matrix represents the indices of elements of the unique attributes (in our example, this would be represented by letters).
As we are dealing with combinations with repetition, we can easily represent them as an indexing frequency vector.
The template vector is:
a b c d e y z
| | | | | | |
v v v v v v v
(0, 0, 0, 0, 0, ... 0, 0)
And here is how certain combinations get mapped:
aaa -->> (3, rep(0, 25))
zdd -->> dzd -->> ddz -->> (0, 0, 0, 2, rep(0, 21), 1)
Once we have created our vector, we convert it to a string, so ddz becomes:
ddz --> c((0,0,0,2, rep(0, 21),1) -->> `00020000000000000000000001`
And this is the key that is used in our hash.
If I've understood you correctly the ordering of the attributes doesn't matter, so aba is the same as aab and baa. You also have 50 different attributes and all other solutions seems to rely on typing these in manually.
The following code creates a column that is the concatenated of all attribute columns, sorts it to ignore the order of the attributes, and the calculates the count per group:
library(dplyr)
library(rlang)
cnames <- colnames(data)
cnames <- cnames[2:length(cnames)] #assuming the first column is the only non-attribute column,
#remove any other non-attribute columns as necessary
#!!!syms(cnames) outputs them as the columns rather than text, taken from here
# https://stackoverflow.com/questions/44613279/dplyr-concat-columns-stored-in-variable-mutate-and-non-standard-evaluation?rq=1
data %>%
mutate(comb = sort(paste0(!!!syms(cnames)))) %>%
group_by(comb) %>%
summarise(cnt = n())
You can use dplyr to do this efficiently. First use group_by to group variables a1, a2, and a3, then use summarize and n() to count frequencies:
set.seed(100)
N = 1e5
data <- data.frame(id = 1:N,
a1 = sample(letters[1:5],N,replace = T),
a2 = sample(letters[1:5],N,replace = T),
a3 = sample(letters[1:5],N,replace = T),
stringsAsFactors=FALSE)
data %>%
group_by(a1, a2, a3) %>%
summarize(count = n()) %>%
arrange(count)
## A tibble: 125 x 4
## Groups: a1, a2 [25]
# a1 a2 a3 count
# <chr> <chr> <chr> <int>
# 1 b a d 735
# 2 c b d 741
# 3 a d e 747
# 4 d a e 754
# 5 d e e 754
# 6 d e c 756
# 7 e a d 756
# 8 d c d 757
# 9 c c c 758
#10 d a b 759
## ... with 115 more rows

Extract the combinations of cells without repeating the index

I am trying to calculate the combinations of elements of a matrix but each element should appear only once.
The (real) matrix is symmetric, and can have more then 5 elements (up to ~2000):
o <- matrix(runif(25), ncol = 5, nrow = 5)
dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
# A B C D E
# A 0.4400317 0.1715681 0.7319108946 0.3994685 0.4466997
# B 0.5190471 0.1666164 0.3430245044 0.3837903 0.9322599
# C 0.3249180 0.6122229 0.6312876740 0.8017402 0.0141673
# D 0.1641411 0.1581701 0.0001703419 0.7379847 0.8347536
# E 0.4853255 0.5865909 0.6096330935 0.8749807 0.7230507
I desire to calculate the product of all the combinations of pairs (If possible it should appear all elements:AB, CD, EF if the matrix is of 6 elements), where for each pair one letter is the column, the other one is the row. Here are some combinations:
AB, CD, E
AC, BD, E
AD, BC, E
AE, BC, D
AE, BD, C
Where the value of the single element is just 1.
Combinations not desired:
AB, BC: Element B appears twice
AB, AC: Element A appears twice
Things I tried:
I thought about removing the unwanted part of the matrix:
out <- which(upper.tri(o), arr.ind = TRUE)
out <- cbind.data.frame(out, value = o[upper.tri(o)])
out[, 1] <- colnames(o)[out[, 1]]
out[, 2] <- colnames(o)[out[, 2]]
# row col value
# 1 A B 0.1715681
# 2 A C 0.7319109
# 3 B C 0.3430245
# 4 A D 0.3994685
# 5 B D 0.3837903
# 6 C D 0.8017402
# 7 A E 0.4466997
# 8 B E 0.9322599
# 9 C E 0.0141673
# 10 D E 0.8347536
My attempt involves the following process:
Make a copy of the matrix (out)
Store first value of the first row.
Remove all the pairs that involve any of the pair.
Select the next pair of the resulting matrix
Repeat until all rows are removed of the matrix
Repeat 2:5 starting from a different row
However, this method has one big problem, it doesn't guarantee that all the combinations are stored, and it could store several times the same combination.
My expected output is a vector, where each element is the product of the values in the cell selected by the combination:
AB, CD: 0.137553
How can I extract all those combinations efficiently?
This might work. I tested this on N elements = 5 and 6.
Note that this is not optimised, and hopefully can provide a framework for you to work from. With a much larger array, I can see steps involving apply and combn being a bottleneck.
The idea here is to generate a collection of unique sets first before calculating the product of the sets from another data.frame that stores values of sets.
Unique sets are identified by counting the number of unique elements in all combination pairs. For example, if N elements = 6, we expect length(unlist(combination)) == 6. The same is true if N elements = 7 (there will only be 3 pairs plus a remainder element). In cases where N elements is odd, we can ignore the remaining, unpaired element since it is constrained by the other elements.
library(dplyr)
library(reshape2)
## some functions
unique_by_n <- function(inlist, N){
## select unique combinations by count
## if unique, expect n = 6 if n elements = 6)
if(N %% 2) N <- N - 1 ## for odd numbers
return(length(unique(unlist(inlist))) == N)
}
get_combs <- function(x,xall){
## format and catches remainder if matrix of odd elements
xu <- unlist(x)
remainder <- setdiff(xall,xu) ## catch remainder if any
xset <- unlist(lapply(x, paste0, collapse=''))
finalset <- c(xset, remainder)
return(finalset)
}
## make dataset
set.seed(0) ## set reproducible example
#o <- matrix(runif(25), ncol = 5, nrow = 5) ## uncomment to test 5
#dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
o <- matrix(runif(36), ncol = 6, nrow = 6)
dimnames(o) <- list(LETTERS[1:6], LETTERS[1:6])
o[lower.tri(o)] <- t(o)[lower.tri(o)] ## make matrix symmetric
n_elements = nrow(o)
#### get matrix
dat <- melt(o, varnames = c('Rw', 'Cl'), as.is = TRUE)
dat$Set <- apply(dat, 1, function(x) paste0(sort(unique(x[1:2])), collapse = ''))
## get unique sets (since your matrix is symmetric)
dat <- subset(dat, !duplicated(Set))
#### get sets
elements <- rownames(o)
allpairs <- expand.grid(Rw = elements, Cl = elements) %>%
filter(Rw != Cl) ## get all pairs
uniqpairsgrid <- unique(t(apply(allpairs,1,sort)))
uniqpairs <- split(uniqpairsgrid, seq(nrow(uniqpairsgrid))) ## get unique pairs
allpaircombs <- combn(uniqpairs,floor(n_elements/2)) ## get combinations of pairs
uniqcombs <- allpaircombs[,apply(allpaircombs, 2, unique_by_n, N = n_elements)] ## remove pairs with repeats
finalcombs <- apply(uniqcombs, 2, get_combs, xall=elements)
#### calculate results
res <- apply(finalcombs, 2, function(x) prod(subset(dat, Set %in% x)$value)) ## calculate product
names(res) <- apply(finalcombs, 2, paste0, collapse=',') ## add names
resdf <- data.frame(Sets = names(res), Products = res, stringsAsFactors = FALSE, row.names = NULL)
print(resdf)
#> Sets Products
#> 1 AB,CD,EF 0.130063454
#> 2 AB,CE,DF 0.171200062
#> 3 AB,CF,DE 0.007212619
#> 4 AC,BD,EF 0.012494787
#> 5 AC,BE,DF 0.023285088
#> 6 AC,BF,DE 0.001139712
#> 7 AD,BC,EF 0.126900247
#> 8 AD,BE,CF 0.158919605
#> 9 AD,BF,CE 0.184631344
#> 10 AE,BC,DF 0.042572488
#> 11 AE,BD,CF 0.028608495
#> 12 AE,BF,CD 0.047056905
#> 13 AF,BC,DE 0.003131029
#> 14 AF,BD,CE 0.049941770
#> 15 AF,BE,CD 0.070707311
Created on 2018-07-23 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0.9000).
Maybe the following does what you want.
Note that I was more interested in being right than in performance.
Also, I have set the RNG seed, to have reproducible results.
set.seed(9840) # Make reproducible results
o <- matrix(runif(25), ncol = 5, nrow = 5)
dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
cmb <- combn(LETTERS[1:5], 2)
n <- ncol(cmb)
res <- NULL
nms <- NULL
for(i in seq_len(n)){
for(j in seq_len(n)[-seq_len(i)]){
x <- unique(c(cmb[, i], cmb[, j]))
if(length(x) == 4){
res <- c(res, o[cmb[1, i], cmb[2, i]] * o[cmb[1, j], cmb[2, j]])
nms <- c(nms, paste0(cmb[1, i], cmb[2, i], '*', cmb[1, j], cmb[2, j]))
}
}
}
names(res) <- nms
res

Integer partitions of size K

Given a vector v of F non-negative integers, I want to create, one by one, all possible sets of K vectors with size F whose sum is v. I call C the matrix of these K vectors; the row sum of C gives v.
For instance, the vector (1,2) of size F=2, if we set K=2, can be decomposed in:
# all sets of K vectors such that their sum is (1,2)
C_1 = 1,0 C_2 = 1,0 C_3 = 1,0 C_4 = 0,1 C_5 = 0,1 C_6 = 0,1
2,0 1,1 0,2 2,0 1,1 0,2
The goal is to apply some function to each possible C. Currently, I use this code, where I pre-compute all possible C and then go through them.
library(partitions)
K <- 3
F <- 5
v <- 1:F
partitions <- list()
for(f in 1:F){
partitions[[f]] <- compositions(n=v[f],m=K)
}
# Each v[f] has multiple partitions. Now we create an index to consider
# all possible combinations of partitions for the whole vector v.
npartitions <- sapply(partitions, ncol)
indices <- lapply(npartitions, function(x) 1:x)
grid <- as.matrix(do.call(expand.grid, indices)) # breaks if too big
for(n in 1:nrow(grid)){
selected <- c(grid[n,])
C <- t(sapply(1:F, function(f) partitions[[f]][,selected[f]]))
# Do something with C
#...
print(C)
}
However, when the dimensions are too big, of F, K are large, then the number of combinations explodes and expand.grid can't deal with that.
I know that, for a given position v[f], I can create a partition at a time
partition <- firstcomposition(n=v[f],m=K)
nextcomposition(partition, v[f],m=K)
But how can I use this to generate all possible C as in the above code?
npartitions <- ......
indices <- lapply(npartitions, function(x) 1:x)
grid <- as.matrix(do.call(expand.grid, indices))
You can avoid the generation of grid and successively generate its rows thanks to a Cantor expansion.
Here is the function returning the Cantor expansion of the integer n:
aryExpansion <- function(n, sizes){
l <- c(1, cumprod(sizes))
nmax <- tail(l,1)-1
if(n > nmax){
stop(sprintf("n cannot exceed %d", nmax))
}
epsilon <- numeric(length(sizes))
while(n>0){
k <- which.min(l<=n)
e <- floor(n/l[k-1])
epsilon[k-1] <- e
n <- n-e*l[k-1]
}
return(epsilon)
}
For example:
expand.grid(1:2, 1:3)
## Var1 Var2
## 1 1 1
## 2 2 1
## 3 1 2
## 4 2 2
## 5 1 3
## 6 2 3
aryExpansion(0, sizes = c(2,3)) + 1
## [1] 1 1
aryExpansion(1, sizes = c(2,3)) + 1
## [1] 2 1
aryExpansion(2, sizes = c(2,3)) + 1
## [1] 1 2
aryExpansion(3, sizes = c(2,3)) + 1
## [1] 2 2
aryExpansion(4, sizes = c(2,3)) + 1
## [1] 1 3
aryExpansion(5, sizes = c(2,3)) + 1
## [1] 2 3
So, instead of generating grid:
npartitions <- ......
indices <- lapply(npartitions, function(x) 1:x)
grid <- as.matrix(do.call(expand.grid, indices))
for(n in 1:nrow(grid)){
selected <- grid[n,]
......
}
you can do:
npartitions <- ......
for(n in seq_len(prod(npartitions))){
selected <- 1 + aryExpansion(n-1, sizes = npartitions)
......
}

Resources