Efficiently preserving array dimensions in R - r

The below R code fills an array of specified dimensions with positive integers generated randomly via a probability vector.
subset.haps <- NULL
haps <- 1:4
num.specs <- 100
probs <- rep(1/4, 4)
perms <- 10000
K <- 1
gen.perms <- function() {
if (is.null(subset.haps)) {
sample(haps, size = num.specs, replace = TRUE, prob = probs)
} else {
resample <- function(x, ...) x[sample.int(length(x), ...)]
resample(subset.haps, size = num.specs, replace = TRUE, prob = probs[subset.haps])
}
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[, , i] <- replicate(perms, gen.perms())
}
However, profiling the above code suggests that improvements can be made.
The 'for' loop can be eliminated using rep()
rep(replicate(perms, gen.perms()), K)
However, this method does not produce an array, nor preserves array dimensions.
Of course, wrapping the above modified code within as.array() will fix the second issue, but the output does not resemble a typical array in structure.
My question
How can I ensure the array structure (i.e., dimensions) is preserved?

You just make pop then set its dim attribute afterwards:
pop <- rep(replicate(perms, gen.perms()), K)
dim(pop) <- c(perms, num.specs, K)
And to prove it:
class(pop)
# [1] "array"
dim(pop)
# [1] 10000 100 1
pop[2020,23,1]
# [1] 2

Related

Workaround for error when creating a correlation matrix of a huge matrix? [duplicate]

This question already has an answer here:
correlation matrix using large data sets in R when ff matrix memory allocation is not enough
(1 answer)
Closed 1 year ago.
I have a matrix of size 55422 x 888, and so the rcorr function is producing this error:
M1<-matrix(rnorm(36),nrow=55422, ncol=888)
cor <- rcorr(t(M1), type = "pearson")
Error in double(p * p) : vector size cannot be NA
In addition: Warning message:
In p * p : NAs produced by integer overflow
Is there anything I can do to solve this?
This isn't a complete/working solution, but will give you an idea of some of the issues.
Your correlation matrix will contain n*(n-1)/2 = 1535771331 unique elements. If each correlation takes one millisecond to compute, computing the correlation matrix will take (n^2-n)/2/(1e6*3600) = 0.42 hours and require (n^2-n)/2*8/(2^30) = 11.4 Gb of storage. These requirements are not impossible if you have a lot of RAM and time ...
In fact it's a little bit worse than this, since rcorr returns its results as a symmetric matrix (i.e., not taking advantage of the symmetry), and returns the n and P matrices as well, so the storage requirement will be approximately 5 times as great as stated above (double for the full matrix, x 2.5 because we have two double-precision and one integer matrix).
Getting to your specific question, the section on long vectors in the R internals manual discusses the maximum sizes of objects in R. The 'standard' limitation is that the total number of elements of the matrix should be less than 2^31 ((n^2-n)/2/(2^31-1) = 0.72), but the redundancy in the matrix gets you in trouble (as would the storage of the correlation, p-values, and the sample sizes).
If you still want to go ahead, here is an implementation by A.N. Spiess, copied from here, that breaks the problem into blocks and stores the results in a disk-backed array (i.e., not in RAM). This won't get you the p-values (and it's still not clear what you're going to do with all those values ...), but it works at least up to 40,000 columns (takes about a minute).
However, it seems to crap out on your actual problem size (888 x 55242) with a too-large length. I'd have to look more closely and see if there is a limitation here we can get around ... It seems that we are actually still limited by the matrix dimensions ... (maximum matrix dimension is sqrt(2^31-1) approx. 46341 ... With more work, we could still do the block-diagonal thing and split this into several components ...
set.seed(101)
nc <- 55422
nr <- 888
d <- matrix(rnorm(nr*nc), ncol = nc)
t1 <- system.time(b1 <- bigcor(d))
bigcor <- function(
x,
y = NULL,
fun = c("cor", "cov"),
size = 2000,
verbose = TRUE,
...)
{
if (!require("ff")) stop("please install the ff package")
fun <- match.arg(fun)
if (fun == "cor") FUN <- cor else FUN <- cov
if (fun == "cor") STR <- "Correlation" else STR <- "Covariance"
if (!is.null(y) & NROW(x) != NROW(y)) stop("'x' and 'y' must have compatible dimensions!")
NCOL <- ncol(x)
if (!is.null(y)) YCOL <- NCOL(y)
## calculate remainder, largest 'size'-divisible integer and block size
REST <- NCOL %% size
LARGE <- NCOL - REST
NBLOCKS <- NCOL %/% size
## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
if (is.null(y)) resMAT <- ff(vmode = "double", dim = c(NCOL, NCOL))
else resMAT <- ff(vmode = "double", dim = c(NCOL, YCOL))
## split column numbers into 'nblocks' groups + remaining block
GROUP <- rep(1:NBLOCKS, each = size)
if (REST > 0) GROUP <- c(GROUP, rep(NBLOCKS + 1, REST))
SPLIT <- split(1:NCOL, GROUP)
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
if (!is.null(y)) COMBS <- cbind(1:length(SPLIT), rep(1, length(SPLIT)))
## initiate time counter
timeINIT <- proc.time()
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
## if y = NULL
if (is.null(y)) {
if (verbose) cat(sprintf("#%d: %s of Block %s and Block %s (%s x %s) ... ", i, STR, COMB[1],
COMB[2], length(G1), length(G2)))
RES <- FUN(x[, G1], x[, G2], ...)
resMAT[G1, G2] <- RES
resMAT[G2, G1] <- t(RES)
} else ## if y = smaller matrix or vector
{
if (verbose) cat(sprintf("#%d: %s of Block %s and 'y' (%s x %s) ... ", i, STR, COMB[1],
length(G1), YCOL))
RES <- FUN(x[, G1], y, ...)
resMAT[G1, ] <- RES
}
if (verbose) {
timeNOW <- proc.time() - timeINIT
cat(timeNOW[3], "s\n")
}
gc()
}
return(resMAT)
}

storing multiple images (3D arrays) into a 4D array, in R, with for loop

I am trying to get an average of 3 pictures. I made a function to do that, but i have an issue with storing each of the images following some image manipulation. I am getting an error:
Warning messages:
1: In b$b_arr[i] <- m : number of items to replace is not a multiple of replacement length
2: In b$b_arr[i] <- m : number of items to replace is not a multiple of replacement length
3: In b$b_arr[i] <- m : number of items to replace is not a multiple of replacement length
This is the code that I used to do it. I know I can do this manually, but I want to make a function (and learn what is my issue with this for loop).
library(OpenImageR)
imgs <- c("img1.png", "img2.png", "img3.png")
b <- data.frame(conc = imgs,
b_arr = array(dim = c(length(imgs),831, 651, 3)))
base_fun <- functienter code hereon(imgs) {
for (i in 1:length(imgs)) {
m <- readImage(imgs[i])
m[ , , 2] = 0
m[ , , 3] = 0
m <- cropImage(m, new_width = 250:1080,
new_height = 650:1300,
type = 'user_defined')
b$b_arr[i] <<- m
}
avg_b <<- (b$b_arr[1,,,] + b$b_arr[2,,,] + b$b_arr[3,,,])/3
}
base_fun(img)
Within your code the problem is that b is a data.frame of dimension 3 times 2. This means that the b_arr column is actually three 4 dimensional arrays.
I don't have any images laying around, but something like the code below should work.
library(OpenImageR)
imgs <- c("img1.png", "img2.png", "img3.png")
b <- lapply(imgs, function(x){
img <- readImage(x)
cropImage(img, new_width = 250:1080, new_height = 650:1300, type = 'user_defined')
})
# Convert list to array
b_arr <- array(dim = c( length(imgs), 831, 651, 3))
for(i in seq(length(imgs))
b_arr[,, i] <- b[[i]]
# calculate the mean across the first dimension (why?)
apply(b_arr, 1, mean)
Oliver's answer worked (with small modification), but I figured out how to do it in a function (helps with scalability and readability).
b <- function(imgs) {
b <<- array(dim = c(length(imgs), w_to-w_from+1, h_to-h_from+1, 3))
for (i in seq(length(imgs))) {
m <- readImage(imgs[i])
m[ , , 2] = 0
m[ , , 3] = 0
m <- cropImage(m, new_width = w_from:w_to,
new_height = h_from:h_to, type = 'user_defined')
b[i,,,] <<- m
}
m_avg <<- (b[1,,,] + b[2,,,] + b[3,,,])/3
}

How to create matrix of all 2^n binary sequences of length n using recursion in R?

I know I can use expand.grid for this, but I am trying to learn actual programming. My goal is to take what I have below and use a recursion to get all 2^n binary sequences of length n.
I can do this for n = 1, but I don't understand how I would use the same function in a recursive way to get the answer for higher dimensions.
Here is for n = 1:
binseq <- function(n){
binmat <- matrix(nrow = 2^n, ncol = n)
r <- 0 #row counter
for (i in 0:1) {
r <- r + 1
binmat[r,] <- i
}
return(binmat)
}
I know I have to use probably a cbind in the return statement. My intuition says the return statement should be something like cbind(binseq(n-1), binseq(n)). But, honestly, I'm completely lost at this point.
The desired output should basically recursively produce this for n = 3:
binmat <- matrix(nrow = 8, ncol = 3)
r <- 0 # current row of binmat
for (i in 0:1) {
for (j in 0:1) {
for (k in 0:1) {
r <- r + 1
binmat[r,] <- c(i, j, k)}
}
}
binmat
It should just be a matrix as binmat is being filled recursively.
I quickly wrote this function to generate all N^K permutations of length K for given N characters. Hope it will be useful.
gen_perm <- function(str=c(""), lst=5, levels = c("0", "1", "2")){
if (nchar(str) == lst){
cat(str, "\n")
return(invisible(NULL))
}
for (i in levels){
gen_perm(str = paste0(str,i), lst=lst, levels=levels)
}
}
# sample call
gen_perm(lst = 3, levels = c("x", "T", "a"))
I will return to your problem when I get more time.
UPDATE
I modified the code above to work for your problem. Note that the matrix being populated lives in the global environment. The function also uses the tmp variable to pass rows to the global environment. This was the easiest way for me to solve the problem. Perhaps, there are other ways.
levels <- c(0,1)
nc <- 3
m <- matrix(numeric(0), ncol = nc)
gen_perm <- function(row=numeric(), lst=nc, levels = levels){
if (length(row) == lst){
assign("tmp", row, .GlobalEnv)
with(.GlobalEnv, {m <- rbind(m, tmp); rownames(m) <- NULL})
return(invisible(NULL))
}
for (i in levels){
gen_perm(row=c(row,i), lst=lst, levels=levels)
}
}
gen_perm(lst=nc, levels=levels)
UPDATE 2
To get the expected output you provided, run
m <- matrix(numeric(0), ncol = 3)
gen_perm(lst = 3, levels = c(0,1))
m
levels specifies a range of values to generate (binary in our case) to generate permutations, m is an empty matrix to fill up, gen_perm generates rows and adds them to the matrix m, lst is a length of the permutation (matches the number of columns in the matrix).

Prime numbers from random samples in R

I wrote the following code trying to find all the prime numbers from a random generated data set. sadly it seems something went wrong, could anybody help me.
set.seed(20171106)
n <- 10000
num <- sample(1:100000,n,replace=TRUE)
findPrime <- function(x){
apple<-c()
n<-length(x)
for(i in n){
if(any(x[i]%%(1:(x[i]-1))!=0)) apple <-c(apple,x[i])
}
return(apple)
}
To get results:
type:findPrime(num)
This is the warning message:
Warning message:
In if (x[i]%%(1:(x[i] - 1)) == 0) apple <- c(apple, x[i]) :
the condition has length > 1 and only the first element will be used
so how can I fix the problem?
if statements only accept single elements and in your declaration seems to get the whole vector. I have rewritten your function using a ifelse expression wrapped inside a sapply loop.
I hope this works for you.
findPrime <- function(x = 0){
primes <- c()
# Prime finder
primes <- sapply(X = x,FUN = function(x) {
ifelse(any(x %% (1:(x - 1)) != 0), T, F)}
)
# Select primes
primes <- num[primes]
return(primes)
}
findPrime(num)
I have checked another silly mistake... Inside the function change num for x in the select primes step and invert the F, T outcomes. It should look like this:
findPrime <- function(x = 0){
primes <- c()
# Prime finder
primes <- sapply(X = x,FUN = function(x) {
ifelse(any(x %% (2:(x - 1)) == 0), F, T)}
)
# Select primes
primes <- x[primes]
return(primes)
}
I have just tried it and it works fine.
use package "gmp" which has a function "isprime" which returns 0 for non prime numbers and 2 for prime numbers and then subset the data based on the same
say you have a vector a = c(1:10)
a = c(1:10)
b = gmp::isprime(a)
c = cbind(a,b)
c = as.data.frame(c)
c = c[c$b==2,]
a1 = c$a
a1
In your code: for(i in 1:n), there is the error

Populating a vector with a for loop

I am trying to fill a vector pred_pos with the result pred on each iteration of the for loop. However, my pred_pos vector is never filled. The my_vec object is a list of large character vectors which I don't believe needs to be reproduced for this problem as it is most likely a fundamental indexing error. I just need to know how to populate a vector from this for loop. I can't seem to work out a solution.
pred_pos <- vector("numeric" , 2)
for(i in my_vec) {
for(r in pred_pos) {
inserts <- sapply(i, function(n) { n <- cond_probs_neg[n] } )
pred <- sum(unlist(inserts) , na.rm = T) * apriori_neg
pred_pos[r] <- pred
}
}
Assuming that the rest of your code works, there is no need to explicitly state:
pred_pos <- vector("numeric" , 2)
That creates a numeric vector of length two. You ought to be able to write:
pred_pos <- vector()
Now when you wish to append to the vector you can simply use:
vector[length(vector)+1] <- someData
I believe your code should work if it is adjusted:
pred_pos <- vector()
for(i in my_vec) {
inserts <- sapply(i, function(n) { n <- cond_probs_neg[n] } )
pred <- sum(unlist(inserts) , na.rm = T) * apriori_neg
pred_pos[length(pred_pos)+1] <- pred
}

Resources