Randomly subset vector in R

Randomly subset vector in R - r

Is there a way to assign vector elements to multiple subarrays in R, using sample() or split() (or a combination of both functions)?
Essentially what I need is a function that randomly assigns values to multiple subarrays
Here's my full specific code:
K <- 2 # number of subarrays
N <- 100
Hstar <- 10
perms <- 10000
probs <- rep(1/Hstar, Hstar)
K1 <- c(1:5)
K2 <- c(6:10)
specs <- 1:N
pop <- array(dim = c(c(perms, N), K))
haps <- as.character(1:Hstar)
for(j in 1:perms){
for(i in 1:K){
if(i == 1){
pop[j, specs, i] <- sample(haps, size = N, replace = TRUE, prob = probs)
} else{
pop[j,, 1] <- sample(haps[K1], size = N, replace = TRUE, prob = probs[K1])
pop[j,, 2] <- sample(haps[K2], size = N, replace = TRUE, prob = probs[K2])
}
}
}
pop[j,, 1] is the first subarray in pop, while pop[j,, 2] is the second subarray in pop
If I have 20 subarrays, using sample() 20 times is tedious. I just want a way to assign values to the any number of subarrays quickly and easily.
Any ideas?

It depends whether you want replacement (the possibility of duplicate/omitted elements). Regardless, it's a one liner
sample(x,length(x),replace=FALSE)
Not 100% clear on the whole multiple subarray thing, but my approach would be something like:
num.intervals<-5
interval.size<-length(x)/5 #need to make sure this is evenly divisible I suppose
arr.master<-rep(NA,0)
for (i in 1:num.intervals){
arr.master<-rbind(arr.mater,sample(x,interval.size,replace=TRUE)
}
Basically, just take samples and keep mashing them together? Would this accomplish your goal?
Do you want to have the sum of num_elements of all subarrays equal to num_elements in the original array? If so, then it's just a random sorting problem (really easy) and then cut it up after into any number of subarrays. If not, then you could fix the number of elems in all subarrays in advance; randomly sample from original a new vector of this size; then partition it into arbitrary subarrays.

Related

Workaround for error when creating a correlation matrix of a huge matrix? [duplicate]

This question already has an answer here:
correlation matrix using large data sets in R when ff matrix memory allocation is not enough
(1 answer)
Closed 1 year ago.
I have a matrix of size 55422 x 888, and so the rcorr function is producing this error:
M1<-matrix(rnorm(36),nrow=55422, ncol=888)
cor <- rcorr(t(M1), type = "pearson")
Error in double(p * p) : vector size cannot be NA
In addition: Warning message:
In p * p : NAs produced by integer overflow
Is there anything I can do to solve this?

This isn't a complete/working solution, but will give you an idea of some of the issues.
Your correlation matrix will contain n*(n-1)/2 = 1535771331 unique elements. If each correlation takes one millisecond to compute, computing the correlation matrix will take (n^2-n)/2/(1e6*3600) = 0.42 hours and require (n^2-n)/2*8/(2^30) = 11.4 Gb of storage. These requirements are not impossible if you have a lot of RAM and time ...
In fact it's a little bit worse than this, since rcorr returns its results as a symmetric matrix (i.e., not taking advantage of the symmetry), and returns the n and P matrices as well, so the storage requirement will be approximately 5 times as great as stated above (double for the full matrix, x 2.5 because we have two double-precision and one integer matrix).
Getting to your specific question, the section on long vectors in the R internals manual discusses the maximum sizes of objects in R. The 'standard' limitation is that the total number of elements of the matrix should be less than 2^31 ((n^2-n)/2/(2^31-1) = 0.72), but the redundancy in the matrix gets you in trouble (as would the storage of the correlation, p-values, and the sample sizes).
If you still want to go ahead, here is an implementation by A.N. Spiess, copied from here, that breaks the problem into blocks and stores the results in a disk-backed array (i.e., not in RAM). This won't get you the p-values (and it's still not clear what you're going to do with all those values ...), but it works at least up to 40,000 columns (takes about a minute).
However, it seems to crap out on your actual problem size (888 x 55242) with a too-large length. I'd have to look more closely and see if there is a limitation here we can get around ... It seems that we are actually still limited by the matrix dimensions ... (maximum matrix dimension is sqrt(2^31-1) approx. 46341 ... With more work, we could still do the block-diagonal thing and split this into several components ...
set.seed(101)
nc <- 55422
nr <- 888
d <- matrix(rnorm(nr*nc), ncol = nc)
t1 <- system.time(b1 <- bigcor(d))
bigcor <- function(
x,
y = NULL,
fun = c("cor", "cov"),
size = 2000,
verbose = TRUE,
...)
{
if (!require("ff")) stop("please install the ff package")
fun <- match.arg(fun)
if (fun == "cor") FUN <- cor else FUN <- cov
if (fun == "cor") STR <- "Correlation" else STR <- "Covariance"
if (!is.null(y) & NROW(x) != NROW(y)) stop("'x' and 'y' must have compatible dimensions!")
NCOL <- ncol(x)
if (!is.null(y)) YCOL <- NCOL(y)
## calculate remainder, largest 'size'-divisible integer and block size
REST <- NCOL %% size
LARGE <- NCOL - REST
NBLOCKS <- NCOL %/% size
## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
if (is.null(y)) resMAT <- ff(vmode = "double", dim = c(NCOL, NCOL))
else resMAT <- ff(vmode = "double", dim = c(NCOL, YCOL))
## split column numbers into 'nblocks' groups + remaining block
GROUP <- rep(1:NBLOCKS, each = size)
if (REST > 0) GROUP <- c(GROUP, rep(NBLOCKS + 1, REST))
SPLIT <- split(1:NCOL, GROUP)
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
if (!is.null(y)) COMBS <- cbind(1:length(SPLIT), rep(1, length(SPLIT)))
## initiate time counter
timeINIT <- proc.time()
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
## if y = NULL
if (is.null(y)) {
if (verbose) cat(sprintf("#%d: %s of Block %s and Block %s (%s x %s) ... ", i, STR, COMB[1],
COMB[2], length(G1), length(G2)))
RES <- FUN(x[, G1], x[, G2], ...)
resMAT[G1, G2] <- RES
resMAT[G2, G1] <- t(RES)
} else ## if y = smaller matrix or vector
{
if (verbose) cat(sprintf("#%d: %s of Block %s and 'y' (%s x %s) ... ", i, STR, COMB[1],
length(G1), YCOL))
RES <- FUN(x[, G1], y, ...)
resMAT[G1, ] <- RES
}
if (verbose) {
timeNOW <- proc.time() - timeINIT
cat(timeNOW[3], "s\n")
}
gc()
}
return(resMAT)
}

for loop question in r :number of items to replace is not a multiple of replacement length

all
I'm new to R. I try many ways and still cannot solve it. Can anyone help to check??
I am trying to produce 3 times 100 random values that follow a chisquare distribution. Console says ''number of items to replace is not a multiple of replacement length''. Any hint to fix it??
for(i in 1:3) {
x1[i] <- rchisq(100, df=2)
n1[i] <- length(x1[i])
}

As an explanation for your problem: You are trying to store a vector of 100 elements into a single element, the ith element, of a vector, x1. To illustrate, you could put a vector of values into a vector of the same length:
x <- rnorm(6, 0, 1)
x[1:3] <- c(1,2,3)
x
## [1] 1.0000000 2.0000000 3.0000000 -0.8652300 1.3776699 -0.8817483
You could to store them into a list, each element of a list is a vector that can be of any length. You will need double square brackets.
x1 <- list()
for(i in 1:3) {
x1[[i]] <- rchisq(100, df=2)
n1[i] <- length(x1[[i]])
}
Lists and vectors are different types of data structures in R, you can read a lot about them in advanced R.

It depends on what containers you want to use. There are two containers that come to mind, either a list or matrix.
# list format
x1 = list();
n1 = vector();
for(i in 1:3) {
x1[[i]] <- rchisq(100, df=2)
n1[i] <- length(x1[[i]])
}
note the double brackets [[i]] as mentioned in the comments
# matrix format
x1 = matrix(NA, nrow = 100, ncol = 3)
n1 = vector();
for(i in 1:3) {
x1[,i] <- rchisq(100, df=2)
n1[i] <- length(x1[,i])
}

Efficiently preserving array dimensions in R

The below R code fills an array of specified dimensions with positive integers generated randomly via a probability vector.
subset.haps <- NULL
haps <- 1:4
num.specs <- 100
probs <- rep(1/4, 4)
perms <- 10000
K <- 1
gen.perms <- function() {
if (is.null(subset.haps)) {
sample(haps, size = num.specs, replace = TRUE, prob = probs)
} else {
resample <- function(x, ...) x[sample.int(length(x), ...)]
resample(subset.haps, size = num.specs, replace = TRUE, prob = probs[subset.haps])
}
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[, , i] <- replicate(perms, gen.perms())
}
However, profiling the above code suggests that improvements can be made.
The 'for' loop can be eliminated using rep()
rep(replicate(perms, gen.perms()), K)
However, this method does not produce an array, nor preserves array dimensions.
Of course, wrapping the above modified code within as.array() will fix the second issue, but the output does not resemble a typical array in structure.
My question
How can I ensure the array structure (i.e., dimensions) is preserved?

You just make pop then set its dim attribute afterwards:
pop <- rep(replicate(perms, gen.perms()), K)
dim(pop) <- c(perms, num.specs, K)
And to prove it:
class(pop)
# [1] "array"
dim(pop)
# [1] 10000 100 1
pop[2020,23,1]
# [1] 2

R - How can I make this loop faster?

Is there some way to make this loop faster in r?
V=array(NA, dim=c(nrow(pixDF), n))
for(i in 1:n)
{
sdC<-sqrt(det(Cov[,i,]))
iC<-inv(Cov[,i,])
V[,i]<-apply(pixDF,1,function(x)(sdC*exp(-0.5*((x-Mean[i,])%*%iC%*%as.matrix((x-Mean[i,]))))))
}
where, in this case, pixDF is a matrix with 490000 rows and 4 columns filled with doubles. n = 5. Cov is a (4,5,4) array filled with "doubles". Mean is a (5,4) array filled with doubles as well.
This loop was taking about 30min on my computer. (before editing).
Right now it's taking 1min.

As Ronak notes, it is hard to help without reproducible example. But, I think that apply could be avoided. Something like this COULD work:
V <- array(NA, dim = c(nrow(pixDF), n))
tpixDF <- t(pixDF)
for (i in 1:n) {
x <- Cov[, i, ]
sdC <- sqrt(det(x))
iC <- solve(x)
mi <- Mean[i, ]
k <- t(tpixDF - mi)
V[, i] <- sdC*exp(-0.5*rowSums(k %*% iC * k))
}
Also, as Roland mentions inv probably is equal solve.

Interpreting [R] Greatest Common Divisor (GCD) (and LCM) Function in {numbers} package

I don't have background in programming (except from wrestling with R to get things done), and I'm trying to verbalize what the formula for the greater common divisor in the R {numbers} package is trying to do at each step. I need help with understanding the flow of steps within the function:
function (n, m)
{
stopifnot(is.numeric(n), is.numeric(m))
if (length(n) != 1 || floor(n) != ceiling(n) || length(m) !=
1 || floor(m) != ceiling(m))
stop("Arguments 'n', 'm' must be integer scalars.")
if (n == 0 && m == 0)
return(0)
n <- abs(n)
m <- abs(m)
if (m > n) {
t <- n
n <- m
m <- t
}
while (m > 0) {
t <- n
n <- m
m <- t%%m
}
return(n)
}
<environment: namespace:numbers>
For instance, in the if (m > n) {} part the n becomes t and ultimately it becomes m? I'm afraid to ask, because it may be painfully obvious, but I don't know what is going on. The same apply to, I guess, he else part of the equation with %% being perhaps modulo.

What it says is:
Stop if either m or n are not numeric, more than one number, or have decimals, and return the message, "Arguments 'n', 'm' must be integer scalars."
If they both are zero, return zero.
Using absolute values from now on.
Make sure that n > m because of the algorithm we'll end up applying in the next step. If this is not the case flip them: initially place n in a temporary variable "t", and assign m to n, so that now the larger number is at the beginning of the (n, m) expression. At this point both the initial (n, m) values contain m. Finish it up by retrieving the value in the temporary variable and assigning it to m.
Now they apply the modified Euclidean algorithm to find the GCD - a more efficient version of the algorithm that shortcuts the multiple subtractions, instead replacing the larger of the two numbers by its remainder when divided by the smaller of the two.
The smaller number at the beginning of the algorithm will end up being the larger after the first iteration, therefore we'll assign it to n to get ready for the second iteration. To do so, though, we need to get the current n out of the way by assigning it to the temporary variable t. After that we get the modulo resulting from dividing the original larger number (n), which now is stored in t, by the smaller number m. The result will replace the number stored in m.
As long as there is a remainder (modulo) the process will go on, this time with the initial smaller number, m playing the role of the big guy. When there is no remainder, the smaller of the numbers in that particular iteration is returned.
ADDENDUM:
Now that I know how to read this function, I see that it is limited to two numbers in the input to the function. So I entertained myself putting together a function that can work with three integers in the input:
require(numbers)
GCF <- function(x,y,z){
tab.x <- tabulate(primeFactors(x))
tab.y <- tabulate(primeFactors(y))
tab.z <- tabulate(primeFactors(z))
max.len <- max(length(tab.x), length(tab.y), length(tab.z))
tab_x = c(tab.x, rep(0, max.len - length(tab.x)))
tab_y = c(tab.y, rep(0, max.len - length(tab.y)))
tab_z = c(tab.z, rep(0, max.len - length(tab.z)))
GCD_elem <- numeric()
for(i in 1:max.len){
GCD_elem[i] <- min(tab_x[i], tab_y[i], tab_z[i]) * i
}
GCD_elem <- GCD_elem[!GCD_elem==0]
GrCD <- prod(GCD_elem)
print(GrCD)
}
Also for the LCM:
LCM <- function(x,y,z){
tab.x <- tabulate(primeFactors(x))
tab.y <- tabulate(primeFactors(y))
tab.z <- tabulate(primeFactors(z))
max.len <- max(length(tab.x), length(tab.y), length(tab.z))
tab_x = c(tab.x, rep(0, max.len - length(tab.x)))
tab_y = c(tab.y, rep(0, max.len - length(tab.y)))
tab_z = c(tab.z, rep(0, max.len - length(tab.z)))
LCM_elem <- numeric()
for(i in 1:max.len){
LCM_elem[i] <- i^(max(tab_x[i], tab_y[i], tab_z[i]))
}
LCM_elem <- LCM_elem[!LCM_elem==0]
LCM <- prod(LCM_elem)
print(LCM)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Randomly subset vector in R - r

Related

Workaround for error when creating a correlation matrix of a huge matrix? [duplicate]

for loop question in r :number of items to replace is not a multiple of replacement length

Efficiently preserving array dimensions in R

R - How can I make this loop faster?

Interpreting [R] Greatest Common Divisor (GCD) (and LCM) Function in {numbers} package

Categories

Resources