I am not able to understand why is this happening. I have a data matrix which is (64x6830). When I do the following
pr.out=prcomp(data,scale=TRUE)
dim(pr.out$rotation)
# [1] 6830 64
I am not able to understand why the rotation matrix is not 6830x6830. When I take a subset of data like this:
data1=data[1:nrow(data),1:10]
pr.data=prcomp(data1,scale=TRUE)
dim(pr.data$rotation)
# [1] 10 10
So for smaller size from the same data is giving correct, but I am clueless why it is giving a different rotation matrix when done on the whole dataset.
The function prcomp is based on the function svd:
svd(x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE)
From edit(stats:::prcomp.default), we see:
s <- svd(x, nu = 0)
This means that the left singular vectors are not computed. Thus, in the case of prcomp, svd only returns "a vector containing the singular values of x, of length min(n, p)" and "a matrix whose columns contain the right singular vectors of x [...]. Dimension c(p, nv)"
If we go back to the call of svd, nv is defined as nv = min(n, p) (minimum between n and p), where n = row(x) and p = ncol(x).
In the case of "data", n = 64 and p = 6830. Then nv = 64 and pr.out$rotation is a 6830x64 (p x nv) matrix
In the case of "data1", n = 10 and p = 10. Then nv = 10 and and pr.out$rotation is a 10x10 (p x nv) matrix
Related
I want to solve the optimazation problem to search best weights for groups of vectors. Would you like to give some suggestions about how to solve it by R? Thanks very much.
The problem is as follows.
Given there are N groups, we know their similarity matrix among these N groups. The dimension of S is N*N.
In each group, there are K vectors . There are M elements in each vector which value is 0 or 1. .
we can fit an average vector based on these K vectors. For example, average vector
Based on these avearge vectors in each group, we could calculate the correlation among these avearge vectors.
The object is to minimize the differene between correlation matrix C and known similarity matrix S.
Beacuse you didn't provide any data I will generate random and demonstrate way you can approach your problem.
Similarity matrix:
N <- 6
S <- matrix(runif(N^2, -1, 1), ncol = N, nrow = N)
similarity_matrix <- (S + t(S)) / 2
N is number of groups. Each value of similarity matrix is between -1 and 1 and matrix is symmetric (beacuse you want to compare it to covariance matrix these makes sense).
group vectors:
M <- 10
K <- 8
group_vectors <- replicate(N, replicate(K, sample(c(0, 1), M, TRUE)), FALSE)
M is dimension of vector and K is number of binary vectors in each group.
fitness function
fitness <- function(W, group_vectors, similarity_matrix){
W <- as.data.frame(matrix(W, nrow = K, ncol = N))
SS <- cov(
mapply(function(x,y) rowSums(sweep(x, 2, y, "*")), group_vectors, W)
)
sum(abs(SS - similarity_matrix))
}
fitness for given weights calculates described covariance matrix and its distance from similarity_matrix.
differential evolution approach
res <- DEoptim::DEoptim(
fn = fitness,
lower = rep(-1, K*N),
upper = rep(1, K*N),
group_vectors = group_vectors,
similarity_matrix = similarity_matrix,
control = DEoptim::DEoptim.control(VTR = 0, itermax = 1000, trace = 50, NP = 100)
)
W <- matrix(res$optim$bestmem, nrow = K, ncol = N)
genetic algorithm approach
res <- GA::ga(
type = "real-valued",
fitness = function(W, ...) -fitness(W, ...),
lower = rep(-1, K*N),
upper = rep(1, K*N),
group_vectors = group_vectors,
similarity_matrix = similarity_matrix,
maxiter = 10000,
run = 200
)
W <- matrix(res#solution[1,], nrow = K, ncol = N)
This question already has an answer here:
correlation matrix using large data sets in R when ff matrix memory allocation is not enough
(1 answer)
Closed 1 year ago.
I have a matrix of size 55422 x 888, and so the rcorr function is producing this error:
M1<-matrix(rnorm(36),nrow=55422, ncol=888)
cor <- rcorr(t(M1), type = "pearson")
Error in double(p * p) : vector size cannot be NA
In addition: Warning message:
In p * p : NAs produced by integer overflow
Is there anything I can do to solve this?
This isn't a complete/working solution, but will give you an idea of some of the issues.
Your correlation matrix will contain n*(n-1)/2 = 1535771331 unique elements. If each correlation takes one millisecond to compute, computing the correlation matrix will take (n^2-n)/2/(1e6*3600) = 0.42 hours and require (n^2-n)/2*8/(2^30) = 11.4 Gb of storage. These requirements are not impossible if you have a lot of RAM and time ...
In fact it's a little bit worse than this, since rcorr returns its results as a symmetric matrix (i.e., not taking advantage of the symmetry), and returns the n and P matrices as well, so the storage requirement will be approximately 5 times as great as stated above (double for the full matrix, x 2.5 because we have two double-precision and one integer matrix).
Getting to your specific question, the section on long vectors in the R internals manual discusses the maximum sizes of objects in R. The 'standard' limitation is that the total number of elements of the matrix should be less than 2^31 ((n^2-n)/2/(2^31-1) = 0.72), but the redundancy in the matrix gets you in trouble (as would the storage of the correlation, p-values, and the sample sizes).
If you still want to go ahead, here is an implementation by A.N. Spiess, copied from here, that breaks the problem into blocks and stores the results in a disk-backed array (i.e., not in RAM). This won't get you the p-values (and it's still not clear what you're going to do with all those values ...), but it works at least up to 40,000 columns (takes about a minute).
However, it seems to crap out on your actual problem size (888 x 55242) with a too-large length. I'd have to look more closely and see if there is a limitation here we can get around ... It seems that we are actually still limited by the matrix dimensions ... (maximum matrix dimension is sqrt(2^31-1) approx. 46341 ... With more work, we could still do the block-diagonal thing and split this into several components ...
set.seed(101)
nc <- 55422
nr <- 888
d <- matrix(rnorm(nr*nc), ncol = nc)
t1 <- system.time(b1 <- bigcor(d))
bigcor <- function(
x,
y = NULL,
fun = c("cor", "cov"),
size = 2000,
verbose = TRUE,
...)
{
if (!require("ff")) stop("please install the ff package")
fun <- match.arg(fun)
if (fun == "cor") FUN <- cor else FUN <- cov
if (fun == "cor") STR <- "Correlation" else STR <- "Covariance"
if (!is.null(y) & NROW(x) != NROW(y)) stop("'x' and 'y' must have compatible dimensions!")
NCOL <- ncol(x)
if (!is.null(y)) YCOL <- NCOL(y)
## calculate remainder, largest 'size'-divisible integer and block size
REST <- NCOL %% size
LARGE <- NCOL - REST
NBLOCKS <- NCOL %/% size
## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
if (is.null(y)) resMAT <- ff(vmode = "double", dim = c(NCOL, NCOL))
else resMAT <- ff(vmode = "double", dim = c(NCOL, YCOL))
## split column numbers into 'nblocks' groups + remaining block
GROUP <- rep(1:NBLOCKS, each = size)
if (REST > 0) GROUP <- c(GROUP, rep(NBLOCKS + 1, REST))
SPLIT <- split(1:NCOL, GROUP)
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
if (!is.null(y)) COMBS <- cbind(1:length(SPLIT), rep(1, length(SPLIT)))
## initiate time counter
timeINIT <- proc.time()
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
## if y = NULL
if (is.null(y)) {
if (verbose) cat(sprintf("#%d: %s of Block %s and Block %s (%s x %s) ... ", i, STR, COMB[1],
COMB[2], length(G1), length(G2)))
RES <- FUN(x[, G1], x[, G2], ...)
resMAT[G1, G2] <- RES
resMAT[G2, G1] <- t(RES)
} else ## if y = smaller matrix or vector
{
if (verbose) cat(sprintf("#%d: %s of Block %s and 'y' (%s x %s) ... ", i, STR, COMB[1],
length(G1), YCOL))
RES <- FUN(x[, G1], y, ...)
resMAT[G1, ] <- RES
}
if (verbose) {
timeNOW <- proc.time() - timeINIT
cat(timeNOW[3], "s\n")
}
gc()
}
return(resMAT)
}
I am trying to write a code to solve the following problem (As stated in HW5 in the CalTech course Learning from Data):
In this problem you will create your own target function f
(probability in this case) and data set D to see how Logistic
Regression works. For simplicity, we will take f to be a 0=1
probability so y is a deterministic function of x. Take d = 2 so you
can visualize the problem, and let X = [-1; 1]×[-1; 1] with uniform
probability of picking each x 2 X . Choose a line in the plane as the
boundary between f(x) = 1 (where y has to be +1) and f(x) = 0 (where y
has to be -1) by taking two random, uniformly distributed points from
X and taking the line passing through them as the boundary between y =
±1. Pick N = 100 training points at random from X , and evaluate the
outputs yn for each of these points xn. Run Logistic Regression with
Stochastic Gradient Descent to find g, and estimate Eout(the cross
entropy error) by generating a sufficiently large, separate set of
points to evaluate the error. Repeat the experiment for 100 runs with
different targets and take the average. Initialize the weight vector
of Logistic Regression to all zeros in each run. Stop the algorithm
when |w(t-1) - w(t)| < 0:01, where w(t) denotes the weight vector at
the end of epoch t. An epoch is a full pass through the N data points
(use a random permutation of 1; 2; · · · ; N to present the data
points to the algorithm within each epoch, and use different
permutations for different epochs). Use a learning rate of 0.01.
I am required to calculate the nearest value to Eout for N=100, and the average number of epochs for the required criterion.
I wrote and ran the code but I'm not getting the right answers (as stated in the solutions, these are Eout is near 0.1 and the number of epochs is near 350). The required number of epochs for a delta w of 0.01 comes to far too small (around 10), leaving the error too big (around 2). I then tried to replace the criterion with |w(t-1) - w(t)| < 0.001 (rather than 0.01). Then, the average required number of epochs was about 250 and out of sample error was about 0.35.
Is there something wrong with my code/solution, or is it possible that the answers provided are faulty? I've added comments to indicate what I intend to do at each step. Thanks in advance.
library(pracma)
h<- 0 # h will later be updated to number of required epochs
p<- 0 # p will later be updated to Eout
C <- matrix(ncol=10000, nrow=2) # Testing set, used to calculate out of sample error
d <- matrix(ncol=10000, nrow=1)
for(i in 1:10000){
C[, i] <- c(runif(2, min = -1, max = 1)) # Sample data
d[1, i] <- sign(C[2, i] - f(C[1, i]))
}
for(g in 1:100){ # 100 runs of the experiment
x <- runif(2, min = -1, max = 1)
y <- runif(2, min = -1, max = 1)
fit = (lm(y~x))
t <- summary(fit)$coefficients[,1]
f <- function(x){ # Target function
t[2]*x + t[1]
}
A <- matrix(ncol=100, nrow=2) # Sample data
b <- matrix(ncol=100, nrow=1)
norm_vec <- function(x) {sqrt(sum(x^2))} # vector norm calculator
w <- c(0,0) # weights initialized to zero
for(i in 1:100){
A[, i] <- c(runif(2, min = -1, max = 1)) # Sample data
b[1, i] <- sign(A[2, i] - f(A[1, i]))
}
q <- matrix(nrow = 2, ncol = 1000) # q tracks the weight vector at the end of each epoch
l= 1
while(l < 1001){
E <- function(z){ # cross entropy error function
x = z[1]
y = z[2]
v = z[3]
return(log(1 + exp(-v*t(w)%*%c(x, y))))
}
err <- function(xn1, xn2, yn){ #gradient of error function
return(c(-yn*xn1, -yn*xn2)*(exp(-yn*t(w)*c(xn1,xn2))/(1+exp(-yn*t(w)*c(xn1,xn2)))))
}
e = matrix(nrow = 2, ncol = 100) # e will track the required gradient at each data point
e[,1:100] = 0
perm = sample(100, 100, replace = FALSE, prob = NULL) # Random permutation of the data indices
for(j in 1:100){ # One complete Epoch
r = A[,perm[j]] # pick the perm[j]th entry in A
s = b[perm[j]] # pick the perm[j]th entry in b
e[,perm[j]] = err(r[1], r[2], s) # Gradient of the error
w = w - 0.01*e[,perm[j]] # update the weight vector accorng to the formula involving step size, gradient
}
q[,l] = w # the lth entry is the weight vector at the end of the lth epoch
if(l > 1 & norm_vec(q[,l] - q[,l-1])<0.001){ # given criterion to terminate the algorithm
break
}
l = l+1 # move to the next epoch
}
for(n in 1:10000){
p[g] = mean(E(c(C[1,n], C[2, n], d[n]))) # average over 10000 data points, of the error function, in experiment no. g
}
h[g] = l #gth entry in the vector h, tracks the number of epochs in the gth iteration of the experiment
}
mean(h) # Mean number of epochs needed
mean(p) # average Eout, over 100 experiments
I'd like to generate a N by n matrix, with colum having different p values:
N=100000
n=100
p= seq(0.005,0.995,.01)
xsim=rbinom(N, size = n, prob = p)
ysim=rbinom(N, size = n, prob = p)
This does not appear to have the correct dimmensions want each row to have a different p. Is there a wa y to do this with apply or otherwise. thanks
sapply(p, rbinom, n = N, size = n)
rbinom doesn't take a vector argument in p.
When I run
weights <- 1:50
profits <- 1:50
library(adagio)
knapsack(w = weights, p = profits, cap = 30)
I get the error
Error in F[, k] <- G :
number of items to replace is not a multiple of replacement length
In addition: Warning message:
In pmax(G, H) : an argument will be fractionally recycled
but when I run smaller sized vectors, like
weights <- 1:20
profits <- 1:20
knapsack(w = weights, p = profits, cap = 30)
it runs fine. Does knapsack() just slow down (and prevent running) for larger sets? I'm looking to use lengths in the thousands eventually.
This is an issue with passing elements with weight exceeding the total capacity. To see the issue, let's look at the first few lines of the knapsack function:
function (w, p, cap)
{
n <- length(w)
x <- logical(n)
F <- matrix(0, nrow = cap + 1, ncol = n)
G <- matrix(0, nrow = cap + 1, ncol = 1)
for (k in 1:n) {
F[, k] <- G
H <- c(numeric(w[k]), G[1:(cap + 1 - w[k]), 1] + p[k])
G <- pmax(G, H)
}
When iteratively filling the F matrix one column at a time, the algorithm creates a vector H with the following command (and then immediately computing pmax(G, H)):
H <- c(numeric(w[k]), G[1:(cap + 1 - w[k]), 1] + p[k])
numeric(w[k]) has length w[k], and when w[k] <= cap, G[1:(cap + 1 - w[k]), 1] + p[k] has length cap + 1 - w[k], meaning the entire vector H has length cap+1, matching the size of G. On the other hand, when w[k] == cap + 1 we will end up with an H vector of size cap+2, which doesn't match the size of G and gives us trouble, and with w[k] > cap + 1 we will get an error for mixing positive and negative indices.
Getting back to your example function call, you have weights up to 50 but only a capacity of 30, yielding an error:
weights <- 1:50
profits <- 1:50
knapsack(w = weights, p = profits, cap = 30)
# Error in F[, k] <- G :
# number of items to replace is not a multiple of replacement length
# In addition: Warning message:
# In pmax(G, H) : an argument will be fractionally recycled
However when you limit to elements with weight not exceeding the capacity, you get no errors:
knapsack(w = weights[weights <= 30], p = profits[weights <= 30], cap = 30)
# $capacity
# [1] 30
#
# $profit
# [1] 30
#
# $indices
# [1] 1 2 3 4 5 7 8
It would be most ideal if the knapsack function gracefully removed any object with weight exceeding the capacity (since no such elements could ever be used in a feasible solution) and gave you a solution for the code you posted, but as a workaround you could simply remove them yourself from the input to the knapsack function.
I received the same error (which is how I got this SO post..) I think that the adagio knapsack function doesn't like profits or weights that are fractional values. I used rnorm() to generate profits and weights, to compare their results to another knapsack function that I personally wrote. Even with a capacity that was several times larger than all the weights put together, I was getting the 'recycling' error. However when I rounded off the rnorm() vectors before passing them as arguments to knapsack, no problems.