I noticed when using a dummy coding for fitting my linear models R excludes certain parameters when forming model matrix. What is the R algorithm for doing this?
It is not well documented, but it goes back to whatever pivoting algorithm the underlying LAPACK code uses:
from the source code of lm.fit:
z <- .Call(C_Cdqrls, x, y, tol, FALSE)
...
coef <- z$coefficients
pivot <- z$pivot
...
r2 <- if(z$rank < p) (z$rank+1L):p else integer()
if (is.matrix(y)) {
....
} else {
coef[r2] <- NA
## avoid copy
if(z$pivoted) coef[pivot] <- coef
...
}
If you want to dig back further, you need to look into dqrdc2.f, which says (for what it's worth):
c dqrdc2 uses householder transformations to compute the qr
c factorization of an n by p matrix x. a limited column
c pivoting strategy based on the 2-norms of the reduced columns
c moves columns with near-zero norm to the right-hand edge of
c the x matrix. this strategy means that sequential one
c degree-of-freedom effects can be computed in a natural way.
In practice I have generally found that R eliminates the last (rightmost) column of a set of collinear predictor variables ...
Related
I am currently in an online class in genomics, coming in as a wetlab physician, so my statistical knowledge is not the best. Right now we are working on PCA and SVD in R. I got a big matrix:
head(mat)
ALL_GSM330151.CEL ALL_GSM330153.CEL ALL_GSM330154.CEL ALL_GSM330157.CEL ALL_GSM330171.CEL ALL_GSM330174.CEL ALL_GSM330178.CEL ALL_GSM330182.CEL
ENSG00000224137 5.326553 3.512053 3.455480 3.472999 3.639132 3.391880 3.282522 3.682531
ENSG00000153253 6.436815 9.563955 7.186604 2.946697 6.949510 9.095092 3.795587 11.987291
ENSG00000096006 6.943404 8.840839 4.600026 4.735104 4.183136 3.049792 9.736803 3.338362
ENSG00000229807 3.322499 3.263655 3.406379 9.525888 3.595898 9.281170 8.946498 3.473750
ENSG00000138772 7.195113 8.741458 6.109578 5.631912 5.224844 3.260912 8.889246 3.052587
ENSG00000169575 7.853829 10.428492 10.512497 13.041571 10.836815 11.964498 10.786381 11.953912
Those are just the first few columns and rows, it has 60 columns and 1000 rows. Columns are cancer samples, rows are genes
The task is to:
removing the eigenvectors and reconstructing the matrix using SVD, then we need to calculate the reconstruction error as the difference between the original and the reconstructed matrix. HINT: You have to use the svd() function and equalize the eigenvalue to $0$ for the component you want to remove.
I have been all over google, but can't find a way to solve this task, which might be because I don't really get the question itself.
so i performed SVD on my matrix m:
d <- svd(mat)
Which gives me 3 matrices (Eigenassays, Eigenvalues and Eigenvectors), which i can access using d$u and so on.
How do I equalize the eigenvalue and ultimately calculate the error?
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/svd
the decomposition expresses your matrix mat as a product of 3 matrices
mat = d$u x diag(d$d) x t(d$v)
so first confirm you are able to do the matrix multiplications to get back mat
once you are able to do this, set the last couple of elements of d$d to zero before doing the matrix multiplication
It helps to create a function that handles the singular values.
Here, for instance, is one that zeros out any singular value that is too small compared to the largest singular value:
zap <- function(d, digits = 3) ifelse(d < 10^(-digits) * max(abs(d))), 0, d)
Although mathematically all singular values are guaranteed non-negative, numerical issues with floating point algorithms can--and do--create negative singular values, so I have prophylactically wrapped the singular values in a call to abs.
Apply this function to the diagonal matrix in the SVD of a matrix X and reconstruct the matrix by multiplying the components:
X. <- with(svd(X), u %*% diag(zap(d)) %*% t(v))
There are many ways to assess the reconstruction error. One is the Frobenius norm of the difference,
sqrt(sum((X - X.)^2))
You have a set of N=400 objects, each having its own coordinates in a, say, 19-dimensional space.
You calculate the (Euclidean) distance matrix (all pairwise distances).
Now you want to select n=50 objects, such that the sum of all pairwise distances between the selected objects is maximal.
I devised a way to solve this by linear programming (code below, for a smaller example), but it seems inefficient to me, because I am using N*(N-1)/2 binary variables, corresponding to all the non-redundant elements of the distance matrix, and then a lot of constraints to ensure self-consistency of the solution vector.
I suspect there must be a simpler approach, where only N variables are used, but I can't immediately think of one.
This post briefly mentions some 'Bron–Kerbosch' algorithm, which apparently addresses the distance sum part.
But in that example the sum of distances is a specific number, so I don't see a direct application to my case.
I had a brief look at quadratic programming, but again I could not see the immediate parallel with my case, although the 'b %*% bT' matrix, where b is the (column) binary solution vector, could in theory be used to multiply the distance matrix, etc.; but I'm really not familiar with this technique.
Could anyone please advise (/point me to other posts explaining) if and how this kind of problem can be solved by linear programming using only N binary variables?
Or provide any other advice on how to tackle the problem more efficiently?
Thanks!
PS: here's the code I referred to above.
require(Matrix)
#distmat defined manually for this example as a sparseMatrix
distmat <- sparseMatrix(i=c(rep(1,4),rep(2,3),rep(3,2),rep(4,1)),j=c(2:5,3:5,4:5,5:5),x=c(0.3,0.2,0.9,0.5,0.1,0.8,0.75,0.6,0.6,0.15))
N = 5
n = 3
distmat_summary <- summary(distmat)
distmat_summary["ID"] <- 1:NROW(distmat_summary)
i.mat <- xtabs(~i+ID,distmat_summary,sparse=T)
j.mat <- xtabs(~j+ID,distmat_summary,sparse=T)
ij.mat <- rbind(i.mat,"5"=rep(0,10))+rbind("1"=rep(0,10),j.mat)
ij.mat.rowSums <- rowSums(ij.mat)
ij.diag.mat <- .sparseDiagonal(n=length(ij.mat.rowSums),-ij.mat.rowSums)
colnames(ij.diag.mat) <- dimnames(ij.mat)[[1]]
mat <- rbind(cbind(ij.mat,ij.diag.mat),cbind(ij.mat,ij.diag.mat),c(rep(0,NCOL(ij.mat)),rep(1,NROW(ij.mat)) ))
dir <- c(rep("<=",NROW(ij.mat)),rep(">=",NROW(ij.mat)),"==")
rhs <- c(rep(0,NROW(ij.mat)),1-unname(ij.mat.rowSums),n)
obj <- xtabs(x~ID,distmat_summary)
obj <- c(obj,setNames(rep(0, NROW(ij.mat)), dimnames(ij.mat)[[1]]))
if (length(find.package(package="Rsymphony",quiet=TRUE))==0) install.packages("Rsymphony")
require(Rsymphony)
LP.sol <- Rsymphony_solve_LP(obj,mat,dir,rhs,types="B",max=TRUE)
items.sol <- (names(obj)[(1+NCOL(ij.mat)):(NCOL(ij.mat)+NROW(ij.mat))])[as.logical(LP.sol$solution[(1+NCOL(ij.mat)):(NCOL(ij.mat)+NROW(ij.mat))])]
items.sol
ID.sol <- names(obj)[1:NCOL(ij.mat)][as.logical(LP.sol$solution[1:NCOL(ij.mat)])]
as.data.frame(distmat_summary[distmat_summary$ID %in% ID.sol,])
This problem is called the p-dispersion-sum problem. It can be formulated using N binary variables, but using quadratic terms. As far as I know, it is not possible to formulate it with only N binary variables in a linear program.
This paper by Pisinger gives the quadratic formulation and discusses bounds and a branch-and-bound algorithm.
Hope this helps.
I'm a complete beginner with R and I need to perform regressions on some data sets. My problem is, I'm not sure, how to rewrite the model into the mathematical formula.
Most confusing are interactions and poly function.
Can they be understood like a product and a polynomial?
Example
Let's have following model, both a and b are vectors of numbers:
y ~ poly(a, 2):b
Can it be rewritten mathematically like this?
y = a*b + a^2 * b
Example 2
And when I get a following expression from fit summary
poly(a, 2)2:b
is it equal to the following formula?
a^2 * b
Your question has two fold:
what does poly do;
what does : do.
For the first question, I refer you to my answer https://stackoverflow.com/a/39051154/4891738 for a complete explanation of poly. Note that for most users, it is sufficient to know that it generates a design matrix of degree number or columns, each of which being a basis function.
: is not a misery. In your case where b is also a numeric, poly(a, 2):b will return
Xa <- poly(a, 2) # a matrix of two columns
X <- Xa * b # row scaling to Xa by b
So your guess in the question is correct. But note that poly gives you orthogonal polynomial basis, so it is not as same as I(a) and I(a^2). You can set raw = TRUE when calling poly to get ordinary polynomial basis.
Xa has column names. poly(a,2)2 just means the 2nd column of Xa.
Note that when b is a factor, there will be a design matrix, say Xb, for b. Obviously this is a 0-1 binary matrix as factor variables are coded as dummy variables. Then poly(a,2):b forms a row-wise Kronecker product between Xa and Xb. This sounds tricky, but is essentially just pair-wise multiplication between all columns of two matrices. So if Xa has ka columns and Xb has kb columns, the resulting matrix has ka * kb columns. Such mixing is called 'interaction'.
The resulting matrix also has column names. For example, poly(a, 2)2:b3 means the product of the 2nd column of Xa and the dummy column in Xb for the third level of b. I am not saying 'the 3rd column of Xb' as this is false if b is contrasted. Usually a factor will be contrasted so if b has 5 levels, Xb will have 4 columns. Then the dummy column for third level will be the 2nd column of Xb, if the first factor level is the reference level (hence not appearing in Xb).
Objective function to be maximized : pos%*%mu where pos is the weights row vector and mu is the column vector of mean returns of d stocks
Constraints: 1) ones%*%pos = 1 where ones is a row vector of 1's of size 1*d (d is the number of stocks)
2) pos%*%cov%*%t(pos) = rb^2 # where cov is the covariance matrix of size d*d and rb is risk budget which is the free parameter whose values will be changed to draw the efficient frontier
I want to write a code for this optimization problem in R but I can't think of any function or library for help.
PS: solve.QP in library quadprog has been used to minimize covariance subject to a target return . Can this function be also used to maximize return subject to a risk budget ? How should I specify the Dmat matrix and dvec vector for this problem ?
EDIT :
library(quadprog)
mu <- matrix(c(0.01,0.02,0.03),3,1)
cov # predefined covariance matrix of size 3*3
pos <- matrix(c(1/3,1/3,1/3),1,3) # random weights vector
edr <- pos%*%mu # expected daily return on portfolio
m1 <- matrix(1,1,3) # constraint no.1 ( sum of weights = 1 )
m2 <- pos%*%cov # constraint no.2
Amat <- rbind(m1,m2)
bvec <- matrix(c(1,0.1),2,1)
solve.QP(Dmat= ,dvec= ,Amat=Amat,bvec=bvec,meq=2)
How should I specify Dmat and dvec ? I want to optimize over pos
Also, I think I have not specified constraint no.2 correctly. It should make the variance of portfolio equal to the risk budget.
(Disclaimer: There may be a better way to do this in R. I am by no means an expert in anything related to R, and I'm making a few assumptions about how R is doing things, notably that you're using an interior-point method. Also, there is likely an R package for what you're trying to do, but I don't know what it is or how to use it.)
Minimising risk subject to a target return is a linearly-constrained problem with a quadratic objective, looking like this:
min x^T Q x
subject to sum x_i = 1
sum ret_i x_i >= target
(and x >= 0 if you want to be long-only).
Maximising return subject to a risk budget is quadratically-constrained, however; it looks like this:
max ret^T x
subject to sum x_i = 1
x^T Q x <= riskbudget
(and maybe x >= 0).
Convex quadratic terms in the objective impose less of a computational cost in an interior-point method compared to introducing a convex quadratic constraint. With a quadratic objective term, the Q matrix just shows up in the augmented system. With a convex quadratic constraint, you need to optimise over a more complicated cone containing a second-order cone factor and you need to be careful about how you solve the linear systems that arise.
I would suggest you use the risk-minimisation formulation repeatedly, doing a binary search on the target parameter until you've found a portfolio approximately maximising return subject to your risk budget. I am suggesting this approach because it is likely sufficient for your needs.
If you really want to solve your problem directly, I would suggest using an interface Todd, Toh, and Tutuncu's SDPT3. This really is overkill; SDPT3 permits you to formulate and solve symmetric cone programs of your choosing. I would also note that portfolio optimisation problems are particularly special cases of symmetric cone programs; other approaches exist that are reportedly very successful. Unfortunately, I'm not studied up on them.
I want to minimize function FlogV (working with a multinormal distribution, Z is data matrix NxC; SIGMA it´s a square matrix CxC of var-covariance of data, R a vector with length C)
FLogV <- function(P){
(here I define parameters, P, within R and SIGMA)
logC <- (C/2)*N*log(2*pi)+(1/2)*N*log(det(SIGMA))
SOMA.t <- 0
for (j in 1:N){
SOMA.t <- SOMA.t+sum(t(Z[j,]-R)%*%solve(SIGMA)%*%(Z[j,]-R))
}
MlogV <- logC + (1/2)*SOMA.t
return(MlogV)
}
minLogV <- optim(P,FLogV)
All this is part of an extend code which was already tested and works well, except in the most important thing: I can´t optimize because I get this error:
“Error in solve.default(SIGMA) :
system is computationally singular: reciprocal condition number = 3.57726e-55”
If I use ginv() or pseudoinverse() or qr.solve() I get:
“Error in svd(X) : infinite or missing values in 'x'”
The thing is: if I take the SIGMA matrix after the error message, I can solve(SIGMA), the eigen values are all positive and the determinant is very small but positive
det(SIGMA)
[1] 3.384674e-76
eigen(SIGMA)$values
[1] 0.066490265 0.024034173 0.018738777 0.015718562 0.013568884 0.013086845
….
[31] 0.002414433 0.002061556 0.001795105 0.001607811
I already read several papers about change matrices like SIGMA (which are close to singular), did several transformations on data scale and form but I realized that, for a 34x34 matrix like the example, after det(SIGMA) close to e-40, R assumes it like 0 and calculation fails; also I can´t reduce matrix dimensions and can´t input in my function correction algorithms to singular matrices because R can´t evaluate it working with this optimization functions like optim. I really appreciate any suggestion to this problem.
Thanks in advance,
Maria D.
It isn't clear from your post whether the failure is coming from det() or solve()
If its just the solve in the quadratic term, you may want to try the two argument version of solve, it can be a bit more stable. solve(X,Y) is the same as solve(X) %*% Y
If you can factor sigma using chol(), you will get a triangular matrix such that LL'=Sigma. The determinant is the product of the diagonals, and you might try this for the quadratic term:
crossprod( backsolve(L, Z[j,]-R))