OLS estimator in R - r

I'm trying to compute OLS estimators manually in R for given vectors and matrices, but when I get the formula beta=(x'x)^-1(x'y), R tells me that the is a dimension issue, and I can't figure out why.
My code is
nr = 100
nc = 1000
x=matrix(rnorm(nr * nc, mean=1, sd=1), nrow = nr)
epsilon=matrix(rnorm(nr * nc, mean=0, sd=1), nrow = nr)
k=c(1,2,4,8)
eta1=((epsilon^1-mean(epsilon^1))/(mean(epsilon^(1*2))-mean(epsilon^1)^2)^(1/2))
eta2=((epsilon^2-mean(epsilon^2))/(mean(epsilon^(2*2))-mean(epsilon^2)^2)^(1/2))
eta4=((epsilon^4-mean(epsilon^4))/(mean(epsilon^(4*2))-mean(epsilon^4)^2)^(1/2))
eta8=((epsilon^8-mean(epsilon^8))/(mean(epsilon^(8*2))-mean(epsilon^8)^2)^(1/2))
y1=x+eta1
y2=x+eta2
y4=x+eta4
y8=x+eta8
beta1=inv(t(x)*x)*(t(x)*y1)
beta2=inv(t(x)*x)*(t(x)*y2)
beta4=inv(t(x)*x)*(t(x)*y4)
beta8=inv(t(x)*x)*(t(x)*y8)
Also, I feel that there should be a way to loop through the values of k to get this automated, instead of doing each eta by hand. So, a bit of help in this area would also be appreciated.
The ouput I'm looking for is to get a vector of beta for each of the different values of k.

You have got several issues. Firstly, you should have nx1 matrix for y and epsilon, but you have nxm matrix for them instead. Secondly, you should use matrix multiplication which is %*% in R. i.e. t(x)%*%y1. However you use dot product (*) instead.
For the sake of simplicity, lets create a matrix with 5 columns. My approach is creating a dependent variable which is related with x columns (independent variables or feature matrix in machine learning terminology)
nr = 100
nc = 5
x=matrix(rnorm(nr * nc, mean=1, sd=1), nrow = nr)
epsilon=matrix(rnorm(nr, mean=0, sd=1), nrow = nr) # it should be nx1
k=c(1,2,4,8)
eta1=((epsilon^1-mean(epsilon^1))/(mean(epsilon^(1*2))-mean(epsilon^1)^2)^(1/2))
eta2=((epsilon^2-mean(epsilon^2))/(mean(epsilon^(2*2))-mean(epsilon^2)^2)^(1/2))
eta4=((epsilon^4-mean(epsilon^4))/(mean(epsilon^(4*2))-mean(epsilon^4)^2)^(1/2))
eta8=((epsilon^8-mean(epsilon^8))/(mean(epsilon^(8*2))-mean(epsilon^8)^2)^(1/2))
To check the output we should create y values wisely. So, let's define some betas and create y values wrt them. At the end, we can compare the output with the inputs we defined. Note that, you should have 5 betas for 5 columns.
# made up betas
beta1_real <- 1:5
beta2_real <- -4:0
beta4_real <- 7:11
beta8_real <- seq(0.25,1.25,0.25)
To create the y values,
y1= 10 + x %*% matrix(beta1_real) + eta1
y2= 20 + x %*% matrix(beta2_real) + eta2
y4= 30 + x %*% matrix(beta4_real) + eta4
y8= 40 + x %*% matrix(beta8_real) + eta8
Here, I also added a constant term for each y values. To get the constant term at the end, we should add ones at the beginning of our x matrix like,
x <- cbind(matrix(1,nrow = nr),x)
The rest is almost same with yours. Only difference is I used solve instead of inv and also I used the matrix multiplication (%*%)
beta1=solve(t(x)%*%x)%*%(t(x)%*%y1)
beta2=solve(t(x)%*%x)%*%(t(x)%*%y2)
beta4=solve(t(x)%*%x)%*%(t(x)%*%y4)
beta8=solve(t(x)%*%x)%*%(t(x)%*%y8)
If we compare the outputs,
beta1_real was,
# [1] 1 2 3 4 5
and the output of beta1 is,
# [,1]
# [1,] 10.0049631
# [2,] 0.9632124
# [3,] 1.8987402
# [4,] 2.9816673
# [5,] 4.2111817
# [6,] 4.9529084
The results are similar. 10 at the beginning is the constant term I added. The difference stems from the error term applied (etas).

Related

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Solve indeterminate equation system in R

I have a equation system and I want to solve it using numerical methods. I want to get a close solution given a starting seed. Let me explain.
I have a vector of constants ,X, of values:
X <- (c(1,-2,3,4))
and a vector W of weights:
W <- (c(0.25,0.25,0.25,0.25))
I want that the sum of the components of W will be (sum(W)=1), and the sum of the multiplication of X and W element by element will be a given number N (sum(W*X)=N).
Is there a easy way to do this in R? I have it in Excel, using Solver, but I need to automatize it.
Here is your constant and your target value:
x <- c(1, -2, 3, 4)
n <- 10
You need a function to minimize. The first line contains each of your conditions, and the second line provides a measure of how to combine the errors into a single score. You may want to change the second line. For example, you could make one error term be more heavily weighted than the other using sum(c(1, 5) * errs ^ 2).
fn <- function(w)
{
errs <- c(sum(w) - 1, sum(x * w) - n)
sum(errs ^ 2)
}
The simplest thing is to start with all the weights the same value.
init_w <- rep.int(1 / length(x), length(x))
Use optim to optimize.
optim(init_w, fn)
## $par
## [1] 0.1204827 -1.2438883 1.1023338 1.0212406
##
## $value
## [1] 7.807847e-08
##
## $counts
## function gradient
## 111 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
The par element contains your weights.
There is no unique solution for this problem. If you try other initial values for w you will most likely get different results from optim.
The problem can be formulated as solving an underdetermined system of linear equations.
A <- matrix(c(rep(1,4),x), nrow=2,byrow=TRUE)
b <- matrix(c(1,n), nrow=2)
We seek a solution that satisfies A %*% w = b but which one? Minimum norm solution? Or maybe some other one? There are infinitely many solutions. Solutions can be given using the pseudo-inverse of the matrix A. Use package MASS for this.
library(MASS)
Ag <- ginv(A)
The minimum norm solution is
wmnorm <- Ag %*% b
And check with A %*% wmnorm - b and fn(wmnorm).
See the Wikipedia page System of linear equations
the section Matrix solutions.
The solutions are given by
Az <- diag(nrow=nrow(Ag)) - Ag %*% A
w <- wmnorm + Az %*% z
where z is an arbitrary vector of ncol(Az) elements.
And now generate some solutions and check
xb <- wmnorm
z <- runif(4)
wsol.2 <- xb + Az %*% z
wsol.2
A %*% wsol.2 - b
fn(wsol.2)
z <- runif(4)
wsol.3 <- xb + Az %*% z
wsol.3
A %*% wsol.2 - b
fn(wsol.3)
And you'll see that these two solutions are valid solutions when given as argument to fn. And are quite different from the solution found by optim. You could test this by choosing a different starting point init_w for example by init_w1 <- runif(4)/4.

PCA Feature selection using R

I am a biologist. An output of my experiment contains large number of features(which are stored as numbers of columns and 563 rows). The columns are the features which are 8603 in number which are quite high.
So, when I tried to do PCA analysis in R and it gives "out of memory" errors.
I have tried also doing princomp in pieces, but it does not seem to work for our
approach.
I tried using the Script given in the link...
http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/
But still it does not wok :(
I am trying to use the following code
bumpus <- read.table("http://www.ndsu.nodak.edu/ndsu/doetkott/introsas/rawdata/bumpus.html",
skip=20, nrows=49,
col.names=c("id","total","alar","head","humerus","sternum"))
boxplot(bumpus, main="Boxplot of Bumpus' data") ## in this step it is showing the ERROR
# we first standardize the data:
bumpus.scaled <- data.frame( apply(bumpus,2,scale) )
boxplot(bumpus.scaled, main="Boxplot of standardized Bumpus' data")
pca.res <- prcomp(bumpus.scaled, retx=TRUE)
pca.res
# note:
# PC.1 is some kind of average of all the measurements
# => measure of size of the bird
# PC.2 has a negative weight for 'sternum'
# and positive weights for 'alar', 'head' and 'humerus'
# => measure of shape of the bird
# first two principal components:
pca.res$x[,1:2]
plot(pca.res$x[,1:2], pch="", main="PC.1 and PC.2 for Bumpus' data (blue=survived, red=died)")
text(pca.res$x[,1:2], labels=c(1:49), col=c(rep("blue",21),rep("red",28)))
abline(v=0, lty=2)
abline(h=0, lty=2)
# compare to segment plot:
windows()
palette(rainbow(12, s = 0.6, v = 0.75))
stars(bumpus, labels=c(1:49), nrow=6, key.loc=c(20,-1),
main="Segment plot of Bumpus' data", draw.segment=TRUE)
# compare to biplot:
windows()
biplot(pca.res, scale=0)
# what do the arrows mean?
# consider the arrow for sternum:
abline(0, pca.res$rotation[5,2]/pca.res$rotation[5,1])
# consider the arrow for head:
abline(0, pca.res$rotation[3,2]/pca.res$rotation[3,1])
But second line
boxplot(bumpus, main="Boxplot of Bumpus' data") ## shows an error
The error is
Error: cannot allocate vector of size 1.4 Mb
In addition: There were 27 warnings (use warnings() to see them)
Please help!
In cases where the number of features is either huge or exceeds the number of
observations, it is well advised to calculate the principal components based on
the transposed dataset. This is especially true in your case because the default
implies calculation of a 8603 x 8603 covariance matrix which itself already
consumes about 500 MB of memory (oh well, this isn't too much, but hey...).
Assuming that the rows of your matrix X correspond to observations
and columns correspond to features, center your data and then perform PCA on the
transpose of the centered X. There won't be more eigenpairs than number of
observations anyway. Finally, multiply each resulting eigenvector by X^T. You do
not need to do the latter for the eigenvalues (see way below for a detailed explanation):
What you want
This code demonstrates the implementation of PCA on the transposed dataset and compares the results of prcomp and the "transposed PCA":
pca.reduced <- function(X, center=TRUE, retX=TRUE) {
# Note that the data must first be centered on the *original* dimensions
# because the centering of the 'transposed covariance' is meaningless for
# the dataset. This is also why Sigma must be computed dependent on N
# instead of simply using cov().
if (center) {
mu <- colMeans(X)
X <- sweep(X, 2, mu, `-`)
}
# From now on we're looking at the transpose of X:
Xt <- t(X)
aux <- svd(Xt)
V <- Xt %*% aux$v
# Normalize the columns of V.
V <- apply(V, 2, function(x) x / sqrt(sum(x^2)))
# Done.
list(X = if (retX) X %*% V else NULL,
V = V,
sd = aux$d / sqrt(nrow(X)-1),
mean = if (center) mu else NULL)
}
# Example data (low-dimensional, but sufficient for this example):
X <- cbind(rnorm(1000), rnorm(1000) * 5, rnorm(1000) * 3)
original <- prcomp(X, scale=FALSE)
transposed <- pca.reduced(X)
# See what happens:
> print(original$sdev)
[1] 4.6468136 2.9240382 0.9681769
> print(transposed$sd)
[1] 4.6468136 2.9240382 0.9681769
>
> print(original$rotation)
PC1 PC2 PC3
[1,] -0.0055505001 0.0067322416 0.999961934
[2,] -0.9999845292 -0.0004024287 -0.005547916
[3,] 0.0003650635 -0.9999772572 0.006734371
> print(transposed$V)
[,1] [,2] [,3]
[1,] 0.0055505001 0.0067322416 -0.999961934
[2,] 0.9999845292 -0.0004024287 0.005547916
[3,] -0.0003650635 -0.9999772572 -0.006734371
Details
To see why it is possible to work on the transposed matrix consider the
following:
The general form of the eigenvalue equation is
A x = λ x (1)
Without loss of generality, let M be a centered "copy" of your original
dataset X. Substitution of M^T M for A yields
M^T M x = λ x (2)
Multiplication of this equation by M yields
M M^T M x = λ M x (3)
Consequent substitution of y = M x yields
M M^T y = λ y (4)
One can already see that y corresponds to an eigenvector of the "covariance"
matrix of the transposed dataset (note that M M^T is in fact no real
covariance matrix as the dataset X was centered along its columns and not its
rows. Also, scaling must be done by means of the number of samples (rows of M)
and not the number of features (columns of M resp. rows of M^T).
It can also be seen that the eigenvalues are the same for M M^T and M^T M.
Finally, one last multiplication by M^T results in
(M^T M) M^T y = λ M^T y (5)
where M^T M is the original covariance matrix.
From equation (5) it follows that M^T y is an eigenvector of M^T M with
eigenvalue λ.

How to do ma and loess normalization in R?

Attempting to do loess on two variables x and y in R using MA normalization (http://en.wikipedia.org/wiki/MA_plot) like this:
> x = rnorm(100) + 5
> y = x + 0.6 + rnorm(100)*0.8
> m = log2(x/y)
> a = 0.5*log(x*y)
I want to normalize x and y in such a way that the average m is 0, as in standard MA normalization, and then back-calculate the correct x and y values. First running loess on MA:
> l = loess(m ~ a)
What is the way to get corrected m values then? Is this correct?
> mc <- predict(l, a)
# original MA plot
> plot(a,m)
# corrected MA plot
> plot(a,m-mc)
not clear to me what predict actually does in the case of loess objects and how it's different from using l$residuals in the object l returned by loess - can someone explain?
finally, how can I back calculate new x and y values based on this correction?
First, yes, your proposed method gets the corrected m values.
Regarding the predict function: yes, l$residuals , m - fitted(l) , and m -
predict(l) all give the same result: the corrected m values. However, the predict function is more general: it will take any new values as input. This is useful if you want to use only a subset of the data to fit the loess, and then predict on the totality of the data (for example, when using spiked-in standards).
Finally, how can you back calculate new x and y values based on this correction? If you transform your data into log-space, by creating two new variables x1 <- log2(x) and y1 <- log2(y), it becomes easier to see. Since we're in log-space, calculating m and a is simpler:
m <- x1 - y1
a <- (x1 + y1)/2
Now, for correcting your data based on the fitted loess model, instead of updating the m variable by your mc correction, you can update x1 and y1 instead. Set:
x1 <- x1 - mc / 2
y1 <- y1 + mc / 2
This update has the same effect as updating m <- m - mc (because m will be recomputed as the difference between the updated x1 and y1) and has no effect on the a value.
To get your corrected data out, transform them by returning 2^x1 and 2^y1.
This is the method as used by the authors of the normalize.loess function in affy package, as originally described here (and includes the capability to cyclically look at all pairs of variables as opposed to a single pair in this case): http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/normalizeCyclicLoess.html

Resources