How to generate matrix with certain rank in R - r

Does anyone know how to generate matrix with certain rank in R?
I ultimately want to create data matrix Y = X + E
where rank(X)=k and E~i.i.d.N(0,sigma^2).

The easiest is the identity matrix, which has always full rank. So e.g. use:
k <- 10
mymatrix <- diag(k)
Here, rows and columns are equal to the rank you specify
I suppose you want to mimic a regression model, so you might want to have more rows (meaning 'observations') than columns, (e.g. 'variables'). The following code allows you to specify both:
k <- 5 # rank of your matrix
nobs <- 10 # number of lines within X
X <- rbind(diag(k), matrix(rep(0,k*(nobs-k)), ncol=k))
y <- X + rnorm(nobs)
Note, that X - and therefore also y - now have full column rank. there is no multicollinearity in this 'model'.

Related

Generating multinomial random data in R

I am trying to generate data from a multinomial distribution in R using the function rmultinom, but I am having some problems.
The fact is that I want a data frame of 50 rows and 20 columns and a total sum of the outcomes equal to 3 times n*p.
I am using this code:
p <- 20
n <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(50, N, prob_true)
But I get some very strange results and a data frame with 20 rows and 50 columns.
How can I solve this problem?
Thanks in advance!
The help available at ?rmultinom says that n in rmultinom(n, size, prob) is:
"number of random vectors to draw"
And size is:
"specifying the total number of objects that are put into K boxes in the typical multinomial experiment"
And the help says that the output is:
"For rmultinom(), an integer K x n matrix where each column is a random vector generated according to the desired multinomial law, and hence summing to size"
So you're asking for 50 vectors/variables with a total number of "objects" equal to 3000, so each column is drawn as a vector that sums to 3000.
colSums(a) does result in 3000.
Do you want your vectors/variables as rows? Then this would work just by transposing a:
t(a)
but if you want 20 columns, each that is its own variable, you would need to switch your n and p (I also subbed in n in the rmultinom call):
n <- 20
p <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(n, N, prob_true)

coefficient not estimable model.matrix

I'm making a de analysis using Limma, and I have a lot of samples. I trying to compute the design matrix and then the lmFit(). But when I call lmFit it returns
coefficient not estimable
refering for the last coefficient in the design matrix. There were some NA values for some rows, so before to compute the design matrix i deleted them. This is the code:
delete = rownames(x$samples)[!complete.cases(x$samples)]
x$samples = x$samples[!rownames(x$samples) %in% delete,]
x$counts = x$counts[,!colnames(x$counts) %in% delete]
design <- model.matrix(~0+group+gender+y+age, data=x$samples)
colnames(design) <- gsub("group", "", colnames(design))
v <- voom(x, design)
vfit <- lmFit(v, design)
where x is a DGE object. y is a numeric variable containing almost different values. So when i create the design matrix, it contains a lot of columns because the variable y has a lot of levels. So maybe I should create a new varaible y2 that divide the values in y in some categorie using ifelse() function.

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

nls() in R using entire matrix

I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James
This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))

Random sample from given bivariate discrete distribution

Suppose I have a bivariate discrete distribution, i.e. a table of probability values P(X=i,Y=j), for i=1,...n and j=1,...m. How do I generate a random sample (X_k,Y_k), k=1,...N from such distribution? Maybe there is a ready R function like:
sample(100,prob=biprob)
where biprob is 2 dimensional matrix?
One intuitive way to sample is the following. Suppose we have a data.frame
dt=data.frame(X=x,Y=y,P=pij)
Where x and y come from
expand.grid(x=1:n,y=1:m)
and pij are the P(X=i,Y=j).
Then we get our sample (Xs,Ys) of size N, the following way:
set.seed(1000)
Xs <- sample(dt$X,size=N,prob=dt$P)
set.seed(1000)
Ys <- sample(dt$Y,size=N,prob=dt$P)
I use set.seed() to simulate the "bivariateness". Intuitively I should get something similar to what I need. I am not sure that this is correct way though. Hence the question :)
Another way is to use Gibbs sampling, marginal distributions are easy to compute.
I tried googling, but nothing really relevant came up.
You are almost there. Assuming you have the data frame dt with the x, y, and pij values, just sample the rows!
dt <- expand.grid(X=1:3, Y=1:2)
dt$p <- runif(6)
dt$p <- dt$p / sum(dt$p) # get fake probabilities
idx <- sample(1:nrow(dt), size=8, replace=TRUE, prob=dt$p)
sampled.x <- dt$X[idx]
sampled.y <- dt$Y[idx]
It's not clear to me why you should care that it is bivariate. The probabilities sum to one and the outcomes are discrete, so you are just sampling from a categorical distribution. The only difference is that you are indexing the observations using rows and columns rather than a single position. This is just notation.
In R, you can therefore easily sample from your distribution by reshaping your data and sampling from a categorical distribution. Sampling from a categorical can be done using rmultinom and using which to select the index, or, as Aniko suggests, using sample to sample the rows of the reshaped data. Some bookkeeping can take care of your exact case.
Here's a solution:
library(reshape)
# Reshape data to long format.
data <- matrix(data = c(.25,.5,.1,.4), nrow=2, ncol=2)
pmatrix <- melt(data)
# Sample categorical n times.
rcat <- function(n, pmatrix) {
rows <- which(rmultinom(n,1,pmatrix$value)==1, arr.ind=TRUE)[,'row']
indices <- pmatrix[rows, c('X1','X2')]
colnames(indices) <- c('i','j')
rownames(indices) <- seq(1,nrow(indices))
return(indices)
}
rcat(3,pmatrix)
This returns 3 random draws from your matrix, reporting the i and j of the rows and columns:
i j
1 1 1
2 2 2
3 2 2

Resources