Regression with multivariate vectors and coefficients extraction - r

I want to create 1000 samples of 200 bivariate normally distributed vectors
set.seed(42) # for sake of reproducibility
mu <- c(1, 1)
S <- matrix(c(0.56, 0.4,
0.4, 1), nrow=2, ncol=2, byrow=TRUE)
bivn <- mvrnorm(200, mu=mu, Sigma=S)
so that I can run OLS regressions on each sample and therefore get 1000 estimators. I tried this
library(MASS)
bivn_1000 <- replicate(1000, mvrnorm(200, mu=mu, Sigma=S), simplify=FALSE)
but I am stuck there, because now I don't know how to proceed to run the regression for each sample.
I would appreciate the help to know how to run these 1000 regressions and then extract the coefficients.

We could write a custom regression function.
regFun1 <- function(x) summary(lm(x[, 1] ~ x[, 2]))
which we can loop over the data with lapply:
l1 <- lapply(bivn_1000, regFun1)
The coefficients are saved inside a list and can be extracted like so:
l1[[1]]$coefficients # for the first regression
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.5554601 0.06082924 9.131466 7.969277e-17
# x[, 2] 0.4797568 0.04255711 11.273246 4.322184e-23
Edit:
If we solely want the estimators without statistics, we adjust the output of the function accordingly.
regFun2 <- function(x) summary(lm(x[, 1] ~ x[, 2]))$coef[, 1]
Since we may want the output in matrix form we use sapply next.
m2 <- t(sapply(bivn_1000, regFun2))
head(m2)
# (Intercept) x[, 2]
# [1,] 0.6315558 0.4389721
# [2,] 0.5514555 0.4840933
# [3,] 0.6782464 0.3250800
# [4,] 0.6350999 0.3848747
# [5,] 0.5899311 0.3645237
# [6,] 0.6263678 0.3825725
where
dim(m2)
# [1] 1000 2
assures us that we have our 1,000 estimates.

Related

How to simplify code in R (normality test): different sample sizes in 1 line or 2 lines of code?

I want to conduct normality tests a little bit cleaner in my coding and do a simulation (repeat the test 1000 times).
sample <- c(10,30,50,100,500)
shapiro.test(rnorm(sample))
Shapiro-Wilk normality test
data: rnorm(sample)
W = 0.90644, p-value = 0.4465
This only gives one output as you can observe above. How do I get 5 outputs? Is there something I am missing here..?
Using the replicate function gives me 1000 statistics per sample size, while I am only interested in the p-values and relate them to a significance level. In the coding of the individual normality tests, I used the following code (thanks to user StupidWolf, in my previous posted questions on stackoverflow)
replicate_sw10 = replicate(1000,shapiro.test(rnorm(10)))
table(replicate_sw10["p.value",]<0.10)/1000
#which gave the following output
> FALSE TRUE
> 0.896 0.104
You may simply use $p.value. The code below yields a matrix with 1,000 rows for the repetitions, and 5 columns for the smpl sizes. If you want a list as result, just use lapply instead of sapply.
smpl <- c(10, 30, 50, 100, 500)
set.seed(42) ## for sake of reproducibility
res <- sapply(smpl, function(x) replicate(1e3, shapiro.test(rnorm(x))$p.value))
head(res)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.43524553 0.5624891 0.02116901 0.8972087 0.8010757
# [2,] 0.67500688 0.1417968 0.03722656 0.7614192 0.7559309
# [3,] 0.52777713 0.6728819 0.67880178 0.1455375 0.7734797
# [4,] 0.55618980 0.1736095 0.69879316 0.4950400 0.5181642
# [5,] 0.93774782 0.9077292 0.58930787 0.2687687 0.8435223
# [6,] 0.01444456 0.1214157 0.07042380 0.4479121 0.7982574
using the purrr package
map(sample, function(x) shapiro.test(rnorm(x)))
which gives
[[1]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.92567, p-value = 0.4067
[[2]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.95621, p-value = 0.247
[[3]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.96144, p-value = 0.1021
[[4]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.98654, p-value = 0.4077
[[5]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.99597, p-value = 0.2324
Edit: so after your edit you are requesting some table. This doesn't work in the way you are doing it with your replicate_sw10 example as that is a matrix, while map (or lapply for that matter) results in a list. So again you want to use apply or map to do the same transformations on all the parts of the list.
replicate_swall <- map(sample, function(x) shapiro.test(rnorm(x)))
replicate_pvalue_extract <- map(replicate_swall , function(x) x["p.value",]) %>% unlist(., recursive = F)
table(replicate_pvalue_extract < 0.10) / length(replicate_pvalue_extract )
This will give you:
FALSE TRUE
0.896 0.104
Another option is using the magrittr package for the extract. Your code will than look like
replicate_pvalue_extract <- map(replicate_swall, magrittr::extract, "p.value") %>% unlist(., recursive = F)
table(replicate_pvalue_extract < 0.10) / length(replicate_pvalue_extract )
In the code above I assumed that you wanted to divide your table by all replicates and that it doesn't matter what the input was (with input I mean 10,30,50,100, or 500) . If you do care about the input you can keep them separate, I will give the code below. Also note that I used length rather than your hardcoded /1000. In this way your code is way more generic, if you change the replicate number the number you divide your table with automatically changes as well. Otherwise you have to make the changes on multiple locations (especially if someone else uses your code) which could easily result in mistakes.
replicate_pvalue_extract <- map(replicate_swall , function(x) x["p.value",])
map(replicate_pvalue_extract , function(x) table(x < 0.10) / length(x))
Or you can combine them:
map(map(replicate_swall, function(x) x["p.value",]), function(x) table(x < 0.10) / length(x))
This is why I gave you the magrittr option, as I do not like the function(x) twice. With magrittr it would look like:
map(map(replicate_swall, magrittr::extract, "p.value"), function(x) table(x < 0.10) / length(x))
which would result in:
[[1]]
FALSE TRUE
0.896 0.104
[[2]]
FALSE TRUE
0.889 0.111
[[3]]
FALSE TRUE
0.904 0.096
[[4]]
FALSE TRUE
0.9 0.1
[[5]]
FALSE TRUE
0.891 0.109

How can I get the contribution by each predictor to the final regression prediction in lm

Using R when I use rlm or lm I would like to get the contribution of each predictor of the model.
The problem occurs when I have interaction terms as I think they are not in the lm object
Bellow is sample data (I am looking for a way that generalized to any number of predictors)
Sample data:
set.seed(1)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m)
betaHat <- coefficients(lmObj)
betaHat
v1 v3 v2 v1:v3 v3:v2
0.03455 -0.50224 -0.57745 0.58905 -0.65592
# How do I get the data.frame or matrix with columns (v1,v3,v2,v1:v3,v3:v2)
# worth [M$v1*v1, ... , (M$v3*M$v2)*v3:v2]
I thought by "contribution" you want explained variance of each term (which an ANOVA table helps), while actually you want term-wise prediction:
predict(lmObj, type = "terms")
See ?predict.lm.
Actually I got it from lm itself, the trick is to ask for x=TRUE
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m, x=TRUE)
lmObj$x %*% diag(lmObj$coefficients)
[,1] [,2] [,3] [,4] [,5]
1 0.0522305 -0.68238 -0.53066 1.20993 -0.81898
2 0.0134687 0.05162 -0.45164 -0.02360 0.05273
3 -0.0214632 -0.19470 -0.04306 -0.14187 -0.01896
4 -0.0765156 0.02702 1.14875 0.07019 -0.07021
5 0.0388652 0.69161 -0.35792 -0.91250 0.55985
6 -0.0015524 0.20843 0.03241 0.01098 -0.01528
7 -0.0005594 0.19803 0.08996 0.00376 -0.04029
8 0.0326086 0.02979 0.84928 -0.03298 -0.05722
9 0.0283723 -0.55248 0.27611 0.53213 0.34500
10 0.0205187 -0.38330 -0.24134 0.26699 -0.20921

Generate multivariate normal r.v.'s with rank-deficient covariance via Pivoted Cholesky Factorization

I'm just beating my head against the wall trying to get a Cholesky decomposition to work in order to simulate correlated price movements.
I use the following code:
cormat <- as.matrix(read.csv("http://pastebin.com/raw/qGbkfiyA"))
cormat <- cormat[,2:ncol(cormat)]
rownames(cormat) <- colnames(cormat)
cormat <- apply(cormat,c(1,2),FUN = function(x) as.numeric(x))
chol(cormat)
#Error in chol.default(cormat) :
# the leading minor of order 8 is not positive definite
cholmat <- chol(cormat, pivot=TRUE)
#Warning message:
# In chol.default(cormat, pivot = TRUE) :
# the matrix is either rank-deficient or indefinite
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
I'm not really sure how to properly use the pivot = TRUE statement to generate my correlated movements. The results look totally bogus.
Even if I have a simple matrix and I try out "pivot" then I get bogus results...
cormat <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
cholmat <- chol(cormat)
# No Error
cholmat2 <- chol(cormat, pivot=TRUE)
# No warning... pivot changes column order
rands <- array(rnorm(ncol(cholmat)), dim = c(10000,ncol(cholmat)))
V <- t(t(cholmat2) %*% t(rands))
#Check for similarity
cor(V) - cormat ## Not all zeros!
#Check the standard deviations
apply(V,2,sd) ## Not all ones!
There are two errors with your code:
You did not use pivoting index to revert the pivoting done to the Cholesky factor. Note, pivoted Cholesky factorization for a semi-positive definite matrix A is doing:
P'AP = R'R
where P is a column pivoting matrix, and R is an upper triangular matrix. To recover A from R, we need apply the inverse of P (i.e., P'):
A = PR'RP' = (RP')'(RP')
Multivariate normal with covariance matrix A, is generated by:
XRP'
where X is multivariate normal with zero mean and identity covariance.
Your generation of X
X <- array(rnorm(ncol(R)), dim = c(10000,ncol(R)))
is wrong. First, it should not be ncol(R) but nrow(R), i.e., the rank of X, denoted by r. Second, you are recycling rnorm(ncol(R)) along columns, and the resulting matrix is not random at all. Therefore, cor(X) is never close to an identity matrix. The correct code is:
X <- matrix(rnorm(10000 * r), 10000, r)
As a model implementation of the above theory, consider your toy example:
A <- matrix(c(1,.95,.90,.95,1,.93,.90,.93,1), ncol=3)
We compute the upper triangular factor (suppressing possible rank-deficient warnings) and extract inverse pivoting index and rank:
R <- suppressWarnings(chol(A, pivot = TRUE))
piv <- order(attr(R, "pivot")) ## reverse pivoting index
r <- attr(R, "rank") ## numerical rank
Then we generate X. For better result we centre X so that column means are 0.
X <- matrix(rnorm(10000 * r), 10000, r)
## for best effect, we centre `X`
X <- sweep(X, 2L, colMeans(X), "-")
Then we generate target multivariate normal:
## compute `V = RP'`
V <- R[1:r, piv]
## compute `Y = X %*% V`
Y <- X %*% V
We can verify that Y has target covariance A:
cor(Y)
# [,1] [,2] [,3]
#[1,] 1.0000000 0.9509181 0.9009645
#[2,] 0.9509181 1.0000000 0.9299037
#[3,] 0.9009645 0.9299037 1.0000000
A
# [,1] [,2] [,3]
#[1,] 1.00 0.95 0.90
#[2,] 0.95 1.00 0.93
#[3,] 0.90 0.93 1.00

Pairwise Correlation Table

I'm new to R, so I apologize if this is a straightforward question, however I've done quite a bit of searching this evening and can't seem to figure it out. I've got a data frame with a whole slew of variables, and what I'd like to do is create a table of the correlations among a subset of these, basically the equivalent of "pwcorr" in Stata, or "correlations" in SPSS. The one key to this is that not only do I want the r, but I also want the significance associated with that value.
Any ideas? This seems like it should be very simple, but I can't seem to figure out a good way.
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
Here is something that I just made, I stumbled on this post because I was looking for a way to take every pair of variables, and get a tidy nX3 dataframe. Column 1 is a variable, Column 2 is a variable, and Column 3 and 4 are their absolute value and true correlation. Just pass the function a dataframe of numeric and integer values.
pairwiseCor <- function(dataframe){
pairs <- combn(names(dataframe), 2, simplify=FALSE)
df <- data.frame(Vairable1=rep(0,length(pairs)), Variable2=rep(0,length(pairs)),
AbsCor=rep(0,length(pairs)), Cor=rep(0,length(pairs)))
for(i in 1:length(pairs)){
df[i,1] <- pairs[[i]][1]
df[i,2] <- pairs[[i]][2]
df[i,3] <- round(abs(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]])),4)
df[i,4] <- round(cor(dataframe[,pairs[[i]][1]], dataframe[,pairs[[i]][2]]),4)
}
pairwiseCorDF <- df
pairwiseCorDF <- pairwiseCorDF[order(pairwiseCorDF$AbsCor, decreasing=TRUE),]
row.names(pairwiseCorDF) <- 1:length(pairs)
pairwiseCorDF <<- pairwiseCorDF
pairwiseCorDF
}
This is what the output is:
> head(pairwiseCorDF)
Vairable1 Variable2 AbsCor Cor
1 roll_belt accel_belt_z 0.9920 -0.9920
2 gyros_dumbbell_x gyros_dumbbell_z 0.9839 -0.9839
3 roll_belt total_accel_belt 0.9811 0.9811
4 total_accel_belt accel_belt_z 0.9752 -0.9752
5 pitch_belt accel_belt_x 0.9658 -0.9658
6 gyros_dumbbell_z gyros_forearm_z 0.9491 0.9491
I've found that the R package picante does a nice job dealing with the problem that you have. You can easily pass your dataset to the cor.table function and get a table of correlations and p-values for all of your variables. You can specify Pearson's r or Spearman in the function. See this link for help:
http://www.inside-r.org/packages/cran/picante/docs/cor.table
Also remember to remove any non-numeric columns from your dataset prior to running the function. Here's an example piece of code:
install.packages("picante")
library(picante)
#Insert the name of your dataset in the code below
cor.table(dataset, cor.method="pearson")
You can use the sjt.corr function of the sjPlot-package, which gives you a nicely formatted correlation table, ready for use in your Office application.
Simplest function call is just to pass the data frame:
sjt.corr(df)
See examples here.

How to create a linear regression with R?

I have a simple matrix like:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
I have to calculate a linear regression of these columns, like: lm(x ~ y)
Where thefirst column is the X, and the other are the Y? I mean... can I do something to use the other with one variable(y)
or
do i have to use something like: lm(x~y+z+c+b) etc etc ?
Thank you
Yes, but I wouldn't really recommend it:
> set.seed(2)
> mat <- matrix(runif(12), ncol = 3, byrow = TRUE)
> mat
[,1] [,2] [,3]
[1,] 0.1848823 0.7023740 0.5733263
[2,] 0.1680519 0.9438393 0.9434750
[3,] 0.1291590 0.8334488 0.4680185
[4,] 0.5499837 0.5526741 0.2388948
> mod <- lm(mat[,1] ~ mat[,-1])
> mod
Call:
lm(formula = mat[, 1] ~ mat[, -1])
Coefficients:
(Intercept) mat[, -1]1 mat[, -1]2
1.0578 -1.1413 0.1177
Why is this not recommended? Well, you are abusing the formula interface here; it works but the model coefficients have odd names and you are incurring a lot of overhead of working with the formula interface, which is designed for extracting response/covariates from a data frame or list object referenced in the symbolic formula.
The usual way of working is:
df <- data.frame(mat)
names(df) <- c("Y","A","B")
## specify all terms:
lm(Y ~ A + B, data = df)
## or use the `.` shortcut
lm(Y ~ ., data = df)
If you don't want to go via the data frame, then you can call the workhorse function behind lm(), lm.fit(), directly with a simple manipulation:
lm.fit(cbind(rep(1, nrow(mat)), mat[,-1]), mat[, 1])
here we bind on a vector of 1s to columns 2 and 3 of mat (cbind(rep(1, nrow(mat)), mat[,-1])); this is the model matrix. mat[, 1] is the response. Whilst it doesn't return an "lm" classed object, it will be very quick and can relatively easily be converted to one if that matters.
By the way, you have the usual notation back to front. Y is usually the response, with X indicating the covariates used to model or predict Y.

Resources