Problem in creating a model.matrix of quantitative predictors in R - r

I must do a Lasso regression with the package glmnetand I have problems to generate my x model.matrix
My data.frame: 108 observations, Y response variable, 24 predictors, here is an overview:
CONVENTIONAL_HUmin CONVENTIONAL_HUmean CONVENTIONAL_HUstd CONVENTIONAL_HUmax
1 37.9400539686119 63.4903779286635 11.7592095845857 85.2375439991287
2 23.8400539686119 80.5903779286635 15.0592095845857 125.837543999129
3 19.3035945249441 73.2764716205565 12.8816244173147 130.24141901586
CONVENTIONAL_HUQ1 CONVENTIONAL_HUQ2 CONVENTIONAL_HUQ3 HISTO_Skewness HISTO_Kurtosis
1 54.9938390994964 65.4873070322704 72.8863025473031 -0.203420585259268 2.25208159159488
2 70.8938390994964 80.3873070322704 91.4863025473031 -0.117420585259268 2.91208159159488
3 64.4689755423307 73.8666609177099 81.7351818199415 -0.0908104900456161 2.8751327713366
HISTO_ExcessKurtosis HISTO_Entropy_log10 HISTO_Entropy_log2 HISTO_Energy...Uniformity.
1 -0.751917020142877 0.701345471328916 2.32782599847774 0.219781577333287
2 -0.0887170201428774 0.793345471328916 2.63782599847774 0.184781577333287
3 -0.127231561113029 0.738530858918985 2.45445652190669 0.206887426065656
GLZLM_SZE GLZLM_LZE GLZLM_LGZE GLZLM_HGZE GLZLM_SZLGE
1 0.366581916604228 35.7249100350856 8.7285612359045e-05 11497.6407737833 3.22615226279017e-05
2 0.693581916604228 984.424910035086 8.5685612359045e-05 11697.6407737833 5.98615226279017e-05
3 0.622711792823853 1103.10288991619 8.5573088970709e-05 11571.7421733917 5.33303855950858e-05
GLZLM_SZHGE GLZLM_LZLGE GLZLM_LZHGE GLZLM_GLNU GLZLM_ZLNU
1 4164.91570215061 0.00314512237564268 405585.990838764 2.66964898745512 2.47759091065361
2 8064.91570215061 0.0835651223756427 11581585.9908388 12.9796489874551 38.5375909106536
3 7295.45317481887 0.0949686480587339 12926109.9421091 15.0930512668698 37.6083347285291
GLZLM_ZP Y
1 0.219643444043173 1
2 0.112643444043173 0
3 0.104031438564764 0
My code for the model.matrix
x=model.matrix(Y~.,data=data.det)
It générale a very large model.matrix with 244728 elements ! It seems that it has duplicated a hundred times each predictor of the 24 !
Here's an overview of the data.matrix:
(Intercept) CONVENTIONAL_HUmin-10.5599460313881
CONVENTIONAL_HUmin-117.359946031388 CONVENTIONAL_HUmin-13.0599460313881
CONVENTIONAL_HUmin-154.359946031388 CONVENTIONAL_HUmin-17.6599460313881
CONVENTIONAL_HUmin-18.3599460313881 CONVENTIONAL_HUmin-2.87994603138811
CONVENTIONAL_HUmin-21.281710504529 CONVENTIONAL_HUmin-28.3599460313881
CONVENTIONAL_HUmin-3.44994603138811 CONVENTIONAL_HUmin-3.89640547505594
CONVENTIONAL_HUmin-67.0599460313881 CONVENTIONAL_HUmin-682.359946031388
CONVENTIONAL_HUmin-9.08171050452898 CONVENTIONAL_HUmin1.04428949547101
CONVENTIONAL_HUmin1.63928949547101 CONVENTIONAL_HUmin10.8400539686119
CONVENTIONAL_HUmin10.968289495471 CONVENTIONAL_HUmin11.5400539686119
CONVENTIONAL_HUmin11.618289495471 CONVENTIONAL_HUmin11.6400539686119
CONVENTIONAL_HUmin12.518289495471 CONVENTIONAL_HUmin12.5400539686119
CONVENTIONAL_HUmin13.4400539686119 CONVENTIONAL_HUmin13.6400539686119
CONVENTIONAL_HUmin13.7400539686119 CONVENTIONAL_HUmin13.818289495471
CONVENTIONAL_HUmin14.5400539686119 CONVENTIONAL_HUmin14.6693017607572
CONVENTIONAL_HUmin14.8400539686119 CONVENTIONAL_HUmin16.9400539686119
CONVENTIONAL_HUmin17.0400539686119 CONVENTIONAL_HUmin17.618289495471
CONVENTIONAL_HUmin18.2400539686119 CONVENTIONAL_HUmin18.8400539686119
CONVENTIONAL_HUmin19.3035945249441 CONVENTIONAL_HUmin20.0400539686119
CONVENTIONAL_HUmin20.818289495471 CONVENTIONAL_HUmin21.0400539686119
CONVENTIONAL_HUmin21.118289495471 CONVENTIONAL_HUmin21.3400539686119
CONVENTIONAL_HUmin21.5400539686119 CONVENTIONAL_HUmin21.9400539686119
...
attr(,"contrasts")$CONVENTIONAL_HUmin
[1] "contr.treatment"
Not convenient at all because I end up with much more predictors in the input x for Lasso Regression which makes hazardous selection of the predictors even more present
Have you an idea of the source of the dysfunction ? any suggestion to fix that ?

Try this, you want a matrix not a model matrix...
# make a matrix of your predictors minus your outcome
x <- as.matrix(data.detect[-25])
# put the y column in a vector
y <- data.detect$Y
# run it
fit.lasso <- glmnet(x, y, family = "binomial", alpha = 1)

Related

"variable lengths differ" error while running regressions in a loop

I am trying to run a regression loop based on code that I have found in a previous answer (How to Loop/Repeat a Linear Regression in R) but I keep getting an error. My outcomes (dependent) are 940 variables (metabolites) and my exposure (independent) are "bmi","Age", "sex","lpa2c", and "smoking". where BMI and Age are continuous. BMI is the mean exposure, and for others, I am controlling for them.
So I'm testing the effect of BMI on 940 metabolites.
Also, I would like to know how I can extract coefficient, p-value, standard error, and confidence interval for BMI only and when it is significant.
This is the code I have used:
y<- c(1653:2592) # response
x1<- c("bmi","Age", "sex","lpa2c", "smoking") # predictor
for (i in x1){
model <- lm(paste("y ~", i[[1]]), data= QBB_clean)
print(summary(model))
}
And this is the error:
Error in model.frame.default(formula = paste("y ~", i[[1]]), data = QBB_clean, :
variable lengths differ (found for 'bmi').
y1 y2 y3 y4 bmi age sex lpa2c smoking
1 0.2875775201 0.59998896 0.238726027 0.784575267 24 18 1 0.470681834 1
2 0.7883051354 0.33282354 0.962358936 0.009429905 12 20 0 0.365845473 1
3 0.4089769218 0.48861303 0.601365726 0.779065883 18 15 0 0.121272054 0
4 0.8830174040 0.95447383 0.515029727 0.729390652 16 21 0 0.046993681 0
5 0.9404672843 0.48290240 0.402573342 0.630131853 18 28 1 0.262796304 1
6 0.0455564994 0.89035022 0.880246541 0.480910830 13 13 0 0.968641168 1
7 0.5281054880 0.91443819 0.364091865 0.156636851 11 12 0 0.488495482 1
8 0.8924190444 0.60873498 0.288239281 0.008215520 21 23 0 0.477822030 0
9 0.5514350145 0.41068978 0.170645235 0.452458394 18 17 1 0.748792881 0
10 0.4566147353 0.14709469 0.172171746 0.492293329 20 15 1 0.667640231 1
If you want to loop over responses you will want something like this:
respvars <- names(QBB_clean[1653:2592])
predvars <- c("bmi","Age", "sex","lpa2c", "smoking")
results <- list()
for (v in respvars) {
form <- reformulate(predvars, response = v)
results[[v]] <- lm(form, data = QBB_clean)
}
You can then print the results with something like lapply(results, summary), extract coefficients, etc.. (I have a little trouble seeing how it's going to be useful to just print the results of 940 regressions ... are you really going to inspect them all?
If you want coefficients etc. for BMI, I think this should work (not tested):
t(sapply(results, function(m) coef(summary(m))["bmi",]))
Or for coefficients:
t(sapply(results, function(m) confint(m)["bmi",]))

Multinomial regression : how to show all coefficients without L, Q and R?

I have this dataframe that I applied multinom function
df = data.frame(x = c('a','a','b','b','c','c','d','d','d','e','e','f','f',
'f','f','g','g','g','h','h','h','h','i','i','j','j'),
y = c(1,2,1,3,1,2,1,4,5,1,2,2,3,4,5,1,1,2,1,2,2,3,2,2,3,4) )
df$y = factor(df$y,ordered = TRUE)
nnet::multinom(y~x, data = df)
when checking the output, I have all the variables with their coefficients (meaning everything is fine)
Coefficients:
(Intercept) xb xc xd xe xf
2 -6.045294e-05 -31.83512 3.800915e-05 -36.67053 3.800915e-05 25.00515
3 -1.613311e+01 16.13310 -1.725649e+01 -21.06832 -1.725649e+01 41.13825
4 -1.692352e+01 -14.71119 -1.428100e+01 16.92351 -1.428100e+01 41.92865
5 -2.129358e+01 -10.49359 -1.002518e+01 21.29353 -1.002518e+01 46.29867
xg xh xi xj
2 -0.6931261 0.6932295 40.499799 -25.311410
3 -24.0387863 16.1332150 -8.876562 45.191730
4 -20.2673490 -16.0884760 -6.394423 45.982129
5 -15.1755064 -11.8589447 -4.563793 -6.953942
but my original dataframe (will share only the output) that is coded as the dependent and independent variables from the df dataframe (meaning as ordinal factors) and all the analysis is well done but when it comes to interpretation I have this output :
Coefficients:
(Intercept) FIES_R.L FIES_R.Q FIES_R.C FIES_R^4 FIES_R^5
2 -0.09594409 -1.303256 0.03325169 -0.1753022 -0.46026668 -0.282463422
3 -0.18587599 -1.469957 0.42005569 -0.2977628 0.00508412 0.003068678
4 -0.58189239 -2.875183 0.33128994 -0.6787992 0.11145099 0.239368520
5 -2.68727952 -10.178604 -5.12515249 -5.8454920 -3.13775961 -1.820629143
FIES_R^6 FIES_R^7 FIES_R^8
2 -0.2179067 -0.1000471 -0.1489342
3 0.1915476 -0.5483707 -0.2565626
4 0.2585801 0.3821566 -0.2679774
5 -0.5562958 -0.6335412 -0.7205215
I don't want FIES_R.L,FIES_R.Q and FIES_R.C. I want them as : FIES_R_1, FIES_R_2, FIES_R_3, FIES_R_4, FIES_R_5, FIES_R_6, FIES_R_7, FIES_R_8,
why I have such an output ? knowing that the two dataframes include ordinal categorical variables and the x variable and the FIES variable include many categories in both dataframes. Thanks
I just figured it out : because the independent variable is an ordinal factor. Meaning FIES in my dataset in an ordinal factor. When I used the argument ordered = FALSE the problem got solved
You can change the coefnames "by hand":
mod = nnet::multinom(y~x, data = df)
mod$vcoefnames = c("(Intercept)", paste0(substr(mod$vcoefnames, 1, 6), "_", 1:8))

Unique intercepts approach for categorical variables in "rstanarm" package in R

Background:
McElearth (2016) in his rethinking book pages 158-159, uses an index variable instead of dummy coding for a 3-category variable called "clade" to predict "kcal.per.g" (linear regression).
Question: I was wondering if we could apply the same approach in "rstanarm"? I have provided data and R code for a possible demonstration below.
library("rethinking") # A github package not on CRAN
data(milk)
d <- milk
d$clade_id <- coerce_index(d$clade) # Index variable maker
#[1] 4 4 4 4 4 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 # index variable
# Model Specification:
fit1 <- map(
alist(
kcal.per.g ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] ,
a[clade_id] ~ dnorm( 0.6 , 10 ) ,
sigma ~ dunif( 0 , 10 )
) ,
data = d )
The most analogous way to do this using the rstanarm package is with
library(rstanarm)
fit1 <- stan_glmer(kcal.per.g ~ 1 + (1 | clade_id), data = milk,
prior_intercept = normal(0.6, 1, autoscale = FALSE),
prior_aux = exponential(rate = 1/5, autoscale = FALSE),
prior_covariance = decov(shape = 10, scale = 1))
However, this is not exactly the same for the following reasons:
Bounded uniform priors on sigma are not implemented because they are not a good idea, so I have used an exponential distribution with an expectation of 5 instead
Fixing the standard deviation on a is not implemented either so I have used a gamma distribution with an expectation of 10
Hierarchical models in rstanarm (and lme4) are parameterized with deviations from common parameters, so rather than using an expectation of 0.6 for a, I have used an expectation of 0.6 for the global intercept and the prior on a is normal with an expectation of zero. This means you need to call coef(fit1) rather than ranef(fit1) to see the "intercepts" as they are parameterized in the original model.

Is there a function to return the matching response vector to model.matrix?

In glmnet() I have to specify the raw X matrix and response vector Y (different than lm where you can specify the model formula). model.matrix() will correctly remove incomplete observations from the X matrix, but it doesn't include the response in the output object. So I will have something like this:
mydf
glmnet(y = mydf$response, x = model.matrix(myformula, mydf)[,-1], ...)
When model.matrix removes observations the y and x dimensions won't match. Is there a function to align y data to x?
Try using model.frame and model.response.
> d <- data.frame(y=rnorm(3), x=c(1,NA,2), z=c(NA, NA, 1))
> d
y x z
1 -0.6257260 1 NA
2 -0.4979723 NA NA
3 -1.2233772 2 1
> form <- y~x
> mf <- model.frame(form, data=d)
> model.response(mf)
1 3
-0.625726 -1.223377
> model.matrix(form, mf)
(Intercept) x
1 1 1
3 1 2
attr(,"assign")
[1] 0 1
I'm not familiar with glmnet, it might be the case that mf is sufficient, just passing y=mf[1,] and x=mf[-1,].

Lasso r code - what is wrong with it?

I am attempting to carry out lasso regression using the lars package but can not seem to get the lars bit to work. I have inputted code:
diabetes<-read.table("diabetes.txt", header=TRUE)
diabetes
library(lars)
diabetes.lasso = lars(diabetes$x, diabetes$y, type = "lasso")
However, I get an error message of :
Error in rep(1, n) : invalid 'times' argument.
I have tried entering it like this:
diabetes<-read.table("diabetes.txt", header=TRUE)
library(lars)
data(diabetes)
diabetes.lasso = lars(age+sex+bmi+map+td+ldl+hdl+tch+ltg+glu, y, type = "lasso")
But then I get the error message:
'Error in lars(age+sex + bmi + map + td + ldl + hdl + tch + ltg + glu, y, type = "lasso") :
object 'age' not found'
Where am I going wrong?
EDIT: Data - as below but with another 5 columns.
ldl hdl tch ltg glu
1 -0.034820763 -0.043400846 -0.002592262 0.019908421 -0.017646125
2 -0.019163340 0.074411564 -0.039493383 -0.068329744 -0.092204050
3 -0.034194466 -0.032355932 -0.002592262 0.002863771 -0.025930339
4 0.024990593 -0.036037570 0.034308859 0.022692023 -0.009361911
5 0.015596140 0.008142084 -0.002592262 -0.031991445 -0.046640874
I think some of the confusion may have to do with the fact that the diabetes data set that comes with the lars package has an unusual structure.
library(lars)
data(diabetes)
sapply(diabetes,class)
## x y x2
## "AsIs" "numeric" "AsIs"
sapply(diabetes,dim)
## $x
## [1] 442 10
##
## $y
## NULL
##
## $x2
## [1] 442 64
In other words, diabetes is a data frame containing "columns" which are themselves matrices. In this case, with(diabetes,lars(x,y,type="lasso")) or lars(diabetes$x,diabetes$y,type="lasso") work fine. (But just lars(x,y,type="lasso") won't, because R doesn't know to look for the x and y variables within the diabetes data frame.)
However, if you are reading in your own data, you'll have to separate the response variable and the predictor matrix yourself, something like
X <- as.matrix(mydiabetes[names(mydiabetes)!="y",])
mydiabetes.lasso = lars(X, mydiabetes$y, type = "lasso")
Or you might be able to use
X <- model.matrix(y~.,data=mydiabetes)
lars::lars does not appear to have a formula interface, which means you cannot use the formula specification for the column names (and furthermore it does not accept a "data=" argument). For more information on this and other "data mining" topics, you might want to get a copy of the classic text: "Elements of Statistical Learning". Try this:
# this obviously assumes require(lars) and data(diabetes) have been executed.
> diabetes.lasso = with( diabetes, lars(x, y, type = "lasso"))
> summary(diabetes.lasso)
LARS/LASSO
Call: lars(x = x, y = y, type = "lasso")
Df Rss Cp
0 1 2621009 453.7263
1 2 2510465 418.0322
2 3 1700369 143.8012
3 4 1527165 86.7411
4 5 1365734 33.6957
5 6 1324118 21.5052
6 7 1308932 18.3270
7 8 1275355 8.8775
8 9 1270233 9.1311
9 10 1269390 10.8435
10 11 1264977 11.3390
11 10 1264765 9.2668
12 11 1263983 11.0000

Resources