How to run Beta Regression with lots of independent variables? - r

Why is it that Beta Regression that is bound between 0 and 1 is unable to handle lots of independent variables as Regressors? I have around 30 independent variables that I am trying to fit and it shows error like:
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Only few variables it is accepting.Now If I combine all these independent variables in X <- (df$x1 + … + df$x30) and make dependent variable in Y <- df$y and then run Beta Regression then it works but I won’t be getting coefficients for individual independent variables which I want.
betareg(Y ~ X, data = df)
So, what’s the solution?

Probably, the model did not converge because of the multicollinearity problem. In most cases, regression models can not be estimated properly when lots of variables are considered. You can overcome this problem with an appropriate variable selection procedure using information criteria.
You can benefit gamlss package in R. Also, stepGAIC() function can help you when considering gamlss(...,family=BE) function during the modeling.

Related

Is it possible to perform a zero inflated poisson regression model in R with more than 4 variables?

This is my first time posting on here, so I apologize if this isn't the correct format/info. But I'm attempting to run a model in R with the zeroinfl function (pscl package). My data consists of insect count data and 5 different variables, which are 5 different habitat types.
Zero inflation poisson model
summary(m1 <- zeroinfl(count~Hab_1+Hab_2+Hab_3+Hab_4+Hab_5, data = insect_data))
I'm able to run the model when I only use 4 variables in the equation, but when I add the fifth variable it gives me this error code:
Error in optim(fn = loglikfun, gr = gradfun, par = c(start$count, start$zero, :
non-finite value supplied by optim
Is there a way to run a zero inflated model using all 5 of these variables or am I missing something? Any input would be greatly appreciated, thank you!

Error in glsEstimate(object, control = control) : computed "gls" fit is singular, rank 19

First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.

Conduct quantile regression with several dependent variables in R

I'm interested in doing a multivariate regression in R, looking at the effects of a grouping variable (2 levels) on several dependent variables. However, due to my data being non-normal and the 2 groups not having homogenous variances, I'm looking to use a quantile regression instead. I'm using the rq function from the quantreg toolbox to do this.
My code is as follows
# Generate some fake data
DV = matrix(rnorm(40*5),ncol=5) #construct matrix for dependent variables
IV = matrix(rep(1:2,20)) #matrix for grouping factor
library(quantreg)
model.q = rq(DV~IV,
tau = 0.5)
I get the following error message when this is run:
Error in y - x %*% z$coef : non-conformable arrays
In addition: Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
I believe this is due to my having several DVs, as the model works fine when I try using a DV of one column. Is there a specific way I should be formatting my data? Or perhaps there is another function I may be able to use?
Thank you!
If you just want to run several regressions, each with the same set of independent variables, but with a different dependent variable, you could write a function and then apply it to all columns of your DV matrix and save the models in a list:
reg <- function(col_number) {
model.q <- rq(DV[, col_number] ~ IV, tau = 0.5)
}
model_list <- lapply(1:ncol(DV), reg)
However, as pointed out in the comments, it might be that you want a multivariate model accounting for the correlation of the outcome - but then I do not think the rq method would be appropriate
If you have multiple responses, what you most likely need is:
DV = matrix(rnorm(40*5),ncol=5) #construct matrix for dependent variables
IV = matrix(rep(1:2,20)) #matrix for grouping factor
library(quantreg)
rqs.fit(x=IV, y=DV, tau=0.5, tol = 0.0001)
Unfortunately, there's really not a lot of documentation about how this works.. I can update if i do find it

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Obtaining predicted (i.e. expected) values from the orm function (Ordinal Regression Model) from rms package in R

I've run a simple model using orm (i.e. reg <- orm(formula = y ~ x)) and I'm having trouble understanding how to get predicted values for Y. I've never worked with models that use multiple intercepts. I want to know for each and every value of Y in my dataset what the predicted value from the model would be. I tried predict(reg, type="mean") and this produced values that are close to the predicted values from an OLS regression, but I'm not sure if this is what I want. I really just want something analogous to OLS where you can obtain the E(Y) given a set of predictors. If possible, please provide code I can run to do this with a very short explanation.

Resources