Avoiding singularity matrix in regression - r

I want to run quantile regression on the following high dimensional code(n=123, p =945) .. but I get an error saying the design matrix is singular and cannot compute. Below is how i extracted the data using the PRIMsrc package.
library(PRIMsrc)
data(Real.2)
df<-Real.2
y<-df$y
X<-data.matrix(df[2:946])
library(quantreg)
rq1 = rq(y ~X,tau=1)
This seems to be an issue with the data itself so i tried to add some noise to the response using jitter() with the base package, but this didn't solve the issue. Any idea?

You have too many variables to fit with too little observations. It's not quite possible to get an estimate for all of them. In your case, I suggest you try a lasso:
library(PRIMsrc)
data(Real.2)
df<-Real.2
y<-df$y
X<-data.matrix(df[2:946])
library(quantreg)
rq1 = rq.fit.lasso(y=y,x=X,tau=0.5)
The coefficients can be found under:
rq1$coefficients
For predictions etc, this should work ok. For inferences based on the coefficients etc, you might need to give it more thought

Related

R, mitools::MIcombine, what is the reason for no p-values?

I am currently running a simple linear regression model with 5 multiply imputed datasets in R.
E.g. model <- with(imp, lm(outcome ~ exposure))
To pool the summary estimates I could use the command summary(mitools::MIcombine(model)) from the mitools package. However, this does not give results for p-values. I could also use the command summary(pool(model)) from the mice package and this does give results for p-values.
Because of this, I am wondering if there is a specific reason why MIcombine does not produce p-values?
After looking through the documentation, it doesn't seem like there is a particular reason that the mitools library doesn't provide p-values. Although, the package's focus is on imputation, not model results.
However, you don't need either of these packages to see your results–along with the per model p-values. I started writing this as a comment but decided to include the code. If you weren't aware...you can use base R's summary. I realize that the output of mice is comparative, as is mitools. I thought it was important enough to mention this, as well.
If the output of your call is model, then this will work.
library(tidyverse)
map(1:length(model), ~summary(model[.x]))

R: gls error "false convergence (8)" and glsControl function

I've seen that a common error when running a generalized least squares (gls) from nlme package in R is the "false convergence (8)". I am trying to run gls models to account for the spatial dependence of my residuals, but I got stucked with the same problem. For example:
library(nlme)
set.seed(2)
samp.sz<-400
lat<-runif(samp.sz,-4,4)
lon<-runif(samp.sz,-4,4)
exp1<-rnorm(samp.sz)
exp2<-rnorm(samp.sz)
resp<-1+4*exp1-3*exp2-2*lat+rnorm(samp.sz)
mod.cor<-gls(resp~exp1+exp2,correlation=corGaus(form=~lat,nugget=TRUE))
Error in gls(resp ~ exp1 + exp2, correlation = corGaus(form = ~lat, nugget = TRUE)) :
false convergence (8)
(the above data simulation was copied from here because it yields the same problem I am facing).
Then, I read that the function glsControl has some parameters (maxIter, msMaxIter, returnObject) that can be setted prior running the analysis, which can solve this error. As an attempt to understand what was going on, I adjusted the three parameters above to 500, 2000 and TRUE, and ran the same code above, but the error still shows up. I think that the glsControl didn't work at all, because none result was shown even I've asked for it.
glsControl(maxIter = 500, msMaxIter=2000, returnObject = TRUE)
mod.cor<-gls(resp~exp1+exp2,correlation=corGaus(form=~lat,nugget=TRUE))
For comparison, if I run different models with the same variables, it works fine and no error is shown.
For example, models containing only one explanatory variable.
mod.cor2<-gls(resp~exp1,correlation=corGaus(form=~lat,nugget=TRUE))
mod.cor3<-gls(resp~exp2,correlation=corGaus(form=~lat,nugget=TRUE))
I really digged into several sites, foruns and books in a desperate search trying to solve it, and then I come to know that the 'false convergence' is a recurrent error that many users have faced. However, none of the previous posts seems to solve it for me. i really thought the glsControl could provide an alternative, but it didn't. Do you guys have a clue on how can I solve that?
I really appreciate any help. Thanks in advance.
The problem is that the nugget effect is very small. Provide better starting values:
mod.cor <- gls(resp ~ exp1 + exp2,
correlation = corGaus(c(200, 0.1), form = ~lat, nugget = TRUE))
summary(mod.cor)
#<snip>
#Correlation Structure: Gaussian spatial correlation
# Formula: ~lat
# Parameter estimate(s):
# range nugget
#2.947163e+02 5.209379e-06
#</snip>
Note that this model may be sensitive to starting values even if there is no error or warning.
I would like to add a quote from library(lme4); help("convergence"):
The lme4 package uses general-purpose nonlinear optimizers (e.g.
Nelder-Mead or Powell's BOBYQA method) to estimate the
variance-covariance matrices of the random effects. Assessing reliably
whether such algorithms have converged is difficult.
I believe something similar applies here. This model is clearly problematic and you should be grateful for getting this error. You should at least check how the fit changes with different starting values and try increasing the number of iterations or decreasing the tolerance. In the end, I would suggest looking for a model that better fits the data (we know that this would be an OLS model including lat as a linear predictor here).
PS: A good coding style uses blanks where appropriate.

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

R: Estimating model variance

In the demo for ROC, there are models that when plotted have a spread, like hiv.svm$predictions which contains 10 estimates of response. Can someone remind me how to calculate N estimates of a model. I'm using RPART and neural network to estimate a single output (true/false). How can I run 10 different sampling for training data to get 10 different model responses to the input. I think the function is called bootstraping, but I don't know how to implement it.
I need to do this outside of caret, cause when I use caret I keep getting the message "Error in tab[1:m, 1:m] : subscript out of bounds". Is there a "simple" bootstrap function?
Obviously the answer is too late, but you could have used caret just by simply renaming the levels of your factor, because caret doesn't work if your binary response is of type logical. For example:
factor(responseWithTrueFalseLevel,
levels=c(TRUE,FALSE),
labels=c("myTrueLevel","myFalseLevel"))

When to choose nls() over loess()?

If I have some (x,y) data, I can easily draw straight-line through it, e.g.
f=glm(y~x)
plot(x,y)
lines(x,f$fitted.values)
But for curvy data I want a curvy line. It seems loess() can be used:
f=loess(y~x)
plot(x,y)
lines(x,f$fitted)
This question has evolved as I've typed and researched it. I started off with wanting to a simple function to fit curvy data (where I know nothing about the data), and wanting to understand how to use nls() or optim() to do that. That was what everyone seemed to be suggesting in similar questions I found. But now I stumbled upon loess() I'm happy. So, now my question is why would someone choose to use nls or optim instead of loess (or smooth.spline)? Using the toolbox analogy, is nls a screwdriver and loess is a power-screwdriver (meaning I'd almost always choose the latter as it does the same thing but with less of my effort)? Or is nls a flat-head screwdriver and loess a cross-head screwdriver (meaning loess is a better fit for some problems, but for others it simply won't do the job)?
For reference, here is the play data I was using that loess gives satisfactory results for:
x=1:40
y=(sin(x/5)*3)+runif(x)
And:
x=1:40
y=exp(jitter(x,factor=30)^0.5)
Sadly, it does less well on this:
x=1:400
y=(sin(x/20)*3)+runif(x)
Can nls(), or any other function or library, cope with both this and the previous exp example, without being given a hint (i.e. without being told it is a sine wave)?
UPDATE: Some useful pages on the same theme on stackoverflow:
Goodness of fit functions in R
How to fit a smooth curve to my data in R?
smooth.spline "out of the box" gives good results on my 1st and 3rd examples, but terrible (it just joins the dots) on the 2nd example. However f=smooth.spline(x,y,spar=0.5) is good on all three.
UPDATE #2: gam() (from mgcv package) is great so far: it gives a similar result to loess() when that was better, and a similar result to smooth.spline() when that was better. And all without hints or extra parameters. The docs were so far over my head I felt like I was squinting at a plane flying overhead; but a bit of trial and error found:
#f=gam(y~x) #Works just like glm(). I.e. pointless
f=gam(y~s(x)) #This is what you want
plot(x,y)
lines(x,f$fitted)
Nonlinear-least squares is a means of fitting a model that is non-linear in the parameters. By fitting a model, I mean there is some a priori specified form for the relationship between the response and the covariates, with some unknown parameters that are to be estimated. As the model is non-linear in these parameters NLS is a means to estimate values for those coefficients by minimising a least-squares criterion in an iterative fashion.
LOESS was developed as a means of smoothing scatterplots. It has a very less well defined concept of a "model" that is fitted (IIRC there is no "model"). LOESS works by trying to identify pattern in the relationship between response and covariates without the user having to specify what form that relationship is. LOESS works out the relationship from the data themselves.
These are two fundamentally different ideas. If you know the data should follow a particular model then you should fit that model using NLS. You could always compare the two fits (NLS vs LOESS) to see if there is systematic variation from the presumed model etc - but that would show up in the NLS residuals.
Instead of LOESS, you might consider Generalized Additive Models (GAMs) fitted via gam() in recommended package mgcv. These models can be viewed as a penalised regression problem but allow for the fitted smooth functions to be estimated from the data like they are in LOESS. GAM extends GLM to allow smooth, arbitrary functions of covariates.
loess() is non-parametric, meaning you don't get a set of coefficients you can use later - it's not a model, just a fit line. nls() will give you coefficients you could use to build an equation and predict values with a different but similar data set - you can create a model with nls().

Resources