Standardization and inclusion of intercept in sparse lasso GLM - r

I found some problems while practicing the sparse group lasso method using the cvSGL function forom the SGL package.
My questions are as follows:
Looking at the code for SGL:::center_scale, it doesn't seem to consider the sample size of the data.
SGL:::center_scale
#Output
function (X, standardize) {
means <- apply(X, 2, mean)
X <- t(t(X) - means)
X.transform <- list(X.means = means)
if (standardize == TRUE) {
var <- apply(X, 2, function(x) (sqrt(sum(x^2))))
X <- t(t(X)/var)
X.transform$X.scale <- var
}
else {
X.transform$X.scale <- 1
}
return(list(x = X, X.transform = X.transform))
}
Therefore, the sample standard deviation of the predicted variable is measured somewhat larger.
Is my understanding correct that this may cause the coefficients to be excessively large?
whether the model can be estimated by SGL package with intercept term (or constant term)
The SGL package does not seem to provide a function for estimating by including a intercept term.
In cvFit[["fit"]], I can see only the beta of the predict variables according to the lambda's except for the constant term. The value of cvFit[["fit"]][["intercept"]] is the mean of the y variable.
It can be estimated by adding 1 to first column of predict variable X, but in this case, it is expected to cause problems in centering and standardizing predict variables.
In addition, the SPG package seems to add a penalty to all predict variables. Even if the estimation is performed by adding 1 to the first column of the explanatory variable X as described above, the constant term may be estimated as 0.

Related

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

weird svm behavior in R (e1071)

I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.

Summary statistics for imputed data from Zelig & Amelia

I'm using Amelia to impute the missing values.
While I'm able to use Zelig and Amelia to do some calculations...
How do I use these packages to find the pooled means and standard deviations of the newly imputed data?
library(Amelia)
library(Zelig)
n= 100
x1= rnorm(n,0,1) #random normal distribution
x2= .4*x1+rnorm(n,0,sqrt(1-.4)^2) #x2 is correlated with x1, r=.4
x1= ifelse(rbinom(n,1,.2)==1,NA,x1) #randomly creating missing values
d= data.frame(cbind(x1,x2))
m=5 #set 5 imputed data frames
d.imp=amelia(d,m=m) #imputed data
summary(d.imp) #provides summary of imputation process
I couldn't figure out how to format the code in a comment so here it is.
foo <- function(x, fcn) apply(x, 2, fcn)
lapply(d.imp$imputations, foo, fcn = mean)
lapply(d.imp$imputations, foo, fcn = sd)
d.imp$imputations gives a list of all the imputed data sets. You can work with that list however you are comfortable with to get out the means and sds by column and then pool as you see fit. Same with correlations.
lapply(d.imp$imputations, cor)
Edit: After some discussion in the comments I see that what you are looking for is how to combine results using Rubin's rules for, for example, the mean of imputed data sets generated by Amelia. I think you should clarify in the title and body of your post that what you are looking for is how to combine results over imputations to get appropriate standard errors with Rubin's rules after imputing with package Amelia. This was not clear from the title or original description. "Pooling" can mean different things, particularly w.r.t. variances.
The mi.meld function is looking for a q matrix of estimates from each imputation, an se matrix of the corresponding se estimates, and a logical byrow argument. See ?mi.meld for an example. In your case, you want the sample means and se_hat(sample means) for each of your imputed data sets in the q and se matrices to pass to mi_meld, respectively.
q <- t(sapply(d.imp$imputations, foo, fcn = mean))
se <- t(sapply(d.imp$imputations, foo, fcn = sd)) / sqrt(100)
output <- mi.meld(q = q, se = se, byrow = TRUE)
should get you what you're looking for. For other statistics than the mean, you will need to get an SE either analytically, if available, or by, say, bootstrapping, if not.

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Resources