Ridge regression within a loop - r

I am new in coding, so I still struggle with simple things as loops, subsetting, and data frame vs. matrix.
I am trying to fit a ridge regression for a multivariable X (X1=Marker 1, X2= Marker, X3= Marker 3,..., X1333= Marker 1333), shown in the first image, as a predictor variable of Y, in the second image.
I want to compute the sum of the squared errors (SSE) for varying tuning parameter λ (between 1 and 20). My code is the following:
#install.packages("MASS")
library(MASS)
fitridge <- function(x,y){
fridge=lm.ridge (y ~ x, lambda = seq(0, 20, 2)) #Fitting a ridge regression for varying λ values
sum(residuals(fridge)^2) #This results in SSE
}
all_gcv= apply(as.matrix(genmark_new),2,fitridge,y=as.matrix(coleslev_new))
}
However, it returns this error, and I don't know what to do anymore. I have tried converting the data set into a matrix, a data frame, changing the order of rows and columns...
Error in colMeans(X[, -Inter]) : 'x' must be an array of at least two dimensions.
I just would like to take each marker value from a single row (first picture), pass them into my fitridge function that fits a ridge regression against the Y from the second data set (in the second picture).
And then subset the SSE and their corresponding lambda values

You cannot fit a ridge with only one independent variable. It is not meant for this. In your case, most likely you have to do:
genmark_new = data.frame(matrix(sample(0:1,1333*100,replace=TRUE),ncol=1333))
colnames(genmark_new) = paste0("Marker_",1:ncol(genmark_new))
coleslev_new = data.frame(NormalizedCholesterol=rnorm(100))
Y = coleslev_new$NormalizedCholesterol
library(MASS)
fit = lm.ridge (y ~ ., data=data.frame(genmark_new,y=Y),lambda = seq(0, 20, 2))
And calculate residuals for each lambda:
apply(fit$coef,2,function(i)sum((Y-as.matrix(genmark_new) %*% i)^2))
0 2 4 6 8 10 12 14
26.41866 27.88029 27.96360 28.04675 28.12975 28.21260 28.29530 28.37785
16 18 20
28.46025 28.54250 28.62459
If you need to fit each variable separately, you can consider using a linear model:
fitlm <- function(x,y){
fridge=lm(y ~ x)
sum(residuals(fridge)^2)
}
all_gcv= apply(genmark_new,2,fitlm,y=Y)
Suggestion, check out make notes or introductions to ridge, they are meant for multiple variate regressions, that is, more than 1 independent variable.

Related

Standardization and inclusion of intercept in sparse lasso GLM

I found some problems while practicing the sparse group lasso method using the cvSGL function forom the SGL package.
My questions are as follows:
Looking at the code for SGL:::center_scale, it doesn't seem to consider the sample size of the data.
SGL:::center_scale
#Output
function (X, standardize) {
means <- apply(X, 2, mean)
X <- t(t(X) - means)
X.transform <- list(X.means = means)
if (standardize == TRUE) {
var <- apply(X, 2, function(x) (sqrt(sum(x^2))))
X <- t(t(X)/var)
X.transform$X.scale <- var
}
else {
X.transform$X.scale <- 1
}
return(list(x = X, X.transform = X.transform))
}
Therefore, the sample standard deviation of the predicted variable is measured somewhat larger.
Is my understanding correct that this may cause the coefficients to be excessively large?
whether the model can be estimated by SGL package with intercept term (or constant term)
The SGL package does not seem to provide a function for estimating by including a intercept term.
In cvFit[["fit"]], I can see only the beta of the predict variables according to the lambda's except for the constant term. The value of cvFit[["fit"]][["intercept"]] is the mean of the y variable.
It can be estimated by adding 1 to first column of predict variable X, but in this case, it is expected to cause problems in centering and standardizing predict variables.
In addition, the SPG package seems to add a penalty to all predict variables. Even if the estimation is performed by adding 1 to the first column of the explanatory variable X as described above, the constant term may be estimated as 0.

ar(1) simulation with non-zero mean

I can't seem to find the correct way to simulate an AR(1) time series with a mean that is not zero.
I need 53 data points, rho = .8, mean = 300.
However, arima.sim(list(order=c(1,0,0), ar=.8), n=53, mean=300, sd=21)
gives me values in the 1500s. For example:
1480.099 1480.518 1501.794 1509.464 1499.965 1489.545 1482.367 1505.103 (and so on)
I have also tried arima.sim(n=52, model=list(ar=c(.8)), start.innov=300, n.start=1)
but then it just counts down like this:
238.81775870 190.19203239 151.91292491 122.09682547 96.27074057 [6] 77.17105923 63.15148491 50.04211711 39.68465916 32.46837830 24.78357345 21.27437183 15.93486092 13.40199333 10.99762449 8.70208879 5.62264196 3.15086491 2.13809323 1.30009732
and I have tried arima.sim(list(order=c(1,0,0), ar=.8), n=53,sd=21) + 300 which seems to give a correct answer. For example:
280.6420 247.3219 292.4309 289.8923 261.5347 279.6198 290.6622 295.0501
264.4233 273.8532 261.9590 278.0217 300.6825 291.4469 291.5964 293.5710
285.0330 274.5732 285.2396 298.0211 319.9195 324.0424 342.2192 353.8149
and so on..
However, I am in doubt that this is doing the correct thing? Is it still auto-correlating on the correct number then?
Your last option is okay to get the desired mean, "mu". It generates data from the model:
(y[t] - mu) = phi * (y[t-1] - mu) + \epsilon[t], epsilon[t] ~ N(0, sigma=21),
t=1,2,...,n.
Your first approach sets an intercept, "alpha", rather than a mean:
y[t] = alpha + phi * y[t-1] + epsilon[t].
Your second option sets the starting value y[0] equal to 300. As long as |phi|<1 the influence of this initial value will vanish after a few periods and will have no effect
on the level of the series.
Edit
The value of the standard deviation that you observe in the simulated data is correct. Be aware that the variance of the AR(1) process, y[t], is not equal the variance of the innovations, epsilon[t]. The variance of the AR(1) process, sigma^2_y, can be obtained obtained as follows:
Var(y[t]) = Var(alpha) + phi^2 * Var(y[t-1]) + Var(epsilon[t])
As the process is stationary Var(y[t]) = Var(t[t-1]) which we call sigma^2_y. Thus, we get:
sigma^2_y = 0 + phi^2 * sigma^2_y + sigma^2_epsilon
sigma^2_y = sigma^2_epsilon / (1 - phi^2)
For the values of the parameters that you are using you have:
sigma_y = sqrt(21^2 / (1 - 0.8^2)) = 35.
Use the rGARMA function in the ts.extend package
You can generate random vectors from any stationary Gaussian ARMA model using the ts.extend package. This package generates random vectors directly form the multivariate normal distribution using the computed autocorrelation matrix for the random vector, so it gives random vectors from the exact distribution and does not require "burn-in" iterations. Here is an example of generating multiple independent time-series vectors all from an AR(1) model.
#Load the package
library(ts.extend)
#Set parameters
MEAN <- 300
ERRORVAR <- 21^2
AR <- 0.8
m <- 53
#Generate n = 16 random vectors from this model
set.seed(1)
SERIES <- rGARMA(n = 16, m = m, mean = MEAN, ar = AR, errorvar = ERRORVAR)
#Plot the series using ggplot2 graphics
library(ggplot2)
plot(SERIES)
As you can see, the generated time-series vectors in this plot use the appropriate mean and error variance that were specified in the inputs.

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Solution of varying coefficients ODE

I have a set of observed raw data and use 2nd order ODE to fit the data
y''+b1(t)y'+b0(t)y = 0
The b1 and b0 are time-dependent and I use principal differential analysis(PDA) (R-package: fda, function: pda.fd)to get the estimate of b1(t) and b0(t) .
To check the validity of the estimates of b1(t) and b0(t), I use collocation method (R-package bvpSolve, function:bvpcol) to get the numerical solution of the ODE and compare the solution with the smoothing curve fitting of the raw data.
My question is that my numerical solution from bvpcol can caputure the shape of the fitting curve but not for the value of the function. There are different in term of some constant multiples.
(Since I am not allowed to post images,please see the link for figure)
See the figure of my output. The gray dot is my raw data, the red line is Fourier expansion of the raw data, the green line is numerical solution of bvpcol function and the blue line the green-line/1.62. We can see the green line can capture the shape but with values that are constant times of fourier expansion.
I fit several other data and have similar situation but different constant. I am wondering it is the problem of numerical solution of ODE or some other reasons and how to solve this problem to get a good accordance between numerical solution(green) and true Fourier expansion?
Any help and idea is appreciated!
Here is a raw data and code:
RData is here
library(fda)
library(bvpSolve)
# load the data
load('y.RData')
tvec = 1:length(y)
tvec = (tvec-min(tvec))/(max(tvec)-min(tvec))
# create basis
fbasis = create.fourier.basis(c(0,1),nbasis=nbasis)
bbasis = create.bspline.basis(c(0,1),norder=8,nbasis=47)
bfdPar = fdPar(bbasis)
yfd = smooth.basis(tvec,y,fbasis)$fd
yfdlist = list(yfd)
bwtlist = rep(list(bfdPar),2)
# PDA fit
bwt = pda.fd(yfdlist,bwtlist)$bwtlist
# output of estimated coefficients
beta0.fd<-bwt[[1]]$fd
beta1.fd<-bwt[[2]]$fd
# define the vary-coef function in terms of t
fbeta0<-function(t)eval.fd(t,beta0.fd)
fbeta1<-function(t)eval.fd(t,beta1.fd)
# define 2nd order ODE
fun2 <- function(t,y,pars) {
with(as.list(c(y,pars)),{
beta0 = pars[[1]];
beta1 = pars[[2]];
dy1 = y[2]
dy2 = -beta1(t)*y[2]-beta0(t)*y[1]
return(list(c(dy1,dy2)))
})
}
# BVP
yinit<-c(p1[1],NA)
yend<-c(p1[length(p1)],NA)
t<-seq(tvec[1],tvec[length(tvec)],0.005)
col<-bvpcol(yini=yinit,yend=yend,x=t,func=fun2,parms=c(fbeta0,fbeta1),atol=1e-5,islin=T)
# plot output
plot(col[,1],col[,2],col='green',type='l')
points(tvec,p1,col='darkgray')
lines(yfd,col='red',lwd=2)
lines(col[,1],col[,2],col='green',type='l')
lines(col[,1],col[,2]/1.62,col='blue',type='l',lwd=2,lty=4)
legend('topleft',col=c('green','darkgray','red','blue'),
legend=c('ODE solution','raw data','basis curve fitting','ODE solution/1.62'),lty=1)

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources