computing ridge estimate manually in R, simple - r

I'm trying to learn about ridge regression, and I am using R. From what I understand the following should be the same beta.r1 and beta.r2 in the code below are the same
library(MASS)
n=50
v1=runif(n)
v2=v1+2
V=cbind(1,v1,v2)
w=3+v1+v2
I=diag(3)
lambda=2 #arbitrarily chosen
beta.r1=solve(t(V)%*%V+lambda*I)%*%t(V)%*%w
#Using library(MASS)
fit=lm.ridge(w~v1+v2,lambda=2, Inter=FALSE)
beta.r2=coef(fit)
#Shouldn't beta.r1 and beta.r2 be the same?

I think the variable scaling performed in the lm.ridge code (which you can access by typing lm.ridge into your R console) that likely cause differences. The code scales each variable by its root-mean-squared value:
Xscale <- drop(rep(1/n, n) %*% X^2)^0.5
X <- X/rep(Xscale, rep(n, p))
Your code does not perform any variable scaling.
The variable scaling is hinted at on the ?lm.ridge help page in the description of what is returned by lm.ridge:
scales: scalings used on the X matrix.
Therefore you can access the scaling used by lm.ridge:
fit$scales
# v1 v2
# 0.2650311 0.2650311

Related

Why does R and PROCESS render different result of a mediation model (one is significant, the other one is not)?

As a newcomer who just gets started in R, I am confused about the result of the mediation analysis.
My model is simple: IV 'T1Incivi', Mediator 'T1Envied', DV 'T2PSRB'. I ran the same model in SPSS using PROCESS, but the result was insignificant in PROCESS; however, the indirect effect is significant in R. Since I am not that familiar with R, could you please help me to see if there is anything wrong with my code? And tell me why the result is significant in R but not in SPSS?Thanks a bunch!!!
My code in R:
X predict M
apath <- lm(T1Envied~T1Incivi, data=dat)
summary(apath)
X and M predict Y
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
summary(bpath)
Bootstrapping for indirect effect
getindirect <- function(dataset,random){
d=dataset[random,]
apath <- lm(T1Envied~T1Incivi, data=d)
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
indirect <- apath$coefficients["T1Incivi"]*bpath$coefficients["T1Envied"]
return(indirect)
}
library(boot)
set.seed(6452234)
Ind1 <- boot(data=dat,
statistic=getindirect,
R=5000)
boot.ci(Ind1,
conf = .95,
type = "norm")`*PSRB as outcome*
In your function getindirect all linear regressions should be based on the freshly shuffled data in d.
However there is the line
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
that makes the wrong reference to the variable dat which should really not be used within this function. That alone can explain incoherent results.

Allowing for aliased coefficients when running `grangertest()` in R

I'm currently trying to run a granger causality analysis in R/R Studio. I am receiving errors about aliased coefficients when using the function grangertest(). From my understanding, this occurs because there is perfect multicolinearity between the variables.
Due to having a very large number of pairwise comparisons (e.g. 200+), I would like to simply run the granger with the aliased coefficients as per normal rather than returning an error. According to one answer here, the solution is or was to add set singular.ok=TRUE, but either I am doing it incorrectly the answer s out of date. I've tried checking the documentation, but have come up empty. Any help would be appreciated.
library(lmtest)
x <- c(0,1,2,3)
y <- c(0,3,6,9)
grangertest(x,y,1) # I want this to run successfully even if there are aliased coefficients.
grangertest(x,y,1, singular.ok=TRUE) # this also doesn't work
"Error in waldtest.lm(fm, 2, ...) :
there are aliased coefficients in the model"
Additionally is there a way to flag x and y are actually aliased variables? There seem to be some answers like here but I'm having issues getting it working properly.
alias((x~ y))
Thanks in advance.
After some investigation and emailing the creator of the grangertest package, they sent me this solution. The solution should run on aliased variables when granger test does not. When the variables are not aliased, the solution should give the same values as the normal granger test.
library(lmtest)
library(dynlm)
# Some data that is multicolinear
x <- c(0,1,2,3,4)
y <- c(0,3,6,9,12)
# Some data that is not multicolinear
# x <- c(0,125,200,230,777)
# y <- c(0,3,6,9,200)
# Convert to time series (this is an important step)
x=ts(x)
y=ts(y)
# This will run even when the data is multicolinear (but also when it is not)
# and is functionally the same as running the granger test (which by default uses the waldtest
m1 = dynlm(x ~ L(x, 1:1) + L(y, 1:1))
m2 = dynlm(x ~ L(x, 1:1))
result <-anova(m1, m2, test="F")
# This will fail if the data is multicolinear or aliased but should give the same results as the anova otherwise (F value and P value etc)
#grangertest(y,x,1)

predict.lm with arbitrary coefficients r

I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...

kernel matrix computation outside SVM training in kernlab

I was developing a new algorithm that generates a modified kernel matrix for training with a SVM and encountered a strange problem.
For testing purposes I was comparing the SVM models learned using kernelMatrix interface and normal kernel interface. For example,
# Model with kernelMatrix computation within ksvm
svp1 <- ksvm(x, y, type="C-svc", kernel=vanilladot(), scaled=F)
# Model with kernelMatrix computed outside ksvm
K <- kernelMatrix(vanilladot(), x)
svp2 <- ksvm(K, y, type="C-svc")
identical(nSV(svp1), nSV(svp2))
Note that I have turned scaling off, as I am not sure how to perform scaling on kernel matrix.
From my understanding both svp1 and svp2 should return the same model. However I observed that this not true for a few datasets, for example glass0 from KEEL.
What am I missing here?
I think this has to do with same issue posted here. kernlab appears to treat the calculation of ksvm differently when explicitly using vanilladot() because it's class is 'vanillakernel' instead of 'kernel'.
if you define your own vanilladot kernel with a class of 'kernel' instead of 'vanillakernel' the code will be equivalent for both:
kfunction.k <- function(){
k <- function (x,y){crossprod(x,y)}
class(k) <- "kernel"
k}
l<-0.1 ; C<-1/(2*l)
svp1 <- ksvm(x, y, type="C-svc", kernel=kfunction.k(), scaled=F)
K <- kernelMatrix(kfunction.k(),x)
svp2 <- ksvm(K, y, type="C-svc", kernel='matrix', scaled=F)
identical(nSV(svp1), nSV(svp2))
It's worth noting that svp1 and svp2 are both different from their values in the original code because of this change.

How to extract info from package in R and use in function?

I apologize for the vague question title. What I want to do is run a regression in R using geeglm from the geepack R package, then use information from that to calculate a quasilikelihood information criteria (QIC; Pan 2001). I can do this fairly easily for single models but I would like to write a general function that can do this for a variety of different types of models. I guess my real question is whether there is a better alternative than having a long series of nested ifelse statements?
Here's my current code:
library(geepack)
data(dietox) #data from the geepack package
# Run gee regression
dietox$Cu <- as.factor(dietox$Cu)
mf <- formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 <- geeglm(mf, data = dietox, id = Pig, family = gaussian, corstr = "ar1")
Then I can run a function to calculate the quasilikelihood:
QlogLik.normal <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2)
quasi.R
}
However, I would like to write a function that is more general because the quasilikelihood function is different for every distribution. The above function would work for gee1 because it had a gaussian (normal) distribution. If I wanted to generalize it for a variety of distributions I could use a series of nested ifelse statements (below), but I don't know if this is the best way to do this. Does anyone have other options or a better solution? This just doesn't seem very elegant to say the least (clearly I don't have much programming or R experience).
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
ifelse(model.R$modelInfo$variance == "poisson",
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R)) - mu.R),
ifelse(model.R$modelInfo$variance == "gaussian",
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2),
ifelse(model.R$modelInfo$variance == "binomial",
# Quasilikelihood for Binomial
quasi.R <- sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
quasi.R <- "Error: distribution not recognized")))
quasi.R
}
In this example, I used the model output from geeglm to extract the type of distribution used to model the variance
model.R$modelInfo$variance
but there may be other ways to determine what distribution was used in the geeglm model. Any help would be appreciated.
You should be able to rewrite your function like this:
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
type <- family(model.R)$family
switch(type,
poisson = sum((y*log(mu.R)) - mu.R),
gaussian = sum(((y - mu.R)^2)/-2),
binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
stop("Error: distribution not recognized"))
}
As #baptise points out, switch useful in these cases. You use family(model.R)$family to automatically detect what family type should be used with switch.
Also, if your commands for what to do in different cases run beyond one line, you can wrap the lines with curly brackets ({ do something here }) instead.
switch(type,
type1 = { something <- do(this)
thisis(something) },
type2 = do(that))
I hope this helps!
You may also use model.R$family$family which gives the type of distribution used to model the variance, but so far I didn't know if you could eliminate those ifelse statements. The quasi.R in your code differs among different distributions, so you have to define each of them separately.
BTW, it is a good question and thanks for posting it: I had similar situations in the past, and hope to get some advice on how to write the codes more efficiently.

Resources