get equation of linear SVM regression line - r

I'm struggling to find a way to get the equation for a linear SVM model in the regression case, since most of the questions deal with classification... I have fit it with caret package.
1- univariate case
set.seed(1)
fit=train(mpg~hp, data=mtcars, method="svmLinear")
plot(x=mtcars$hp, y=predict(fit, mtcars), pch=15)
points(x=mtcars$hp, y=mtcars$mpg, col="red")
abline(lm(mpg~hp, mtcars), col="blue")
Which gives the plot with red=actual, black=fitted, and blue line is classic regression. In this case I know I could manually calculate the SVM prediction line from 2 points, but is there a way to get directly the equation from the model structure? I actually need the equation like this y=a+bx (here mpg=?+?*hp ) with values in the original scale.
2-multivariate
same question but with 2 dependent variables (mpg~hp+wt)
Thanks,

Yes, I believe there is. Take a look at this answer, which is similar, but does not use the caret library. If you add svp = fit$finalModelto the example, you should be able to follow it almost exactly. I applied a similar technique to your data below. I scaled the data to fit nicely on the plot of the vectors since the library scales the data at runtime.
require(caret)
set.seed(1)
x = model.matrix(data=mtcars, mpg ~ scale(hp)) #set up data
y = mtcars$mpg
fit=train(x, y, method="svmLinear") #train
svp = fit$finalModel #extract s4 model object
plot(x, xlab="", ylab="")
w <- colSums(coef(svp)[[1]] * x[unlist(alphaindex(svp)),])
b <- b(svp)
abline(b/w[1],-w[2]/w[1], col='red')
abline((b+1)/w[1],-w[2]/w[1],lty=2, col='red')
abline((b-1)/w[1],-w[2]/w[1],lty=2, col='red')
And your second question:
x = model.matrix(data=mtcars, mpg ~ scale(hp) + scale(wt) - 1) #set up data
fit=train(x, y, method="svmLinear") #train
svp = fit$finalModel #extract s4 model object
plot(x, xlab="", ylab="")
w <- colSums(coef(svp)[[1]] * x[unlist(alphaindex(svp)),])
b <- b(svp)
abline(b/w[1],-w[2]/w[1], col='red')
abline((b+1)/w[1],-w[2]/w[1],lty=2, col='red')
abline((b-1)/w[1],-w[2]/w[1],lty=2, col='red')
Edit
The above answer concerns plotting a boundary, not the linear SVM regression line. To answer the question, one easy way to get the line is to extract the predicted values and plot the regression. You actually only need a couple of points to get the line, but for simplicity, I used the following code.
abline(lm(predict(fit, newdata=mtcars) ~ mtcars$hp), col='green')
or
abline(lm(predict(fit) ~ mtcars$hp), col='green')

Related

Why abline won't show line from glm with Gamma family?

I have the following data, which I'm trying to model via GLM, using Gamma function. It works, except that abline won't show any line. What am I doing wrong?
y <- c(0.00904977380111,0.009174311972687,0.022573363475789,0.081632653008122,0.005571030584803,1e-04,0.02375296916921,0.004962779106823,0.013729977117333,0.00904977380111,0.004514672640982,0.016528925619835,1e-04,0.027855153258277,0.011834319585449,0.024999999936719,1e-04,0.026809651528869,0.016348773841071,1e-04,0.009345794439034,0.00457665899303,0.004705882305772,0.023201856194357,1e-04,0.033734939711656,0.014251781472007,0.004662004755245,0.009259259166667,0.056872037917387,0.018518518611111,0.014598540145986,0.009478673032951,0.023529411811211,0.004819277060357,0.018691588737881,0.018957345923721,0.005390835525461,0.056179775223141,0.016348773841071,0.01104972381185,0.010928961639344,1e-04,1e-04,0.010869565271444,0.011363636420778,0.016085790883856,0.016,0.005665722322786,0.01117318441372,0.028818443860841,1e-04,0.022988505862069,0.01010101,1e-04,0.018083182676638,0.00904977380111,0.00961538466323,0.005390835525461,0.005763688703004,1e-04,0.005571030584803,1e-04,0.014388489208633,0.005633802760722,0.005633802760722,1e-04,0.005361930241431,0.005698005811966,0.013986013986014,1e-04,1e-04)
x <- c(600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,744.47,744.47,744.47,744.47,744.47,744.47,744.47,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42)
hist(y,breaks=15)
plot(y~x)
fit <- glm(y~x,family='Gamma'(link='log'))
abline(fit)
abline plots linear functions, from a simple linear regression, say. A GLM with a Gamma family and a log link is nonlinear on the original scale. To visualize the fit of such a model, you could use predict (an example is given below). Several packages (e.g. effects or visreg) for R exist that feature functions that allow you to directly plot the fit on the original scale including confidence intervals.
Here is an example using visreg using your data and model:
library(visreg)
y <- c(0.00904977380111,0.009174311972687,0.022573363475789,0.081632653008122,0.005571030584803,1e-04,0.02375296916921,0.004962779106823,0.013729977117333,0.00904977380111,0.004514672640982,0.016528925619835,1e-04,0.027855153258277,0.011834319585449,0.024999999936719,1e-04,0.026809651528869,0.016348773841071,1e-04,0.009345794439034,0.00457665899303,0.004705882305772,0.023201856194357,1e-04,0.033734939711656,0.014251781472007,0.004662004755245,0.009259259166667,0.056872037917387,0.018518518611111,0.014598540145986,0.009478673032951,0.023529411811211,0.004819277060357,0.018691588737881,0.018957345923721,0.005390835525461,0.056179775223141,0.016348773841071,0.01104972381185,0.010928961639344,1e-04,1e-04,0.010869565271444,0.011363636420778,0.016085790883856,0.016,0.005665722322786,0.01117318441372,0.028818443860841,1e-04,0.022988505862069,0.01010101,1e-04,0.018083182676638,0.00904977380111,0.00961538466323,0.005390835525461,0.005763688703004,1e-04,0.005571030584803,1e-04,0.014388489208633,0.005633802760722,0.005633802760722,1e-04,0.005361930241431,0.005698005811966,0.013986013986014,1e-04,1e-04)
x <- c(600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,744.47,744.47,744.47,744.47,744.47,744.47,744.47,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42)
fit <- glm(y~x,family='Gamma'(link='log'))
visreg(fit, scale = "response")
An here is the example using R base graphics and predict:
pred_frame <- data.frame(
x = seq(min(x), max(x), length.out = 1000)
)
pred_frame$fit <- predict(fit, newdata = pred_frame, type = "response")
plot(y~x, pch = 16, las = 1, cex = 1.5)
lines(fit~x, data = pred_frame, col = "steelblue", lwd = 3)
You are not being consistent here since you chose to model on the log scale but you are plotting on the raw scale. Mind you many, many published plots do the same. You need to plot the points in log space or transform the coefficients and pass them to abline() explicitly.

Plotting a 95% confidence interval for a lm object

How can I calculate and plot a confidence interval for my regression in r? So far I have two numerical vectors of equal length (x,y) and a regression object(lm.out). I have made a scatterplot of y given x and added the regression line to this plot. I am looking for a way to add a 95% prediction confidence band for lm.out to the plot. I've tried using the predict function, but I don't even know where to start with that :/. Here is my code at the moment:
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
plot(x,y)
regression.data = summary(lm.out) #save regression summary as variable
names(regression.data) #get names so we can index this data
a= regression.data$coefficients["(Intercept)","Estimate"] #grab values
b= regression.data$coefficients["x","Estimate"]
abline(a,b) #add the regression line
Thank you!
Edit: I've taken a look at the proposed duplicate and can't quite get to the bottom of it.
You have yo use predict for a new vector of data, here newx.
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
newx = seq(min(x),max(x),by = 0.05)
conf_interval <- predict(lm.out, newdata=data.frame(x=newx), interval="confidence",
level = 0.95)
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
EDIT
as it is mention in the coments by Ben this can be done with matlines as follow:
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
matlines(newx, conf_interval[,2:3], col = "blue", lty=2)
I'm going to add a tip that would have saved me a lot of frustration when trying the method given by #Alejandro Andrade: If your data are in a data frame, then when you build your model with lm(), use the data= argument rather than $ notation. E.g., use
lm.out <- lm(y ~ x, data = mydata)
rather than
lm.out <- lm(mydata$y ~ mydata$x)
If you do the latter, then this statement
predict(lm.out, newdata=data.frame(x=newx), interval="confidence", level = 0.95)
seems to either ignore the new values passed using newdata= or there's a silent error. Either way, the output is the predictions from the original data, not the new data.
Also, be sure your x variable gets the same name in the new data frame that it had
in the original. That's easier to figure out because you do get an error, but knowing it ahead of time might save you a round of debugging.
Note: Tried to add this as a comment, but don't have enough reputation points.

`gam` package: extra shift spotted when sketching data on `plot.gam`

I try to fit a GAM using the gam package (I know mgcv is more flexible, but I need to use gam here). I now have the problem that the model looks good, but in comparison with the original data it seems to be offset along the y-axis by a constant value, for which I cannot figure out where this comes from.
This code reproduces the problem:
library(gam)
data(gam.data)
x <- gam.data$x
y <- gam.data$y
fit <- gam(y ~ s(x,6))
fit$coefficients
#(Intercept) s(x, 6)
# 1.921819 -2.318771
plot(fit, ylim = range(y))
points(x, y)
points(x, y -1.921819, col=2)
legend("topright", pch=1, col=1:2, legend=c("Original", "Minus intercept"))
Chambers, J. M. and Hastie, T. J. (1993) Statistical Models in S (Chapman & Hall) shows that there should not be an offset, and this is also intuitively correct (the smooth should describe the data).
I noticed something comparable in mgcv, which can be solved by providing the shift parameter with the intercept value of the model (because the smooth is seemingly centred). I thought the same could be true here, so I subtracted the intercept from the original data-points. However, the plot above shows this idea wrong. I don't know where the extra shift comes from. I hope someone here may be able to help me.
(R version. 3.3.1; gam version 1.12)
I think I should first explain various output in the fitted GAM model:
library(gam)
data(gam.data)
x <- gam.data$x
y <- gam.data$y
fit <-gam(y ~ s(x,6), model = FALSE)
## coefficients for parametric part
## this includes intercept and null space of spline
beta <- coef(fit)
## null space of spline smooth (a linear term, just `x`)
nullspace <- fit$smooth.frame[,1]
nullspace - x ## all 0
## smooth space that are penalized
## note, the backfitting procedure guarantees that this is centred
pensmooth <- fit$smooth[,1]
sum(pensmooth) ## centred
# [1] 5.89806e-17
## estimated smooth function (null space + penalized space)
smooth <- nullspace * beta[2] + pensmooth
## centred smooth function (this is what `plot.gam` is going to plot)
c0 <- mean(smooth)
censmooth <- smooth - c0
## additive predictors (this is just fitted values in Gaussian case)
addpred <- beta[1] + smooth
You can first verify that addpred is what fit$additive.predictors gives, and since we are fitting additive models with Gaussian response, this is also as same as fit$fitted.values.
What plot.gam does, is to plot censmooth:
plot.gam(fit, col = 4, ylim = c(-1.5,1.5))
points(x, censmooth, col = "gray")
Remember, there is
addpred = beta[0] + censmooth + c0
If you want to shift original data y to match this plot, you not only need to subtract intercept (beta[0]), but also c0 from y:
points(x, y - beta[1] - c0)

How to find "y" values of the already estimated monotone function of the non-monotone regression curve corresponding to the original "x" points?

The title sounds complicated but that is what I am looking for. Focus on the picture.
## data
x <- c(1.009648,1.017896,1.021773,1.043659,1.060277,1.074578,1.075495,1.097086,1.106268,1.110550,1.117795,1.143573,1.166305,1.177850,1.188795,1.198032,1.200526,1.223329,1.235814,1.239068,1.243189,1.260003,1.262732,1.266907,1.269932,1.284472,1.307483,1.323714,1.326705,1.328625,1.372419,1.398703,1.404474,1.414360,1.415909,1.418254,1.430865,1.431476,1.437642,1.438682,1.447056,1.456152,1.457934,1.457993,1.465968,1.478041,1.478076,1.485995,1.486357,1.490379,1.490719)
y <- c(0.5102649,0.0000000,0.6360097,0.0000000,0.8692671,0.0000000,1.0000000,0.0000000,0.4183691,0.8953987,0.3442624,0.0000000,0.7513169,0.0000000,0.0000000,0.0000000,0.0000000,0.1291901,0.4936121,0.7565551,1.0085108,0.0000000,0.0000000,0.1655482,0.0000000,0.1473168,0.0000000,0.0000000,0.0000000,0.1875293,0.4918018,0.0000000,0.0000000,0.8101771,0.6853480,0.0000000,0.0000000,0.0000000,0.0000000,0.4068802,1.1061434,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.6391678)
fit1 <- c(0.5102649100,0.5153380934,0.5177234836,0.5255544980,0.5307668662,0.5068087080,0.5071001179,0.4825657520,0.4832969250,0.4836378194,0.4842147729,0.5004039310,0.4987301366,0.4978800742,0.4978042478,0.4969807064,0.5086987191,0.4989497612,0.4936121200,0.4922210302,0.4904593166,0.4775197108,0.4757040857,0.4729265271,0.4709141776,0.4612406896,0.4459316517,0.4351338346,0.4331439717,0.4318664278,0.3235179189,0.2907908968,0.1665721429,0.1474035158,0.1443999345,0.1398517097,0.1153991839,0.1142140393,0.1022584672,0.1002410843,0.0840033244,0.0663669309,0.0629119398,0.0627979240,0.0473336492,0.0239237481,0.0238556876,0.0084990298,0.0077970954,0.0000000000,-0.0006598571)
fit2 <- c(-0.0006598571,0.0153328298,0.0228511733,0.0652889427,0.0975108758,0.1252414661,0.1270195143,0.1922510501,0.2965234797,0.3018551305,0.3108761043,0.3621749370,0.4184150225,0.4359301495,0.4432114081,0.4493565757,0.4510158144,0.4661865431,0.4744926045,0.4766574718,0.4796937554,0.4834718810,0.4836125426,0.4839450098,0.4841092849,0.4877317306,0.4930561638,0.4964939389,0.4970089201,0.4971376528,0.4990394601,0.5005881678,0.5023814257,0.5052125977,0.5056691690,0.5064254338,0.5115481820,0.5117259449,0.5146054557,0.5149729419,0.5184178197,0.5211542908,0.5216215426,0.5216426533,0.5239797875,0.5273573222,0.5273683002,0.5293994824,0.5295130266,0.5306236672,0.5307303109)
## picture
plot(x, y)
## red regression curve
points(x, fit1, col=2); lines(x, fit1, col=2)
## blue monotonic curve to the regression
points(min(x) + cumsum(c(0, rev(diff(x)))), rev(fit2), col="blue"); lines(min(x) + cumsum(c(0, rev(diff(x)))), rev(fit2), col="blue")
## "x" original point matches with the regression estimated point
## but not with the estimated (fit2=estimate) monotonic curve
abline(v=1.223329, lty=2, col="grey")
Focus on the dashed grey line. The idea is to get y value of the monotonic blue curve corresponding to x original value. The grey line should cross three points (the original one "black", the regression estimate "red", the adjusted regression estimate "blue"). Can we do this?
Methodology:
The object "fit2" is the output of the function rearrangement(). It is always monotonically increasing.
library(Rearrangement)
fit2 <- rearrangement(x=as.data.frame(x), y=fit1)
It sounds like you might be interested in approxfun
fn <- approxfun(x=min(x) + cumsum(c(0, rev(diff(x)))), y=rev(fit2))
fn(1.223329)
It's not very fancy, but it will do a basic linear interpolation between observed points for unobserved x values. The code above will estimate a y value for the value x=1.223329 using the existing points. You can then use the fn function to estimate other points as well.
I found a way to do this without approxfun. The rearrangement function returns a monotonically increasing result. Your fitted curve is decreasing, and you can do a simple trick to get what you want (also what you wanted in your earlier question).
## Apply rearrangement on minus fit1
fit3 <- rearrangement(x=as.data.frame(x), y = - fit1)
## Plot the minus rearranged result
plot(x, y)
lines(x, - fit3, col="green"); points(x, - fit3, col="green")
lines(x, fit1, col="green"); points(x, fit1, col="green")
So, the result is a monotonically decreasing curve, with x values equal to that of your data and your fit.
ind <- which(x == 1.223329)
-fit3[ind]
## 0.4857717
Hope it helps,
alex

dividing data and fitting in R

I've the following rainfall data for the year 1951:
dat.1951=c(122,122,122,122,122,122,122,122,122,122,122,121,121,121,121,120,119,119,117,117,117,115,115,115,114,112,112,111,110,109,106,105,104,103,102,99,97,95,91,89,88,86,84,83,83,82,82,79,77,77,76,74,74,72,72,71,70,69,66,65,64,61,61,58,56,56,54,53,51,49,48,47,46,46,46,45,42,40,39,38,37,36,36,35,34,33,33,32,30,30,29,28,28,27,25,25,23,22,21,20,20,20,20,20,19,19,18,18,18,16,16,15,15,15,15,15,14,14,14,14,14,14,14,14,14,14,14,13,13,12,12,11,11,11,11,11,11,11,11,11,11,11,11,11,11,10,10,10,9,8,8,8,8,8,8,8,8,8,8,8,8,7,7,6,6,6,6,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,1,1)
I want to fit this data. I break this data into 2 regions (head & tail). One where points are less than 100 (head) and the rest (for > 100) - tail. I could fit an exponential to the head portion (see the code below). For the tail, I want to fit a curve and want to plot both the portions in a single plot along with the data. Can anyone help?
dat.1951<-dat.1951[dat.1951 > 0]
dat.1951.tail<-dat.1951[dat.1951 >= 100]
dat.1951.head<-dat.1951[dat.1951 < 100]
x.head<-seq(1,length(dat.1951.head))
log.data<-log(dat.1951.head)
idf.head<-data.frame(x.head,dat.1951.head)
idf.head$dat.1951.head<-log(idf.head$dat.1951.head)
model=lm(idf.head$dat.1951.head ~ idf.head$x.head,data=idf.head)
summary(model)
plot(dat.1951.head)
lines(idf.head$x.head,exp(fitted(model)),col="blue")
I'm not sure why you want to (1) break the data into two regions, (2) eliminate records where there was no rainfall, and (3) fit the model you describe. You may want to consult with a statistician on these matters.
To answer your question, though, I came up with an example for a second model and show the fits from both models on the same plot.
x <- seq(dat.1951)
sel <- dat.1951 >= 100
model1 <- lm(dat.1951[sel] ~ poly(x[sel], 2))
model2 <- lm(log(dat.1951[!sel]) ~ x[!sel])
plot(dat.1951, cex=1.5)
lines(x[sel], fitted(model1), col="blue", lwd=3)
lines(x[!sel], exp(fitted(model2)), col="navy", lwd=3)
Just for grins, I added a third model that fits all of the data with a generalized additive model using the function gam() from the package mgcv.
library(mgcv)
model3 <- gam(dat.1951 ~ s(x))
lines(x, fitted(model3), col="orange", lwd=3, lty=2)

Resources