Abline not working with Linear regression Model - r

I have a data in R so i want to test the data on various models. I have split the data into 2 sets 80% training and 20% testing. So now what i want to do is train the training data set on a linear model and predict it on the testing data set.
I have don this so far.
temp<-lm(formula = cityMpg ~ peakRpm+horsePower+wheelBase , data=train)
temp_test<- predict(temp,test)
plot(temp_test)
Here, I get the scatter plot. Now I just want a line in this scatter plot.
When I use abline(temp_test), I get an error.
i WANT THE LINE as automatic, I do not wish to specify the co-ordinates.
getting error as:
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) :
invalid a=, b= specification

As pointed out above, this is a bit tricky for a multi-dimensional model.
Get some data (you neglected to include a reproducible example: see http://tinyurl.com/reproducible-000 ...)
library(foreign)
dat <- read.arff(url("http://www.cs.umb.edu/~rickb/files/UCI/autos.arff"))
Split into training and test data sets:
train <- dat[1:150,]
test <- dat[151:nrow(dat),]
The variable names are a bit awkward for R (the dashes are interpreted as minus operators, so we have to use back-quotes to protect the names):
fit <- lm(`city-mpg` ~ `peak-rpm`+horsepower+`wheel-base`,data=train)
temp_test <- predict(fit,test)
Plot the predictions vs peak RPM:
par(las=1,bty="l") ## cosmetic
plot(test[["peak-rpm"]],temp_test,xlab="peak rpm",ylab="predicted")
In order to add the line, we have to adjust the intercept according to some baseline values of the other parameters: we'll use the mean (another alternative is to center all the predictor variables before fitting the model):
cf <- coef(fit)
abline(a=cf["(Intercept)"]+
mean(test$horsepower)*cf["horsepower"]+
mean(test$`wheel-base`)*cf["`wheel-base`"],
b=coef(fit)["`peak-rpm`"])
Another way to do this is to use predict():
newdat <- with(test,
data.frame(horsepower=mean(horsepower),
"wheel-base"=mean(`wheel-base`),
"peak-rpm"=seq(min(`peak-rpm`),
max(`peak-rpm`),
length=41),
check.names=FALSE))
newdat["city-mpg"] <- predict(fit,newdat)
with(newdat,lines(`peak-rpm`,`city-mpg`,col=4))
(41 points is silly for a straight line -- we could have used just 2 -- but will work well if you want to plot something curved, like confidence intervals or a nonlinear fit.)
Alternatively you could just fit the marginal model, but the actual fitted line is somewhat different (it will only be the same if all the predictors are orthogonal to each other):
fit2 <- lm(`city-mpg` ~ `peak-rpm`,data=train)
abline(fit2,col="red")

Related

Predict Future values using polynomial regression in R

Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?
# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.

Generating predictive simulations from a multilevel model with random intercepts

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.
I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:
mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:
set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
stringsAsFactors=FALSE)
Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:
library(arm)
n.sims = 1000 # set number of simulations
n.tilde = nrow(fakeiris) # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"]) # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims) # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde)) # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris#coef[s,], sim.fakeiris#sigma[s]) } # use matrix multiplication to fill that array
That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.
Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
Okay, that works, but now what? I run:
sim.fakeiris.lmer <- sim(mod1, n.sims)
When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:
#fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)
#ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)
#sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations
I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.
Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?
One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,
install.packages("devtools")
devtools::install_github("jknowles/merTools")
To get the median and a 95% credible interval of 100 simulations:
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median")
By default, predictInterval includes the residual variation, but you can
turn that feature off with:
out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median",
include.resid.var=FALSE)
Hope this helps!
This might help: it doesn't use sim(), but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery (setBeta0) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA. However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.
In your use case, you would specify newdata=fakeiris.
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
v <- vcov(object)
b <- fixef(object)
betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
npred <- if (is.null(newdata)) {
length(predict(object))
} else nrow(newdata)
res <- matrix(NA,npred,n)
for (i in 1:n) {
mod1#pp$setBeta0(betapars[i,])
res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
}
return(res)
}
ss <- simfun(mod1,100)

Modifying a curve to prevent singular gradient matrix at initial parameter estimates

I want to use y=a^(b^x) to fit the data below,
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
When I use the non-linear least squares procedure,
f <- function(x,a,b) {a^(b^x)}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1, b=0.5)))
it produces an error: singular gradient matrix at initial parameter estimates. The result is roughly a = 1.1466, b = 0.6415, so there shouldn't be a problem with intial parameter estimates as I have defined them as a=1, b=0.5.
I have read in other topics that it is convenient to modify the curve. I was thinking about something like log y=log a *(b^x), but I don't know how to deal with function specification. Any idea?
I will expand my comment into an answer.
If I use the following:
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
f <- function(x,a,b) {a^b^x}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=0.9, b=0.6)))
or
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1.2, b=0.4)))
I obtain:
Nonlinear regression model
model: y ~ f(x, a, b)
data: data
a b
1.0934 0.7242
residual sum-of-squares: 0.0001006
Number of iterations to convergence: 10
Achieved convergence tolerance: 3.301e-06
I always obtain an error if I use 1 as a starting value for a, perhaps because 1 raised to anything is 1.
As for automatically generating starting values, I am not familiar with a procedure to do that. One method I have read about is to simulate curves and use starting values that generate a curve that appears to approximate your data.
Here is the plot generated using the above parameter estimates using the following code. I admit that maybe the lower right portion of the line could fit a little better:
setwd('c:/users/mmiller21/simple R programs/')
jpeg(filename = "nlr.plot.jpeg")
plot(x,y)
curve(1.0934^(0.7242^x), from=0, to=11, add=TRUE)
dev.off()

How to manually specify outer knots for smoother in gam (mgcv package)

I am fitting GAM models to data using the mgcv package in R. Some of my predictors are circular, so I am using a periodic smoother. I run into an issue in cross validation where my holdout dataset can contain values outside the range of the training data. Since the gam package automatically chooses knots for the smooths, this leads to an error (see my related question here -- thanks to #nograpes and #DWin for their explanations of the errors there).
How can I manually specify the outer knots in a periodic smooth?
Example code
The first block generates some data.
library(mgcv)
set.seed(223) # produces error.
# set.seed(123) # no error.
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
The next block fits the GAM model with the periodic smooth:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
It will be somewhere in the specification of the smoother term s(x,bs="cc",k=5) where I'm sure you'll be able to set some knots, but this is not obvious to me from the help of gam or from googling.
This block will fit some holdout data and produce the error if you set the seed as above:
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Ideally, I would only set the outer knots and let gam pick the rest.
Apologies if this question is better for CrossValidated than SO.
Try this:
gamFit <- gam(y ~ s(x,bs="cc",k=5),
knots=list( x=seq(-pi,pi, len=5) ),
data=df, family=binomial())
You will find a worked example at:
?smooth.construct.cr.smooth.spec
I learned in testing this code that the 'k' parameter in s() needs to match the 'len' parameter in the 'x'-seq() value passed to knots(). I thought incorrectly that the knots argument would get passed to s().
You can do this in {mgcv} now and for some years (but perhaps not at the time the question was posed and answered). Using the model in #IRTFM's answer, one can just specify the outer knots for a cyclic CRS:
gamFit <- gam(y ~ s(x, bs = "cc"),
knots = list(x = c(-pi, pi)),
data = df, family = binomial())

Fitting logarithmic curve in R

If I have a set of points in R that are linear I can do the following to plot the points, fit a line to them, then display the line:
x=c(61,610,1037,2074,3050,4087,5002,6100,7015)
y=c(0.401244, 0.844381, 1.18922, 1.93864, 2.76673, 3.52449, 4.21855, 5.04368, 5.80071)
plot(x,y)
Estimate = lm(y ~ x)
abline(Estimate)
Now, if I have a set of points that looks like a logarithmic curve fit is more appropriate such as the following:
x=c(61,610,1037,2074,3050,4087,5002,6100,7015)
y=c(0.974206,1.16716,1.19879,1.28192,1.30739,1.32019,1.35494,1.36941,1.37505)
I know I can get the standard regression fit against the log of the x values with the following:
logEstimate = lm(y ~ log(x))
But then how do I transform that logEstimate back to normal scaling and plot the curve against my linear curve from earlier?
Hmmm, I'm not quite sure what you mean by "plot the curve against my linear curve from earlier".
d <- data.frame(x,y) ## need to use data in a data.frame for predict()
logEstimate <- lm(y~log(x),data=d)
Here are three ways to get predicted values:
(1) Use predict:
plot(x,y)
xvec <- seq(0,7000,length=101)
logpred <- predict(logEstimate,newdata=data.frame(x=xvec))
lines(xvec,logpred)
(2) Extract the numeric coefficient values:
coef(logEstimate)
## (Intercept) log(x)
## 0.6183839 0.0856404
curve(0.61838+0.08564*log(x),add=TRUE,col=2)
(3) Use with() magic (you need back-quotes around the parameter estimate names because they contain parentheses)
with(as.list(coef(logEstimate)),
curve(`(Intercept)`+`log(x)`*log(x),add=TRUE,col=4))
Maybe what you want is
est1 <- predict(lm(y~x,data=d),newdata=data.frame(x=xvec))
plot(est1,logpred)
... although I'm not sure why ...
I'm not exactly sure what you mean either... but I guessed a little different. I think you want to fit two models to those points, one linear, and one logged. Then, you want to plot the points, and the functional form of both models. Here is the code for that:
x=c(61,610,1037,2074,3050,4087,5002,6100,7015)
y=c(0.974206,1.16716,1.19879,1.28192,1.30739,1.32019,1.35494,1.36941,1.37505)
Estimate = lm(y ~ x)
logEstimate = lm(y ~ log(x))
plot(x,predict(Estimate),type='l',col='blue')
lines(x,predict(logEstimate),col='red')
points(x,y)
In response to your second question in the comment, linear regression does always return a linear combination of your predictors, but that doesn't necessarily mean that it is a straight line. Think about what your log transformation really means: If you fit,
y = log(x)
that is the same as fitting
exp(y) = x
Which means that as x increases linearly, then y will change exponentially, which is clearly not a 'straight line'. However, if you transformed your x-axis on the log scale, then the displayed line would be straight.

Resources