predict() gives wrong matrix for bs(); how to predict linear regression? - r

I've met a problem about function bs().
library(ISLR)
library(ggplot2)
library(caret)
data(Wage)
#summary(Wage)
set.seed(123)
inTrain <- createDataPartition(Wage$wage, p = 0.7, list = F)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
library(splines)
bsBasis <- bs(training$age, df=3)
bsBasis[1:12,]
lm1 <- lm(wage ~ bsBasis, data=training)
lm1$coefficients
## (Intercept) bsBasis1 bsBasis2 bsBasis3
## 60.22 93.39 51.05 47.28
plot(training$age, training$wage, pch=19, cex=0.5)
points(training$age, predict(lm1, newdata=training), col="red", pch=19, cex=0.5)
predict(bsBasis, age=testing$age)
The dimensions of predict(bsBasis, age=testing$age) is 2012x3, while the testing$age got only 988 rows. And the results of predict(bsBasis, age=testing$age) is identical to the bsBasis.
My questions are:
What is predict(bsBasis, age=testing$age) actually doing?
How to use this bsBasis in predicting the wage in the TEST data correctly?

Your question 1
Use newx. Check ?predict.bs for its arguments.
x <- runif(100)
b <- bs(x, df = 3)
predict(b, newx = c(0.2, 0.5))
Different predict functions may behave differently. Here, no matter what variable you use in bs(), age, sex, height, etc, it can only be newx in predict.bs().
Your question 2
You don't really need to form explicitly bsBasis. When using splines in regression, lm and predict.lm will hide construction and prediction of spline from you.
lm1 <- lm(wage ~ bs(age, df = 3), data=training)
predict(lm1, newdata = test)
Note the argument in predict.lm is newdata.

Related

How get plot from nlrq in R?

Following workflow for nonlinear quantile regression seems to work. However I don´t know how to plot the resulting curve.
btw.: I´d prefer to use the function graphics::curve() instead of graphics::lines()
require(quantreg)
# load sample data
dat <- DNase
# introduce variable
x <- DNase$conc
y <- DNase$density
# introduce function
f <- function(a, b, x) {(a*x/(b+x))}
# fit the model
fm0 <- nls(log(y) ~ log(f(a,b,x)), dat, start = c(a = 1, b = 1))
# fit a nonlinear least-square regression
fit <- nls(y ~ f(a,b,x), dat, start = coef(fm0))
# receive coeffientes
co <- coef(fit)
a=co[1]
b=co[2]
# plot
plot(y~x)
# add curve
curve((a*x/(b+x)), add=T)
# then fit the median using nlrq
dat.nlrq <- nlrq(y ~ SSlogis(x, Asym, mid, scal), data=dat, tau=0.5)
# add curve
???
EDIT: What I´m looking for is a way to plot various quantile regressions of a formula, like a*x/(b+x).
Inserting the formula leads me to the question what to put as 'start' argument
dat.nlrq.075 <- nlrq(formula=fit, data = dat, start=???, tau = 0.75)
curve uses lines so there is really no reason to use curve when it's easier to use lines.
First ensure that data is sorted so plots come out right. Then fit with nls or nlrq and use fitted for the fitted line.
library(quantreg)
dat <- DNase[order(DNase$conc), ]
fit.nlrq <- nlrq(density ~ SSlogis(conc, Asym, mid, scal), data = dat, tau = 0.5)
plot(density ~ conc, dat)
lines(fitted(fit.nlrq) ~ conc, dat)
If you want to plot the fit at a different number of equally spaced points such as 250 then do the same except use predict instead of fitted:
x <- seq(min(dat$conc), max(dat$conc), length = 250)
lines(predict(fit.nlrq, list(conc = x)) ~ x, lty = 2, col = "red")
The same style works with nls.
Note that if you use require its value should be checked. If you don't want to do that use library instead.

Plotting binomial glm with interactions in numeric variables

I want to know if is possible to plotting binomial glm with interactions in numeric variables. In my case:
##Data set artificial
set.seed(20)
d <- data.frame(
mating=sample(0:1, 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
#Binomial GLM ajusted
model<-glm(mating ~ behv + condition, data=d, family=binomial)
summary(model)
In a situation where behv and condition are significant in the model
#Plotting first for behv
x<-d$behv ###Take behv values
x2<-rep(mean(d$condition),length(d_p[,1])) ##Fixed mean condition
# Points
plot(d$mating~d$behv)
#Curve
curve(exp(model$coefficients[1]+model$coefficients[2]*x+model$coefficients[3]*x2)
/(1+exp(model$coefficients[1]+model$coefficients[2]*x+model$coefficients[3]*x2)))
But doesn't work!! There is another correct approach?
Thanks
It seems like your desired output is a plot of the conditional means (or best-fit line). You can do this by computing predicted values with the predict function.
I'm going to change your example a bit, to get a nicer looking result.
d$mating <- ifelse(d$behv > 0, rbinom(200, 1, .8), rbinom(200, 1, .2))
model <- glm(mating ~ behv + condition, data = d, family = binomial)
summary(model)
Now, we make a newdata dataframe with your desired values:
newdata <- d
newdata$condition <- mean(newdata$condition)
newdata$yhat <- predict(model, newdata, type = "response")
Finally, we sort newdata by the x-axis variable (if not, we'll get lines that zig-zag all over the plot), and then plot:
newdata <- newdata[order(newdata$behv), ]
plot(newdata$mating ~ newdata$behv)
lines(x = newdata$behv, y = newdata$yhat)
Output:

Comparing GLM models using predict

Suppose I have two models created by calling glm() on the same data but with different formulas and/or families. Now I want to compare which model is better by predicting on an unknown data. Something like this:
mod1 <- glm(formula1, family1, data)
mod2 <- glm(formula2, family2, data)
mu1 <- predict(mod1, newdata, type = "response")
mu2 <- predict(mod2, newdata, type = "response")
How can I tell which of the predictions mu1 or mu2 is better?
Is there some simple command to compute the log likelihood of a prediction?
It would be easier to answer this with a reproducible example.
It often makes more sense to choose a family a priori rather than according too goodness of fit -- for example, if you have count (non-negative integer) responses with no obvious upper bound, your only real choice that lies strictly within the exponential family is Poisson.
set.seed(101)
x <- runif(1000)
mu <- exp(1+2*x)
y <- rgamma(1000,shape=3,scale=mu/3)
d <- data.frame(x,y)
New data:
nd <- data.frame(x=runif(100))
nd$y <- rgamma(100,shape=3,scale=exp(1+2*nd$x)/3)
Fit Gamma and Gaussian:
mod1 <- glm(y~x,family=Gamma(link="log"),data=d)
mod2 <- glm(y~x,family=gaussian(link="log"),data=d)
Predictions:
mu1 <- predict(mod1, newdata=nd, type="response")
mu2 <- predict(mod2, newdata=nd, type="response")
Extract shape/scale parameters:
sigma <- sqrt(summary(mod2)$dispersion)
shape <- MASS::gamma.shape(mod1)$alpha
Root mean squared error:
rmse <- function(x1,x2) sqrt(mean((x1-x2)^2))
rmse(mu1,nd$y) ## 5.845
rmse(mu2,nd$y) ## 5.842
Negative log likelihoods:
-sum(dgamma(nd$y,shape=shape,scale=mu1/shape,log=TRUE)) ## 276.84
-sum(dnorm(nd$y,mean=mu2,sd=sigma,log=TRUE)) ## 318.4

Why is leave-one-out cross-validation of GLM model (package=boot) failing when data contains NaN's?

This is a fairly simple procedure - refitting GLM model with subset of data (training set) and calculating the accuracy of the prediction on the remaining data. I am trying to run a "leave-one-out" strategy on a data set (i.e. training subset is length = n-1) using the cv.glm function of the package boot.
Am I doing something wrong, or is this really the case that the function doesn't seem to handle NA's? I'm guessing that this is fairly easy to program on my own, but I would appreciate any advise if there is some other mistake that I am making. Cheers.
Example:
require(boot)
#create data
n <- 100
x <- runif(n)
e <- rnorm(n, sd=100)
a <- 5
b <- 3
y <- exp(a + b*x) + e
plot(y ~ x)
plot(y ~ x, log="y")
#make some y's NaN
set.seed(1)
y[sample(n, 0.1*n)] <- NaN
#fit glm model
df <- data.frame(y=y, x=x)
glm.fit <- glm(y ~ x, data=df, family=gaussian(link="log"))
summary(glm.fit)
#calculate mean error of prediction (leave-one-out cross-validation)
cv.res <- cv.glm(df, glm.fit)
cv.res$delta
[1] NA NA
You're right. The function is not set up to handle NAs. The various options for the na.action argument of the glm() function don't really help, either. The easiest way to deal with it, is to remove the NAs from the data frame at the outset.
sub <- df[!is.na(df$y), ]
glm.fit <- glm(y ~ x, data=sub, family=gaussian(link="log"))
summary(glm.fit)
# calculate mean error of prediction (leave-one-out cross-validation)
cv.res <- cv.glm(sub, glm.fit)
cv.res$delta

Example of Time Series Prediction using Neural Networks in R

Anyone's got a quick short educational example how to use Neural Networks (nnet in R) for the purpose of prediction?
Here is an example, in R, of a time series
T = seq(0,20,length=200)
Y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
plot(T,Y,type="l")
Many thanks
David
I think you can use the caret package and specially the train function
This function sets up a grid of tuning parameters for a number
of classification and regression routines.
require(quantmod)
require(nnet)
require(caret)
T = seq(0,20,length=200)
y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
dat <- dat[c(3:200),] #delete first 2 observations
#Fit model
model <- train(y ~ x1+x2 ,
dat,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, dat)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[-c(1:2)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
The solution proposed by #agstudy is useful, but in-sample fits are not a reliable guide to out-of-sample forecasting accuracy. The gold standard in forecasting accuracy measurement is to use a holdout sample. Remove the last 5 or 10 or 20 observations (depending to the length of the time series) from the training sample, fit your models to the rest of the data, use the fitted models to forecast the holdout sample and simply compare accuracies on the holdout, using Mean Absolute Deviations (MAD) or weighted Mean Absolute Percentage Errors (wMAPEs).
So to do this you can change the code above in this way:
require(quantmod)
require(nnet)
require(caret)
t = seq(0,20,length=200)
y = 1 + 3*cos(4*t+2) +.2*t^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
train_set <- dat[c(3:185),]
test_set <- dat[c(186:200),]
#Fit model
model <- train(y ~ x1+x2 ,
train_set,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, test_set)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[c(186:200)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
This last two lines output the wMAPE of the forecasts from the model
sum(abs(ps-test_set["y"]))/sum(test_set)

Resources