Using nnet for prediction, am i doing it right? - r

I'm still pretty new to R and AI / ML techniques. I would like to use a neural net for prediction, and since I'm new I would just like to see if this is how it should be done.
As a test case, I'm predicting values of sin(), based on 2 previous values. For training I create a data frame withy = sin(x), x1 = sin(x-1), x2 = sin(x-2), then use the formula y ~ x1 + x2.
It seems to work, but I am just wondering if this is the right way to do it, or if there is a more idiomatic way.
This is the code:
require(quantmod) #for Lag()
requre(nnet)
x <- seq(0, 20, 0.1)
y <- sin(x)
te <- data.frame(y, Lag(y), Lag(y,2))
names(te) <- c("y", "x1", "x2")
p <- nnet(y ~ x1 + x2, data=te, linout=TRUE, size=10)
ps <- predict(p, x1=y)
plot(y, type="l")
lines(ps, col=2)
Thanks
[edit]
Is this better for the predict call?
t2 <- data.frame(sin(x), Lag(sin(x)))
names(t2) <- c("x1", "x2")
vv <- predict(p, t2)
plot(vv)
I guess I'd like to see that the nnet is actually working by looking at its predictions (which should approximate a sin wave.)

I really like the caret package, as it provides a nice, unified interface to a variety of models, such as nnet. Furthermore, it automatically tunes hyperparameters (such as size and decay) using cross-validation or bootstrap re-sampling. The downside is that all this re-sampling takes some time.
#Load Packages
require(quantmod) #for Lag()
require(nnet)
require(caret)
#Make toy dataset
y <- sin(seq(0, 20, 0.1))
te <- data.frame(y, x1=Lag(y), x2=Lag(y,2))
names(te) <- c("y", "x1", "x2")
#Fit model
model <- train(y ~ x1 + x2, te, method='nnet', linout=TRUE, trace = FALSE,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
ps <- predict(model, te)
#Examine results
model
plot(y)
lines(ps, col=2)
It also predicts on the proper scale, so you can directly compare results. If you are interested in neural networks, you should also take a look at the neuralnet and RSNNS packages. caret can currently tune nnet and neuralnet models, but does not yet have an interface for RSNNS.
/edit: caret now has an interface for RSNNS. It turns out if you email the package maintainer and ask that a model be added to caret he'll usually do it!
/edit: caret also now supports Bayesian regularization for feed-forward neural networks from the brnn package. Furthermore, caret now also makes it much easier to specify your own custom models, to interface with any neural network package you like!

Related

How does the kernel SVM in e1071 predict?

I am trying to understand the way the e1071 package obtains its SVM predictions in a two-class classification framework. Consider the following toy example.
library(mvtnorm)
library(e1071)
n <- 50
### Gaussians
eps <- 0.05
data1 <- as.data.frame(rmvnorm(n, mean = c(0,0), sigma=diag(rep(eps,2))))
data2 <- as.data.frame(rmvnorm(n, mean = c(1,1), sigma=diag(rep(eps,2))))
### Train Model
data_df <- as.data.frame(rbind(data1, data2))
data <- as.matrix(data_df)
data_df$y <- as.factor(c(rep(-1,n), rep(1,n)))
svm <- svm(y ~ ., data = data_df, kernel = "radial", gamma=1, type = "C-classification", scale = FALSE)
Having trained the SVM, I would like to write a function that uses the coefficients and the intercept to predict on a new data point.
Recall that the kernel trick guarantees that we can write the prediction on a new point as the weighted sum of the kernel evaluated at the support vectors and the new point itself (plus some intercept).
In other words: How to combine the following three terms
supportv <- svm$SV
coefs <- svm$coefs
intercept <- svm$rho
to get the prediction associated with the corresponding SVM?
If this is not possible, or too complicated, I would also switch to a different package.

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Bootcov in rms package not working when cluster variable included in regression as fixed effect

I'm trying to use bootcov to get clustered standard errors for a regression analysis on panel data. In the analysis, I'm including the cluster variable as a fixed effect to address cluster-level confounding. However, including the cluster variable as a fixed effect causes bootcov to throw an error ("Warning message:...fit failure in 200 resamples. Might try increasing maxit"). I imagine this is because the coefficient matrix varies over bootstrap replications depending on which clusters are selected (here's a similar issue and solution in Stata).
Does anyone know a way around this problem? If not, I can try to manually edit the function myself. Unfortunately, I can't use the cluster option in robcov because my analysis actually requires the Glm function rather than the ols function. Furthermore, I want to stick with the rms package because my analysis involves restricted cubic splines, which rms makes easy to visualize, and test via ANOVA (although I'm open to other suggestions).
Thanks for the help. I copied an example below.
#load package
library(rms)
#make df
x <- rnorm(1000)
y <- sample(c(1:100),1000, replace=TRUE)
z <- factor(rep(1:50, 20))
df <- data.frame(y,x,z)
#set datadist
dd <- datadist(df)
options(datadist='dd')
#works when cluster variable isn't included as fixed effect in regression
reg <- ols(x ~ y, df, x=TRUE, y=TRUE)
reg_clus <- bootcov(reg, df$z)
summary(reg_clus)
#doesn't work when cluster variable included as fixed effect in regression
reg2 <- ols(x ~ y + z, df, x=TRUE, y=TRUE)
reg_clus2 <- bootcov(reg2, df$z)
summary(reg_clus2)

How to manually specify outer knots for smoother in gam (mgcv package)

I am fitting GAM models to data using the mgcv package in R. Some of my predictors are circular, so I am using a periodic smoother. I run into an issue in cross validation where my holdout dataset can contain values outside the range of the training data. Since the gam package automatically chooses knots for the smooths, this leads to an error (see my related question here -- thanks to #nograpes and #DWin for their explanations of the errors there).
How can I manually specify the outer knots in a periodic smooth?
Example code
The first block generates some data.
library(mgcv)
set.seed(223) # produces error.
# set.seed(123) # no error.
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
The next block fits the GAM model with the periodic smooth:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
It will be somewhere in the specification of the smoother term s(x,bs="cc",k=5) where I'm sure you'll be able to set some knots, but this is not obvious to me from the help of gam or from googling.
This block will fit some holdout data and produce the error if you set the seed as above:
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Ideally, I would only set the outer knots and let gam pick the rest.
Apologies if this question is better for CrossValidated than SO.
Try this:
gamFit <- gam(y ~ s(x,bs="cc",k=5),
knots=list( x=seq(-pi,pi, len=5) ),
data=df, family=binomial())
You will find a worked example at:
?smooth.construct.cr.smooth.spec
I learned in testing this code that the 'k' parameter in s() needs to match the 'len' parameter in the 'x'-seq() value passed to knots(). I thought incorrectly that the knots argument would get passed to s().
You can do this in {mgcv} now and for some years (but perhaps not at the time the question was posed and answered). Using the model in #IRTFM's answer, one can just specify the outer knots for a cyclic CRS:
gamFit <- gam(y ~ s(x, bs = "cc"),
knots = list(x = c(-pi, pi)),
data = df, family = binomial())

SVM in R for regression

I have 4 dimensions of data. In R, I'm using plot3d with the 4th dimension being color. I'd like to now use SVM to find the best regression line to give me the best correlation. Basically, a best fit hyperplane dependent on the color dimension. How can I do this?
This is the basic idea (of course the specific formula will vary depending on your variable names and which is the dependent):
library(e1071)
data = data.frame(matrix(rnorm(100*4), nrow=100))
fit = svm(X1 ~ ., data=data)
Then you can use regular summary, plot, predict, etc. functions on the fit object. Note that with SVMs, the hyper-parameters usually need to be tuned for best results. you can do this with the tune wrapper. Also check out the caret package, which I think is great.
Take a look on svm function in the e1071 package.
You can also consider the kernelab, klaR or svmpath packages.
EDIT: #CodeGuy, John has provided you an example. I suppose your 4 dimensions are features that you use to classify your data, and that you have also another another variable that is the real class.
y <- gl(4, 5)
x1 <- c(0,1,2,3)[y]
x2 <- c(0,5,10,15)[y]
x3 <- c(1,3,5,7)[y]
x4 <- c(0,0,3,3)[y]
d <- data.frame(y,x1,x2,x3,x4)
library(e1071)
svm01 <- svm(y ~ x1 + x2 + x3 + x4, data=d)
ftable(predict(svm01), y) # Tells you how your svm performance

Resources