I have 4 dimensions of data. In R, I'm using plot3d with the 4th dimension being color. I'd like to now use SVM to find the best regression line to give me the best correlation. Basically, a best fit hyperplane dependent on the color dimension. How can I do this?
This is the basic idea (of course the specific formula will vary depending on your variable names and which is the dependent):
library(e1071)
data = data.frame(matrix(rnorm(100*4), nrow=100))
fit = svm(X1 ~ ., data=data)
Then you can use regular summary, plot, predict, etc. functions on the fit object. Note that with SVMs, the hyper-parameters usually need to be tuned for best results. you can do this with the tune wrapper. Also check out the caret package, which I think is great.
Take a look on svm function in the e1071 package.
You can also consider the kernelab, klaR or svmpath packages.
EDIT: #CodeGuy, John has provided you an example. I suppose your 4 dimensions are features that you use to classify your data, and that you have also another another variable that is the real class.
y <- gl(4, 5)
x1 <- c(0,1,2,3)[y]
x2 <- c(0,5,10,15)[y]
x3 <- c(1,3,5,7)[y]
x4 <- c(0,0,3,3)[y]
d <- data.frame(y,x1,x2,x3,x4)
library(e1071)
svm01 <- svm(y ~ x1 + x2 + x3 + x4, data=d)
ftable(predict(svm01), y) # Tells you how your svm performance
Related
I'm trying to use bootcov to get clustered standard errors for a regression analysis on panel data. In the analysis, I'm including the cluster variable as a fixed effect to address cluster-level confounding. However, including the cluster variable as a fixed effect causes bootcov to throw an error ("Warning message:...fit failure in 200 resamples. Might try increasing maxit"). I imagine this is because the coefficient matrix varies over bootstrap replications depending on which clusters are selected (here's a similar issue and solution in Stata).
Does anyone know a way around this problem? If not, I can try to manually edit the function myself. Unfortunately, I can't use the cluster option in robcov because my analysis actually requires the Glm function rather than the ols function. Furthermore, I want to stick with the rms package because my analysis involves restricted cubic splines, which rms makes easy to visualize, and test via ANOVA (although I'm open to other suggestions).
Thanks for the help. I copied an example below.
#load package
library(rms)
#make df
x <- rnorm(1000)
y <- sample(c(1:100),1000, replace=TRUE)
z <- factor(rep(1:50, 20))
df <- data.frame(y,x,z)
#set datadist
dd <- datadist(df)
options(datadist='dd')
#works when cluster variable isn't included as fixed effect in regression
reg <- ols(x ~ y, df, x=TRUE, y=TRUE)
reg_clus <- bootcov(reg, df$z)
summary(reg_clus)
#doesn't work when cluster variable included as fixed effect in regression
reg2 <- ols(x ~ y + z, df, x=TRUE, y=TRUE)
reg_clus2 <- bootcov(reg2, df$z)
summary(reg_clus2)
Would anyone be able to explain how to specify optimization methods in the SparkR operation glm? When I try to fit an OLS model with glm, I can only specify "normal" or "auto" as the solver type. SparkR isn't able to interpret the solver specification "l-bfgs", leading me to believe that when I do specify "auto", SparkR simply assumes "normal" and then estimates the model coefficients analytically, using the LS normal equation.
Is fitting GLMs with stochastic gradient descent and L-BFGS not available in SparkR, or am I writing the following evaluation incorrectly?
m <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
There's plenty of documentation in Spark about using iterative methods to fit GLMs, e.g. LogisticRegressionWithLBFGS and LinearRegressionWithSGD (discussed here), but I haven't been able to find any such documentation for the R API. Is this simply not available in SparkR (i.e. are SparkR users constrained to solving analytically and, therefore, constrained in the size of our data), or am I missing something essential here? If it isn't currently available in SparkR, is it supposed to come out with SparkR 2.0.0?
Below, I create a toy data set and fit three models, each with a different solver specification:
x1 <- rnorm(n=200, mean=10, sd=2)
x2 <- rnorm(n=200, mean=17, sd=3)
x3 <- rnorm(n=200, mean=8, sd=1)
y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1)
dat <- cbind.data.frame(y, x1, x2, x3)
df <- as.DataFrame(sqlContext, dat)
m1 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "normal")
m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "auto")
m3 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
The first and second model result in the same parameter estimation values (supporting my assumption that SparkR is solving the normal equation when fitting both models and, consequently, the models are equivalent). SparkR is able to fit the third model, but when I try to print a summary of the GLM, I receive the following error:
For reference, I am doing this through AWS and have tried different versions of EMR, including the most recent (in case that makes a difference). Also, I am using Spark 1.6.1 (R API).
Spark 1.6.2 API documentation is here
solver:
The solver algorithm used for optimization, this can be "l-bfgs", "normal" and "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. The default value is "auto" which means that the solver algorithm is selected automatically.
To me - this looks worthy of a bug report on the Apache Spark Jira site.
I have a data set with response variable ADA, and independent variables LEV, ROA, and ROAL. The data is called dt. I used the following code to get coefficients for latent classes.
m1 <- stepFlexmix(ADA ~ LEV+ROA+ROAL,data=dt,control= list(verbose=0),
k=1:5,nrep= 10);
m1 <- getModel(m1, "BIC");
All was fine until I read the following from http://rss.acs.unt.edu/Rdoc/library/flexmix/html/flexmix.html
model Object of FLXM of list of FLXM objects. Default is the object returned by calling FLXMRglm().
Which I think says that default model call is generalized linear model, while I am interested in linear model. How can I use linear model rather than GLM? I searched for it for quite a while, bit could't get it except this example from
http://www.inside-r.org/packages/cran/flexmix/docs/flexmix, which I couldn't make sense of:
data("NPreg", package = "flexmix")
## mixture of two linear regression models. Note that control parameters
## can be specified as named list and abbreviated if unique.
ex1 <- flexmix(yn~x+I(x^2), data=NPreg, k=2,
control=list(verb=5, iter=100))
ex1
summary(ex1)
plot(ex1)
## now we fit a model with one Gaussian response and one Poisson
## response. Note that the formulas inside the call to FLXMRglm are
## relative to the overall model formula.
ex2 <- flexmix(yn~x, data=NPreg, k=2,
model=list(FLXMRglm(yn~.+I(x^2)),
FLXMRglm(yp~., family="poisson")))
plot(ex2)
Someone please let me know how to use linear regression instead of GLM. Or am I already using LM and just got confused because of the "default model line"? Please explain. Thanks.
I did a numerical analysis to understand if
m1 <- stepFlexmix(ADA ~ LEV+ROA+ROAL,data=dt,control= list(verbose=0)
does produce results from linear regression. To do the experiment, I ran the following code and found that yes the estimated parameters are indeed from linear regression. Experiment helped me to allay my reservations.
x1 <- c(1:200);
x2 <- x1*x1;
x3 <- x1*x2;
e1 <- rnorm(200,0,1);
e2 <- rnorm(200,0,1);
y1 <- 5+12*x1+20*x2+30*x3+e1;
y2 <- 18+5*x1+10*x2+15*x3+e2;
y <- c(y1,y2)
x11 <- c(x1,x1)
x22 <- c(x2,x2)
x33 <- c(x3,x3)
d <- data.frame(y,x11,x22,x33)
m <- stepFlexmix(y ~ x11+x22+x33, data =d, control = list(verbose=0), k=1:5, nrep = 10);
m <- getModel(m, "BIC");
parameters(m);
plotEll(m, data = d)
m.refit <- refit(m);
summary(m.refit)
I am fitting GAM models to data using the mgcv package in R. Some of my predictors are circular, so I am using a periodic smoother. I run into an issue in cross validation where my holdout dataset can contain values outside the range of the training data. Since the gam package automatically chooses knots for the smooths, this leads to an error (see my related question here -- thanks to #nograpes and #DWin for their explanations of the errors there).
How can I manually specify the outer knots in a periodic smooth?
Example code
The first block generates some data.
library(mgcv)
set.seed(223) # produces error.
# set.seed(123) # no error.
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
The next block fits the GAM model with the periodic smooth:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
It will be somewhere in the specification of the smoother term s(x,bs="cc",k=5) where I'm sure you'll be able to set some knots, but this is not obvious to me from the help of gam or from googling.
This block will fit some holdout data and produce the error if you set the seed as above:
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Ideally, I would only set the outer knots and let gam pick the rest.
Apologies if this question is better for CrossValidated than SO.
Try this:
gamFit <- gam(y ~ s(x,bs="cc",k=5),
knots=list( x=seq(-pi,pi, len=5) ),
data=df, family=binomial())
You will find a worked example at:
?smooth.construct.cr.smooth.spec
I learned in testing this code that the 'k' parameter in s() needs to match the 'len' parameter in the 'x'-seq() value passed to knots(). I thought incorrectly that the knots argument would get passed to s().
You can do this in {mgcv} now and for some years (but perhaps not at the time the question was posed and answered). Using the model in #IRTFM's answer, one can just specify the outer knots for a cyclic CRS:
gamFit <- gam(y ~ s(x, bs = "cc"),
knots = list(x = c(-pi, pi)),
data = df, family = binomial())
I'm still pretty new to R and AI / ML techniques. I would like to use a neural net for prediction, and since I'm new I would just like to see if this is how it should be done.
As a test case, I'm predicting values of sin(), based on 2 previous values. For training I create a data frame withy = sin(x), x1 = sin(x-1), x2 = sin(x-2), then use the formula y ~ x1 + x2.
It seems to work, but I am just wondering if this is the right way to do it, or if there is a more idiomatic way.
This is the code:
require(quantmod) #for Lag()
requre(nnet)
x <- seq(0, 20, 0.1)
y <- sin(x)
te <- data.frame(y, Lag(y), Lag(y,2))
names(te) <- c("y", "x1", "x2")
p <- nnet(y ~ x1 + x2, data=te, linout=TRUE, size=10)
ps <- predict(p, x1=y)
plot(y, type="l")
lines(ps, col=2)
Thanks
[edit]
Is this better for the predict call?
t2 <- data.frame(sin(x), Lag(sin(x)))
names(t2) <- c("x1", "x2")
vv <- predict(p, t2)
plot(vv)
I guess I'd like to see that the nnet is actually working by looking at its predictions (which should approximate a sin wave.)
I really like the caret package, as it provides a nice, unified interface to a variety of models, such as nnet. Furthermore, it automatically tunes hyperparameters (such as size and decay) using cross-validation or bootstrap re-sampling. The downside is that all this re-sampling takes some time.
#Load Packages
require(quantmod) #for Lag()
require(nnet)
require(caret)
#Make toy dataset
y <- sin(seq(0, 20, 0.1))
te <- data.frame(y, x1=Lag(y), x2=Lag(y,2))
names(te) <- c("y", "x1", "x2")
#Fit model
model <- train(y ~ x1 + x2, te, method='nnet', linout=TRUE, trace = FALSE,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
ps <- predict(model, te)
#Examine results
model
plot(y)
lines(ps, col=2)
It also predicts on the proper scale, so you can directly compare results. If you are interested in neural networks, you should also take a look at the neuralnet and RSNNS packages. caret can currently tune nnet and neuralnet models, but does not yet have an interface for RSNNS.
/edit: caret now has an interface for RSNNS. It turns out if you email the package maintainer and ask that a model be added to caret he'll usually do it!
/edit: caret also now supports Bayesian regularization for feed-forward neural networks from the brnn package. Furthermore, caret now also makes it much easier to specify your own custom models, to interface with any neural network package you like!