SparkR MLlib & spark.ml: least squares and glm optimization - r

Would anyone be able to explain how to specify optimization methods in the SparkR operation glm? When I try to fit an OLS model with glm, I can only specify "normal" or "auto" as the solver type. SparkR isn't able to interpret the solver specification "l-bfgs", leading me to believe that when I do specify "auto", SparkR simply assumes "normal" and then estimates the model coefficients analytically, using the LS normal equation.
Is fitting GLMs with stochastic gradient descent and L-BFGS not available in SparkR, or am I writing the following evaluation incorrectly?
m <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
There's plenty of documentation in Spark about using iterative methods to fit GLMs, e.g. LogisticRegressionWithLBFGS and LinearRegressionWithSGD (discussed here), but I haven't been able to find any such documentation for the R API. Is this simply not available in SparkR (i.e. are SparkR users constrained to solving analytically and, therefore, constrained in the size of our data), or am I missing something essential here? If it isn't currently available in SparkR, is it supposed to come out with SparkR 2.0.0?
Below, I create a toy data set and fit three models, each with a different solver specification:
x1 <- rnorm(n=200, mean=10, sd=2)
x2 <- rnorm(n=200, mean=17, sd=3)
x3 <- rnorm(n=200, mean=8, sd=1)
y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1)
dat <- cbind.data.frame(y, x1, x2, x3)
df <- as.DataFrame(sqlContext, dat)
m1 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "normal")
m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "auto")
m3 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
The first and second model result in the same parameter estimation values (supporting my assumption that SparkR is solving the normal equation when fitting both models and, consequently, the models are equivalent). SparkR is able to fit the third model, but when I try to print a summary of the GLM, I receive the following error:
For reference, I am doing this through AWS and have tried different versions of EMR, including the most recent (in case that makes a difference). Also, I am using Spark 1.6.1 (R API).

Spark 1.6.2 API documentation is here
solver:
The solver algorithm used for optimization, this can be "l-bfgs", "normal" and "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. The default value is "auto" which means that the solver algorithm is selected automatically.
To me - this looks worthy of a bug report on the Apache Spark Jira site.

Related

Delta Method in GLM.jl

Is hypothesis testing of linear and non-linear functions of coefficients of GLM supported in Julia's GLM.jl?
I am looking for a Julia equivalent of marginaleffects package in R which uses the deltamethod() function, or the nlcom post estimation command in Stata.
Thanks!
Sample R code:
eq = lm(y ~ x1 + x2, data)
deltamethod(eq, "x1 / x2 = 1")
Sample Stata code:
reg y x1 x2
nlcom _b[x1]/_b[x2]
The Julia Package, 'TargetedLearning' uses the delta method, see:
( https://lendle.github.io/TargetedLearning.jl/user-guide/influencecurves/#the-delta-method )

Using margins command in R with quadratic term and interacted dummy variables

My objective is to create marginal effects and a plot similar to what's done in this post under "marginal effects": https://www.drbanderson.com/myresources/interpretinglogisticregressionpartii/
Since I cannot provide the actual model or actual data (data is sensitive), I will provide a generic example.
I have the following model created using the glm function:
model = glm(y ~ as.factor(x1) + x2 + I(x2^2) + x3 + as.factor(x4):as.factor(x5), data = dataFrame,family="binomial")
x2 is a continuous variable that I want to calculate marginal effects at the average of the other continuous variable, x3, and at pre-defined values for x1, x4, and x5. For further simplification, assume x1 is categorical of either morning, afternoon, or night (thus producing two coefficients in the logit model), x4 is categorical of either left or right, and x5 is categorical of either up or down (thus x4:x5 produces coefficient results for left and up, left and down, right and up, with right and down the excluded interaction).
Similar to what is done in the post, I run the following code:
x2.inc <- seq(min(dataFrame$x2), max(dataFrame$x2), by = .1)
to get a sequence of x2 values at which to evaluate the marginal effect. Finally, I attempt to run the margins command:
x2.margins.df <- as.data.frame(summary(margins(model, at = list(x2 = x2.inc, x3 = mean(dataFrame$x3), x1 = 'morning', x4 = 'left', x5 = 'right'))))
However, running this produced the following error:
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [1] must be the same length as the vector [0]
Any thoughts on how I can successfully run the margins command given a) the quadratic nature of x2 in my model, and b) the interaction of terms in the model?
As a side note: I know I can calculate these things manually if I wanted to. However, for the sake of having less code and ease of reproducibility, I'd like to make this method work. Thank you for the assistance!
The readme of margins says:
https://cran.r-project.org/web/packages/margins/readme/README.html
that it supports logit models. So why implement somethiny manually?
library("car")
library("plm")
data("LaborSupply", package = "plm")
model <- glm(disab ~ kids*age + kids*I(age^2), data = LaborSupply, family="binomial")
summary(margins(model))

Multiple variables with different relationships in nls() in R

I would like to build a model to predict Y based on several variables. First, I have a look on scatterplot and correlation map on R (see below)
enter image description here
It appears that Y has a exponential relationship with X1, a logistics growth with X2, and linear relationiships with X3 and X4. So, I was wondering that if it is possible to use nls() to build a model that may cover above relationship. Below is my try:
modelling Y~X2 solely in nls() to get phi parameters :
fit <- nls(Y~ c1*(exp(-k1*X1))+ SSlogis(Y, phi1, phi2, phi3) + X3+ X4,
start=list(c1=Y[1], k1=0, phi1=15.07, phi2=1082.67,phi3=55.47))
Error:minFactor
Then try it differently:
fit <- nls(Y~ c1*(exp(-k1*X1))+ c2/(1+b2*exp(-k2*X2)) + X3+ X4,
start=list(c1=Y[1], k1=0, c2 = 1, b2 = 0, k2 = 111))
Error: singularity
Q1: can I mix variables like above,if so, any thoughts how to fix errors
Q2: any thoughts on model selection(other models)

lme4: Random slopes shared by all observations

I'm using R's lme4. Suppose I have a mixed-effects logistic-regression model where I want some random slopes shared by every observation. They're supposed to be random in the sense that these random slopes should all come from a single normal distribution. This is essentially the same thing as ridge regression, but without choosing a penalty size with cross-validation.
I tried the following code:
library(lme4)
ilogit = function(v)
1 / (1 + exp(-v))
set.seed(20)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)
x3 = rnorm(n)
x4 = rnorm(n)
x5 = rnorm(n)
y.p = ilogit(.5 + x1 - x2)
y = rbinom(n = n, size = 1, prob = y.p)
m1 = glm(
y ~ x1 + x2 + x3 + x4 + x5,
family = binomial)
print(round(d = 2, unname(coef(m1))))
m2 = glmer(
y ~ ((x1 + x2 + x3 + x4 + x5)|1),
family = binomial)
print(round(d = 2, unname(coef(m2))))
This yields:
Loading required package: Matrix
[1] 0.66 1.14 -0.78 -0.01 -0.16 0.25
Error: (p <- ncol(X)) == ncol(Y) is not TRUE
Execution halted
What did I do wrong? What's the right way to do this?
Looks like lme4 can't do this as-is. Here's what #amoeba said in stats.SE chat:
What Kodi wants to do is definitely a mixed model, in the sense of Bates et al. see e.g. eq (2) here https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf As far as I can see, X and Z design matrices are equal in this case. However, there is no way one can use lme4 to fit this (without hacking into the code): it allows only particular Z matrices that arise from the model formulas of the type (formula|factor).
See https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015581.html "We intend to allow lmer to be able to use more flexible model matrices for the random effects although, at present, that requires a certain amount of tweaking on the part of the user"
And https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q2/002351.html "I view the variance-covariance structures available in the lme4 package as being related to random-effects terms in the model matrix. A random-effects term is of the form (LMexpr | GrpFac). The expression on the right of the vertical bar is evaluated as a factor, which I call the grouping factor. The expression on the left is evaluated as a linear model expression."
That's all quotes from Bates. He does say "In future versions of lme4 I plan to allow for extensions of the unconditional variance-covariance structures." (in 2009) but I don't this was implemented.

SVM in R for regression

I have 4 dimensions of data. In R, I'm using plot3d with the 4th dimension being color. I'd like to now use SVM to find the best regression line to give me the best correlation. Basically, a best fit hyperplane dependent on the color dimension. How can I do this?
This is the basic idea (of course the specific formula will vary depending on your variable names and which is the dependent):
library(e1071)
data = data.frame(matrix(rnorm(100*4), nrow=100))
fit = svm(X1 ~ ., data=data)
Then you can use regular summary, plot, predict, etc. functions on the fit object. Note that with SVMs, the hyper-parameters usually need to be tuned for best results. you can do this with the tune wrapper. Also check out the caret package, which I think is great.
Take a look on svm function in the e1071 package.
You can also consider the kernelab, klaR or svmpath packages.
EDIT: #CodeGuy, John has provided you an example. I suppose your 4 dimensions are features that you use to classify your data, and that you have also another another variable that is the real class.
y <- gl(4, 5)
x1 <- c(0,1,2,3)[y]
x2 <- c(0,5,10,15)[y]
x3 <- c(1,3,5,7)[y]
x4 <- c(0,0,3,3)[y]
d <- data.frame(y,x1,x2,x3,x4)
library(e1071)
svm01 <- svm(y ~ x1 + x2 + x3 + x4, data=d)
ftable(predict(svm01), y) # Tells you how your svm performance

Resources