I would like to build a model to predict Y based on several variables. First, I have a look on scatterplot and correlation map on R (see below)
enter image description here
It appears that Y has a exponential relationship with X1, a logistics growth with X2, and linear relationiships with X3 and X4. So, I was wondering that if it is possible to use nls() to build a model that may cover above relationship. Below is my try:
modelling Y~X2 solely in nls() to get phi parameters :
fit <- nls(Y~ c1*(exp(-k1*X1))+ SSlogis(Y, phi1, phi2, phi3) + X3+ X4,
start=list(c1=Y[1], k1=0, phi1=15.07, phi2=1082.67,phi3=55.47))
Error:minFactor
Then try it differently:
fit <- nls(Y~ c1*(exp(-k1*X1))+ c2/(1+b2*exp(-k2*X2)) + X3+ X4,
start=list(c1=Y[1], k1=0, c2 = 1, b2 = 0, k2 = 111))
Error: singularity
Q1: can I mix variables like above,if so, any thoughts how to fix errors
Q2: any thoughts on model selection(other models)
Related
I am new to modeling in R, so I'm stumbling a bit...
I have a model in Eviews, which I have to translate to R and make further upgrades.
The model is multiple OLS with AR(1) of residuals.
I implemented it like this
model1 <- lm(y ~ x1 + x2 + x3, data)
data$e <- dplyr:: lag(residuals(model1), 1)
model2 <- lm(y ~ x1 + x2 + x3 + e, data)
My issue is the same as it is in this thread and I expected it: while parameter estimations are similar, they are different enought that I cannot use it.
I am planing of using ARIMA from stats package, but the problem is implementation. How to make AR(1) on residuals, and make other variables as they are?
Provided I understood you correctly, you can supply external regressors to your arima model through the xreg argument.
You don't provide sample data so I don't have anything to play with, but your model should translate to something like
model <- arima(data$y, xreg = as.matrix(data[, c("x1", "x2", "x3")]), order = c(1, 0, 0))
Explanation: The first argument data$y contains your time series data. xreg contains your external regressors as a matrix, with every column containing as many observations for that regressor as you have time points. order = c(1, 0, 0) defines an AR(1) model.
When we plot a GAM model using the mgcv package with isotropic smoothers, we have a contour plot that looks something like this:
x axis for one predictor,
y axis for another predictor,
the main is a function s(x1, x2) (isotropic smother).
Suppose that in this model we have many other isotropic smoothers like:
y ~ s(x1, x2) + s(x3, x4) + s(x5, x6)
My doubts are: when interpreting the contour plot for s(x1, x2), what happens to the others isotropic smoothers? Are they "fixed at their medians"? Can we interpret a s(x1, x2) plot separately?
Because this model is additive in the functions you can interpret the functions (the separate s() terms) separately, but not necessarily as separate effects of covariates on the response. In your case there is no overlap between the covariates in each of the bivariate smooths, so you can also interpret them as the effects of the covariates on the response separately from the other smoothers.
All of the smooth functions are typically subject to a sum to zero constraint to allow the model constant term (the intercept) to be an identifiable parameter. As such, the 0 line in each plot is the value of the model constant term (on the scale of the link function or linear predictor).
The plots shown in the output from plot.gam(model) are partial effects plots or partial plots. You can essentially ignore the other terms if you are interested in understanding the effect of that term on the response as a function of the covariates for the term.
If you have other terms in the model that might include one or more covariates in another terms, and you want to look at how the response changes as you vary that term or coavriate, then you should predict from the model over the range of the variables you are interested in, whilst holding the other variables at some representation values, say their means or medians.
For example if you had
model <- gam(y ~ s(x, z) + s(x, v), data = foo, method = 'REML')
and you want to know how the response varied as a function of x only, you would fix z and v at representative values and then predict over a range of values for x:
newdf <- with(foo, expand.grid(x = seq(min(x), max(x), length = 100),
z = median(z)
v = median(v)))
newdf <- cbind(newdf, fit = predict(model, newdata = newdf, type = 'response'))
plot(fit ~ x, data = newdf, type = 'l')
Also, see ?vis.gam in the mgcv package as a means of preparing plots like this but where it does the hard work.
My objective is to create marginal effects and a plot similar to what's done in this post under "marginal effects": https://www.drbanderson.com/myresources/interpretinglogisticregressionpartii/
Since I cannot provide the actual model or actual data (data is sensitive), I will provide a generic example.
I have the following model created using the glm function:
model = glm(y ~ as.factor(x1) + x2 + I(x2^2) + x3 + as.factor(x4):as.factor(x5), data = dataFrame,family="binomial")
x2 is a continuous variable that I want to calculate marginal effects at the average of the other continuous variable, x3, and at pre-defined values for x1, x4, and x5. For further simplification, assume x1 is categorical of either morning, afternoon, or night (thus producing two coefficients in the logit model), x4 is categorical of either left or right, and x5 is categorical of either up or down (thus x4:x5 produces coefficient results for left and up, left and down, right and up, with right and down the excluded interaction).
Similar to what is done in the post, I run the following code:
x2.inc <- seq(min(dataFrame$x2), max(dataFrame$x2), by = .1)
to get a sequence of x2 values at which to evaluate the marginal effect. Finally, I attempt to run the margins command:
x2.margins.df <- as.data.frame(summary(margins(model, at = list(x2 = x2.inc, x3 = mean(dataFrame$x3), x1 = 'morning', x4 = 'left', x5 = 'right'))))
However, running this produced the following error:
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [1] must be the same length as the vector [0]
Any thoughts on how I can successfully run the margins command given a) the quadratic nature of x2 in my model, and b) the interaction of terms in the model?
As a side note: I know I can calculate these things manually if I wanted to. However, for the sake of having less code and ease of reproducibility, I'd like to make this method work. Thank you for the assistance!
The readme of margins says:
https://cran.r-project.org/web/packages/margins/readme/README.html
that it supports logit models. So why implement somethiny manually?
library("car")
library("plm")
data("LaborSupply", package = "plm")
model <- glm(disab ~ kids*age + kids*I(age^2), data = LaborSupply, family="binomial")
summary(margins(model))
Would anyone be able to explain how to specify optimization methods in the SparkR operation glm? When I try to fit an OLS model with glm, I can only specify "normal" or "auto" as the solver type. SparkR isn't able to interpret the solver specification "l-bfgs", leading me to believe that when I do specify "auto", SparkR simply assumes "normal" and then estimates the model coefficients analytically, using the LS normal equation.
Is fitting GLMs with stochastic gradient descent and L-BFGS not available in SparkR, or am I writing the following evaluation incorrectly?
m <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
There's plenty of documentation in Spark about using iterative methods to fit GLMs, e.g. LogisticRegressionWithLBFGS and LinearRegressionWithSGD (discussed here), but I haven't been able to find any such documentation for the R API. Is this simply not available in SparkR (i.e. are SparkR users constrained to solving analytically and, therefore, constrained in the size of our data), or am I missing something essential here? If it isn't currently available in SparkR, is it supposed to come out with SparkR 2.0.0?
Below, I create a toy data set and fit three models, each with a different solver specification:
x1 <- rnorm(n=200, mean=10, sd=2)
x2 <- rnorm(n=200, mean=17, sd=3)
x3 <- rnorm(n=200, mean=8, sd=1)
y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1)
dat <- cbind.data.frame(y, x1, x2, x3)
df <- as.DataFrame(sqlContext, dat)
m1 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "normal")
m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "auto")
m3 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
The first and second model result in the same parameter estimation values (supporting my assumption that SparkR is solving the normal equation when fitting both models and, consequently, the models are equivalent). SparkR is able to fit the third model, but when I try to print a summary of the GLM, I receive the following error:
For reference, I am doing this through AWS and have tried different versions of EMR, including the most recent (in case that makes a difference). Also, I am using Spark 1.6.1 (R API).
Spark 1.6.2 API documentation is here
solver:
The solver algorithm used for optimization, this can be "l-bfgs", "normal" and "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. The default value is "auto" which means that the solver algorithm is selected automatically.
To me - this looks worthy of a bug report on the Apache Spark Jira site.
I'm using the package ordinal in R to run ordinal logistic regression on a dependent variable that is based on a 1 - 5 likert scale and trying to figure out how to test the proportional odds assumption.
My current model is y ~ x1 + x2 + x3 + x4 + x2*x3 + (1|ID) + (1|form) where x1 and x2 are dichotomous and x3 and x4 are continuous variables. (92 subjects, 4 forms).
As far as I know,
-"nominal" is not implemented in the more recent version of clmm.
-clmm2 (the older version) does not accept more than one random variable
-nominal_test() only appears to work for clm2 (without random effects at all)
For a different dv (that only has one random term and no interaction), I had used:
m1 <- clmm2 (y ~ x1 + x2 + x3, random = ID, Hess = TRUE, data = d
m1.nom <- clmm2 (y ~ x1 + x2, random = ID, Hess = TRUE, nominal = ~x3, data = d)
m2.nom <- clmm2 (y ~ x2+ x3, random = ID, Hess = TRUE, nominal = ~ x1, data = d)
m3.nom <- clmm2 (y ~ x1+ x3, random = ID, Hess = TRUE, nominal = ~ x2, data = d)
anova (m1.nom, m1)
anova (m2.nom, m1)
anova (m3.nom, m1) # (as well as considering the output in summary (m#.nom)
But I'm not sure how to modify this approach to handle the current model (2 random terms and an interaction of the fixed effects), nor am I sure that this actually a correct way to test the proportional odds assumption in the first place. (The example in the package tutorial only has 2 fixed effects.)
I'm open to other approaches (be they other packages, software, or graphical approaches) that would let me test this. Any suggestions?
Even in the case of the most basic ordinal logistic regression models, the diagnostic tests for the proportional odds assumption are known to frequently reject the null hypothesis that the coefficients are the same across the levels of the ordered factor. The statistician Frank Harrell suggests here a general graphical method for examining the proportional odds assumption, which is probably your best bet. In this approach you'd just graph the linear predictions from a logit model (with random effects) for each level of the outcome and one predictor variable at a time.