variable selection in package earth R for MARS - r

I am working on a MARS model using earth package in R. I want to force all the variables in the model. I have 14 predictors but in the result, I am getting only 13 predictors. Below is my code.
mdl <- earth(x,y,nprune=10000, nk=1000, degree=1, thresh=1e-7,
linpreds=c(1:14), penalty = -1, ponly=T, trace = 0)
Here are my questions
Is it possible for me to force variables instead of some variable selection. If yes, how?\
Once I start exploring hinges in the data, is it possible for me to manually fix the knots and then ge tthe estimates on the basis of them.

You cannot force predictors into the earth/MARS model. Fundamental to the MARS algorithm is automatic selection of predictors.
But it is possible to increase the likelihood of all predictors going into the model by subverting the normal algorithm by setting thresh=0 and penalty=-1. See the earth help page, and search for "thresh" and "penalty" in the earth vignette for details and examples.
However, to quote that vignette: "It’s usually best not to subvert the standard MARS algorithm by toying with tuning
parameters such as thresh, penalty, and endspan. Remember that we aren’t seeking
a model that best fits the training data, but rather a model that best fits the underlying
distribution."
You can't manually fix the knots. Once again, automatic selection of knots if inherent in the MARS algorithm. However, the minspan and endspan arguments give you some flexibility in the automatic placement of knots, for example minspan=-3 will allow up to 3 equally spaced knots per predictor.

Related

Donwsizing a lm object for plotting

I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly. Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).
# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)
# checking model assumptions
performance::check_model(model)
Created on 2022-08-23 by the reprex package (v2.0.1)
Alternative: Is downsizing, ok? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example. But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)
Speeding up check_model
The documentation (?check_model) explains a few things you can do to speed up the function/plotting without subsampling:
For models with many observations, or for more complex models in
general, generating the plot might become very slow. One reason might
be that the underlying graphic engine becomes slow for plotting many
data points. In such cases, setting the argument show_dots = FALSE
might help. Furthermore, look at the check argument and see if some of
the model checks could be skipped, which also increases performance.
Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE). You can also choose the specific checks you get (reducing computation time) if you are not interested in them. For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check").
Implications of Downsampling
Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data. Both your model parameters and post-estimation summaries conditioning on the data will change. Just how much it will change depends on variability of your observations and sample size. With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.
Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little. You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does. From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.
That being said, these plots are also approximations to the actual distribution of interest anyway (i.e. you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.

How should you use scaled weights with the svydesign() function in the survey package in R?

I am using the survey package in R to analyse the "Understanding Society" social survey. The main user guide for the survey specifies (on page 45) that the weights have been scaled to have a mean of 1. When using the svydesign() function, I am passing the weight variable to the weight argument.
In the survey package documentation, under the surveysummary() function, it states:
Note that the design effect will be incorrect if the weights have been rescaled so that they are not reciprocals of sampling probabilities.
Will I therefore get incorrect estimates and/or standard errors when using functions such as svyglm() etc?
This came to my attention because, when using the psrsq() function to get the Pseudo R-Squared of a model, I received the following warning:
Weights appear to be scaled: rsquared may be wrong
Any help would be greatly appreciated! Thanks!
No, you don't need to worry
The warning is only about design effect estimation (which most people don't want to do), and only about without-replacement design effects (DEFF rather than DEFT). Most people don't need to do design-effect estimation, they just need estimates and standard errors. These are fine; there is no problem.
If you want to estimate the design effects, R needs to estimate the standard errors (which is fine) and also estimate what the standard errors would be under simple random sampling without replacement, with the same sample size. That second part is the problem: working out the variance under SRSWoR requires knowing the population size. If you have scaled the weights, R can no longer work out the population size.
If you do need design effects (eg, to do a power calculation for another survey), you can still get the DEFT design effects that compare to simple random sampling with replacement. It's only if you want design effects compared to simple random sampling without replacement that you need to worry about the scaling of weights. Very few people are in that situation.
As a final note surveysummary isn't a function, it's a help page.

lavaan - problems with WLSMV estimator

I am running a SEM using lavaan that includes 5 latent variables. Also, I have 5 regression equations (Y~...) where outcomes are manifest variables and regressors are a mix of latents and indicators.
When I use maximum likelihood estimation the model runs without problem. But when I switch to WLSMV estimation (adding the argument estimator = "WLSMV") I am finding two problems. The first problem is that the execution becomes extremely slow taking several hours to run a single model, any idea why this is happening and if there is a way to fix it?
The second problem is that when I try to fit multigroup SEMs and start constraining the model I get the following warning:
lavaan WARNING: the optimizer (NLMINB) claimed the model converged,
but not all elements of the gradient are (near) zero;
the optimizer may not have found a local solution
use check.gradient = FALSE to skip this check.
Any idea what this means? what are the implications? is this a problem? how I fix it? should I simply stay with maximum likelihood?
IMPORTANT: when I remove the regressions and keep only the measurement part (the five latent variables) the function execute fast and I stop getting the warning message. Does it mean that WLSMV should not be used when the CFA becomes a SEM?
Thanks in advance!
You have a big model for a small sample, I bet, and particularly small for the DWLS estimator with a mean- and variance-adjusted (MV) chi-squared test statistic... WLSMV. You can try to simplify your model, increase your sample, or use a different estimator, like the "MLR" maximum likelihood estimation with robust (Huber– White) standard errors.
I suggest that you check the chapter by Finney, DiStefano, and Kopp (2016).
S.J. Finney, C. DiStefano, J.P. Kopp. Overview of estimation methods and preconditions for their application with structural equation modeling K. Schweizer, C. DiStefano (Eds.), Principles and methods of test construction: Standards and recent advances, Hogrefe Publishing, Boston, MA, USA (2016), pp. 135-165, 10.1027/00449-000

how to fit the best hybrid Gibbs model by spatstat?

I want fit a best hybrid model to a ppp with more than 100 points and a rectangle window, so do trial and error. it's natural to find the significant r for geyer model by looking into the centered L(r) plot where there are several peak and troughs on it. so i draw the centered L(r) function first:
L.inhom<-Linhom(my.X)
plot(L.inhom, iso-r~r,legend=FALSE)
then decide whether to include a hardcore (the first trough suggests a hardcore), and a geyer component(following peaks or troughs).
according to the spatstat book, one can alway use profilepl() to find the best (r,sat) for a Geyer saturation model
df<-expand.grid(r=seq(45,200, by=1),sat=c(1,2,3,4,5))
p.Geyer<-profilepl(df,Geyer, my.X~1, correction="translate", aic=TRUE)
as.ppm(p.Geyer) #(r=,sat=)
I feel that the (r,sat) found by profilepl() is good enough for the global pattern of my ppp (is that right?). Indeed, a hybrid model include a hardcore and a geyer component with this (r,sat) parameters owns a much lower AIC than corresponding hardcore model and geyer saturation model.
after this, I fit the second hybrid model with hardcore and two geyer components, which is even better according to AICs. and I think the second hybrid model is also valid or good for my data, as the L(r) plot does present more than one peak and one trough.
I am thinking to use envelope of fitted model to judge its validity: if the fitted L(r) was bounded in the simulated envelope, then the model is good enough for the data.
env.hybrid.final<-envelope(hybrid.1, Linhom,
simulate=expression(rpoispp(den.X)),
use.thoery=TRUE,nsim=19, global=TRUE)
plot(env.hybrid.final,.-r~r, shade=NULL, legend=FALSE,
xlab="r",ylab="L(r)-r", main = "")
here come the questions:
q1: is it right using L(r) envelop plot to judge the validity of fitted model, as i did? if not, how to judge the best fitted model or even a "good" model?
q2: when you can fit several models to a dataset, how to decide when a model is good enough? especially when there no clear mechanism konwn. AIC?

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

Resources