GLM & LM assumptions and interpretation - r

I am currently running some linear models and lmer (with replicate as a random effect) for continuous data and a glm and glmer (again, replicate as a random effect) for count data.
I was wondering if a lm, lmer, glm and glmer all need the data to be normally distributed and if not, do I need an alternative test?
Also, I have run a glm and looked at the pairwise differences and when reporting it other than P<0.001 I don't know what else I should report! As my glm output doesn't really give me that much. Thanks!

Related

Is there an R function to fit a GLMM for count data with range 0-15? (Possible right censoring needed?)

We are due to collect some survey data that has a range 0-15 score, potentially skewed, and has a multilevel structure (repeated measures and clustering). I'm anticipating that fitting a linear mixed model in R lmer will be problematic given the outcome distribution.
I am considering whether some sort of right-censored generalized linear mixed model (Poisson) may be a solution but I'm struggling to find something to fit this model.
I think the closest that I can find is the VGAM::vglm with family = cens.poisson but, as far as I can tell, it cannot include multilevel structure?
Does anyone know any R functions that would permit this model? If so, is there an equivalent power calc function or would this be written as a simulation?

SVM for data prediction in R

I'd like to use the 'e1071' library for fitting an SVM model. So far, I've made a model that creates a curve regression based on the data set.
(take a look at the purple curve):
However, I want the SVM model to "follow" the data, such that the prediction for each value is as close as possible to the actual data. I think this is possible because of this graph that shows how SVMs (model 2) model are similar to an ARIMA model (model 1):
I tried changing the kernel to no avail. Any help will be much appreciated.
Fine tuning a SVM classifier is no easy task. Have you considered other models? For ex. GAM's (generalized additive models)? These work well on very curvy data.

R - Testing for homo/heteroscedasticity and collinearity in a multivariate regression model

I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod) 
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!
I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.

How to calculate marginal effects for plm models

I wish to obtain marginal effects for covariates that are in my plm models in first differences with interacted variables. For my lm and glm models I am using the margins package and its functions. However, this method does not seem to work with panel models.
What alternatives do I have?
Thank you.
EDIT:
In the absence of any viable solution I would calculate by hand the marginal effects at specific values of the relevant covariables. Ideally I would like to do so in a faster way.

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

Resources