How to make GAM work with binary variables? - r

I am running this piece of code:
model.runs <- BIOMOD_Modeling(run.data,
models = models,
NbRunEval = runs.CV,
DataSplit = 80,
VarImport = 20,
SaveObj = T,
Yweights = NULL,
rescal.all.models = FALSE,
do.full.models = FALSE,
models.options = BIOMOD_ModelingOptions(
GAM = list(k=3),
MAXENT.Phillips = list(path_to_maxent.jar = "F:/xxx/xxx/")))
save(model.runs, file = "model.runs")
and I had this error:
Error in smooth.construct.tp.smooth.spec(object, data, knots) :
A term has fewer unique covariate combinations than specified maximum
degrees of freedom
After some research, I understood that GAM did not like my binary variables, so I took them out, and it worked fine.
Therefore, my question is simple: I would like to keep my environmental variables, is this doable in some way?
Sorry if it is trivial, but I do not use GAM often and I did not find the answer anywhere else.
Cheers.

You can’t smooth binary or categorical variables, only continuous ones.
You can create and interaction between a smooth and a categorical variable, and you could use random effects “smooths” for categorical variables. But you can’t just smooth binary or categorical variables. You would need to arrange for biomod to include those variables as linear factor terms. If you codes them as 0,1 then R, biomod, and mgcv will think those variables are numeric. Make sure they are coerced to be factors and then retry.

Related

How do you fix incorrect predicted values ("pred") being used in the plot_model function in R?

I am trying to fix a plot that I generated that has the incorrect predicted values along the y-axis. The predicted values should be "Score," my outcome variable. For some reason, the "id" variable is along the y-axis instead. Everything else is plotted correctly. I checked my model, and I can't see where the issue is coming from. I will post my regression model syntax and plot syntax below. The model is a multivariate regression with two outcomes. "scale" labels each of those two outcomes which are indicated by the Score variable. Both predictors are two-level categorical variables, and there is also an interaction between them.
If anyone has any ideas, I would greatly appreciate it!
multireg4 <- gls(Score ~ 0 + scale + scale:DrinkStat + scale:ACEHx + scale:DrinkStat:ACEHx,
data = R34_Long,
na.action = na.omit,
weights = varIdent(form = ~ 1 | scale),
correlation = corSymm(form = ~ as.numeric(scale) | id))
plot_model(multireg4, type = "pred", terms = c("DrinkStat", "ACEHx", "scale")) + theme_sjplot2()
here is an image of the plot
I've tried adding limits to the scale variable in the plot_model code, but the issue is that the wrong predicted values are being pulled. This seems like an automatic function, so I am not exactly sure where to edit the syntax (in the model regression or the plot) in order to get R to use the correct predicted values.

How to run Beta Regression with lots of independent variables?

Why is it that Beta Regression that is bound between 0 and 1 is unable to handle lots of independent variables as Regressors? I have around 30 independent variables that I am trying to fit and it shows error like:
Error in optim(par = start, fn = loglikfun, gr = gradfun, method =
method, : non-finite value supplied by optim
Only few variables it is accepting.Now If I combine all these independent variables in X <- (df$x1 + … + df$x30) and make dependent variable in Y <- df$y and then run Beta Regression then it works but I won’t be getting coefficients for individual independent variables which I want.
betareg(Y ~ X, data = df)
So, what’s the solution?
Probably, the model did not converge because of the multicollinearity problem. In most cases, regression models can not be estimated properly when lots of variables are considered. You can overcome this problem with an appropriate variable selection procedure using information criteria.
You can benefit gamlss package in R. Also, stepGAIC() function can help you when considering gamlss(...,family=BE) function during the modeling.

k-nearest-neighbors in caret with categorical data

I want to predict a binary variable with some other variables, some of them are categorical. I've set up a code and everything seemed to work fine, immediately. The the predictions were quite similar in comparison to a logistic regression and a random forest. This is my code (I don't think there is something wrong with it):
knn.Fit <-
train(Y ~ .,
data = Data,
method = "knn",
trControl = trainControl(method = "repeatedcv",
repeats = 5,
number = 5),
tuneLength = 20)
Now my question is how is this done with categorical variables? For example, if I have a categorial variable with values a, b and c, does the function create three (or two?) dummy variables in the background and calculates the distance with them? And are the numeric variables standardized automatically? Otherwise these dummy variables should not fall into account if one or more numeric variables have much bigger standard deviations? I've thought I have to do quite much data preparation before running the algorithm ...
EDIT:
I've seen that I can standardize with the argument preProcess:
preProcess = c("center", "scale")
My numeric variables didn't have a big SD, indeed.
For categorical variables, k-nearest neighbors won't work. Try k-modes (https://en.wikipedia.org/wiki/K-medoids) which can be used via the klaR package. The two cannot typically be combined however.

Conduct quantile regression with several dependent variables in R

I'm interested in doing a multivariate regression in R, looking at the effects of a grouping variable (2 levels) on several dependent variables. However, due to my data being non-normal and the 2 groups not having homogenous variances, I'm looking to use a quantile regression instead. I'm using the rq function from the quantreg toolbox to do this.
My code is as follows
# Generate some fake data
DV = matrix(rnorm(40*5),ncol=5) #construct matrix for dependent variables
IV = matrix(rep(1:2,20)) #matrix for grouping factor
library(quantreg)
model.q = rq(DV~IV,
tau = 0.5)
I get the following error message when this is run:
Error in y - x %*% z$coef : non-conformable arrays
In addition: Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
I believe this is due to my having several DVs, as the model works fine when I try using a DV of one column. Is there a specific way I should be formatting my data? Or perhaps there is another function I may be able to use?
Thank you!
If you just want to run several regressions, each with the same set of independent variables, but with a different dependent variable, you could write a function and then apply it to all columns of your DV matrix and save the models in a list:
reg <- function(col_number) {
model.q <- rq(DV[, col_number] ~ IV, tau = 0.5)
}
model_list <- lapply(1:ncol(DV), reg)
However, as pointed out in the comments, it might be that you want a multivariate model accounting for the correlation of the outcome - but then I do not think the rq method would be appropriate
If you have multiple responses, what you most likely need is:
DV = matrix(rnorm(40*5),ncol=5) #construct matrix for dependent variables
IV = matrix(rep(1:2,20)) #matrix for grouping factor
library(quantreg)
rqs.fit(x=IV, y=DV, tau=0.5, tol = 0.0001)
Unfortunately, there's really not a lot of documentation about how this works.. I can update if i do find it

set random forest to classification

I am attempting a random forest on some data where the class variables is binary (either 1 or 0). Here is the code I'm running:
forest.model <- randomForest(x = ticdata2000[,1:85], y = ticdata2000[,86],
ntree=500,
mtry=9,
importance=TRUE,
norm.votes=TRUE,
na.action=na.roughfix,
replace=FALSE,
)
But when the forest gets to the end, I get the following error:
Warning message:
In randomForest.default(x = ticdata2000[, 1:85], y = ticdata2000[, :
The response has five or fewer unique values. Are you sure you want to do regression?
The answer, of course, is no. I don't want to do regression. I have a single, discrete variable that only takes on 2 classes. Of course, when I run predictions with this model, I get continuous numbers, when I want a list of zeroes and ones. Can someone tell me what I'm doing wrong to get this to use regression and not classification?
Change your response column to a factor using as.factor (or just factor). Since you've stored that variable as numeric 0's and 1's, R rightly interprets it as a numeric variable. If you want R to treat it differently, you have to tell it so.
This is mentioned in the documentation under the y argument:
A response vector. If a factor, classification is assumed, otherwise
regression is assumed. If omitted, randomForest will run in
unsupervised mode.

Resources