randomForest() machine learning in R - r

I am exploring with the function randomforest() in R and several articles I found all suggest using a similar logic as below, where the response variable is column 30 and independent variables include everthing else except for column 30:
dat.rf <- randomForest(dat[,-30],
dat[,30],
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
When I try this, I got the following error messages:
Error in randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
NA not permitted in predictors
In addition: Warning message:
In randomForest.default(dat[, -30], dat[, 30], proximity = TRUE, :
The response has five or fewer unique values. Are you sure you want to do regression?
However, I was able to get it to work when I listed the independent variables one by one while keeping all the other parameters the same.
dat.rf <- randomForest(as.factor(Y) ~X1+ X2+ X3+ X4+ X5+ X6+ X7+ X8+ X9+ X10+......,
data=dat
proximity=TRUE,
mtry=3,
importance=TRUE,
do.trace=100,
na.action = na.omit)
Could someone help me debug the simplier command where I don't have to list each predictor one by one?

The error message gives you a clue to two problems:
First, you need to remove any row that has a NA anywhere. Removing NA should be easy enough and I'll leave you that one as an exercise.
It looks like you need to do classification (which predicts a response which only has one of a few discrete levels), rather than regression (which predicts a continuous response). If the response is continuous, randomForest() will automatically apply regression.
So, how do you force randomForest() to use classification?As you noticed in your first try, randomForest allows you to give data as predictors and response data, not just using the formula style. To force randomForest() to apply classification, make sure that the value you are trying to predict (the response, or dat[,30]) is a factor. Remember to explicitly identify the $x$ and $y$ arguments. This is easy to do:
randomForest(x = dat[,-30],
y = factor(dat[,30]),
...)
This way your output can only take one of the levels given in y.
This is all buried in the description of the arguments $x$ and $y$: see ?help.

Related

lmer failing with na.pass

When I run a lmer model with lme4 using na.pass as the na.action, I get the following error:
R: NA/NaN/Inf in foreign function call (arg 1)
I run the model like this:
model1 <- lme4::lmer(agg_dv_singing ~ GMS.Musical.Training +
JAJ.ability + MDT.ability + MPT.ability + PDCT.ability +
PIAT.ability + agg_dv_long_note + demographics.age +
aggiv_entropy + aggiv_interval_complexity +
aggiv_rhythmic_complexity + aggiv_tonal_complexity +
log.freq + length + (1|p_id),
data = dat, na.action = na.pass)
summary(dat) indicates that there are no Inf or NaN values, although yes, there are many NA values.
Running na.pass outside of lmer on the same data set does not give an error:
na.pass(dat)
So what could be going wrong within lmer?
Comments to a previous question of yours attempted to explain that, in general, mixed model machinery cannot handle estimation from cases when there are missing values in the predictors; it just doesn't work that way. If you want to fit mixed models with missing data you need to do some form of imputation, i.e. filling in values for missing predictors (e.g. see the mice package, which is more or less the state of the art at least as far as the R ecosystem is concerned). Here is what the four different standard na.* actions do in the context of mixed models:
na.fail(): fail immediately if there are missing values in the data (predictors or response). This is frustrating, but alerts you immediately to the fact that you have missing data, and lets you decide what to do about it.
na.omit(): drop non-complete cases from the data before fitting.
na.exclude(): like na.omit(), but keep track of the locations of the excluded cases. When using predict() or residuals() (or any function that produces results per observation), reconstitute a complete data set with NA values for the non-complete cases in the original data set. (I usually find this setting to be the most useful default.)
na.pass: do not remove NA values, but attempt to continue with the fitting procedure. As you found out, this usually doesn't work at all! It will just pass the NA values down through the code until something goes wrong. Typically one of two things happens at this point:
if the entire estimation procedure is written using R functions that can handle and propagate missing values, then you'll usually get a fitted model object with NA/NaN for all coefficients, likelihoods, etc. etc. (because the missing values contaminate the entire fitting procedure);
if some step of the estimation procedure can't handle NA/NaN values (as in this case), you get an inscrutable error from the first point in the procedure that fails.
If you look at the source code of na.pass() (by typing na.pass at the R prompt), you'll see that in fact all it does is return the same object, unchanged. To be honest, I'm not really sure why na.pass even exists, except for completeness ... (or compatibility with S)
Your NA value was not in a parameter that is used in a random-effects term: if it had, you would have gotten a more interpretable error message:
library(lme4)
ss <- sleepstudy
ss[1,"Days"] <- NA
lmer(Reaction ~ Days + (Days|Subject), ss, na.action=na.pass)
Error in lme4::lFormula(formula = Reaction ~ Days + (Days | Subject), :
NA in Z (random-effects model matrix): please use "na.action='na.omit'" or "na.action='na.exclude'"
If I fit a model with (1|Subject), so that the NA value only affects the fixed effects
lmer(Reaction ~ Days + (1|Subject), ss, na.action=na.pass)
then we get your error message.
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
traceback() tells me that this happens in the internal chkRank.drop.cols() function, where R is trying to figure out if any of your fixed-effect columns are collinear. There should probably be a check for missing values there ...

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

TukeyHSD or glht in R, ANCOVA

I'm wondering if i can use the function "TukeyHSD" to perform the all pairwise comparisons of a "aov()" model with one factor (e.g., GROUP) and one continuous covariate (e.g., AGE). I did for example:
library(multcomp)
data('litter', package = 'multcomp')
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
and i get a warning message like this:
Warning message:
In replications(paste("~", xx), data = mf): non-factor ignored: gesttime
Is this process above correct? What's the meaning of the warning message? And does "TukeyHSD" apply to badly unbalanced designs?
In addition, is there any difference between the processes above and below?
litter.mc <- glht(litter.aov, linfct = mcp(dose = 'Tukey'))
summary(litter.mc)
Best, Sue
There's no difference. TukeyHSD() is just a bit more eager to tell you about potential problems. Notice that it's a warning message, not an error, meaning that the results might not be what you expect, but they'll still be returned so you can judge for yourself.
As for what it means, it means what it says: non-factor variables are ignored. Remember that you are comparing the differences between groups, and grouping is done using factors, so factors are all TukeyHSD() care about. In your case you explicitly tell the function to only care about dose, which is factor, so the warning might be seen as overly cautious.
One way of avoiding the warning would be to convert gesttime into a factor, and as it consist of only four levels it makes some sense to do so.
data('litter', package = 'multcomp')
litter$gesttime <- as.factor(litter$gesttime)
litter.aov <- aov(weight ~ gesttime + dose, data = litter)
TukeyHSD(litter.aov, which = 'dose')
I know this is an old thread but I'm not sure the existing answers are quite right...
I've been trying both functions with my own data and have a similar situation to Sue, where TukeyHSD gives a warning message about ignoring non-factor covariates, while glht() does not.
It does not appear that they are doing the same thing contrary to the other answer. The results are different and it appears that TukeyHSD is not marginalizing over the non-factor covariates (as the warning states). It appears that glht() correctly uses the mean value of non-factor covariates to compute the marginal mean of the groups of interest since the point estimates are the same as those obtained from lsmeans().
So it does not seem that TukeyHSD is overly cautious, it just seems that it can't handle non-factor covariates while glht is able to. So glht seems to be the correct function to use in this case, to me.

glmmLasso error and warning

I am trying to perform variable selection in a generalized linear mixed model using glmmLasso, but am coming up with an error and a warning, that I can not resolve. The dataset is unbalanced, with some participants (PTNO) having more samples than others; no missing data. My dependent variable is binary, all other variables (beside the ID variable PTNO) are continous.
I suspect something very generic is happening, but obviously fail to see it and have not found any solution in the documentation or on the web.
The code, which is basically just adapted from the glmmLasso soccer example is:
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL, rnd = list(PTNO=~1),
family = poisson(link = log), data = stackdata, lambda=100,
control = list(print.iter=TRUE,start=c(1,rep(0,29)),q.start=0.7))
The error message is displayed below. Specficially, I do not believe there are any NAs in the dataset and I am unsure about the meaning of the warning regarding the factor variable.
Iteration 1
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
In addition: Warning message:
In Ops.factor(y, Mu) : ‘-’ not meaningful for factors
An abbreviated dataset containing the necessary variables is available in R format and can be downladed here.
I hope I can be guided a bit as to how to go on with the analysis. Please let me know if there is anything wrong with the dataset or you cannot download it. ANY help is much appreciated.
Just to follow up on #Kristofersen comment above. It is indeed the start vector that messes your analysis up.
If I run
glm8 <- glmmLasso(Group~NDUFV2_dCTABL+GPER1_dCTABL+ ESR1_dCTABL+ESR2_dCTABL+KLF12_dCTABL+SP4_dCTABL+SP1_dCTABL+ PGAM1_dCTABL+ANK3_dCTABL+RASGRP1_dCTABL+AKT1_dCTABL+NUDT1_dCTABL+ POLG_dCTABL+ ADARB1_dCTABL+OGG_dCTABL+ PDE4B_dCTABL+ GSK3B_dCTABL+ APOE_dCTABL+ MAPK6_dCTABL,
rnd = list(PTNO=~1),
family = binomial(),
data = stackdata,
lambda=100,
control = list(print.iter=TRUE))
then everything is fine and dandy (i.e., it converges and produces a solution). You have copied the example with poisson regression and you need to tweak the code to your situation. I have no idea about whether the output makes sense.
Quick note: I ran with the binomial distribution in the code above since your outcome is binary. If it makes sense to estimate relative risks then poisson may be reasonable (and it also converges), but you need to recode your outcome as the two groups are defined as 1 and 2 and that will certainly mess up the poisson regression.
In other words do a
stackdata$Group <- stackdata$Group-1
before you run the analysis.

Model runs with glm but not bigglm

I was trying to run a logistic regression on 320,000 rows of data (6 variables). Stepwise model selection on a sample of the data (10000) gives a rather complex model with 5 interaction terms: Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5. The glm() function could fit this model with 10000 rows of data, but not with the whole dataset (320,000).
Using bigglm to read data chunk by chunk from a SQL server resulted in an error, and I couldn't make sense of the results from traceback():
fit <- bigglm(Y~X1+ X2*X3+ X2*X4+ X2*X5+ X3*X6+ X4*X5,
data=sqlQuery(myconn,train_dat),family=binomial(link="logit"),
chunksize=1000, maxit=10)
Error in coef.bigqr(object$qr) :
NA/NaN/Inf in foreign function call (arg 3)
> traceback()
11: .Fortran("regcf", as.integer(p), as.integer(p * p/2), bigQR$D,
bigQR$rbar, bigQR$thetab, bigQR$tol, beta = numeric(p), nreq = as.integer(nvar),
ier = integer(1), DUP = FALSE)
10: coef.bigqr(object$qr)
9: coef(object$qr)
8: coef.biglm(iwlm)
7: coef(iwlm)
6: bigglm.function(formula = formula, data = datafun, ...)
5: bigglm(formula = formula, data = datafun, ...)
4: bigglm(formula = formula, data = datafun, ...)
bigglm was able to fit a smaller model with fewer interaction terms. but bigglm was not able to fit the same model with a small dataset (10000 rows).
Has anyone run into this problem before? Any other approach to run a complex logistic model with big data?
I've run into this problem many times and it was always caused by the fact that the the chunks processed by the bigglm did not contain all the levels in a categorical (factor) variable.
bigglm crunches data by chunks and the default size of the chunk is 5000. If you have, say, 5 levels in your categorical variable, e.g. (a,b,c,d,e) and in your first chunk (from 1:5000) contains only (a,b,c,d), but no "e" you will get this error.
What you can do is increase the size of the "chunksize" argument and/or cleverly reorder your dataframe so that each chunk contains ALL the levels.
hope this helps (at least somebody)
Ok so we were able to find the cause for this problem:
for one category in one of the interaction terms, there's no observation. "glm" function was able to run and provide "NA" as the estimated coefficient, but "bigglm" doesn't like it. "bigglm" was able to run the model if I drop this interaction term.
I'll do more research on how to deal with this kind of situation.
I met this error before, thought it was from randomForest instead of biglm. The reason could be the function cannot handle character variables, so you need to convert characters to factors. Hope this can help you.

Resources