I would like to use the method Random Forest to impute missing values. I have read some papers that claim that MICE random Forest perform better than parametric mice.
In my case, I already run a model for the default mice and got the results and played with them. However when I had a option for the method random forest, I got an error and I'm not sure why. I've seen some questions relating to errors with random forest and mice but those are not my cases. My variables have more than a single NA.
imp <- mice(data1, m=70, pred=quickpred(data1), method="pmm", seed=71152, printFlag=TRUE)
impRF <- mice(data1, m=70, pred=quickpred(data1), method="rf", seed=71152, printFlag=TRUE)
iter imp variable
1 1 Vac
Error in if (n == 0) stop("data (x) has 0 rows") : argument is of length zero
Any one has any idea why I'm getting this error?
EDIT
I tried to change all variables to numeric instead of having dummy variables and it returned the same error and some warnings()
impRF <- mice(data, m=70, pred=quickpred(data), method="rf", seed=71152, printFlag=TRUE)
iter imp variable
1 1 Vac CliForm
Error in if (n == 0) stop("data (x) has 0 rows") : argument is of length zero
In addition: There were 50 or more warnings (use warnings() to see the first 50)
50: In randomForest.default(x = xobs, y = yobs, ntree = 1, ... :
The response has five or fewer unique values. Are you sure you want to do regression?
EDIT1
I've tried only with 5 imputations and a smaller subset of the data, with only 2000 rows and got a few different errors:
> imp <- mice(data2, m=5, pred=quickpred(data2), method="rf", seed=71152, printFlag=TRUE)
iter imp variable
1 1 Vac Radio Origin Job Alc Smk Drugs Prison Commu Hmless Symp
Error in randomForest.default(x = xobs, y = yobs, ntree = 1, ...) : NAs in foreign
function call (arg 11)
In addition: Warning messages:
1: In randomForest.default(x = xobs, y = yobs, ntree = 1, ...) : invalid mtry: reset to within valid range
2: In max(ncat) : no non-missing arguments to max; returning -Inf
3: In randomForest.default(x = xobs, y = yobs, ntree = 1, ...) : NAs introduced by coercion
I also encountered this error when I had only one fully observed variable, which I'm guessing is the cause in your case too. My colleague Anoop Shah provided me with a fix (below) and Prof van Buuren (mice's author) has said he will include it in the next update of the package.
In R type the following to enable you to redefine the rf impute function.
fixInNamespace("mice.impute.rf", "mice")
The corrected function to paste in is then:
mice.impute.rf <- function (y, ry, x, ntree = 100, ...){
ntree <- max(1, ntree)
xobs <- as.matrix(x[ry, ])
xmis <- as.matrix(x[!ry, ])
yobs <- y[ry]
onetree <- function(xobs, xmis, yobs, ...) {
fit <- randomForest(x = xobs, y = yobs, ntree = 1, ...)
leafnr <- predict(object = fit, newdata = xobs, nodes = TRUE)
nodes <- predict(object = fit, newdata = xmis, nodes = TRUE)
donor <- lapply(nodes, function(s) yobs[leafnr == s])
return(donor)
}
forest <- sapply(1:ntree, FUN = function(s) onetree(xobs,
xmis, yobs, ...))
impute <- apply(forest, MARGIN = 1, FUN = function(s) sample(unlist(s),
1))
return(impute)
}
Related
I'm trying to fit the differential equation using the least squares method (FME package).
However, I keep getting this error that I don't know how to tackle.
The reproducible example:
times = seq(0, 4, by = 0.5)
dat = data.frame(time = seq(1,4),
Tick = c(128, 52.5, 28, 121))
N = 10
tick.model <- function(time, y, params, ...) { #here we begin a function with three arguments
with(as.list(c(y, params)),{
dTick <- (30 - s.t*Tick)*Tick*0.3*N - delta.t*Tick
return(list(c(dTick)))
})
}
y = c(Tick = 82.375)
cost1 <- function(p) {
out <- ode(y, times, tick.model, p)
modCost(out, dat, weight = "none")
}
params <- c(s.t=0.1, delta.t = 1)
fit = modFit(f = cost1, p = params, lower = rep(0,2),
upper = c(10, 5))
summary(fit)
The result comes out like this:
Parameters:
Estimate Std. Error t value Pr(>|t|)
s.t 0.3641876 NA NA NA
delta.t 0.0001417 NA NA NA
Residual standard error: 60.92 on 2 degrees of freedom
Error in cov2cor(x$cov.unscaled) : 'V' is not a square numeric matrix
In addition: Warning message:
In summary.modFit(fit) : Cannot estimate covariance; system is singular
Also, the fitted model doesn't look nice
.
I have no idea what I could have done wrong.
I am getting the following error when I run a linear mixed model.
Error in validObject(.Object) :
invalid class “corMatrix” object: 'sd' slot has non-finite entries
In addition: Warning message:
In vcov.merMod(object, use.hessian = use.hessian) :
Computed variance-covariance matrix problem: not a positive definite matrix;
returning NA matrix
I have created a loop that runs through the columns of my data frame and run a LMM for each column. I have run the same process on 2 other data frames and I had no errors. The other data frames have similar properties (i.e contain 0s etc.). I have also tried excluding rows that contain zero values and this has not fixed the error.
Can anyone explain what this error means?
My loop: I am running LMM with likelihood ratio test
uni_HPOS_total1_A = data.frame(metab = character(), beta = double(), pvalue = double(), stringsAsFactors = FALSE)
for (i in 1:length(colnames(HPOS_XCMS_metab_0_new))){
model0 = lmer(HPOS_XCMS_metab_0_new[, i] ~ (1|technical covariate) ,
REML = FALSE, data = HPOS_XCMS_metab_0_covars)
model1 = lmer(HPOS_XCMS_metab_0_new[, i] ~ (1|technical covariate) + outcome,
REML = FALSE, data = HPOS_XCMS_metab_0_covars)
beta = summary(model1)$coefficients["outcome",1]
res <- anova(model0, model1)
pval = res$`Pr(>Chisq)`[2]
uni_HPOS_total1_A[nrow(uni_HPOS_total1_A) + 1,] = list(colnames(HPOS_XCMS_metab_0)[i], beta, pval)
}
Thanks
I have a training_predictors set with 56 columns, all of which are numeric. training_labels is a factor vector of 0 and 1.
I am using following list as subset sizes to be tested.
subset_sizes <- c(1:5, 10, 15, 20, 25)
Following is the list of modified rfFuncs functions.
rfRFE <- list(summary = defaultSummary,
fit = function(x, y, first, last, ...) {
library(randomForest)
randomForest(x, y, importance = first, ...)
},
pred = function(object, x) predict(object, x),
rank = function(object, x, y) {
vimp <- varImp(object)
vimp <- vimp[order(vimp$Overall, decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = pickSizeBest,
selectVar = pickVars)
I have declared the control function as:
rfeCtrl <- rfeControl(functions = rfRFE,
method = "cv",
number = 10,
verbose = TRUE)
But when I run rfe function as shown below,
rfProfile <- rfe(training_predictors,
training_labels,
sizes = subset_sizes,
rfeControl = rfeCtrl)
I am getting an error as :
Error in { : task 1 failed - "argument 1 is not a vector"
I also tried changing the vector subset_sizes, but still no luck. What am I doing wrong?
Update : I tried to run these steps one by one and the problem seems to be with the rank function. But I am still unable to figure out the problem.
Update: I found out the problem. varImp in rank function is not containing $Overall. But it contains columns with names 0 and 1. Why is it so? What does 0 and 1 signify (both column values are exactly same, by the way)? Also, how can I make varImp to return $Overall column? [as a temporary solution, I am creating a new column $Overall and attaching it to vimp in rank function.]
Using 0 and 1 as factor levels is problematic since those are not valid R column names. In your other SO post you probably would have received a message about using these as factor levels for your output.
Try using a factor outcome with some more informative levels that can be translated into valid R column names (for class probabilities).
I have found a solution for this same issue to fit a logistic regression model in rfe using caret. The solution as below:
glmFuncs$rank <-function (object, x, y){
vimp <- varImp(object, scale = FALSE)
loadNamespace("dplyr")
vimp <- vimp$importance %>%
mutate(var=row.names(.)) %>%
arrange(-Overall)
vimp <- vimp[order(vimp$Overall, decreasing = TRUE), ,drop = FALSE]
vimp
}
Ive made a Minimal Reproducible Example for the problem i'm facing.
Data for Y(monthly dependent variable):
monthlytest <- c(-.035, 0.455)
ytest <- ts(monthlytest, start=c(2008,8), frequency=12)
Data for X(daily explanatory variable):
lol1 <- paste(2008, sprintf("%02s",rep(1:12, each=30)), sprintf("%02s", 1:30), sep="-") [211:270]
lol2 <- seq(0.015, 0.078, length.out=60)
xtest <- zoo(lol2, order.by = lol1)
Load package:
library(midasr)
library(zoo)
Run regression:
beta <- midas_r(ytest ~ mls(ytest, 1, 1) + mls(xtest, 3:30, 30))
When this final line of code is run I get this error, what am I doing wrong?
Error in matrix(NA, nrow = n - nrow(X), ncol = ncol(X)) :
invalid 'nrow' value (< 0)
The error is produced by the function mls:
> mls(xtest, 3:30, 30)
Erreur dans matrix(NA, nrow = n - nrow(X), ncol = ncol(X)) :
valeur 'nrow' incorrecte (< 0)
This happens because mls expects numeric argument. Converting xtest to numeric solves the problem:
xtn <- as.numeric(xtest)
beta <- midas_r(ytest ~ mls(ytest, 1, 1) + mls(xtn, 3:30, 30))
Which produces error again:
Erreur dans prepmidas_r(y, X, mt, Zenv, cl, args, start, Ofunction, user.gradient, :
l'argument "start" est manquant, avec aucune valeur par défaut
Which means that you did not specify start which is mandatory for function midas_r. Your model is the unrestricted MIDAS model, which means that either you need to use function midas_u or supply start=NULL. But even this does not help:
> beta <- midas_r(ytest ~ mls(ytest, 1, 1) + mls(xtn, 3:30, 30),start=NULL)
Erreur dans midas_r.fit(prepmd) :
Not possible to estimate MIDAS model, more parameters than observations
You have two low frequency observations, which in theory allows you to estimate two parameters, your model has 29. So you need to have at least 30 low frequency observations (since you lose one observation due to lagged dependent variable) to estimate this model.
In an assignment, we are asked to perform a cross-validation on a CART model. I have tried using the cvFit function from cvTools but got a strange error message. Here's a minimal example:
library(rpart)
library(cvTools)
data(iris)
cvFit(rpart(formula=Species~., data=iris))
The error I'm seeing is:
Error in nobs(y) : argument "y" is missing, with no default
And the traceback():
5: nobs(y)
4: cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K,
R = R, foldType = foldType, folds = folds, names = names,
predictArgs = predictArgs, costArgs = costArgs, envir = envir,
seed = seed)
3: cvFit(call, data = data, x = x, y = y, cost = cost, K = K, R = R,
foldType = foldType, folds = folds, names = names, predictArgs = predictArgs,
costArgs = costArgs, envir = envir, seed = seed)
2: cvFit.default(rpart(formula = Species ~ ., data = iris))
1: cvFit(rpart(formula = Species ~ ., data = iris))
It looks that y is mandatory for cvFit.default. But:
> cvFit(rpart(formula=Species~., data=iris), y=iris$Species)
Error in cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K, :
'x' must have 0 observations
What am I doing wrong? Which package would allow me to do a cross-validation with a CART tree without having to code it myself? (I am sooo lazy...)
The caret package makes cross validation a snap:
> library(caret)
> data(iris)
> tc <- trainControl("cv",10)
> rpart.grid <- expand.grid(.cp=0.2)
>
> (train.rpart <- train(Species ~., data=iris, method="rpart",trControl=tc,tuneGrid=rpart.grid))
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.94 0.91 0.0798 0.12
Tuning parameter 'cp' was held constant at a value of 0.2
Finally, I was able to get it to work. As Joran noted, the cost parameter needs to be adapted. In my case I am using 0/1 loss, which means that I use a simple function that evaluates != instead of - between y and yHat. Also, predictArgs must include c(type='class'), otherwise the predict call used internally will return a vector of probabilities instead of the most probable classification. To sum up:
library(rpart)
library(cvTools)
data(iris)
cvFit(rpart, formula=Species~., data=iris,
cost=function(y, yHat) (y != yHat) + 0, predictArgs=c(type='class'))
(This uses another variant of cvFit. Additional args to rpart can be passed by setting the args= parameter.)