R random forest using cforest, how to plot tree - r

I have created random forest model using cforest
library("party")
crs$rf <- cforest(as.factor(Censor) ~ .,
data=crs$dataset[crs$sample,c(crs$input, crs$target)],
controls=cforest_unbiased(ntree=500, mtry=4))
cf <- crs$rf
tr <- party:::prettytree(cf#ensemble[[1]], names(cf#data#get("input")))
#tr
plot(new("BinaryTree", tree=tr, data=cf#data, responses=cf#responses))
I get error when plotting tree
Error: no string supplied for 'strwidth/height' unit
Any help how to overcome this error?

Looking at your code, I assume that crs is referring to a data frame. The dollar sign may be a problem (specifically crs$rf). If crs is indeed a data frame, then the $ would tell R to extract items from inside the dataframe by using the usual list indexing device. This may be conflicting with the call to generate the random forest, and be causing the error. You could fix this by starting with:
crs_rf <- cforest(as.factor(Censor) ~ ., .....
Which would create the random forest object. This would replace:
crs$rf <- cforest(as.factor(Censor) ~ ., ......
As a reference, in case this doesn't fix it, I wanted to refer you to a great guide from Stanford that covers random forests. They do have examples showing how the party package is used. In order to make trouble shooting easier, I might recommend pulling apart the call like they do in the guide. For example (taking from the guide that is provided), first, set the controls:
data.controls <- cforest_unbiased(ntree=1000, mtry=3)
Then make the call:
data.cforest <- cforest(Resp ~ x + y + z…, data = mydata, controls=data.controls)
Then generate the plot once the call works. If you need help plotting trees using cforest(), you might think about looking at this excellent resource (as I believe the root of the problem is with your plotting function call): http://www.r-bloggers.com/a-brief-tour-of-the-trees-and-forests/

Related

Back transformation of emmeans in lmer

I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.

R error when using a dataframe with named samples - 'termlabels' must be a character vector of length at least one

I have a database of famous people's voices and I'm currently trying to build a model which predicts whether a voice sounds male or female using the randomForest function. This works fine when I delete the celebrity's name from the dataframe before building the model, however I want to be able to find out which voices the model gets wrong, so I've tried creating the same model without deleting the voice name first. This is the code:
library(randomForest)
# Build a random forest model
rf_model <- randomForest(Gender ~ -VoiceName, data = dataset)
# Compute the accuracy of the random forest on a validation set
validation$pred <- predict(rf_model, validation)
mean(validation$pred == validation$Gender)
However, when I run this, I see the following error:
Error in reformulate(attributes(Terms)$term.labels) :
'termlabels' must be a character vector of length at least one
This makes no sense to me, because each of these voice samples have a VoiceName which returns TRUE when I run is.character(dataset$VoiceName). Does anybody have any tips on what's going wrong here? Thank you.
You need to set a "dot" between ~ and -VoiceName. This means
rf_model <- randomForest(Gender ~. -VoiceName, data = dataset)

Subtracting a fitted polynomial from a dataset in R

I have one curve, a scatterplot, which is the plot of the data set I am working with (named 'mydata') and the other curve which is the fitted 2nd degree polynomial curve that I obtained from the data set.
The scatterplot was obtained with a simple plot function:
plot(mydata)
The code I used for the fitting is:
fit<-lm(mydata$Volts ~ poly(mydata$Frequency, 2, raw=TRUE),data=mydata)
#summary(fit)
lines(mydata$Frequency, predict(fit))
Now, I would like to subtract the fitted polynomial from the dataset. Following was my approach:
given<-plot(mydata)
fit<-lm(mydata$Volts ~ poly(mydata$Frequency, 2, raw=TRUE),data=mydata)
new<-lines(mydata$Frequency, predict(fit))
corrected<-given-new
plot(corrected)
The error I received was:
Error in plot(corrected) : object 'corrected' not found
How do I correct this?
Looks like you are trying to subtract graphical elements. You should perform any math/operations on your data before trying to plot it. Something like the following may work. However without sample data this is just an educated guess.
given <- mydata$Volts
fit <- lm(mydata$Volts ~ poly(mydata$Frequency, 2, raw=TRUE),data=mydata)
new <- predict(fit)
corrected <- given-new
plot(mydata$Frequency, corrected)
I ran a reprex (although technically, I need a random seed for a true reprex, but because of the actual issue with the code, that doesn't matter here) on nonsense data.
volts=rnorm(50,mean=220,sd=5)
frequency=runif(50,min=30,max=90)
mydata=data.frame(Volts=volts,Frequency=frequency)
given<-plot(mydata)
fit<-lm(mydata$Volts ~ poly(mydata$Frequency, 2, raw=TRUE),data=mydata)
new<-lines(mydata$Frequency, predict(fit))
corrected<-given-new
plot(corrected)
The scope of my answer is strictly to explain why the not found error showed up. Daniel's code shows you the fix.
I'm not sure why the response of Daniel O was not chosen, because it worked. I know it is frustrating when you clearly defined something and your source code is right in front of you, yet the interpreter says NOT FOUND. The lesson learned here when you get this situation, to check for NULL. It's a good habit in general for R.

Undoing createDataPartition ordering

I'm new to data science and trying to finish up this project. I have a data frame (from here https://www.kaggle.com/c/house-prices-advanced-regression-techniques) with assigned train and test sets (1:1460, 1461:2919) that I was suggested to use createDataPartition() on due to an error I was getting when trying to predict
> predSale <- as.data.frame(predict(model, newdata = test))
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$xlevels) : factor MSSubClass has new levels 150
But now when using createDataPartition, it's mixing up my original train and test sets, which I need in a specific order for the Kaggle submission. I've read in the vignette and looks like there is an argument for returnTrain. I'm not sure if this could be used (I don't fully understand it), but ultimately I'd like to know if there is a way to undo the ordering so I can submit my project with the original ordered set.
test$SalePrice <- NA
combined <- rbind(train, test)
train <- combined[1:1460, ]
test <- combined[1461:2919, ]
#____________Models____________
set.seed(666)
index <- createDataPartition(paste(combined$MSSubClass,combined$Neighborhood,
combined$Condition2,combined$LotConfig,
combined$Exterior1st,combined$Exterior2nd,
combined$RoofMatl,combined$MiscFeature,combined$SaleType))$Resample
train <- combined[index,]
test <- combined[-index,]
model <- lm(SalePrice ~., train)
predSale <- as.data.frame(predict(model, newdata = test))
SampleSubmission <- round(predSale, digits = 1)
write.csv(SampleSubmission, "SampleSubmission.csv")
Thanks!! If there is anything you need answered please let me know,I think I've provided everything (I'm not sure if you need the full code or what, I'll be happy to edit with whatever more needed)?
You do not use createDataPartition on a combined kaggle dataset. You need to keep these data sets separate. That's why kaggle provides them. If you want to combine them, then you have to split them again as they were after you have done your data cleaning.
But the issue you have is that there are factor levels in the test dataset that are not seen by the model. There are multiple posts on kaggle about this issue. But I must say that kaggle's search engines is crappy.
In kaggle competitions some people use the following code on character columns to turn them into numeric (e.g. for using the data with xgboost). This code assumes that you loaded the data sets with stringAsFactors = False.
for (f in feature.names) {
if (class(train[[f]])=="character") {
levels <- unique(c(train[[f]], test[[f]]))
test[[f]] <- as.integer(factor(test[[f]], levels=levels))
train[[f]] <- as.integer(factor(train[[f]], levels=levels))
}
}
Others use a version of the following to create all the level names in the training data set.
levels(xtrain[,x]) <- union(levels(xtest[,x]),levels(xtrain[,x]))
There are more ways of dealing with this.
Of course these solutions are nice for Kaggle, as this might give you a better score. But in a strict sense this is a sort of data leakage. And using this in a production setting, is asking for trouble. There are many situations in which you might not know all of the possible values in advance, and when encountering a new value returning a missing value instead of a prediction is a more sensible choice. But that discussion fills complete research articles.

Number at risk table using survplot with cph() object

This is pretty specific to the rms package. When using survplot with a cph object, the number at risk table is not stratified according to the covariate. using npsurv() does this correctly.
library(survival);library(rms)
data(lung)
fit <- cph(Surv(time, status==2) ~ sex,data = lung,surv = TRUE,x=TRUE, y=TRUE)
survplot(fit, n.risk=TRUE)
# compared to:
survplot(npsurv(fit$sformula, data = lung), conf= "none",n.risk=TRUE)
For usual use cases I guess re-running the model with npsurv will do, but for others, like when wanting to use fit.mult.impute (my specific use case), you want to use the "original" fit created by cph().
Is there a work-around/way to fix this?
So I should have used
strat(sex)
as the term.

Resources