R - RandomForest with two Outcome Variables - r

Fairly new to using randomForest statistical package here.
I'm trying to run a model with 2 response variables and 7 predictor variables, but I can't seem to because of the lengths of the response variables and/or the nature of fitting the model with 2 response variables.
Let's assume this is my data and model:
> table(data$y1)
0 1 2 3 4
23 43 75 47 21
> length(data$y1)
0 4
> table(data$y2)
0 2 3 4
104 30 46 29
> length(data$y2)
0 4
m1<-randomForest(cbind(y1,y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
When I run this model, I receive this error:
Error in randomForest.default(m, y, ...) :
length of response must be the same as predictors
I did some troubleshooting, and find that cbind() the two response variables simply places their values together, thus doubling the original length, and possible resulting in the above error. As an example,
length(cbind(y1,y2))
> 418
t(lapply(data, length()))
> a b c d e f g y1 y2
209 209 209 209 209 209 209 209 209
I then tried to solve this issue by running randomForest individually on each of the response variables and then apply combine() on the regression models, but came across these issues:
m2<-randomForest(y1~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m3<-randomForest(y2~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m2,m3)
Warning message:
In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I then decide to treat the randomForest models as classification models, and apply as.factor() to both response variables before running randomForest, but then came across this new issue:
m4<-randomForest(as.factor(y1)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
m5<-randomForest(as.factor(y2)~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)
combine(m4,m5)
Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) :
non-conformable arrays
My guess is that I can't combine() classification models.
I hope that my inquiry of trying to run a multivariate Random Forest model makes sense. Let me know if there are further questions. I can also go back and make adjustments.

Combine your columns outside the randomForest formula:
data[["y3"]] <- paste0(data$y1, data$y2)
randomForest(y3~a+b+c+d+e+f+g, data, mtry=7, importance=TRUE)

Related

Warning message 'newdata' had 1 row but variables found have 16 rows in R

I am suppose to use the predict function to predict when fjbjor is 5.5 and I always get this warning message and I have tried many ways but it always comes so is there anyone who can see what I am doing wrong here
This is my code
fit.lm <- lm(fjbjor~amagn, data=bjor)
summary(fit.lm)
new.bjor<- data.frame(fjbjor=5.5)
predict(fit.lm,new.bjor)
and this comes out
1 2 3 4 5 6 7 8 9 10 11
5.981287 2.864521 9.988559 5.758661 4.645530 2.419269 4.645530 5.313409 6.871792 3.309773 4.200278
12 13 14 15 16
3.755026 5.981287 5.536035 1.974016 3.755026
Warning message: 'newdata' had 1 row but variables found have 16 rows
If anyone can see what is wrong I would be really thankful for the help.
Your model is fjbjor ~ amagn, where fjbjor is response and amagn is covariate. Then your newdata is data.frame(fjbjor=5.5).
newdata should be used to provide covariates rather than response. predict will only retain columns of covariates in newdata. For your specified newdata, this will be NULL. As a result, predict will use the internal model frame for prediction, which returns you fitted values.
The warning message is fairly clear. predict determines the expected number of predictions from nrow(newdata), which is 1. But then what I described above happened so 16 fitted values are returned. Such mismatch produces the warning.
Looks like the model you really want is: amagn ~ fjbjor.

Lasso: Cross-validation for glmnet

I am using cv.glmnet() to perform cross-validation, by default 10-fold
library(Matrix)
library(tm)
library(glmnet)
library(e1071)
library(SparseM)
library(ggplot2)
trainingData <- read.csv("train.csv", stringsAsFactors=FALSE,sep=",", header = FALSE)
testingData <- read.csv("test.csv",sep=",", stringsAsFactors=FALSE, header = FALSE)
x = model.matrix(as.factor(V42)~.-1, data = trainingData)
crossVal <- cv.glmnet(x=x, y=trainingData$V42, family="multinomial", alpha=1)
plot(crossVal)
I am having the following error message
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has 1 or 0 observations; not allowed
But as it is shown below, I don't seem to have an observation level with counts of either 0 or 1.
>table(trainingData$V42)
back buffer_overflow ftp_write guess_passwd imap ipsweep land loadmodule multihop
956 30 8 53 11 3599 18 9 7
neptune nmap normal perl phf pod portsweep rootkit satan
41214 1493 67343 3 4 201 2931 10 3633
smurf spy teardrop warezclient warezmaster
2646 2 892 890 20
Any pointers?
cv.glmnet does N-fold crossvalidation with N=10 by default. This means it splits your data into 10 subsets, then trains a model on 9 of the 10 and tests it on the remaining 1. It repeats this, leaving out each subset in turn.
Your data is sparse enough that sometimes, the training subset will run into the problem encountered here (and in your previous question). The best solution is to reduce the number of classes in your response by combining the rarer classes (do you really need to get a predicted probability for spy or perl for example).
Also, if you're doing glmnet crossvalidation and constructing a model matrix, you could use the glmnetUtils package I wrote to streamline the process.

r rms error using validate

I'm building an Linear model using OLS in the r package with:
model<-ols(nallSmells ~ rcs(size, 5) + rcs(minor,5)+rcs(change_churn,3)
+rcs(review_rate,0), data=quality,x=T, y=T)
When I want to validate my model using:
validate(model,B=100)
I get the following error:
Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 1164 missing values deleted
But if I decrease B, e.g., B=10, I works. Why I can't iterate more. Also I notice that the seed has an effect when I use this method.
Can someone give me some advice?
UPDATE:
I'm using rcs(review_rate,0) because I want to assign the 0 number of knots to this predictor, according to my DOF budget. I noticed that the problem is with thte data in review_rate. Even if I ommit the parameter in rcs() and just put the name of the predictor, I get errors. This is the frequency of the data in review_rate: count(quality$review_rate)
x freq
1 0.8571429 1
2 0.9483871 1
3 0.9789474 1
4 0.9887640 1
5 0.9940476 1
6 1.0000000 1159 I wonder if there is a relationship with the values of this vector? Because when I built the OLS model, I get the following warning:
Warning message:
In rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied = fractied) :
5 knots requested with 6 unique values of x. knots set to 4 interior values.
The values in the other predictors are real positives, but if ommit review_rate predictor I don't get any warning or error.
Thanks for your support.
I add the link for a sample of 100 of my data for replication
https://www.dropbox.com/s/oks2ztcse3l8567/examplestackoverflow.csv?dl=0
X represent the depedent variable and Y4 the predictor that is giving me problems.
require (rms)
Data <- read.csv ("examplestackoverflow.csv")
testmodel<-ols(X~ rcs(Y1)+rcs(Y2)+rcs(Y3),rcs(Y4),data=Data,x=T,y=T)
validate(testmodel,B=1000)
Kind regards,

error message when performing Gamma glmer in R- PIRLS step-halvings failed to reduce deviance in pwrssUpdate

I am trying to perform a glmer in R using the Gamma error family. I get the error message:
"Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate"
my response variable is flower mass. My fixed effects are base mass, F1 treatment, and fertilisation method. My random effects are line and maternal ID nested within line.
When I perform the same analysis using an integer as the response (ie. flower number) This error does not occur.
Here is a sample of my data:
LINE MATERNAL_ID F1TREAT SO FLWR_MASS BASE_MASS
17 81 stress s 2.7514 9.488
5 41 control o 0.3042 1.809
37 89 control o 2.3749 6.694
5 41 stress s 3.6140 9.729
9 5 control s 0.5020 7.929
13 7 stress s 0.4914 0.969
35 88 stress s 0.4418 1.840
1 57 control o 2.1531 6.673
13 7 stress s 3.0191 7.131
Here is the code I am using:
library(lme4)
m <- glmer(data=mydata,
FLWR_MASS~BASE_MASS*F1TREAT*SO+(1 |LINE/MATERNAL_ID),family=Gamma)
(I am using r 3.0.3 for windows)
#HongOoi answered this question in the comments, but I will repeat it here for anyone else having this issue. He suggested changing
family=Gamma
to
family=Gamma(link=log)

Help fitting a poisson glmer (getting an error message)

I am trying to fit a Poisson glmer model in R, to determine if 4 experimental
treatments affected the rate at which plants developed new branches over time.
New branches were counted after 35, 70 and 83 days and data were organised as follows:
treatment replicate.plant time branches
a ID4 35 0
a ID4 70 1
a ID4 83 1
a ID12 35 1
a ID12 70 3
a ID12 83 8
Loading the package lme4, I ran the following model:
mod<-glmer(branches ~ treatment + (1|time),
family=poisson,
data=dataset)
but I obtain the following error message:
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object '.setDummyField' not found
Can anyone please give me an indication of why I am getting this error
and what does it mean?
Any advicse on how to make this model run will be greatly appreciated.
This is a known issue, see here: https://github.com/lme4/lme4/issues/54
The problem seems to be limited to R version 3.0.0. You should update to a more recent version.

Resources