I am using the randomForest package in R to build several species distribution models. My response variable is binary (0 - absence or 1-presence), and pretty unbalanced - for some species the ratio of absences:presences is 37:1. This imbalance (or zero-inflation) leads to questionable out-of-bag error estimates - the larger the ratio of absences to presence, the lower my out-of-bag (OOB) error estimate.
To compensate for this imbalance, I wanted to implement stratified sampling such that each tree in the random forest included an equal (or at least less imbalanced) number of results from both the presence and absences category. I was surprised that there doesn't seem to be any difference in the stratified and unstratified model OOB error estimates. See my code below:
Without stratification
> set.seed(25)
> HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
> HHrf
Call:
randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 19.1%
Confusion matrix:
0 1 class.error
0 422 18 0.04090909
1 84 10 0.89361702
With stratification
> HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, strata = bll_HH$HH_Pres, sampsize = ceiling(.632*nrow(bll_HH)))
> HHrf
Call:
randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 19.1%
Confusion matrix:
0 1 class.error
0 422 18 0.04090909
1 84 10 0.89361702
Is there a reason that I am getting the same results in both cases? For the strata argument, I specify my response variable, HH_Pres. For the sampsize argument, I specify that it should just be 63.2% of the entire dataset.
Anyone know what I am doing wrong? Or is this to be expected?
Thanks,
Liza
To reproduce this problem:
Sample data: https://docs.google.com/file/d/0B-JMocik79JzY3B4U3NoU3kyNW8/edit
Code:
bll = read.csv("bll_Nov2013_NMV.csv", header=TRUE)
HH_Pres <- bll$HammerHeadALL_Presence
Slope <-bll$Slope
Dist2Shr <-bll$Dist2Shr
Bathy <-bll$Bathy2
Chla <-bll$GSM_Chl_Daily_MF
SST <-bll$SST_PF_daily
Region <- bll$Region
MoonPhase <-bll$MoonPhase
DaylightHours <- bll$DaylightHours
bll_HH <- data.frame(HH_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region)
set.seed(25)
HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
HHrf
set.seed(25)
HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, strata = bll_HH$HH_Pres, sampsize = c(100, 50), ntree = 500, replace = FALSE, importance = TRUE)
HHrf
As far as I know, the sampsize argument should be a vector that is the same length as the number of classes in your data set. If you specify a factor variable in the strata argument, then sampsize should be given a vector that is the same length as the number of factors in the strata argument. I am not sure it performs as you describe in your question, but it has been a while since I have used the randomForest function.
From the help files, it says:
strata
A (factor) variable that is used for stratified sampling.
sampsize:
Size(s) of sample to draw. For classification, if sampsize is a vector
of the length the number of strata, then sampling is stratified by
strata, and the elements of sampsize indicate the numbers to be drawn
from the strata.
For example, since your classification has 2 distinct classes, you need to give sampsize a vector of length 2 that specifies how many observations you want to sample from each class during training time.
e.g. sampsize=c(100,50)
Furthermore, you can specify the names of the groups to be extra clear.
e.g. sampsize=c('0'=100, '1'=50)
An example from the help files that uses the sampsize argument, to clarify:
## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))
EDIT: Added some notes about the strata argument in randomForest.
EDIT: Make sure the strata argument is given a factor variable!
e.g. try strata = factor(HH_Pres), sampsize = c(...) where c(...) is a vector that is the same length as length(levels(factor(bll_HH$HH_Pres)))
EDIT:
OK, I tried running the code with your data, and it works for me.
# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)
# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 425 15 0.03409091
# 1 86 8 0.91489362
# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 424 16 0.03636364
# 1 85 9 0.90425532
Note that the OOB error estimate is the same in this case, even if we only use 63.2% of the data from each of the classes in our bootstrap samples. This is probably due to using sample sizes that are proportional to the class distribution in your training data, and the relatively small size of your data set. Let's try changing mySampSize to make sure it REALLY worked.
# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 21.16%
# Confusion matrix:
# 0 1 class.error
# 0 382 58 0.1318182
# 1 55 39 0.5851064
This syntax seems to be working fine for me on your data. The OOB is 32.21% and the class error(s): 0.32, 0.29. I did kick up the number of Bootstraps to 1000. I always recommend using indexing to define a RF model. In certain circumstances, symbolic syntax seems to be unstable.
require(randomForest)
HHrf <- read.csv("bll_HH.csv")
set.seed(25)
( rf.mdl <- randomForest( y=as.factor(HHrf[,"HH_Pres"]), x=HHrf[,2:ncol(HHrf)],
strata=as.factor(HHrf[,"HH_Pres"]), sampsize=c(50,50),
ntree=1000) )
I ran into this problem too. What I noticed is that my error rate when using importance = TRUE changes significantly. It is not the same as if I did not choose stratification with sampling.
For me it ended up being a tradeoff in not having an importance/accuracy score for my classification tree. It appears to be one of many bugs in this implementation.
Related
Hi I tried to print the confusion matrix for a dataset using R. Following are my results
My class variables contains binary values. Medv value is my class variable binarized with Medv value of the house greater than 230k being 1, or 0 otherwise. When I see the confusion matrix, at the end represents Positive class as 0. What does this mean? Are these results misrepresents my data?
my R code so far,
# Load CART packages
library(rpart)
library(rpart.plot)
library(caTools)
library(caret)
library (pROC)
housing_data = read.csv('housing.csv')
summary(housing_data)
housing_data = na.omit(housing_data)
# CART model
latlontree = rpart(Medv ~ Crim + Rm, data=housing_data , method = "class")
# Plot the tree using prp command defined in rpart.plot package
prp(latlontree)
# Split the data for Machine Learning
set.seed(123)
split = sample.split(housing_data$Medv, SplitRatio = 0.8)
train = subset(housing_data, split==TRUE)
test = subset(housing_data, split==FALSE)
#print (train)
#print (test)
# Create a CART model
tree = rpart(Medv ~ Crim + Zn + Indus + Chas + Nox + Rm + Age + Dis + Rad + Tax + Ptratio + B + Lstat , data=train , method = "class")
prp(tree)
#Decision tree prediction
#tree.pred = predict(tree, test)
pred = predict(tree,test, type="class")
#print (pred)
table(pred, test$Medv)
table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv)))
# If p exceeds threshold of 0.5, M else R: m_or_r
#m_or_r <- ifelse(p > 0.5, 1, 0)
#print (m_or_r)
# Convert to factor: p_class
#p_class <- factor(m_or_r, levels = test$Medv)
# Create confusion matrix
confusionMatrix(table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv))))
#print (tree.sse)
#ROC Curve
#Obtaining predicted probabilites for Test data
tree.probs=predict(tree,
test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(test$Medv,tree.probs[,2])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
auc <- auc (test$Medv,tree.probs[,2])
print (auc)
#creating a dataframe with a single row
x <- data.frame("Crim"= c(0.03), "Zn"=c(13), "Indus"=c(3.5), "Chas"=c(0.3), "Nox"=c(0.58), "Rm"=c(4.1), "Age"=c(68), "Dis"=c(4.98), "Rad" =c(3), "Tax"=c(225), "Ptratio"=c(17), "B"=c(396), "Lstat"=c(7.56))
#Obtaining predicted probabilites for Test data
probability2=predict(tree,
x,
type="prob")
print (probability2)
#Obtaining predicted class for Test data
probability3=predict(tree,
x,
type="class")
print (probability3)
Image of the dataset
Im trying to fit a mixed effect model to asses for effects upon the rate of germinated polen grains. I started with a binomial distribution with a model structure like this:
glmer(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database, family="binomial",
control = glmerControl(optimizer="bobyqa"))
Where NGG is the number of successes (germinated grains per stigma, can vary from 0 to e.g. 55), NGNG the number of failures (non-germinated grains 0 to e.g. 80). The issue is, after seeing the results, data seems to be over-dispersed, as indicated by the function (found in http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html):
overdisp_fun <- function(model) {
vpars <- function(m) {
nrow(m)*(nrow(m)+1)/2
}
model.df <- sum(sapply(VarCorr(model), vpars)) + length(fixef(model))
rdf <- nrow(model.frame(model))-model.df
rp <- residuals(model, type = "pearson") # computes pearson residuals
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df = rdf, lower.tail = FALSE)
c(chisq = Pearson.chisq, ratio = prat, rdf = rdf, p = pval)
}
The output was:
chisq = 1.334567e+04, ratio = 1.656201e+00, rdf = 8.058000e+03, p = 3.845911e-268
So I decided to try a beta-binomial in glmmTMB as follows (its important to keep this hierarchical structure):
glmmTMB(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database,
family=betabinomial(link = "logit"), na.action = na.omit, weights=NGT)
When I run it.. says:
Error in nlminb(start = par, objective = fn, gradient = gr, control = control$optCtrl) : (converted from warning) NA/NaN function evaluation
Is there something wrong in the model writing? I already checked for posible issues in (http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html) but did not find any solution yet.
thanks
I tried neural net in R on Boston data set available.
data("Boston",package="MASS")
data <- Boston
Retaining only those variable we want to use:
keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" )
data <- data[keeps]
In this case the formula is stored in an R object called f.
The response variable medv is to be “regressed” against the remaining nine attributes. I have done it as below:
f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat
To set up train sample 400 of the 506 rows of data without replacement is collected using the sample method:
set.seed(2016)
n = nrow(data)
train <- sample(1:n, 400, FALSE)
neuralnet function of R is fitted.
library(neuralnet)
fit<- neuralnet(f, data = data[train ,], hidden=c(10 ,12 ,20),
algorithm = "rprop+", err.fct = "sse", act.fct = "logistic",
threshold =0.1, linear.output=TRUE)
But warning message is displayed as algorithm not converging.
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax
Tried Prediction using compute,
pred <- compute(fit,data[-train, 1:9])
Following error msg is displayed
Error in nrow[w] * ncol[w] : non-numeric argument to binary operator
In addition: Warning message:
In is.na(weights) : is.na() applied to non-(list or vector) of type 'NULL'
Why the error is coming up and how to recover from it for prediction. I want to use the neuralnet function on that data set.
When neuralnet doesn't converge, the resulting neural network is not complete. You can tell by calling attributes(fit)$names. When training converges, it will look like this:
[1] "call" "response" "covariate" "model.list" "err.fct"
[6] "act.fct" "linear.output" "data" "net.result" "weights"
[11] "startweights" "generalized.weights" "result.matrix"
When it doesn't, some attributes will not be defined:
[1] "call" "response" "covariate" "model.list" "err.fct" "act.fct" "linear.output"
[8] "data"
That explains why compute doesn't work.
When training doesn't converge, the first possible solution could be to increase stepmax (default 100000). You can also add lifesign = "full", to get better insight into the training process.
Also, looking at your code, I would say three layers with 10, 12 and 20 neurons is too much. I would start with one layer with the same number of neurons as the number of inputs, in your case 9.
EDIT:
With scaling (remember to scale both training and test data, and to 'de-scale' compute results), it converges much faster. Also note that I reduced the number of layers and neurons, and still lowered the error threshold.
data("Boston",package="MASS")
data <- Boston
keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" )
data <- data[keeps]
f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat
set.seed(2016)
n = nrow(data)
train <- sample(1:n, 400, FALSE)
# Scale data. Scaling parameters are stored in this matrix for later.
scaledData <- scale(data)
fit<- neuralnet::neuralnet(f, data = scaledData[train ,], hidden=9,
algorithm = "rprop+", err.fct = "sse", act.fct = "logistic",
threshold = 0.01, linear.output=TRUE, lifesign = "full")
pred <- neuralnet::compute(fit,scaledData[-train, 1:9])
scaledResults <- pred$net.result * attr(scaledData, "scaled:scale")["medv"]
+ attr(scaledData, "scaled:center")["medv"]
cleanOutput <- data.frame(Actual = data$medv[-train],
Prediction = scaledResults,
diff = abs(scaledResults - data$medv[-train]))
# Show some results
summary(cleanOutput)
The problem seems to be in your argument linear.output = TRUE.
With your data, but changing the code a bit (not defining the formula and adding some explanatory comments):
library(neuralnet)
fit <- neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat,
data = data[train,],
hidden=c(10, 12, 20), # number of vertices (neurons) in each hidden layer
algorithm = "rprop+", # resilient backprop with weight backtracking,
err.fct = "sse", # calculates error based on the sum of squared errors
act.fct = "logistic", # smoothing the cross product of neurons and weights with logistic function
threshold = 0.1, # of the partial derivatives for error function, stopping
linear.output=FALSE) # act.fct applied to output neurons
print(net)
Call: neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat, data = data[train, ], hidden = c(10, 12, 20), threshold = 0.1, rep = 10, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = FALSE)
10 repetitions were calculated.
Error Reached Threshold Steps
1 108955.0318 0.03436116236 4
5 108955.0339 0.01391790099 8
3 108955.0341 0.02193379592 3
9 108955.0371 0.01705056758 6
8 108955.0398 0.01983134293 8
4 108955.0450 0.02500006437 5
6 108955.0569 0.03689097762 5
7 108955.0677 0.04765829189 5
2 108955.0705 0.05052776877 5
10 108955.1103 0.09031966778 7
10 108955.1103 0.09031966778 7
# now compute will work
pred <- compute(fit, data[-train, 1:9])
this post follows this question : https://stackoverflow.com/questions/31234329/rpart-user-defined-implementation
I'm very interested in tools which could handle tree growing with customized criteria, such that I could test different model.
I tried to use the partykit R package to grow a tree for which the split rule is given by the negative log-likelihood of a Cox model (which is log-quasi-likelihood in case of the Cox model) and a Cox model is fitted in each leaf.
As I understood reading the vignette about the MOB function, there are two way to implement my own split criteria, namely to get the fit function return either a list or a model object.
For my purpose, I tried the two solutions but I failed to make it work.
Solution 1 : return a list object :
I take as an example the "breast cancer dataset" as in the "mob" vignette.
I tried this :
cox1 = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_cox = coxph(formula = y ~ x )
list(
coefficients = res_cox$coefficients,
objfun = - res_cox$loglik[2],
object = res_cox)
}
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox1,
control = mob_control(alpha = 0.0001) )
There is a warning about the singularity of the X matrix, and the mob function a tree with a single node (even with smaller values for alpha).
Note that there is no singularity problem with the X matrix when running the coxph function :
res_cox = coxph( formula = Surv(time, cens) ~ horTh + pnodes ,
data = GBSG2 )
Solution 2 : Return a coxph.object :
I tried this :
cox2 = function(y,x, start = NULL, weights = NULL, offset = NULL, ... ){
res_cox = coxph(formula = y ~ x )
}
logLik.cox2 <- function(object, ...)
structure( - object$loglik[2], class = "logLik")
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox2,
control = mob_control(alpha = 0.0001 ) )
So this time I get a split along the "progrec" variable :
Model-based recursive partitioning (cox2)
Model formula:
Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade +
progrec + estrec + menostat
Fitted party:
[1] root
| [2] progrec <= 21: n = 281
| xhorThno xhorThyes xpnodes
| 0.19306661 NA 0.07832756
| [3] progrec > 21: n = 405
| xhorThno xhorThyes xpnodes
| 0.64810352 NA 0.04482348
Number of inner nodes: 1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 1531.132
Warning message:
In coxph(formula = y ~ x) : X matrix deemed to be singular; variable 2
I would like to know what's wrong with my Solution 1.
I also tried a similar thing for a regression problem and get the same result, ending with a single leaf :
data("BostonHousing", package = "mlbench")
BostonHousing <- transform(BostonHousing,
chas = factor(chas, levels = 0:1, labels = c("no", "yes")),
rad = factor(rad, ordered = TRUE))
linear_reg = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_lm = glm(formula = y ~ x , family = "gaussian")
list(
coefficients = res_lm$coefficients,
objfun = res_lm$deviance,
object = res_lm )
}
mob( formula = medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox +
+ age + dis + rad + tax + crim + b + ptratio,
data = BostonHousing ,
fit = linear_reg)
Also I would like to know if there is no problem using a variable for both "fit the model in a node" and "make a split".
Thank you in advance.
I will probably have other questions about partykit functioning.
The problem with the cox1() and linear_reg() functions you have set up are that you do not supply the estimating functions aka score contributions. As these are the basis for the inference that selects the splitting variable, the algorithm does not split at all if these are not provided. See this recent answer for some discussion of this issues.
But for coxph() objects (unlike the fitdistr() example in the discussion linked above) it is very easy to obtain these estimating functions or scores because there is an estfun() method available. So your cox2() approach is the easier route to go here.
The reason that the latter doesn't work correctly is due to the special handling of intercepts in coxph(). Internally, this always forces the intercept into the model but then omits the first column from the design matrix. When interfacing this through mob() you need to be careful not to mess this up because mob() sets up its own model matrix. And because you exclude the intercept, mob() thinks that it can estimate both levels of horTh. But this is not the case because the intercept is not identified in the Cox-PH model.
The best solution in this case (IMO) is the following: You let mob() set up an intercept but then exclude it again when passing the model matrix to coxph(). Because there are coef(), logLik(), and estfun() methods for the resulting objects, one can use the simple setup of your cox2() function.
Packages and data:
library("partykit")
library("survival")
data("GBSG2", package = "TH.data")
Fitting function:
cox <- function(y, x, start = NULL, weights = NULL, offset = NULL, ... ) {
x <- x[, -1]
coxph(formula = y ~ 0 + x)
}
Fitting of the MOB tree to the GBSG2 data:
mb <- mob(formula = Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + estrec + menostat,
data = GBSG2, fit = cox)
mb
## Model-based recursive partitioning (cox)
##
## Model formula:
## Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec +
## estrec + menostat
##
## Fitted party:
## [1] root: n = 686
## xhorThyes xpnodes
## -0.35701115 0.05768026
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 1758.86
I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).