Type parameter of the predict() function - r

What is the difference between type="class" and type="response" in the predict function?
For instance between:
predict(modelName, newdata=testData, type = "class")
and
predict(modelName, newdata=testData, type = "response")

Response gives you the numerical result while class gives you the label assigned to that value.
Response lets you to determine your threshold. For instance,
glm.fit = glm(Direction~., data=data, family = binomial, subset = train)
glm.probs = predict(glm.fit, test, type = "response")
In glm.probs we have some numerical values between 0 and 1. Now we can determine the threshold value, let's say 0.6. Direction has two possible outcomes, up or down.
glm.pred = rep("Down",length(test))
glm.pred[glm.probs>.6] = "Up"

type = "response" is used in glm models and type = "class" is used in rpart models(CART).
See:
predict.glm
predict.rpart

see ?predict.lm:
predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. For type = "terms" this is a matrix with a column per term and may have an attribute "constant".
> d <- data.frame(x1=1:10,x2=rep(1:5,each=2),y=1:10+rnorm(10)+rep(1:5,each=2))
> l <- lm(y~x1+x2,d)
> predict(l)
1 2 3 4 5 6 7 8 9 10
2.254772 3.811761 4.959634 6.516623 7.664497 9.221486 10.369359 11.926348 13.074222 14.631211
> predict(l,type="terms")
x1 x2
1 -7.0064511 0.8182315
2 -5.4494620 0.8182315
3 -3.8924728 0.4091157
4 -2.3354837 0.4091157
5 -0.7784946 0.0000000
6 0.7784946 0.0000000
7 2.3354837 -0.4091157
8 3.8924728 -0.4091157
9 5.4494620 -0.8182315
10 7.0064511 -0.8182315
attr(,"constant")
[1] 8.442991
i.e. predict(l) is the row sums of predict(l,type="terms") + the constant

Related

dimension of predicted results is lower than given matrix

I have a dataset of 17 columns and 500000 rows. I want to predict 250000 of one of these columns. so my training dataset has 250000 rows. after dividing to testing and training set, I ran "gbm" and "lm" model on the set. (
modellm <- train(DARAMAD ~ ., data = trainig, method = "lm", na.action = na.pass)
modelgbm <- train(DARAMAD ~., data = trainig, method = "gbm", na.action = na.omit)
the problem is that when I want to predict, I only receive a vector of 9976 elements while, I try to predict 250000 elements.
z <- predict(modelgbm, newdata = forPredict)
z <- predict(modellm, newdata = forPredict)
forPredict and training datasets both have dimensions of 250000.
your code didn't work for me, but I counted NAs as follows:
naCountFunc <- function(x) sum(is.na(x))
naCount <- sapply(trainData, naCountFunc)
as.data.frame(table(naCount))
naCount Freq
1 0 12
2 1 1
3 100 2
4 187722 1
5 188664 1
these two columns with high NAs are not the one I want to predict. the "daramad" column hasn't any NA.

MCMCglmm binomial model prior

I want to estimate a binomial model with the R package MCMCglmm. The model shall incorporate an intercept and a slope - both as fixed and random parts. How do I have to specify an accepted prior? (Note, here is a similar question, but in a much more complicated setting.)
Assume the data have the following form:
y x cluster
1 0 -0.56047565 1
2 1 -0.23017749 1
3 0 1.55870831 1
4 1 0.07050839 1
5 0 0.12928774 1
6 1 1.71506499 1
In fact, the data have been generated by
set.seed(123)
nj <- 15 # number of individuals per cluster
J <- 30 # number of clusters
n <- nj * J
x <- rnorm(n)
y <- rbinom(n, 1, prob = 0.6)
cluster <- factor(rep(1:nj, each = J))
dat <- data.frame(y = y, x = x, cluster = cluster)
The information in the question about the model, suggest to specify fixed = y ~ 1 + x and random = ~ us(1 + x):cluster. With us() you allow the random effects to be correlated (cf. section 3.4 and table 2 in Hadfield's 2010 jstatsoft-article)
First of all, as you only have one dependent variable (y), the G part in the prior (cf. equation 4 and section 3.6 in Hadfield's 2010 jstatsoft-article) for the random effects variance(s) only needs to have one list element called G1. This list element isn't the actual prior distribution - this was specified by Hadfield to be an inverse-Wishart distribution. But with G1 you specify the parameters of this inverse-Whishart distribution which are the scale matrix ( in Wikipedia notation and V in MCMCglmm notation) and the degrees of freedom ( in Wikipedia notation and nu in MCMCglmm notation). As you have two random effects (the intercept and the slope) V has to be a 2 x 2 matrix. A frequent choice is the two dimensional identity matrix diag(2). Hadfield often uses nu = 0.002 for the degrees of freedom (cf. his course notes)
Now, you also have to specify the R part in the prior for the residual variance. Here again an inverse-Whishart distribution was specified by Hadfield, leaving the user to specify its parameters. As we only have one residual variance, V has to be a scalar (lets say V = 0.5). An optional element for R is fix. With this element you specify, whether the residual variance shall be fixed to a certain value (than you have to write fix = TRUE or fix = 1) or not (then fix = FALSE or fix = 0). Notice, that you don't fix the residual variance to be 0.5 by fix = 0.5! So when you find in Hadfield's course notes fix = 1, read it as fix = TRUE and look to which value of V it is was fixed.
All togehter we set up the prior as follows:
prior0 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = FALSE))
With this prior we can run MCMCglmm:
library("MCMCglmm") # for MCMCglmm()
set.seed(123)
mod0 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior0)
The draws from the Gibbs-sampler for the fixed effects are found in mod0$Sol, the draws for the variance parameters in mod0$VCV.
Normally a binomial model requires the residual variance to be fixed, so we set the residual variance to be fixed at 0.5
set.seed(123)
prior1 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = TRUE))
mod1 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior1)
The difference can be seen by comparing mod0$VCV[, 5] to mod1$VCV[, 5]. In the later case, all entries are 0.5 as specified.

Problems with apply R

I Have a problem with using the apply function in R. I made the following function:
TrainSupportVectorMachines <- function(trainingData,kernel,G,C){
####train het model
fit<-svm(Device~.,data=trainingData,kernel=kernel,probability=TRUE,
gamma =G, costs=C)
return(fit);
}
I want to train the model with different values of Cost(c). Therefore, I tried the following commend:
cst = matrix(2^(-4:-2),ncol=3)
kernl = "sigmoid"
fitSVMBP <- apply(cst,2,function(x)TrainSupportVectorMachines(dtr1,kernl,0.625,x))
My opinion is that, fitSVMBP becomes a list with different SVM models with different values for cost. But I get a list with different SVM model but they have all a cost of 1.
Does anybody know what I do wrong?
EDIT:
I use the e1071 package.
And the dataset looks like:
> head(dtr1)
Device Geslacht Leeftijd Invultijd Type Maanden.geleden
1 pc M 45 16.0 A 15
2 pc V 43 27.5 A 3
3 pc V 28 16.0 A 15
4 pc V 17 10.0 A 13
5 pc M 56 16.0 A 15
6 pc M 50 27.5 A 3
You have called the argument costs and not cost. Here's an example using the sample data in ?svm so you can try this:
model <- svm(Species ~ ., data = iris, cost=.6)
model$cost
# [1] 0.6
model <- svm(Species ~ ., data = iris, costs=.6)
model$cost
# [1] 1
R will do partial matching (so in this case cos=.6 would work) but if you overspecify an argument it doesn't match.
Nor will it always complain if you give it an argument it doesn't expect:
> model <- svm(Species ~ ., data = iris, costs=.6, asjkdakjshd=1)
>
Because unmatched args get caught in the ... argument.
If you take this too far, you get:
> model <- svm(Species ~ ., data = iris, c=.122)
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
argument 4 matches multiple formal arguments
because c matches cost, coef0, class.weights and cachesize.

MCMCglmm multinomial model in R

I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).

"Factor has new levels" error for variable I'm not using

Consider a simple dataset, split into a training and testing set:
dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
# x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
# x y z
# 5 5 e 1
When I train a logistic regression model to predict z using x and obtain test-set predictions, all is well:
mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
# 5
# 0.5546394
However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:
mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor y has new level e
Since I removed y from my model equation, I'm surprised to see this error message. In my application, dat is very wide, so z~.-y is the most convenient model specification. The simplest workaround I can think of is removing the y variable from my data frame and then training the model with the z~. syntax, but I was hoping for a way to use the original dataset without the need to remove columns.
You could try updating mod2$xlevels[["y"]] in the model object
mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Another option would be to exclude (but not remove) "y" from the training data
mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
We may generalize #matt_k's great solution to apply it to high-dimensional data where there are multiple factors with different levels in the training and test sets, like these:
dat2
# x y1 y2 z
# 1 1 a A 0
# 2 2 b B 0
# 3 3 c C 1
# 4 4 d D 0
# 5 5 e E 1
When we divide into test and training as before,
train <- dat2[1:4, ]
test <- dat2[5, ]
both y1 and y2 test levels will differ from those of train and we get the error.
mod <- glm(z ~ ., data=train, family="binomial")
predict(mod, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor y1 has new level e
With high-dimensional data, it's rather boring to correct every single failing factor, so we might want to loop over them.
Either, the bad guys are of class "factor", or of class "character" (as in our case). Since these will be the ones to be included in the 'xlevels', we use a small helper that identifies them,
is.prone <- function(x) is.factor(x) | is.character(x)
and put it into Map.
id <- sapply(dat2, is.prone)
mod$xlevels <- Map(union, mod$xlevels, lapply(dat2[id], unique))
Then it should work.
predict(mod, newdata=test, type="response")
# 5
# 5.826215e-11
# Warning message:
# In predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
# prediction from a rank-deficient fit may be misleading
dat2 <- structure(list(x = 1:5, y1 = c("a", "b", "c", "d", "e"), y2 = c("a",
"b", "c", "d", "e"), z = c(0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA,
-5L))
I was confused about this issue for a long time. However, there was a simple solution to this. One of the variable "traffic type" had 20 factors and for one factor ie 17 there was only one row. Hence this row could be present either in train data or test data. In my case it was present in test data hence the error came - factor "traffic type" has a new level of 17 because there is no row with level 17in train data. I deleted this row from data set and model runs perfectly fine

Resources