Like in this post I'm struggling with the notation of MCMCglmm, especially what is meant by trait. My code ist the following
library("MCMCglmm")
set.seed(123)
y <- sample(letters[1:3], size = 100, replace = TRUE)
x <- rnorm(100)
id <- rep(1:10, each = 10)
dat <- data.frame(y, x, id)
mod <- MCMCglmm(fixed = y ~ x, random = ~us(x):id,
data = dat,
family = "categorical")
Which gives me the error message For error structures involving catgeorical data with more than 2 categories pleasue use trait:units or variance.function(trait):units. (!sic). If I would generate dichotomous data by letters[1:2], everything would work fine. So what is meant by this error message in general and "trait" in particular?
Edit 2016-09-29:
From the linked question I copied rcov = ~ us(trait):units into my call of MCMCglmm. And from https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q3/004006.html I took (and slightly modified it) the prior
list(R = list(V = diag(2), fix = 1), G = list(G1 = list(V = diag(2), nu = 1, alpha.mu = c(0, 0), alpha.V = diag(2) * 100))). Now my model actually gives results:
MCMCglmm(fixed = y ~ 1 + x, random = ~us(1 + x):id,
rcov = ~ us(trait):units, prior = prior, data = dat,
family = "categorical")
But still I've got a lack of understanding what is meant by trait (and what by units and the notation of the prior, and what is us() compared to idh() and ...).
Edit 2016-11-17:
I think trait is synoym to "target variable" or "response" in general or y in this case. In the formula for random there is nothing on the left side of ~ "because the response is known from the fixed effect specification." So the rational behind specifiying that rcov needs trait:units could be that it is alread defined by the fixed formula, what trait is (y in this case).
units is the response variable value, and trait is the response variable name, which corresponds to the categories. By specifying rcov = ~us(trait):units, you are allowing the residual variance to be heterogeneous across "traits" (response categories) so that all elements of the residual variance-covariance matrix will be estimated.
In Section 5.1 of Hadfield's MCMCglmm Course Notes (vignette("CourseNotes", "MCMCglmm")) you can read an explanation for the reserved variables trait and units.
Related
I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891
Hi I'm working on a decision tree.
tree1=tree(League.binary~TME.factor+APM.factor+Wmd.factor,starcraft)
The tree shows a partitioning based solely on the APM.factor and the leaves aren't pure. here's a screenshot:
I tried creating a tree with a subset with 300 of the 3395 observations and it used more than one variable. What went wrong in the first case? Did it not need the extra two variables so it used only one?
Try playing with the tree.control() parameters, for example setting minsize=1 so that you end up with a single observation in each leaf (overfit), e.g:
model = tree(y ~ X1 + X2, data = data, control = tree.control(nobs=n, minsize = 2, mindev=0))
Also, try the same thing with the rpart package, see what results you get, which is the "new" version of tree. You can also plot the importance of the variables. Here a syntax example:
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
## fit tree
### alt1: class
model = rpart(y ~ X1 + X2, data=data, method = "class")
### alt2: reg
model = rpart(y ~ X1 + X2, data=data, control = rpart.control(maxdepth = 30, minsplit = 1, minbucket = 1, cp=0))
## show model
print(model)
rpart.plot(model, cex=0.5)
## importance
model$variable.importance
Note that since trees do binary splits, it is possible that a single variable explains most/all of the SSR (for regression). Try plotting the response for each regressor, see if there's any significant relation to anything but the variable you're getting.
In case you want to run the examples above, here a data simulation (put it at beginning of code):
n = 12000
X1 = runif(n, -100, 100)
X2 = runif(n, -100, 100)
## 1. SQUARE DATA
# y = ifelse( (X1< -50) | (X1>50) | (X2< -50) | (X2>50), 1, 0)
## 2. CIRCLE DATA
y = ifelse(sqrt(X1^2+X2^2)<=50, 0, 1)
## 3. LINEAR BOUNDARY DATA
# y = ifelse(X2<=-X1, 0, 1)
# Create
color = ifelse(y==0,"red","green")
data = data.frame(y,X1,X2,color)
# Plot
data$color = data$color %>% as.character()
plot(data$X2 ~ data$X1, col = data$color, type='p', pch=15)
I would like to fit an exponential decay function in R to the following data:
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
I've tried fitting with nls but the generated curve is not close to the actual data.
enter image description here
It would be very helpful if anyone could explain how to work with such nonlinear data and find a function of best fit.
Try y ~ .lin / (b + x^c). Note that when using "plinear" one omits the .lin linear parameter when specifying the formula to nls and also omits a starting value for it.
Also note that the .lin and b parameters are approximately 1 at the optimum so we could also try the one parameter model y ~ 1 / (1 + x^c). This is the form of a one-parameter log-logistic survival curve. The AIC for this one parameter model is worse than for the 3 parameter model (compare AIC(fm1) and AIC(fm3)) but the one parameter model might still be preferable due to its parsimony and the fact that the fit is visually indistinguishable from the 3 parameter model.
opar <- par(mfcol = 2:1, mar = c(3, 3, 3, 1), family = "mono")
# data = data.frame with x & y col names; fm = model fit; main = string shown above plot
Plot <- function(data, fm, main) {
plot(y ~ x, data, pch = 20)
lines(fitted(fm) ~ x, data, col = "red")
legend("topright", bty = "n", cex = 0.7, legend = capture.output(fm))
title(main = paste(main, "- AIC:", round(AIC(fm), 2)))
}
# 3 parameter model
fo3 <- y ~ 1/(b + x^c) # omit .lin parameter; plinear will add it automatically
fm3 <- nls(fo3, data = data, start = list(b = 1, c = 1), alg = "plinear")
Plot(data, fm3, "3 parameters")
# one parameter model
fo1 <- y ~ 1 / (1 + x^c)
fm1 <- nls(fo1, data, start = list(c = 1))
Plot(data, fm1, "1 parameter")
par(read.only = opar)
AIC
Adding the solutions in the other answers we can compare the AIC values. We have labelled each solution by the number of parameters it uses (the degrees of freedom would be one greater than that) and have reworked the log-log solution to use nls instead of lm and have a LHS of y since one cannot compare the AIC values of models having different left hand sides or using different optimization routines since the log likelihood constants used could differ.
fo2 <- y ~ exp(a + b * log(x+1))
fm2 <- nls(fo2, data, start = list(a = 1, b = 1))
fo4 <- y ~ SSbiexp(x, A1, lrc1, A2, lrc2)
fm4 <- nls(fo4, data)
aic <- AIC(fm1, fm2, fm3, fm4)
aic[order(aic$AIC), ]
giving from best AIC (i.e. fm3) to worst AIC (i.e. fm2):
df AIC
fm3 4 -329.35
fm1 2 -307.69
fm4 5 -215.96
fm2 3 -167.33
A biexponential model would fit much better, though still not perfect. This would indicate that you might have two simultaneous decay processes.
fit <- nls(y ~ SSbiexp(x, A1, lrc1, A2, lrc2), data = data)
#A1*exp(-exp(lrc1)*x)+A2*exp(-exp(lrc2)*x)
plot(y ~x, data = data)
curve(predict(fit, newdata = data.frame(x)), add = TRUE)
If the measurement error depends on magnitude, you could consider using it for weighting.
However, you should consider carefully what kind of model you'd expect from your domain knowledge. Just selecting a non-linear model empirically is usually not a good idea. A non-parametric fit might be a better option.
data <- structure(list(x = 0:38, y = c(0.991744340878828, 0.512512332368168,
0.41102449265681, 0.356621905557202, 0.320851602373477, 0.29499198506227,
0.275037747162642, 0.25938850981822, 0.245263623938863, 0.233655093612007,
0.224041426946405, 0.214152907133301, 0.207475138903635, 0.203270738895484,
0.194942528735632, 0.188107106969046, 0.180926819430008, 0.177028560207711,
0.172595416846822, 0.166729221891201, 0.163502461048814, 0.159286528409165,
0.156110097827889, 0.152655498715612, 0.148684858095915, 0.14733605355542,
0.144691873223729, 0.143118852619617, 0.139542186417186, 0.137730138713745,
0.134353615271572, 0.132197800438632, 0.128369567159113, 0.124971834736476,
0.120027536018095, 0.117678812415655, 0.115720611113327, 0.112491329844252,
0.109219168085624)), class = "data.frame", row.names = c(NA,
-39L), .Names = c("x", "y"))
# Do this because the log of 0 is not possible to calculate
data$x = data$x +1
fit = lm(log(y) ~ log(x), data = data)
plot(data$x, data$y)
lines(data$x, data$x ^ fit$coefficients[2], col = "red")
This did a lot better than using the nls forumla. And when plotting the fit seems to do fairly well.
I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?
(A similar question was posted here: xgboost, offset exposure?)
To illustrate the issue, the R code below generates some data with the fields:
x1, x2 - factors (either 0 or 1)
exposure - length of policy period on observed data
frequency - mean number of claims per unit exposure
claims - number of observed claims ~Poisson(frequency*exposure)
The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.
Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.
The code tries to predict this using XGBoost by:
Setting exposure as a weight in the model matrix
Setting log(exposure) as an offset
Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.
set.seed(1)
size<-10000
d <- data.frame(
x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)
#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
"objective" = "count:poisson"
, "eval_metric" = "logloss"
, "eta" = 1
, "subsample" = 1
, "colsample_bytree" = 1
, "min_child_weight" = 1
, "max_depth" = 2
)
## 1 - set weight in xgb.Matrix
xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_1 <- predict(xgb, xgtrain)
## 2 - set as offset in xgb.Matrix
xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_2 <- predict(model, xgtrain)
#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
data = d,
method = "poisson")
d$Tree_F <- predict(tree, newdata = d)
#### Fit a GBM
gbm <- gbm(claims~x1+x2+offset(log(exposure)),
data = d,
distribution = "poisson",
n.trees = 1,
shrinkage=1,
interaction.depth=2,
bag.fraction = 0.5)
d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set
I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.
I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:
setinfo(xgtrain, "base_margin", log(d$exposure))
I'm having trouble adding a constraint to my nonlinear model. Suppose I have the following data that is roughly an integrated Gaussian:
x = 1:100
y = pnorm(x, mean = 50, sd = 15) + rnorm(length(x), mean = 0, sd = 0.03)
model <- nls(y ~ pnorm(x, mean = a, sd = b), start = list(a = 50, b = 15))
I can fit the data with nls, but I would like to add the constraint that my fit must fit the data exactly (i.e. have no residual) at y = 0.25 (or whatever point is closest to 0.25). I assume that I need to use glmc for this, but I can't figure out how to use it.
I know it's not necessarily kosher to make the fit adhere to the data like that, but I'm trying to replicate another person's work and this is what they did.
You could impose the restriction somewhat manually. That is, for any parameter b we can solve for a unique a (since the cdf of the normal distribution is strictly increasing) that the restriction would hold:
getA <- function(b, x, y)
optim(x, function(a) (pnorm(x, mean = a, sd = b) - y)^2, method = "BFGS")$par
Then, after finding (tx,ty), the observation of interest, with
idx <- which.min(abs(y - 0.25))
tx <- x[idx]
ty <- y[idx]
we can fit the model with a single parameter:
model <- nls(y ~ pnorm(x, mean = getA(b, tx, ty), sd = b), start = list(b = 15))
and get that the restriction is satisfied
resid(model)[idx]
# [1] -2.440452e-07
and the coefficient a is
getA(coef(model), tx, ty)
# [1] 51.00536