nls singular gradient matrix - fit parameters in integral's upper limits - r

I am trying to make a nls fit for a little bit complicated expression that includes two integrals with two of the fit parameters in their upper limits.
I got the error
"Error in nlsModel(formula, mf, start, wts) : singular gradient
matrix at initial parameter estimates".
I have searched already in the previous answers, but didn't help. The parameters initialization seem to be ok, I have tried to change the parameters but none work. If my function has just one integral everything works very nicely, but when adding a second integral term just got the error. I don't believe the function is over-parametrized, as I have performed other fits with much more parameters and they worked. Below I have wrote a list with some data.
The minimal example is the following:
integrand <- function(X) {
return(X^4/(2*sinh(X/2))^2)
}
fitting = function(T1, T2, N, D, x){
int1 = integrate(integrand, lower=0, upper = T1)$value
int2 = integrate(integrand, lower=0, upper = T2)$value
return(N*(D/x)^2*(exp(D/x)/(1+exp(D/x))^2
)+(448.956*(x/T1)^3*int1)+(299.304*(x/T2)^3*int2))
}
fit = nls(y ~ fitting(T1, T2, N, D, x),
start=list(T1=400,T2=200,N=0.01,D=2))
------>For reference, the fit that worked is the following:
integrand <- function(X) {
return(X^4/(2*sinh(X/2))^2)
}
fitting = function(T1, N, D, x){
int = integrate(integrand, lower=0, upper = T1)$value
return(N*(D/x)^2*(exp(D/x)/(1+exp(D/x))^2 )+(748.26)*(x/T1)^3*int)
}
fit = nls(y ~ fitting(T1 , N, D, x), start=list(T1=400,N=0.01,D=2))
------->Data to illustrate the problem:
dat<- read.table(text="x y
0.38813 0.0198
0.79465 0.02206
1.40744 0.01676
1.81532 0.01538
2.23105 0.01513
2.64864 0.01547
3.05933 0.01706
3.47302 0.01852
3.88791 0.02074
4.26301 0.0256
4.67607 0.03028
5.08172 0.03507
5.48327 0.04283
5.88947 0.05017
6.2988 0.05953
6.7022 0.07185
7.10933 0.08598
7.51924 0.0998
7.92674 0.12022
8.3354 0.1423
8.7384 0.16382
9.14656 0.19114
9.55062 0.22218
9.95591 0.25542", header=TRUE)
I cannot figure out what happen. I need to perform this fit for three integral components, but even for two I have this problem. I appreciate so much your help. Thank you.

You could try some other optimizers:
fitting1 <- function(par, x, y) {
sum((fitting(par[1], par[2], par[3], par[4], x) - y)^2)
}
library(optimx)
res <- optimx(c(400, 200, 0.01, 2),
fitting1,
x = DF$x, y = DF$y,
control = list(all.methods = TRUE))
print(res)
# p1 p2 p3 p4 value fevals gevals niter convcode kkt1 kkt2 xtimes
#BFGS 409.7992 288.6416 -0.7594461 39.00871 1.947484e-03 101 100 NA 1 NA NA 0.22
#CG 401.1281 210.9087 -0.9026459 20.80900 3.892929e-01 215 101 NA 1 NA NA 0.25
#Nelder-Mead 414.6402 446.5080 -1.1298606 -227.81280 2.064842e-03 89 NA NA 0 NA NA 0.02
#L-BFGS-B 412.4477 333.1338 -0.3650530 37.74779 1.581643e-03 34 34 NA 0 NA NA 0.06
#nlm 411.8639 333.4776 -0.3652356 37.74855 1.581644e-03 NA NA 45 0 NA NA 0.04
#nlminb 411.9678 333.4449 -0.3650271 37.74753 1.581643e-03 50 268 48 0 NA NA 0.07
#spg 422.0394 300.5336 -0.5776862 38.48655 1.693119e-03 1197 NA 619 0 NA NA 1.06
#ucminf 412.7390 332.9228 -0.3652029 37.74829 1.581644e-03 45 45 NA 0 NA NA 0.05
#Rcgmin NA NA NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00
#Rvmmin NA NA NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00
#newuoa 396.3071 345.1165 -0.3650286 37.74754 1.581643e-03 3877 NA NA 0 NA NA 1.02
#bobyqa 410.0392 334.7074 -0.3650289 37.74753 1.581643e-03 7866 NA NA 0 NA NA 2.07
#nmkb 569.0139 346.0856 282.6526588 -335.32320 2.064859e-03 75 NA NA 0 NA NA 0.01
#hjkb 400.0000 200.0000 0.0100000 2.00000 3.200269e+00 1 NA 0 9999 NA NA 0.01
Levenberg-Marquardt converges too, but nlsLM fails when it tries to create an nls model object from the result because the gradient matrix is singular:
library(minpack.lm)
fit <- nlsLM(y ~ fitting(T1, T2, N, D, x),
start=list(T1=412,T2=333,N=-0.36,D=38), data = DF, trace = TRUE)
#It. 0, RSS = 0.00165827, Par. = 412 333 -0.36 38
#It. 1, RSS = 0.00158186, Par. = 417.352 329.978 -0.3652 37.746
#It. 2, RSS = 0.00158164, Par. = 416.397 330.694 -0.365025 37.7475
#It. 3, RSS = 0.00158164, Par. = 416.618 330.568 -0.365027 37.7475
#It. 4, RSS = 0.00158164, Par. = 416.618 330.568 -0.365027 37.7475
#Error in nlsModel(formula, mf, start, wts) :
# singular gradient matrix at initial parameter estimates

Related

Nonlinear constrained optimization with optimx

I'm trying to use optimx for a constrained nonlinear problem, but I just can't find an example online that I can adjust (I'm not an R programmer). I found that I should be using the below to test a few algorithms
optimx(par, fn, lower=low, upper=up, method=c("CG", "L-BFGS-B", "spg", "nlm"))
I understand par is just an example of a feasible solution. So, if I have two variables and (0,3) is feasible I can just do par <- c(0,3). If I want to minimise
2x+3y
subject to
2x^2 + 3y^2 <= 100
x<=3
-x<=0
-y<=-3
I guess i can set fn like
fn <- function(x){return 2*x[0]+3*x[1]}
but how do I set lower and upper for my constraints?
Many thanks!
1) We can incorporate the constraints within the objective function by returning a large number if any constraint is violated.
For most methods (but not Nelder Mead) the requirement is that the objective function be continuous and differentiable and requires a starting value in the interior of the feasible region, not the boundary. These requirements are not satisfied for f below but we will try it anyways.
library(optimx)
f <- function(z, x = z[1], y = z[2]) {
if (2*x^2 + 3*y^2 <= 100 && x<=3 && -x<=0 && -y<=-3) 2*x+3*y else 1e10
}
optimx(c(0, 3), f, method = c("Nelder", "CG", "L-BFGS-B", "spg", "nlm"))
## p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
## Nelder-Mead 0 3 9 187 NA NA 0 FALSE FALSE 0.00
## CG 0 3 9 41 1 NA 0 FALSE FALSE 0.00
## L-BFGS-B 0 3 9 21 21 NA 52 FALSE FALSE 0.00
## spg 0 3 9 1077 NA 1 0 FALSE FALSE 0.05
## nlm 0 3 9 NA NA 1 0 FALSE FALSE 0.00
1a) This also works with optim where Nelder Mead is the default (or you could try constrOptim which explcitly supports inequality constraints).
optim(c(0, 3), f)
## $par
## [1] 0 3
##
## $value
## [1] 9
##
## $counts
## function gradient
## 187 NA
$convergence
[1] 0
$message
NULL
2) Above we notice that the 2x^2 + 3y^2 <= 100 constraint is not active so we can drop it. Now since the objective function is increasing in both x and y independently it is obvious that we want to set both of them to their lower bounds so c(0, 3) is the answer.
If we want to use optimx anyways then we just use upper= and lower= arguments for those methods that use them.
f2 <- function(z, x = z[1], y = z[2]) 2*x+3*y
optimx(c(0, 3), f2, lower = c(0, 3), upper = c(3, Inf),
method = c("L-BFGS-B", "spg", "nlm"))
## p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
## L-BFGS-B 0 3 9 1 1 NA 0 FALSE NA 0.00
## spg 0 3 9 1 NA 0 0 FALSE NA 0.01
## nlminb 0 3 9 1 2 1 0 FALSE NA 0.00
## Warning message:
## In BB::spg(par = par, fn = ufn, gr = ugr, lower = lower, upper = upper, :
## convergence tolerance satisified at intial parameter values.

Issue with a multiple regression model in R

First let me apologize but I'm a biologist starting in the world of bioinformatics and therefore in R programming and statistics.
I have to do an analysis of a multilinear regression model with the data (Penta) from Library(mvdalav).
I have to try different models including the PLS model that is the model that is normally used for this data set (https://rdrr.io/cran/mvdalab/f/README.md)
However, they ask us to play with the data more models and I'm very lost as the data seems to always give me errors:
1) Normal multiple regression model:
> mod2<-mod1<-lm(Penta1$log.RAI~.,Penta1)
> summary(mod2)
Call:
lm(formula = Penta1$log.RAI ~ ., data = Penta1)
Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e-01 NA NA NA
Obs.NameAAWAA 8.500e-01 NA NA NA
Obs.NameAAYAA 5.600e-01 NA NA NA
Obs.NameEKWAP 1.400e+00 NA NA NA
Obs.NameFEAAK 4.000e-01 NA NA NA
Obs.NameFSPFR 7.400e-01 NA NA NA
Obs.NameGEAAK -4.200e-01 NA NA NA
Obs.NameLEAAK 5.000e-01 NA NA NA
Obs.NamePGFSP 1.000e+00 NA NA NA
Obs.NameRKWAP 2.080e+00 NA NA NA
Obs.NameRYLPT 5.000e-01 NA NA NA
Obs.NameVAAAK 1.114e-15 NA NA NA
Obs.NameVAAWK 3.300e-01 NA NA NA
Obs.NameVAWAA 1.530e+00 NA NA NA
Obs.NameVAWAK 1.550e+00 NA NA NA
Obs.NameVEAAK 6.100e-01 NA NA NA
Obs.NameVEAAP 2.800e-01 NA NA NA
Obs.NameVEASK 3.000e-01 NA NA NA
Obs.NameVEFAK 1.670e+00 NA NA NA
Obs.NameVEGGK -9.000e-01 NA NA NA
Obs.NameVEHAK 1.630e+00 NA NA NA
Obs.NameVELAK 6.900e-01 NA NA NA
Obs.NameVESAK 3.800e-01 NA NA NA
Obs.NameVESSK 1.000e-01 NA NA NA
Obs.NameVEWAK 2.830e+00 NA NA NA
Obs.NameVEWVK 1.810e+00 NA NA NA
Obs.NameVKAAK 2.100e-01 NA NA NA
Obs.NameVKWAA 1.810e+00 NA NA NA
Obs.NameVKWAP 2.450e+00 NA NA NA
Obs.NameVWAAK 1.400e-01 NA NA NA
S1 NA NA NA NA
L1 NA NA NA NA
P1 NA NA NA NA
S2 NA NA NA NA
L2 NA NA NA NA
P2 NA NA NA NA
S3 NA NA NA NA
L3 NA NA NA NA
P3 NA NA NA NA
S4 NA NA NA NA
L4 NA NA NA NA
P4 NA NA NA NA
S5 NA NA NA NA
L5 NA NA NA NA
P5 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 29 and 0 DF, p-value: NA
2) Study the reduced model provided by the stepwise method. The aim is to compare the RMSE of the reduced model and the complete model for the training group and for the test group.
step(lm(log.RAI~.,data = penta),direction = "backward")
Error in step(lm(log.RAI ~ ., data = penta), direction = "backward") :
AIC is -infinity for this model, so 'step' cannot proceed
3)Find the best model by the criteria of the AIC and by the adjusted R2
4) PLS model --> what fits the data following:https://rdrr.io/cran/mvdalab/f/README.md
5)Also study it with the Ridge Regression method with the lm.ridge () function or similar
6) Finally we will study the LASSO method with the lars () function of Lasso project.
I'm super lost with why the data.frame gave those errors and also how to develop the analysis. Any help with any of the parts would be much appreciated
Kind regards
Ok after reading the vignette, Penta is some data obtained from drug discovery and the first column is the unique identifier. To do regression or downstream analysis you need to exclude this column. For the steps below, I simply do Penta[,-1] as input data
For the first part, this works:
library(mvdalab)
data(Penta)
summary(lm(log.RAI~.,data = Penta[,-1]))
Call:
lm(formula = log.RAI ~ ., data = Penta[, -1])
Residuals:
Min 1Q Median 3Q Max
-0.39269 -0.12958 -0.05101 0.07261 0.63414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.80263 0.92495 -0.868 0.40016
S1 -0.09783 0.03895 -2.512 0.02489 *
L1 0.03236 0.04973 0.651 0.52576
P1 -0.10795 0.08521 -1.267 0.22587
S2 0.08670 0.04428 1.958 0.07043 .
Second part for AIC is ok as well:
step(lm(log.RAI~.,data = Penta[,-1]),direction="backward")
Start: AIC=-57.16
log.RAI ~ S1 + L1 + P1 + S2 + L2 + P2 + S3 + L3 + P3 + S4 + L4 +
P4 + S5 + L5 + P5
Df Sum of Sq RSS AIC
- P3 1 0.00150 1.5374 -59.132
- L4 1 0.00420 1.5401 -59.080
If you want to select model with AIC, the one above works. For adjusted R^2 i think most likely there are packages out there that does this
For lm.ridge, do the same:
library(MASS)
fit=lm.ridge(log.RAI~.,data = Penta[,-1])
For lars, lasso, you need to have the predictors etc in a matrix, so let's do
library(lars)
data = as.matrix(Penta[,-1])
fit = lars(x=data[,-ncol(data)],y=data[,"log.RAI"],type="lasso")

Timeseries Crossvalidation in R: using tsCV() with tslm()-Models

I am currently trying to evaluate a tslm-model using timeseries cross validation. I want to use a fixed model (without parameter reestimation) an look at the 1 to 3 step ahead horizon forecasts for the evaluation period of the last year.
I have trouble to get tsCV and tslm from the forecast-library to work well together. What am I missing?
library(forecast)
library(ggfortify)
AirPassengers_train <- head(AirPassengers, 100)
AirPassengers_test <- tail(AirPassengers, 44)
## Holdout Evaluation
n_train <- length(AirPassengers_train)
n_test <- length(AirPassengers_test)
pred_train <- ts(rnorm(n_train))
pred_test <- ts(rnorm(n_test))
fit <- tslm(AirPassengers_train ~ trend + pred_train)
forecast(fit, newdata = data.frame(pred_train = pred_test)) %>%
accuracy(AirPassengers_test)
#> ME RMSE MAE MPE MAPE MASE
#> Training set 1.135819e-15 30.03715 23.41818 -1.304311 10.89785 0.798141
#> Test set 3.681350e+01 76.39219 55.35298 6.513998 11.96379 1.886546
#> ACF1 Theil's U
#> Training set 0.6997632 NA
#> Test set 0.7287923 1.412804
## tsCV Evaluation
fc_reg <- function(x) forecast(x, newdata = data.frame(pred_train = pred_test),
h = h, model = fit)
tsCV(AirPassengers_test, fc_reg, h = 1)
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1957 NA NA NA NA NA NA NA NA
#> 1958 NA NA NA NA NA NA NA NA NA NA NA NA
#> 1959 NA NA NA NA NA NA NA NA NA NA NA NA
#> 1960 NA NA NA NA NA NA NA NA NA NA NA NA
forecast(AirPassengers_test, newdata = data.frame(pred_train = pred_test),
h = 1, model = fit)
#> Error in forecast.ts(AirPassengers_test, newdata = data.frame(pred_train = pred_test),
#> : Unknown model class
I have a feeling, that https://gist.github.com/robjhyndman/d9eb5568a78dbc79f7acc49e22553e96 is relevant. How would I apply it to the scenario above?
For time series cross-validation, you should be fitting a separate model to every training set, not passing an existing model. With predictor variables, the function needs to be able to grab the relevant elements when fitting each model, and other elements when producing forecasts.
The following will work.
fc <- function(y, h, xreg)
{
if(NROW(xreg) < length(y) + h)
stop("Not enough xreg data for forecasting")
X <- xreg[seq_along(y),]
fit <- tslm(y ~ X)
X <- xreg[length(y)+seq(h),]
forecast(fit, newdata=X)
}
# Predictors of the same length as the data
# and with the same time series characteristics.
pred <- ts(rnorm(length(AirPassengers)), start=start(AirPassengers),
frequency=frequency(AirPassengers))
# Now pass the whole time series and the corresponding predictors
tsCV(AirPassengers, fc, xreg=pred)
If you have more than one predictor variable, then xreg should be a matrix.
I came here to post my ugly workaround to the same problem (and possibly find out what is wrong with it):
myxreg<-regmat[,c("xvar1","xvar2")]
flm_xreg<-function(x,h,xreg,newxreg){
forecast(Arima(x,order=c(0,0,0),xreg=xreg),xreg=newxreg)
}
e<-tsCV(regmat[,"yvar"],flm_xreg,h=14,xreg=myexreg)
I ended up using a function to forecast a trend. I'm not sure if this is correctly specified but the rmse looks about right.
flm <- function(y, h) { forecast(tslm(y ~ trend, lambda=0), h=h) }
e <- tsCV(tsDF, flm, h=6)
sqrt(mean(e^2, na.rm=TRUE))
#robhyndman

Going through the xgboostExplainer package: running into errors from github page

I am currently trying to work with the new xgboostExplainer package.
I am following the githib page here https://github.com/AppliedDataSciencePartners/xgboostExplainer/blob/master/R/explainPredictions.R
on line 34, the xgboost model is ran:
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds=3)
However on line 43 I am running into some problems.
explainer = buildExplainer(xgb.model,xgb.train.data, type="binary", base_score = 0.5, n_first_tree = xgb.model$best_ntreelimit - 1)
I understand that n_first_tree is depreciated but I cannot seem to access the xgb.model$best_ntreelimit -1 part.
The sections I can access in xgboost are;
handle, raw, niter, evaluation_log, call, params, callbacks, feature_names
not best_ntreelimit
Has somebody else ran into this issue.
EDIT:
Output of the showWaterfall()
Extracting the breakdown of each prediction...
|=============================================================| 100%
DONE!
Prediction: NA
Weight: NA
Breakdown
intercept cap-shape=bell
NA NA
cap-shape=conical cap-shape=convex
NA NA
cap-shape=flat cap-shape=knobbed
NA NA
cap-shape=sunken cap-surface=fibrous
NA NA
cap-surface=grooves cap-surface=scaly
NA NA
cap-surface=smooth cap-color=brown
NA NA
cap-color=buff cap-color=cinnamon
NA NA
cap-color=gray cap-color=green
NA NA
cap-color=pink cap-color=purple
NA NA
cap-color=red cap-color=white
NA NA
cap-color=yellow bruises?=bruises
NA NA
bruises?=no odor=almond
NA NA
odor=anise odor=creosote
NA NA
odor=fishy odor=foul
NA NA
odor=musty odor=none
NA NA
odor=pungent odor=spicy
NA NA
gill-attachment=attached gill-attachment=descending
NA NA
gill-attachment=free gill-attachment=notched
NA NA
gill-spacing=close gill-spacing=crowded
NA NA
gill-spacing=distant gill-size=broad
NA NA
gill-size=narrow gill-color=black
NA NA
gill-color=brown gill-color=buff
NA NA
gill-color=chocolate gill-color=gray
NA NA
gill-color=green gill-color=orange
NA NA
gill-color=pink gill-color=purple
NA NA
gill-color=red gill-color=white
NA NA
gill-color=yellow stalk-shape=enlarging
NA NA
stalk-shape=tapering stalk-root=bulbous
NA NA
stalk-root=club stalk-root=cup
NA NA
stalk-root=equal stalk-root=rhizomorphs
NA NA
stalk-root=rooted stalk-root=missing
NA NA
stalk-surface-above-ring=fibrous stalk-surface-above-ring=scaly
NA NA
stalk-surface-above-ring=silky stalk-surface-above-ring=smooth
NA NA
stalk-surface-below-ring=fibrous stalk-surface-below-ring=scaly
NA NA
stalk-surface-below-ring=silky stalk-surface-below-ring=smooth
NA NA
stalk-color-above-ring=brown stalk-color-above-ring=buff
NA NA
stalk-color-above-ring=cinnamon stalk-color-above-ring=gray
NA NA
stalk-color-above-ring=orange stalk-color-above-ring=pink
NA NA
stalk-color-above-ring=red stalk-color-above-ring=white
NA NA
stalk-color-above-ring=yellow stalk-color-below-ring=brown
NA NA
stalk-color-below-ring=buff stalk-color-below-ring=cinnamon
NA NA
stalk-color-below-ring=gray stalk-color-below-ring=orange
NA NA
stalk-color-below-ring=pink stalk-color-below-ring=red
NA NA
stalk-color-below-ring=white stalk-color-below-ring=yellow
NA NA
veil-type=partial veil-type=universal
NA NA
veil-color=brown veil-color=orange
NA NA
veil-color=white veil-color=yellow
NA NA
ring-number=none ring-number=one
NA NA
ring-number=two ring-type=cobwebby
NA NA
ring-type=evanescent ring-type=flaring
NA NA
ring-type=large ring-type=none
NA NA
ring-type=pendant ring-type=sheathing
NA NA
ring-type=zone spore-print-color=black
NA NA
spore-print-color=brown spore-print-color=buff
NA NA
spore-print-color=chocolate spore-print-color=green
NA NA
spore-print-color=orange spore-print-color=purple
NA NA
spore-print-color=white spore-print-color=yellow
NA NA
population=abundant population=clustered
NA NA
population=numerous population=scattered
NA NA
population=several population=solitary
NA NA
habitat=grasses habitat=leaves
NA NA
habitat=meadows habitat=paths
NA NA
habitat=urban habitat=waste
NA NA
habitat=woods
NA
-3.89182 -3.178054 -2.751535 -2.442347 -2.197225 -1.99243 -1.81529 -1.658228 -1.516347 -1.386294 -1.265666 -1.15268 -1.045969 -0.9444616 -0.8472979 -0.7537718 -0.6632942 -0.5753641 -0.4895482 -0.4054651 -0.3227734 -0.2411621 -0.1603427 -0.08004271 0 0.08004271 0.1603427 0.2411621 0.3227734 0.4054651 0.4895482 0.5753641 0.6632942 0.7537718 0.8472979 0.9444616 1.045969 1.15268 1.265666 1.386294 1.516347 1.658228 1.81529 1.99243 2.197225 2.442347 2.751535 3.178054 3.89182
Error in if (abs(values[i]) > put_rect_text_outside_when_value_below) { :
missing value where TRUE/FALSE needed
EDIT: Here is the code I ran:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
xgb.train.data <- xgb.DMatrix(train$data, label = train$label)
xgb.test.data <- xgb.DMatrix(test$data, label = test$label)
param <- list(objective = "binary:logistic")
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 500,
early_stopping_rounds = 10,
nfold = 3)
model.cv$best_ntreelimit
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds = 10)
explained <- buildExplainer(xgb.model, xgb.train.data, type="binary", base_score = 0.5, n_first_tree = 9)
pred.breakdown = explainPredictions(xgb.model,
explained,
xgb.test.data)
showWaterfall(xgb.model,
explained,
xgb.test.data, test$data, 2, type = "binary")
I tested the code in the linked page.
best_ntreelimit is a parameter returned by xgb.cv when early_stopping_rounds is set. From the help of xgb.cv:
best_ntreelimit the ntreelimit value corresponding to the best
iteration, which could further be used in predict method (only
available with early stopping).
You can get to it by using xgb.cv:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
xgb.train.data <- xgb.DMatrix(train$data, label = train$label)
param <- list(objective = "binary:logistic")
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 500,
early_stopping_rounds = 10,
nfold = 3)
model.cv$best_ntreelimit
#output
9
However output of xgb.cv can not be used to build an explainer.
So you need:
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds = 10)
and set the n_first_tree to an integer:
explained <- buildExplainer(xgb.model, xgb.train.data, type="binary", base_score = 0.5, n_first_tree = 9)
EDIT: I failed to paste the following code:
xgb.test.data <- xgb.DMatrix(test$data, label = test$label)
pred.breakdown = explainPredictions(xgb.model,
explained,
xgb.test.data)
and now you can do:
showWaterfall(xgb.model,
explained,
xgb.test.data, test$data, 2, type = "binary")

Trend analysis for ANOVA with both btw-Ss and within-Ss factors

I want to do a trend analysis for an ANOVA that has both btw-Ss and within-Ss factors.
The btw factors are "treatments"
The within factors are "trials".
test.data <- data.frame(sid = rep(c("s1", "s2", "s3", "s4", "s5"), each = 4),
treatments = rep(c("a1", "a2"), each = 20),
trials = rep(c("b1", "b2", "b3", "b4"), 10),
responses = c(3,5,9,6,7,11,12,11,9,13,14,12,4,8,11,7,1,3,5,4,5,6,11,7,10,12,18,15,10,15,15,14,6,9,13,9,3,5,9,7))}
The ANOVA matches the one in the textbook (Keppel, 1973) exactly:
aov.model.1 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
What I am having trouble with is the trend analysis. I want to look at the linear, quadratic, and cubic trends for “trials”. Would also be nice to look at these same trends for “treatments x trials”.
I have set up the contrasts for the trend analyses as:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
contrasts(tab20.09$trials)
[,1] [,2] [,3]
b1 -3 1 -1
b2 -1 -1 3
b3 1 -1 -3
b4 3 1 1
for the linear, quadratic, and cubic trends.
According to Keppel the results for the trends should be:
TRIALS:
SS df MS F
(Trials) (175.70) 3
Linear 87.12 1 87.12 193.60
Quadratic 72.90 1 72.90 125.69
Cubic 15.68 1 15.68 9.50
TREATMENTS X TRIALS
SS df MS F
(Trtmt x Trials)
(3.40) 3
Linear 0.98 1 0.98 2.18
Quadratic 0.00 1 0.00 <1
Cubic 2.42 1 2.42 1.47
ERROR TERMS
(21.40) (24)
Linear 3.60 8 0.45
Quadratic 4.60 8 0.58
Cubic 13.20 8 1.65
I have faith in his answers as once upon the time I had to derive them myself using a 6 function calculator supplemented by paper and pencil. However, when I do this:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
aov.model.2 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
summary(lm(aov.model.2))
what I get seems not to make sense.
summary(lm(aov.model.2))
Call:
lm(formula = aov.model.2)
Residuals:
ALL 40 residuals are 0: no residual degrees of freedom!
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.750e+00 NA NA NA
treatmentsa2 3.500e+00 NA NA NA
trials1 6.500e-01 NA NA NA
trials2 -1.250e+00 NA NA NA
trials3 -4.500e-01 NA NA NA
sids10 -3.250e+00 NA NA NA
sids2 4.500e+00 NA NA NA
sids3 6.250e+00 NA NA NA
sids4 1.750e+00 NA NA NA
sids5 -2.500e+00 NA NA NA
sids6 -2.000e+00 NA NA NA
sids7 4.500e+00 NA NA NA
sids8 4.250e+00 NA NA NA
sids9 NA NA NA NA
treatmentsa2:trials1 2.120e-16 NA NA NA
treatmentsa2:trials2 -5.000e-01 NA NA NA
treatmentsa2:trials3 5.217e-16 NA NA NA
trials1:sids10 1.500e-01 NA NA NA
trials2:sids10 7.500e-01 NA NA NA
trials3:sids10 5.000e-02 NA NA NA
trials1:sids2 -1.041e-16 NA NA NA
trials2:sids2 -2.638e-16 NA NA NA
trials3:sids2 5.000e-01 NA NA NA
trials1:sids3 -1.500e-01 NA NA NA
trials2:sids3 -2.500e-01 NA NA NA
trials3:sids3 4.500e-01 NA NA NA
trials1:sids4 -5.000e-02 NA NA NA
trials2:sids4 -7.500e-01 NA NA NA
trials3:sids4 1.500e-01 NA NA NA
trials1:sids5 -1.000e-01 NA NA NA
trials2:sids5 5.000e-01 NA NA NA
trials3:sids5 3.000e-01 NA NA NA
trials1:sids6 -1.000e-01 NA NA NA
trials2:sids6 5.000e-01 NA NA NA
trials3:sids6 -2.000e-01 NA NA NA
trials1:sids7 4.000e-01 NA NA NA
trials2:sids7 5.000e-01 NA NA NA
trials3:sids7 -2.000e-01 NA NA NA
trials1:sids8 -5.000e-02 NA NA NA
trials2:sids8 2.500e-01 NA NA NA
trials3:sids8 6.500e-01 NA NA NA
trials1:sids9 NA NA NA NA
trials2:sids9 NA NA NA NA
trials3:sids9 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 39 and 0 DF, p-value: NA
Any ideas what I am doing wrong? I suspect there is some problem with “lm” and the ANOVA but I don’t know what and I don’t know how to put in my trend analyses.
###### MORE DETAILS in response to ssdecontrol's response
Well, "trials" is a factor as it codes four levels of experience that are being manipulated. Likewise "sid" is the "subject identification number" that is definitely "nominal" not "ordinal" or "interval". Subjects are pretty much always treated as Factors in ANOVAS.
However, I did try both of these changes, but it greatly distorted the ANOVA (try it yourself and compare). Likewise, it didn't seem to help. PERHAPS MORE DIRECTLY RELEVANT, when I try to create and apply my contrasts I am told that it cannot be done as my numerics need to be factors:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
Error in `contrasts<-`(`*tmp*`, value = c(-3, -1, 1, 3, 1, -1, -1, 1, :
contrasts apply only to factors
STARTING OVER
I seem to make more progress using contr.poly as in
contrasts(tab20.09$trials) <- contr.poly(levels(tab20.09$trials))
The ANOVA doesn't change at all. So that is good and when I do:
lm.model <- lm(responses ~ trials, data = tab20.09)
summary.lm(lm.model)
I get basically the same pattern as Keppel.
BUT, as I am interested in the linear trend of the interaction (treatments x trials), not just on trials, I tried this:
lm3 <- lm(responses ~ treatments*trials, data = tab20.09)
summary.lm(lm3)
and the ME of "trials" goes away . . .
In Keppel’s treatment, he calculated separate error terms for each contrast (i.e., Linear, Quadratic, and Cubic) and used that on both the main effect of “trial” as well as on the “treatment x trial” interaction.
I certainly could hand calculate all of these things again. Perhaps I could even write R functions for the general case; however, it seems difficult to believe that such a basic core contrast for experimental psychology has not yet found an R implementation!!??
Any help or suggestions would be greatly appreciated. Thanks. W
It looks like trials and sids are factors, but you are intending for them to be numeric/integer. Run sapply(tab20.09, class) to see if that's the case. That's what the output means; instead of fitting a continuous/count interaction, it's fitting a dummy variable for each level of each variable and computing all of the interactions between them.
To fix it, just reassign tab20.09$trials <- as.numeric(tab20.09$trials) and tab20.09$sids <- as.numeric(tab20.09$sids) in list syntax, or you can use matrix syntax like tab20.09[, c("trials", "sids")] <- apply(tab20.09[, c("trials", "sids")], 2, as.numeric). The first one is easier in this case, but you should be aware of the second one as well.

Resources