Going through the xgboostExplainer package: running into errors from github page - r

I am currently trying to work with the new xgboostExplainer package.
I am following the githib page here https://github.com/AppliedDataSciencePartners/xgboostExplainer/blob/master/R/explainPredictions.R
on line 34, the xgboost model is ran:
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds=3)
However on line 43 I am running into some problems.
explainer = buildExplainer(xgb.model,xgb.train.data, type="binary", base_score = 0.5, n_first_tree = xgb.model$best_ntreelimit - 1)
I understand that n_first_tree is depreciated but I cannot seem to access the xgb.model$best_ntreelimit -1 part.
The sections I can access in xgboost are;
handle, raw, niter, evaluation_log, call, params, callbacks, feature_names
not best_ntreelimit
Has somebody else ran into this issue.
EDIT:
Output of the showWaterfall()
Extracting the breakdown of each prediction...
|=============================================================| 100%
DONE!
Prediction: NA
Weight: NA
Breakdown
intercept cap-shape=bell
NA NA
cap-shape=conical cap-shape=convex
NA NA
cap-shape=flat cap-shape=knobbed
NA NA
cap-shape=sunken cap-surface=fibrous
NA NA
cap-surface=grooves cap-surface=scaly
NA NA
cap-surface=smooth cap-color=brown
NA NA
cap-color=buff cap-color=cinnamon
NA NA
cap-color=gray cap-color=green
NA NA
cap-color=pink cap-color=purple
NA NA
cap-color=red cap-color=white
NA NA
cap-color=yellow bruises?=bruises
NA NA
bruises?=no odor=almond
NA NA
odor=anise odor=creosote
NA NA
odor=fishy odor=foul
NA NA
odor=musty odor=none
NA NA
odor=pungent odor=spicy
NA NA
gill-attachment=attached gill-attachment=descending
NA NA
gill-attachment=free gill-attachment=notched
NA NA
gill-spacing=close gill-spacing=crowded
NA NA
gill-spacing=distant gill-size=broad
NA NA
gill-size=narrow gill-color=black
NA NA
gill-color=brown gill-color=buff
NA NA
gill-color=chocolate gill-color=gray
NA NA
gill-color=green gill-color=orange
NA NA
gill-color=pink gill-color=purple
NA NA
gill-color=red gill-color=white
NA NA
gill-color=yellow stalk-shape=enlarging
NA NA
stalk-shape=tapering stalk-root=bulbous
NA NA
stalk-root=club stalk-root=cup
NA NA
stalk-root=equal stalk-root=rhizomorphs
NA NA
stalk-root=rooted stalk-root=missing
NA NA
stalk-surface-above-ring=fibrous stalk-surface-above-ring=scaly
NA NA
stalk-surface-above-ring=silky stalk-surface-above-ring=smooth
NA NA
stalk-surface-below-ring=fibrous stalk-surface-below-ring=scaly
NA NA
stalk-surface-below-ring=silky stalk-surface-below-ring=smooth
NA NA
stalk-color-above-ring=brown stalk-color-above-ring=buff
NA NA
stalk-color-above-ring=cinnamon stalk-color-above-ring=gray
NA NA
stalk-color-above-ring=orange stalk-color-above-ring=pink
NA NA
stalk-color-above-ring=red stalk-color-above-ring=white
NA NA
stalk-color-above-ring=yellow stalk-color-below-ring=brown
NA NA
stalk-color-below-ring=buff stalk-color-below-ring=cinnamon
NA NA
stalk-color-below-ring=gray stalk-color-below-ring=orange
NA NA
stalk-color-below-ring=pink stalk-color-below-ring=red
NA NA
stalk-color-below-ring=white stalk-color-below-ring=yellow
NA NA
veil-type=partial veil-type=universal
NA NA
veil-color=brown veil-color=orange
NA NA
veil-color=white veil-color=yellow
NA NA
ring-number=none ring-number=one
NA NA
ring-number=two ring-type=cobwebby
NA NA
ring-type=evanescent ring-type=flaring
NA NA
ring-type=large ring-type=none
NA NA
ring-type=pendant ring-type=sheathing
NA NA
ring-type=zone spore-print-color=black
NA NA
spore-print-color=brown spore-print-color=buff
NA NA
spore-print-color=chocolate spore-print-color=green
NA NA
spore-print-color=orange spore-print-color=purple
NA NA
spore-print-color=white spore-print-color=yellow
NA NA
population=abundant population=clustered
NA NA
population=numerous population=scattered
NA NA
population=several population=solitary
NA NA
habitat=grasses habitat=leaves
NA NA
habitat=meadows habitat=paths
NA NA
habitat=urban habitat=waste
NA NA
habitat=woods
NA
-3.89182 -3.178054 -2.751535 -2.442347 -2.197225 -1.99243 -1.81529 -1.658228 -1.516347 -1.386294 -1.265666 -1.15268 -1.045969 -0.9444616 -0.8472979 -0.7537718 -0.6632942 -0.5753641 -0.4895482 -0.4054651 -0.3227734 -0.2411621 -0.1603427 -0.08004271 0 0.08004271 0.1603427 0.2411621 0.3227734 0.4054651 0.4895482 0.5753641 0.6632942 0.7537718 0.8472979 0.9444616 1.045969 1.15268 1.265666 1.386294 1.516347 1.658228 1.81529 1.99243 2.197225 2.442347 2.751535 3.178054 3.89182
Error in if (abs(values[i]) > put_rect_text_outside_when_value_below) { :
missing value where TRUE/FALSE needed
EDIT: Here is the code I ran:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
xgb.train.data <- xgb.DMatrix(train$data, label = train$label)
xgb.test.data <- xgb.DMatrix(test$data, label = test$label)
param <- list(objective = "binary:logistic")
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 500,
early_stopping_rounds = 10,
nfold = 3)
model.cv$best_ntreelimit
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds = 10)
explained <- buildExplainer(xgb.model, xgb.train.data, type="binary", base_score = 0.5, n_first_tree = 9)
pred.breakdown = explainPredictions(xgb.model,
explained,
xgb.test.data)
showWaterfall(xgb.model,
explained,
xgb.test.data, test$data, 2, type = "binary")

I tested the code in the linked page.
best_ntreelimit is a parameter returned by xgb.cv when early_stopping_rounds is set. From the help of xgb.cv:
best_ntreelimit the ntreelimit value corresponding to the best
iteration, which could further be used in predict method (only
available with early stopping).
You can get to it by using xgb.cv:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
xgb.train.data <- xgb.DMatrix(train$data, label = train$label)
param <- list(objective = "binary:logistic")
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 500,
early_stopping_rounds = 10,
nfold = 3)
model.cv$best_ntreelimit
#output
9
However output of xgb.cv can not be used to build an explainer.
So you need:
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds = 10)
and set the n_first_tree to an integer:
explained <- buildExplainer(xgb.model, xgb.train.data, type="binary", base_score = 0.5, n_first_tree = 9)
EDIT: I failed to paste the following code:
xgb.test.data <- xgb.DMatrix(test$data, label = test$label)
pred.breakdown = explainPredictions(xgb.model,
explained,
xgb.test.data)
and now you can do:
showWaterfall(xgb.model,
explained,
xgb.test.data, test$data, 2, type = "binary")

Related

PSCL Returning NAs for zeroinfl negbin model

I am trying to run a Zero-Inflated Negative Binomial Count Model on some data containing the number of campaign visits by a politician by county. (Log Liklihood tests indicate Negative Binomial is correct, Vuong test suggests Zero-Inflated, though that could be thrown off by the fact that my Zero-Inflated model is clearly not converging.) I am using the pscl package in R. The problem is that when I run
Call:
zeroinfl(formula = Sanders_Adjacent_Clinton_Visit ~ Relative_Divisiveness + Obama_General_Percent_12 +
Percent_Over_65 + Percent_F + Percent_White + Percent_HS + Per_Capita_Income +
Poverty_Rate + MRP_Ideology_Mean + Swing_State, data = Unity_Data, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.96406 -0.24339 -0.11744 -0.03183 16.21356
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.216e+01 NA NA NA
Relative_Divisiveness -3.831e-01 NA NA NA
Obama_General_Percent_12 1.904e+00 NA NA NA
Percent_Over_65 -4.848e-02 NA NA NA
Percent_F 1.737e-01 NA NA NA
Percent_White 2.980e+00 NA NA NA
Percent_HS -3.563e-02 NA NA NA
Per_Capita_Income 7.413e-05 NA NA NA
Poverty_Rate -2.273e-02 NA NA NA
MRP_Ideology_Mean -8.316e-01 NA NA NA
Swing_State 1.580e+00 NA NA NA
Log(theta) 9.595e+00 NA NA NA
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.024e+02 NA NA NA
Relative_Divisiveness -3.265e+00 NA NA NA
Obama_General_Percent_12 -2.300e+01 NA NA NA
Percent_Over_65 -7.768e-02 NA NA NA
Percent_F 2.873e+00 NA NA NA
Percent_White 5.156e+00 NA NA NA
Percent_HS -5.097e-01 NA NA NA
Per_Capita_Income 2.831e-04 NA NA NA
Poverty_Rate 1.391e-02 NA NA NA
MRP_Ideology_Mean -2.569e+00 NA NA NA
Swing_State 5.075e-01 NA NA NA
Theta = 14696.9932
Number of iterations in BFGS optimization: 94
Log-likelihood: -596.5 on 23 Df
Obviously, all of those NA's are less then helpful to me. Any advice would be greatly appreciated! I'm pretty novice at R, StackOverflow, and Statistics, but trying to learn. I'm trying to provide everything needed for the minimal reproducible example, but I don't see anywhere to share my actual data... so if that's something you need in order to answer the question, let me know where I can put it!

Issue with a multiple regression model in R

First let me apologize but I'm a biologist starting in the world of bioinformatics and therefore in R programming and statistics.
I have to do an analysis of a multilinear regression model with the data (Penta) from Library(mvdalav).
I have to try different models including the PLS model that is the model that is normally used for this data set (https://rdrr.io/cran/mvdalab/f/README.md)
However, they ask us to play with the data more models and I'm very lost as the data seems to always give me errors:
1) Normal multiple regression model:
> mod2<-mod1<-lm(Penta1$log.RAI~.,Penta1)
> summary(mod2)
Call:
lm(formula = Penta1$log.RAI ~ ., data = Penta1)
Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e-01 NA NA NA
Obs.NameAAWAA 8.500e-01 NA NA NA
Obs.NameAAYAA 5.600e-01 NA NA NA
Obs.NameEKWAP 1.400e+00 NA NA NA
Obs.NameFEAAK 4.000e-01 NA NA NA
Obs.NameFSPFR 7.400e-01 NA NA NA
Obs.NameGEAAK -4.200e-01 NA NA NA
Obs.NameLEAAK 5.000e-01 NA NA NA
Obs.NamePGFSP 1.000e+00 NA NA NA
Obs.NameRKWAP 2.080e+00 NA NA NA
Obs.NameRYLPT 5.000e-01 NA NA NA
Obs.NameVAAAK 1.114e-15 NA NA NA
Obs.NameVAAWK 3.300e-01 NA NA NA
Obs.NameVAWAA 1.530e+00 NA NA NA
Obs.NameVAWAK 1.550e+00 NA NA NA
Obs.NameVEAAK 6.100e-01 NA NA NA
Obs.NameVEAAP 2.800e-01 NA NA NA
Obs.NameVEASK 3.000e-01 NA NA NA
Obs.NameVEFAK 1.670e+00 NA NA NA
Obs.NameVEGGK -9.000e-01 NA NA NA
Obs.NameVEHAK 1.630e+00 NA NA NA
Obs.NameVELAK 6.900e-01 NA NA NA
Obs.NameVESAK 3.800e-01 NA NA NA
Obs.NameVESSK 1.000e-01 NA NA NA
Obs.NameVEWAK 2.830e+00 NA NA NA
Obs.NameVEWVK 1.810e+00 NA NA NA
Obs.NameVKAAK 2.100e-01 NA NA NA
Obs.NameVKWAA 1.810e+00 NA NA NA
Obs.NameVKWAP 2.450e+00 NA NA NA
Obs.NameVWAAK 1.400e-01 NA NA NA
S1 NA NA NA NA
L1 NA NA NA NA
P1 NA NA NA NA
S2 NA NA NA NA
L2 NA NA NA NA
P2 NA NA NA NA
S3 NA NA NA NA
L3 NA NA NA NA
P3 NA NA NA NA
S4 NA NA NA NA
L4 NA NA NA NA
P4 NA NA NA NA
S5 NA NA NA NA
L5 NA NA NA NA
P5 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 29 and 0 DF, p-value: NA
2) Study the reduced model provided by the stepwise method. The aim is to compare the RMSE of the reduced model and the complete model for the training group and for the test group.
step(lm(log.RAI~.,data = penta),direction = "backward")
Error in step(lm(log.RAI ~ ., data = penta), direction = "backward") :
AIC is -infinity for this model, so 'step' cannot proceed
3)Find the best model by the criteria of the AIC and by the adjusted R2
4) PLS model --> what fits the data following:https://rdrr.io/cran/mvdalab/f/README.md
5)Also study it with the Ridge Regression method with the lm.ridge () function or similar
6) Finally we will study the LASSO method with the lars () function of Lasso project.
I'm super lost with why the data.frame gave those errors and also how to develop the analysis. Any help with any of the parts would be much appreciated
Kind regards
Ok after reading the vignette, Penta is some data obtained from drug discovery and the first column is the unique identifier. To do regression or downstream analysis you need to exclude this column. For the steps below, I simply do Penta[,-1] as input data
For the first part, this works:
library(mvdalab)
data(Penta)
summary(lm(log.RAI~.,data = Penta[,-1]))
Call:
lm(formula = log.RAI ~ ., data = Penta[, -1])
Residuals:
Min 1Q Median 3Q Max
-0.39269 -0.12958 -0.05101 0.07261 0.63414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.80263 0.92495 -0.868 0.40016
S1 -0.09783 0.03895 -2.512 0.02489 *
L1 0.03236 0.04973 0.651 0.52576
P1 -0.10795 0.08521 -1.267 0.22587
S2 0.08670 0.04428 1.958 0.07043 .
Second part for AIC is ok as well:
step(lm(log.RAI~.,data = Penta[,-1]),direction="backward")
Start: AIC=-57.16
log.RAI ~ S1 + L1 + P1 + S2 + L2 + P2 + S3 + L3 + P3 + S4 + L4 +
P4 + S5 + L5 + P5
Df Sum of Sq RSS AIC
- P3 1 0.00150 1.5374 -59.132
- L4 1 0.00420 1.5401 -59.080
If you want to select model with AIC, the one above works. For adjusted R^2 i think most likely there are packages out there that does this
For lm.ridge, do the same:
library(MASS)
fit=lm.ridge(log.RAI~.,data = Penta[,-1])
For lars, lasso, you need to have the predictors etc in a matrix, so let's do
library(lars)
data = as.matrix(Penta[,-1])
fit = lars(x=data[,-ncol(data)],y=data[,"log.RAI"],type="lasso")

mlogit() outputs NA's for nested logit model in R

I am trying to estimate a nested logit model of company siting choices, with nests = countries and alternatives = provinces, based on a number of alternative-specific characteristics as well as some company-specific characteristics. I formatted my data to a "long" structure using:
data <- mlogit.data(DB, choice="Occurrence", shape="long", chid.var="IDP", varying=6:ncol(DB), alt.var="Prov")
Here's a sample of the data:
IDP Occurrence From Prov ToC Dist Price Yield
5p1.APY 5p1 FALSE Sao Paulo APY PY 0.0000000 0.3698913 0.0000000
5p1.BOQ 5p1 FALSE Sao Paulo BOQ PY 0.6495493 0.3698913 0.0000000
5p1.CHA 5p1 FALSE Sao Paulo CHA AR 0.7870593 0.4622464 0.4461496
5p1.COR 5p1 FALSE Sao Paulo COR AR 0.3747480 0.4622464 0.5536546
5p1.FOR 5p1 FALSE Sao Paulo FOR AR 0.6822188 0.4622464 0.4402772
5p1.JUY 5p1 FALSE Sao Paulo JUY AR 1.0000000 0.4622464 0.3617038
Note that I've reduced the table to a few variables for clarity but would normally use more.
The code I use for the nested logit is the following:
nests <- list(Bolivia="SCZ",Paraguay=c("PHY","BOQ","APY"),Argentina=c("CHA","COR","FOR","JUY","SAL","SFE","SDE"))
nml <- mlogit(Occurrence ~ DistComp + PriceComp + YieldComp, data=data, nests=nests, unscaled=T)
summary(nml)
When running this model, I get the following output:
> summary(nml)
Call:
mlogit(formula = Occurrence ~ DistComp + PriceComp + YieldComp,
data = data, nests = nests, unscaled = T)
Frequencies of alternatives:
APY BOQ CHA COR FOR JUY PHY
SAL SCZ SDE SFE
0.1000000 0.0666667 0.1333333 0.0250000 0.0750000 0.0083333 0.0083333
0.1166667 0.2583333 0.1750000 0.0333333
bfgs method
1 iterations, 0h:0m:0s
g'(-H)^-1g = 1E+10
last step couldn't find higher value
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
BOQ:(intercept) -0.29923 NA NA NA
CHA:(intercept) -1.25406 NA NA NA
COR:(intercept) -1.76020 NA NA NA
FOR:(intercept) -1.97083 NA NA NA
JUY:(intercept) -4.14476 NA NA NA
PHY:(intercept) -2.63961 NA NA NA
SAL:(intercept) -1.72047 NA NA NA
SCZ:(intercept) -0.15714 NA NA NA
SDE:(intercept) -0.57449 NA NA NA
SFE:(intercept) -2.47345 NA NA NA
DistComp 2.44322 NA NA NA
PriceComp 2.45202 NA NA NA
YieldComp 3.15611 NA NA NA
iv.Bolivia 1.00000 NA NA NA
iv.Paraguay 1.00000 NA NA NA
iv.Argentina 1.00000 NA NA NA
Log-Likelihood: -221.84
McFadden R^2: 0.10453
Likelihood ratio test : chisq = 51.79 (p.value = 2.0552e-09)
I don't understand what causes the NAs in the output, considering that I prepared the data using mlogit.data(). Any help on this would be greatly appreciated.
Best,
Yann

nls singular gradient matrix - fit parameters in integral's upper limits

I am trying to make a nls fit for a little bit complicated expression that includes two integrals with two of the fit parameters in their upper limits.
I got the error
"Error in nlsModel(formula, mf, start, wts) : singular gradient
matrix at initial parameter estimates".
I have searched already in the previous answers, but didn't help. The parameters initialization seem to be ok, I have tried to change the parameters but none work. If my function has just one integral everything works very nicely, but when adding a second integral term just got the error. I don't believe the function is over-parametrized, as I have performed other fits with much more parameters and they worked. Below I have wrote a list with some data.
The minimal example is the following:
integrand <- function(X) {
return(X^4/(2*sinh(X/2))^2)
}
fitting = function(T1, T2, N, D, x){
int1 = integrate(integrand, lower=0, upper = T1)$value
int2 = integrate(integrand, lower=0, upper = T2)$value
return(N*(D/x)^2*(exp(D/x)/(1+exp(D/x))^2
)+(448.956*(x/T1)^3*int1)+(299.304*(x/T2)^3*int2))
}
fit = nls(y ~ fitting(T1, T2, N, D, x),
start=list(T1=400,T2=200,N=0.01,D=2))
------>For reference, the fit that worked is the following:
integrand <- function(X) {
return(X^4/(2*sinh(X/2))^2)
}
fitting = function(T1, N, D, x){
int = integrate(integrand, lower=0, upper = T1)$value
return(N*(D/x)^2*(exp(D/x)/(1+exp(D/x))^2 )+(748.26)*(x/T1)^3*int)
}
fit = nls(y ~ fitting(T1 , N, D, x), start=list(T1=400,N=0.01,D=2))
------->Data to illustrate the problem:
dat<- read.table(text="x y
0.38813 0.0198
0.79465 0.02206
1.40744 0.01676
1.81532 0.01538
2.23105 0.01513
2.64864 0.01547
3.05933 0.01706
3.47302 0.01852
3.88791 0.02074
4.26301 0.0256
4.67607 0.03028
5.08172 0.03507
5.48327 0.04283
5.88947 0.05017
6.2988 0.05953
6.7022 0.07185
7.10933 0.08598
7.51924 0.0998
7.92674 0.12022
8.3354 0.1423
8.7384 0.16382
9.14656 0.19114
9.55062 0.22218
9.95591 0.25542", header=TRUE)
I cannot figure out what happen. I need to perform this fit for three integral components, but even for two I have this problem. I appreciate so much your help. Thank you.
You could try some other optimizers:
fitting1 <- function(par, x, y) {
sum((fitting(par[1], par[2], par[3], par[4], x) - y)^2)
}
library(optimx)
res <- optimx(c(400, 200, 0.01, 2),
fitting1,
x = DF$x, y = DF$y,
control = list(all.methods = TRUE))
print(res)
# p1 p2 p3 p4 value fevals gevals niter convcode kkt1 kkt2 xtimes
#BFGS 409.7992 288.6416 -0.7594461 39.00871 1.947484e-03 101 100 NA 1 NA NA 0.22
#CG 401.1281 210.9087 -0.9026459 20.80900 3.892929e-01 215 101 NA 1 NA NA 0.25
#Nelder-Mead 414.6402 446.5080 -1.1298606 -227.81280 2.064842e-03 89 NA NA 0 NA NA 0.02
#L-BFGS-B 412.4477 333.1338 -0.3650530 37.74779 1.581643e-03 34 34 NA 0 NA NA 0.06
#nlm 411.8639 333.4776 -0.3652356 37.74855 1.581644e-03 NA NA 45 0 NA NA 0.04
#nlminb 411.9678 333.4449 -0.3650271 37.74753 1.581643e-03 50 268 48 0 NA NA 0.07
#spg 422.0394 300.5336 -0.5776862 38.48655 1.693119e-03 1197 NA 619 0 NA NA 1.06
#ucminf 412.7390 332.9228 -0.3652029 37.74829 1.581644e-03 45 45 NA 0 NA NA 0.05
#Rcgmin NA NA NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00
#Rvmmin NA NA NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00
#newuoa 396.3071 345.1165 -0.3650286 37.74754 1.581643e-03 3877 NA NA 0 NA NA 1.02
#bobyqa 410.0392 334.7074 -0.3650289 37.74753 1.581643e-03 7866 NA NA 0 NA NA 2.07
#nmkb 569.0139 346.0856 282.6526588 -335.32320 2.064859e-03 75 NA NA 0 NA NA 0.01
#hjkb 400.0000 200.0000 0.0100000 2.00000 3.200269e+00 1 NA 0 9999 NA NA 0.01
Levenberg-Marquardt converges too, but nlsLM fails when it tries to create an nls model object from the result because the gradient matrix is singular:
library(minpack.lm)
fit <- nlsLM(y ~ fitting(T1, T2, N, D, x),
start=list(T1=412,T2=333,N=-0.36,D=38), data = DF, trace = TRUE)
#It. 0, RSS = 0.00165827, Par. = 412 333 -0.36 38
#It. 1, RSS = 0.00158186, Par. = 417.352 329.978 -0.3652 37.746
#It. 2, RSS = 0.00158164, Par. = 416.397 330.694 -0.365025 37.7475
#It. 3, RSS = 0.00158164, Par. = 416.618 330.568 -0.365027 37.7475
#It. 4, RSS = 0.00158164, Par. = 416.618 330.568 -0.365027 37.7475
#Error in nlsModel(formula, mf, start, wts) :
# singular gradient matrix at initial parameter estimates

Trend analysis for ANOVA with both btw-Ss and within-Ss factors

I want to do a trend analysis for an ANOVA that has both btw-Ss and within-Ss factors.
The btw factors are "treatments"
The within factors are "trials".
test.data <- data.frame(sid = rep(c("s1", "s2", "s3", "s4", "s5"), each = 4),
treatments = rep(c("a1", "a2"), each = 20),
trials = rep(c("b1", "b2", "b3", "b4"), 10),
responses = c(3,5,9,6,7,11,12,11,9,13,14,12,4,8,11,7,1,3,5,4,5,6,11,7,10,12,18,15,10,15,15,14,6,9,13,9,3,5,9,7))}
The ANOVA matches the one in the textbook (Keppel, 1973) exactly:
aov.model.1 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
What I am having trouble with is the trend analysis. I want to look at the linear, quadratic, and cubic trends for “trials”. Would also be nice to look at these same trends for “treatments x trials”.
I have set up the contrasts for the trend analyses as:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
contrasts(tab20.09$trials)
[,1] [,2] [,3]
b1 -3 1 -1
b2 -1 -1 3
b3 1 -1 -3
b4 3 1 1
for the linear, quadratic, and cubic trends.
According to Keppel the results for the trends should be:
TRIALS:
SS df MS F
(Trials) (175.70) 3
Linear 87.12 1 87.12 193.60
Quadratic 72.90 1 72.90 125.69
Cubic 15.68 1 15.68 9.50
TREATMENTS X TRIALS
SS df MS F
(Trtmt x Trials)
(3.40) 3
Linear 0.98 1 0.98 2.18
Quadratic 0.00 1 0.00 <1
Cubic 2.42 1 2.42 1.47
ERROR TERMS
(21.40) (24)
Linear 3.60 8 0.45
Quadratic 4.60 8 0.58
Cubic 13.20 8 1.65
I have faith in his answers as once upon the time I had to derive them myself using a 6 function calculator supplemented by paper and pencil. However, when I do this:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
aov.model.2 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
summary(lm(aov.model.2))
what I get seems not to make sense.
summary(lm(aov.model.2))
Call:
lm(formula = aov.model.2)
Residuals:
ALL 40 residuals are 0: no residual degrees of freedom!
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.750e+00 NA NA NA
treatmentsa2 3.500e+00 NA NA NA
trials1 6.500e-01 NA NA NA
trials2 -1.250e+00 NA NA NA
trials3 -4.500e-01 NA NA NA
sids10 -3.250e+00 NA NA NA
sids2 4.500e+00 NA NA NA
sids3 6.250e+00 NA NA NA
sids4 1.750e+00 NA NA NA
sids5 -2.500e+00 NA NA NA
sids6 -2.000e+00 NA NA NA
sids7 4.500e+00 NA NA NA
sids8 4.250e+00 NA NA NA
sids9 NA NA NA NA
treatmentsa2:trials1 2.120e-16 NA NA NA
treatmentsa2:trials2 -5.000e-01 NA NA NA
treatmentsa2:trials3 5.217e-16 NA NA NA
trials1:sids10 1.500e-01 NA NA NA
trials2:sids10 7.500e-01 NA NA NA
trials3:sids10 5.000e-02 NA NA NA
trials1:sids2 -1.041e-16 NA NA NA
trials2:sids2 -2.638e-16 NA NA NA
trials3:sids2 5.000e-01 NA NA NA
trials1:sids3 -1.500e-01 NA NA NA
trials2:sids3 -2.500e-01 NA NA NA
trials3:sids3 4.500e-01 NA NA NA
trials1:sids4 -5.000e-02 NA NA NA
trials2:sids4 -7.500e-01 NA NA NA
trials3:sids4 1.500e-01 NA NA NA
trials1:sids5 -1.000e-01 NA NA NA
trials2:sids5 5.000e-01 NA NA NA
trials3:sids5 3.000e-01 NA NA NA
trials1:sids6 -1.000e-01 NA NA NA
trials2:sids6 5.000e-01 NA NA NA
trials3:sids6 -2.000e-01 NA NA NA
trials1:sids7 4.000e-01 NA NA NA
trials2:sids7 5.000e-01 NA NA NA
trials3:sids7 -2.000e-01 NA NA NA
trials1:sids8 -5.000e-02 NA NA NA
trials2:sids8 2.500e-01 NA NA NA
trials3:sids8 6.500e-01 NA NA NA
trials1:sids9 NA NA NA NA
trials2:sids9 NA NA NA NA
trials3:sids9 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 39 and 0 DF, p-value: NA
Any ideas what I am doing wrong? I suspect there is some problem with “lm” and the ANOVA but I don’t know what and I don’t know how to put in my trend analyses.
###### MORE DETAILS in response to ssdecontrol's response
Well, "trials" is a factor as it codes four levels of experience that are being manipulated. Likewise "sid" is the "subject identification number" that is definitely "nominal" not "ordinal" or "interval". Subjects are pretty much always treated as Factors in ANOVAS.
However, I did try both of these changes, but it greatly distorted the ANOVA (try it yourself and compare). Likewise, it didn't seem to help. PERHAPS MORE DIRECTLY RELEVANT, when I try to create and apply my contrasts I am told that it cannot be done as my numerics need to be factors:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
Error in `contrasts<-`(`*tmp*`, value = c(-3, -1, 1, 3, 1, -1, -1, 1, :
contrasts apply only to factors
STARTING OVER
I seem to make more progress using contr.poly as in
contrasts(tab20.09$trials) <- contr.poly(levels(tab20.09$trials))
The ANOVA doesn't change at all. So that is good and when I do:
lm.model <- lm(responses ~ trials, data = tab20.09)
summary.lm(lm.model)
I get basically the same pattern as Keppel.
BUT, as I am interested in the linear trend of the interaction (treatments x trials), not just on trials, I tried this:
lm3 <- lm(responses ~ treatments*trials, data = tab20.09)
summary.lm(lm3)
and the ME of "trials" goes away . . .
In Keppel’s treatment, he calculated separate error terms for each contrast (i.e., Linear, Quadratic, and Cubic) and used that on both the main effect of “trial” as well as on the “treatment x trial” interaction.
I certainly could hand calculate all of these things again. Perhaps I could even write R functions for the general case; however, it seems difficult to believe that such a basic core contrast for experimental psychology has not yet found an R implementation!!??
Any help or suggestions would be greatly appreciated. Thanks. W
It looks like trials and sids are factors, but you are intending for them to be numeric/integer. Run sapply(tab20.09, class) to see if that's the case. That's what the output means; instead of fitting a continuous/count interaction, it's fitting a dummy variable for each level of each variable and computing all of the interactions between them.
To fix it, just reassign tab20.09$trials <- as.numeric(tab20.09$trials) and tab20.09$sids <- as.numeric(tab20.09$sids) in list syntax, or you can use matrix syntax like tab20.09[, c("trials", "sids")] <- apply(tab20.09[, c("trials", "sids")], 2, as.numeric). The first one is easier in this case, but you should be aware of the second one as well.

Resources