Issue with a multiple regression model in R - r

First let me apologize but I'm a biologist starting in the world of bioinformatics and therefore in R programming and statistics.
I have to do an analysis of a multilinear regression model with the data (Penta) from Library(mvdalav).
I have to try different models including the PLS model that is the model that is normally used for this data set (https://rdrr.io/cran/mvdalab/f/README.md)
However, they ask us to play with the data more models and I'm very lost as the data seems to always give me errors:
1) Normal multiple regression model:
> mod2<-mod1<-lm(Penta1$log.RAI~.,Penta1)
> summary(mod2)
Call:
lm(formula = Penta1$log.RAI ~ ., data = Penta1)
Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e-01 NA NA NA
Obs.NameAAWAA 8.500e-01 NA NA NA
Obs.NameAAYAA 5.600e-01 NA NA NA
Obs.NameEKWAP 1.400e+00 NA NA NA
Obs.NameFEAAK 4.000e-01 NA NA NA
Obs.NameFSPFR 7.400e-01 NA NA NA
Obs.NameGEAAK -4.200e-01 NA NA NA
Obs.NameLEAAK 5.000e-01 NA NA NA
Obs.NamePGFSP 1.000e+00 NA NA NA
Obs.NameRKWAP 2.080e+00 NA NA NA
Obs.NameRYLPT 5.000e-01 NA NA NA
Obs.NameVAAAK 1.114e-15 NA NA NA
Obs.NameVAAWK 3.300e-01 NA NA NA
Obs.NameVAWAA 1.530e+00 NA NA NA
Obs.NameVAWAK 1.550e+00 NA NA NA
Obs.NameVEAAK 6.100e-01 NA NA NA
Obs.NameVEAAP 2.800e-01 NA NA NA
Obs.NameVEASK 3.000e-01 NA NA NA
Obs.NameVEFAK 1.670e+00 NA NA NA
Obs.NameVEGGK -9.000e-01 NA NA NA
Obs.NameVEHAK 1.630e+00 NA NA NA
Obs.NameVELAK 6.900e-01 NA NA NA
Obs.NameVESAK 3.800e-01 NA NA NA
Obs.NameVESSK 1.000e-01 NA NA NA
Obs.NameVEWAK 2.830e+00 NA NA NA
Obs.NameVEWVK 1.810e+00 NA NA NA
Obs.NameVKAAK 2.100e-01 NA NA NA
Obs.NameVKWAA 1.810e+00 NA NA NA
Obs.NameVKWAP 2.450e+00 NA NA NA
Obs.NameVWAAK 1.400e-01 NA NA NA
S1 NA NA NA NA
L1 NA NA NA NA
P1 NA NA NA NA
S2 NA NA NA NA
L2 NA NA NA NA
P2 NA NA NA NA
S3 NA NA NA NA
L3 NA NA NA NA
P3 NA NA NA NA
S4 NA NA NA NA
L4 NA NA NA NA
P4 NA NA NA NA
S5 NA NA NA NA
L5 NA NA NA NA
P5 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 29 and 0 DF, p-value: NA
2) Study the reduced model provided by the stepwise method. The aim is to compare the RMSE of the reduced model and the complete model for the training group and for the test group.
step(lm(log.RAI~.,data = penta),direction = "backward")
Error in step(lm(log.RAI ~ ., data = penta), direction = "backward") :
AIC is -infinity for this model, so 'step' cannot proceed
3)Find the best model by the criteria of the AIC and by the adjusted R2
4) PLS model --> what fits the data following:https://rdrr.io/cran/mvdalab/f/README.md
5)Also study it with the Ridge Regression method with the lm.ridge () function or similar
6) Finally we will study the LASSO method with the lars () function of Lasso project.
I'm super lost with why the data.frame gave those errors and also how to develop the analysis. Any help with any of the parts would be much appreciated
Kind regards

Ok after reading the vignette, Penta is some data obtained from drug discovery and the first column is the unique identifier. To do regression or downstream analysis you need to exclude this column. For the steps below, I simply do Penta[,-1] as input data
For the first part, this works:
library(mvdalab)
data(Penta)
summary(lm(log.RAI~.,data = Penta[,-1]))
Call:
lm(formula = log.RAI ~ ., data = Penta[, -1])
Residuals:
Min 1Q Median 3Q Max
-0.39269 -0.12958 -0.05101 0.07261 0.63414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.80263 0.92495 -0.868 0.40016
S1 -0.09783 0.03895 -2.512 0.02489 *
L1 0.03236 0.04973 0.651 0.52576
P1 -0.10795 0.08521 -1.267 0.22587
S2 0.08670 0.04428 1.958 0.07043 .
Second part for AIC is ok as well:
step(lm(log.RAI~.,data = Penta[,-1]),direction="backward")
Start: AIC=-57.16
log.RAI ~ S1 + L1 + P1 + S2 + L2 + P2 + S3 + L3 + P3 + S4 + L4 +
P4 + S5 + L5 + P5
Df Sum of Sq RSS AIC
- P3 1 0.00150 1.5374 -59.132
- L4 1 0.00420 1.5401 -59.080
If you want to select model with AIC, the one above works. For adjusted R^2 i think most likely there are packages out there that does this
For lm.ridge, do the same:
library(MASS)
fit=lm.ridge(log.RAI~.,data = Penta[,-1])
For lars, lasso, you need to have the predictors etc in a matrix, so let's do
library(lars)
data = as.matrix(Penta[,-1])
fit = lars(x=data[,-ncol(data)],y=data[,"log.RAI"],type="lasso")

Related

clogit output producing NAs but partially show up when I use cbind

(updated)
Hey Stack Overflowers,
I'm trying to run a series of MLM fixed-effect logistic regressions with R's clogit function. When I add additional covariates to my model, the summary output shows NAs. But, when I used the cbind function, some of the missing covariate coefficients show up.
Here's my model 1 equation and output:
> model1 <- clogit(chldwork~lag_aspgrade_binned+age+strata(childid), data=finaletdtlag, method = 'exact')
> summary(model1)
Call:
coxph(formula = Surv(rep(1, 2686L), chldwork) ~ lag_aspgrade_binned +
age + strata(childid), data = finaletdtlag, method = "exact")
n= 2686, number of events= 2287
coef exp(coef) se(coef) z Pr(>|z|)
lag_aspgrade_binnedhigh school 1.04156 2.83363 0.52572 1.981 0.04757 *
lag_aspgrade_binnedno primary 1.31891 3.73935 0.89010 1.482 0.13841
lag_aspgrade_binnedprimary some hs 0.85000 2.33964 0.56244 1.511 0.13072
lag_aspgrade_binnedsome college 1.28607 3.61855 0.41733 3.082 0.00206 **
age -0.39600 0.67301 0.03105 -12.753 < 2e-16 ***
Here's my model two equation:
model2
<- clogit(chldwork~lag_aspgrade_binned+age+sex+chldeth+typesite+selfwlth+enroll+strata(childid), data=finaletdtlag, method = 'exact')
summary(model2)
Here's the summary output:
> summary(model2)
Call:
coxph(formula = Surv(rep(1, 2686L), chldwork) ~ lag_aspgrade_binned +
age + sex + chldeth + typesite + selfwlth + enroll + strata(childid),
data = finaletdtlag, method = "efron")
n= 2675, number of events= 2277
(11 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
lag_aspgrade_binnedhigh school 0.32943 1.39018 0.13933 2.364 0.0181 *
lag_aspgrade_binnedno primary 0.46553 1.59286 0.25154 1.851 0.0642 .
lag_aspgrade_binnedprimary some hs 0.33477 1.39762 0.15728 2.128 0.0333 *
lag_aspgrade_binnedsome college 0.36268 1.43718 0.11792 3.076 0.0021 **
age -0.07638 0.92647 0.01020 -7.486 7.11e-14 ***
sex1 NA NA 0.00000 NA NA
chldeth2 NA NA 0.00000 NA NA
chldeth3 NA NA 0.00000 NA NA
chldeth4 NA NA 0.00000 NA NA
chldeth6 NA NA 0.00000 NA NA
chldeth7 NA NA 0.00000 NA NA
chldeth8 NA NA 0.00000 NA NA
chldeth9 NA NA 0.00000 NA NA
chldeth99 NA NA 0.00000 NA NA
typesite1 NA NA 0.00000 NA NA
selfwlth1 0.04031 1.04113 0.29201 0.138 0.8902
selfwlth2 0.11971 1.12717 0.28736 0.417 0.6770
selfwlth3 0.07928 1.08251 0.29189 0.272 0.7859
selfwlth4 0.05717 1.05884 0.30231 0.189 0.8500
selfwlth5 0.39709 1.48750 0.43653 0.910 0.3630
selfwlth99 NA NA 0.00000 NA NA
enroll1 -0.20443 0.81511 0.08890 -2.300 0.0215 *
enroll88 NA NA 0.00000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But here's what happens when I use the cbind function to show all of my models next to each other. Note, that coefficients sex through chldeth99 are not in model one.
cbind results:
> cbind(coef(model1), (coef(model2)), coef(model3)) #creating side by side list of all model coefficients
[,1] [,2] [,3]
lag_aspgrade_binnedhigh school 1.0415583 0.27198991 0.32827106
lag_aspgrade_binnedno primary 1.3189131 0.37986205 0.46103492
lag_aspgrade_binnedprimary some hs 0.8499958 0.27831739 0.33256493
lag_aspgrade_binnedsome college 1.2860726 0.30089261 0.36214068
age -0.3960015 -0.06233958 -0.07653464
sex1 1.0415583 NA NA
chldeth2 1.3189131 NA NA
chldeth3 0.8499958 NA NA
chldeth4 1.2860726 NA NA
chldeth6 -0.3960015 NA NA
chldeth7 1.0415583 NA NA
chldeth8 1.3189131 NA NA
chldeth9 0.8499958 NA NA
chldeth99 1.2860726 NA NA
typesite1 -0.3960015 NA NA
selfwlth1 1.0415583 0.03245507 0.04424493
selfwlth2 1.3189131 0.09775395 0.12743276
selfwlth3 0.8499958 0.06499650 0.08854499
selfwlth4 1.2860726 0.05038224 0.07092755
selfwlth5 -0.3960015 0.32162830 0.38079232
selfwlth99 1.0415583 NA NA
enroll1 1.3189131 -0.16966609 -0.30366842
enroll88 0.8499958 NA NA
sex1:enroll1 1.2860726 0.27198991 0.24088361
sex1:enroll88 -0.3960015 0.37986205 NA
Much gratitude for any insights. Wishing you all the best as another year wraps up--special shoutout to those currently in school still on the grind.

PSCL Returning NAs for zeroinfl negbin model

I am trying to run a Zero-Inflated Negative Binomial Count Model on some data containing the number of campaign visits by a politician by county. (Log Liklihood tests indicate Negative Binomial is correct, Vuong test suggests Zero-Inflated, though that could be thrown off by the fact that my Zero-Inflated model is clearly not converging.) I am using the pscl package in R. The problem is that when I run
Call:
zeroinfl(formula = Sanders_Adjacent_Clinton_Visit ~ Relative_Divisiveness + Obama_General_Percent_12 +
Percent_Over_65 + Percent_F + Percent_White + Percent_HS + Per_Capita_Income +
Poverty_Rate + MRP_Ideology_Mean + Swing_State, data = Unity_Data, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.96406 -0.24339 -0.11744 -0.03183 16.21356
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.216e+01 NA NA NA
Relative_Divisiveness -3.831e-01 NA NA NA
Obama_General_Percent_12 1.904e+00 NA NA NA
Percent_Over_65 -4.848e-02 NA NA NA
Percent_F 1.737e-01 NA NA NA
Percent_White 2.980e+00 NA NA NA
Percent_HS -3.563e-02 NA NA NA
Per_Capita_Income 7.413e-05 NA NA NA
Poverty_Rate -2.273e-02 NA NA NA
MRP_Ideology_Mean -8.316e-01 NA NA NA
Swing_State 1.580e+00 NA NA NA
Log(theta) 9.595e+00 NA NA NA
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.024e+02 NA NA NA
Relative_Divisiveness -3.265e+00 NA NA NA
Obama_General_Percent_12 -2.300e+01 NA NA NA
Percent_Over_65 -7.768e-02 NA NA NA
Percent_F 2.873e+00 NA NA NA
Percent_White 5.156e+00 NA NA NA
Percent_HS -5.097e-01 NA NA NA
Per_Capita_Income 2.831e-04 NA NA NA
Poverty_Rate 1.391e-02 NA NA NA
MRP_Ideology_Mean -2.569e+00 NA NA NA
Swing_State 5.075e-01 NA NA NA
Theta = 14696.9932
Number of iterations in BFGS optimization: 94
Log-likelihood: -596.5 on 23 Df
Obviously, all of those NA's are less then helpful to me. Any advice would be greatly appreciated! I'm pretty novice at R, StackOverflow, and Statistics, but trying to learn. I'm trying to provide everything needed for the minimal reproducible example, but I don't see anywhere to share my actual data... so if that's something you need in order to answer the question, let me know where I can put it!

Going through the xgboostExplainer package: running into errors from github page

I am currently trying to work with the new xgboostExplainer package.
I am following the githib page here https://github.com/AppliedDataSciencePartners/xgboostExplainer/blob/master/R/explainPredictions.R
on line 34, the xgboost model is ran:
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds=3)
However on line 43 I am running into some problems.
explainer = buildExplainer(xgb.model,xgb.train.data, type="binary", base_score = 0.5, n_first_tree = xgb.model$best_ntreelimit - 1)
I understand that n_first_tree is depreciated but I cannot seem to access the xgb.model$best_ntreelimit -1 part.
The sections I can access in xgboost are;
handle, raw, niter, evaluation_log, call, params, callbacks, feature_names
not best_ntreelimit
Has somebody else ran into this issue.
EDIT:
Output of the showWaterfall()
Extracting the breakdown of each prediction...
|=============================================================| 100%
DONE!
Prediction: NA
Weight: NA
Breakdown
intercept cap-shape=bell
NA NA
cap-shape=conical cap-shape=convex
NA NA
cap-shape=flat cap-shape=knobbed
NA NA
cap-shape=sunken cap-surface=fibrous
NA NA
cap-surface=grooves cap-surface=scaly
NA NA
cap-surface=smooth cap-color=brown
NA NA
cap-color=buff cap-color=cinnamon
NA NA
cap-color=gray cap-color=green
NA NA
cap-color=pink cap-color=purple
NA NA
cap-color=red cap-color=white
NA NA
cap-color=yellow bruises?=bruises
NA NA
bruises?=no odor=almond
NA NA
odor=anise odor=creosote
NA NA
odor=fishy odor=foul
NA NA
odor=musty odor=none
NA NA
odor=pungent odor=spicy
NA NA
gill-attachment=attached gill-attachment=descending
NA NA
gill-attachment=free gill-attachment=notched
NA NA
gill-spacing=close gill-spacing=crowded
NA NA
gill-spacing=distant gill-size=broad
NA NA
gill-size=narrow gill-color=black
NA NA
gill-color=brown gill-color=buff
NA NA
gill-color=chocolate gill-color=gray
NA NA
gill-color=green gill-color=orange
NA NA
gill-color=pink gill-color=purple
NA NA
gill-color=red gill-color=white
NA NA
gill-color=yellow stalk-shape=enlarging
NA NA
stalk-shape=tapering stalk-root=bulbous
NA NA
stalk-root=club stalk-root=cup
NA NA
stalk-root=equal stalk-root=rhizomorphs
NA NA
stalk-root=rooted stalk-root=missing
NA NA
stalk-surface-above-ring=fibrous stalk-surface-above-ring=scaly
NA NA
stalk-surface-above-ring=silky stalk-surface-above-ring=smooth
NA NA
stalk-surface-below-ring=fibrous stalk-surface-below-ring=scaly
NA NA
stalk-surface-below-ring=silky stalk-surface-below-ring=smooth
NA NA
stalk-color-above-ring=brown stalk-color-above-ring=buff
NA NA
stalk-color-above-ring=cinnamon stalk-color-above-ring=gray
NA NA
stalk-color-above-ring=orange stalk-color-above-ring=pink
NA NA
stalk-color-above-ring=red stalk-color-above-ring=white
NA NA
stalk-color-above-ring=yellow stalk-color-below-ring=brown
NA NA
stalk-color-below-ring=buff stalk-color-below-ring=cinnamon
NA NA
stalk-color-below-ring=gray stalk-color-below-ring=orange
NA NA
stalk-color-below-ring=pink stalk-color-below-ring=red
NA NA
stalk-color-below-ring=white stalk-color-below-ring=yellow
NA NA
veil-type=partial veil-type=universal
NA NA
veil-color=brown veil-color=orange
NA NA
veil-color=white veil-color=yellow
NA NA
ring-number=none ring-number=one
NA NA
ring-number=two ring-type=cobwebby
NA NA
ring-type=evanescent ring-type=flaring
NA NA
ring-type=large ring-type=none
NA NA
ring-type=pendant ring-type=sheathing
NA NA
ring-type=zone spore-print-color=black
NA NA
spore-print-color=brown spore-print-color=buff
NA NA
spore-print-color=chocolate spore-print-color=green
NA NA
spore-print-color=orange spore-print-color=purple
NA NA
spore-print-color=white spore-print-color=yellow
NA NA
population=abundant population=clustered
NA NA
population=numerous population=scattered
NA NA
population=several population=solitary
NA NA
habitat=grasses habitat=leaves
NA NA
habitat=meadows habitat=paths
NA NA
habitat=urban habitat=waste
NA NA
habitat=woods
NA
-3.89182 -3.178054 -2.751535 -2.442347 -2.197225 -1.99243 -1.81529 -1.658228 -1.516347 -1.386294 -1.265666 -1.15268 -1.045969 -0.9444616 -0.8472979 -0.7537718 -0.6632942 -0.5753641 -0.4895482 -0.4054651 -0.3227734 -0.2411621 -0.1603427 -0.08004271 0 0.08004271 0.1603427 0.2411621 0.3227734 0.4054651 0.4895482 0.5753641 0.6632942 0.7537718 0.8472979 0.9444616 1.045969 1.15268 1.265666 1.386294 1.516347 1.658228 1.81529 1.99243 2.197225 2.442347 2.751535 3.178054 3.89182
Error in if (abs(values[i]) > put_rect_text_outside_when_value_below) { :
missing value where TRUE/FALSE needed
EDIT: Here is the code I ran:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
xgb.train.data <- xgb.DMatrix(train$data, label = train$label)
xgb.test.data <- xgb.DMatrix(test$data, label = test$label)
param <- list(objective = "binary:logistic")
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 500,
early_stopping_rounds = 10,
nfold = 3)
model.cv$best_ntreelimit
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds = 10)
explained <- buildExplainer(xgb.model, xgb.train.data, type="binary", base_score = 0.5, n_first_tree = 9)
pred.breakdown = explainPredictions(xgb.model,
explained,
xgb.test.data)
showWaterfall(xgb.model,
explained,
xgb.test.data, test$data, 2, type = "binary")
I tested the code in the linked page.
best_ntreelimit is a parameter returned by xgb.cv when early_stopping_rounds is set. From the help of xgb.cv:
best_ntreelimit the ntreelimit value corresponding to the best
iteration, which could further be used in predict method (only
available with early stopping).
You can get to it by using xgb.cv:
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
xgb.train.data <- xgb.DMatrix(train$data, label = train$label)
param <- list(objective = "binary:logistic")
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 500,
early_stopping_rounds = 10,
nfold = 3)
model.cv$best_ntreelimit
#output
9
However output of xgb.cv can not be used to build an explainer.
So you need:
xgb.model <- xgboost(param =param, data = xgb.train.data, nrounds = 10)
and set the n_first_tree to an integer:
explained <- buildExplainer(xgb.model, xgb.train.data, type="binary", base_score = 0.5, n_first_tree = 9)
EDIT: I failed to paste the following code:
xgb.test.data <- xgb.DMatrix(test$data, label = test$label)
pred.breakdown = explainPredictions(xgb.model,
explained,
xgb.test.data)
and now you can do:
showWaterfall(xgb.model,
explained,
xgb.test.data, test$data, 2, type = "binary")

mlogit() outputs NA's for nested logit model in R

I am trying to estimate a nested logit model of company siting choices, with nests = countries and alternatives = provinces, based on a number of alternative-specific characteristics as well as some company-specific characteristics. I formatted my data to a "long" structure using:
data <- mlogit.data(DB, choice="Occurrence", shape="long", chid.var="IDP", varying=6:ncol(DB), alt.var="Prov")
Here's a sample of the data:
IDP Occurrence From Prov ToC Dist Price Yield
5p1.APY 5p1 FALSE Sao Paulo APY PY 0.0000000 0.3698913 0.0000000
5p1.BOQ 5p1 FALSE Sao Paulo BOQ PY 0.6495493 0.3698913 0.0000000
5p1.CHA 5p1 FALSE Sao Paulo CHA AR 0.7870593 0.4622464 0.4461496
5p1.COR 5p1 FALSE Sao Paulo COR AR 0.3747480 0.4622464 0.5536546
5p1.FOR 5p1 FALSE Sao Paulo FOR AR 0.6822188 0.4622464 0.4402772
5p1.JUY 5p1 FALSE Sao Paulo JUY AR 1.0000000 0.4622464 0.3617038
Note that I've reduced the table to a few variables for clarity but would normally use more.
The code I use for the nested logit is the following:
nests <- list(Bolivia="SCZ",Paraguay=c("PHY","BOQ","APY"),Argentina=c("CHA","COR","FOR","JUY","SAL","SFE","SDE"))
nml <- mlogit(Occurrence ~ DistComp + PriceComp + YieldComp, data=data, nests=nests, unscaled=T)
summary(nml)
When running this model, I get the following output:
> summary(nml)
Call:
mlogit(formula = Occurrence ~ DistComp + PriceComp + YieldComp,
data = data, nests = nests, unscaled = T)
Frequencies of alternatives:
APY BOQ CHA COR FOR JUY PHY
SAL SCZ SDE SFE
0.1000000 0.0666667 0.1333333 0.0250000 0.0750000 0.0083333 0.0083333
0.1166667 0.2583333 0.1750000 0.0333333
bfgs method
1 iterations, 0h:0m:0s
g'(-H)^-1g = 1E+10
last step couldn't find higher value
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
BOQ:(intercept) -0.29923 NA NA NA
CHA:(intercept) -1.25406 NA NA NA
COR:(intercept) -1.76020 NA NA NA
FOR:(intercept) -1.97083 NA NA NA
JUY:(intercept) -4.14476 NA NA NA
PHY:(intercept) -2.63961 NA NA NA
SAL:(intercept) -1.72047 NA NA NA
SCZ:(intercept) -0.15714 NA NA NA
SDE:(intercept) -0.57449 NA NA NA
SFE:(intercept) -2.47345 NA NA NA
DistComp 2.44322 NA NA NA
PriceComp 2.45202 NA NA NA
YieldComp 3.15611 NA NA NA
iv.Bolivia 1.00000 NA NA NA
iv.Paraguay 1.00000 NA NA NA
iv.Argentina 1.00000 NA NA NA
Log-Likelihood: -221.84
McFadden R^2: 0.10453
Likelihood ratio test : chisq = 51.79 (p.value = 2.0552e-09)
I don't understand what causes the NAs in the output, considering that I prepared the data using mlogit.data(). Any help on this would be greatly appreciated.
Best,
Yann

Trend analysis for ANOVA with both btw-Ss and within-Ss factors

I want to do a trend analysis for an ANOVA that has both btw-Ss and within-Ss factors.
The btw factors are "treatments"
The within factors are "trials".
test.data <- data.frame(sid = rep(c("s1", "s2", "s3", "s4", "s5"), each = 4),
treatments = rep(c("a1", "a2"), each = 20),
trials = rep(c("b1", "b2", "b3", "b4"), 10),
responses = c(3,5,9,6,7,11,12,11,9,13,14,12,4,8,11,7,1,3,5,4,5,6,11,7,10,12,18,15,10,15,15,14,6,9,13,9,3,5,9,7))}
The ANOVA matches the one in the textbook (Keppel, 1973) exactly:
aov.model.1 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
What I am having trouble with is the trend analysis. I want to look at the linear, quadratic, and cubic trends for “trials”. Would also be nice to look at these same trends for “treatments x trials”.
I have set up the contrasts for the trend analyses as:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
contrasts(tab20.09$trials)
[,1] [,2] [,3]
b1 -3 1 -1
b2 -1 -1 3
b3 1 -1 -3
b4 3 1 1
for the linear, quadratic, and cubic trends.
According to Keppel the results for the trends should be:
TRIALS:
SS df MS F
(Trials) (175.70) 3
Linear 87.12 1 87.12 193.60
Quadratic 72.90 1 72.90 125.69
Cubic 15.68 1 15.68 9.50
TREATMENTS X TRIALS
SS df MS F
(Trtmt x Trials)
(3.40) 3
Linear 0.98 1 0.98 2.18
Quadratic 0.00 1 0.00 <1
Cubic 2.42 1 2.42 1.47
ERROR TERMS
(21.40) (24)
Linear 3.60 8 0.45
Quadratic 4.60 8 0.58
Cubic 13.20 8 1.65
I have faith in his answers as once upon the time I had to derive them myself using a 6 function calculator supplemented by paper and pencil. However, when I do this:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
aov.model.2 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
summary(lm(aov.model.2))
what I get seems not to make sense.
summary(lm(aov.model.2))
Call:
lm(formula = aov.model.2)
Residuals:
ALL 40 residuals are 0: no residual degrees of freedom!
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.750e+00 NA NA NA
treatmentsa2 3.500e+00 NA NA NA
trials1 6.500e-01 NA NA NA
trials2 -1.250e+00 NA NA NA
trials3 -4.500e-01 NA NA NA
sids10 -3.250e+00 NA NA NA
sids2 4.500e+00 NA NA NA
sids3 6.250e+00 NA NA NA
sids4 1.750e+00 NA NA NA
sids5 -2.500e+00 NA NA NA
sids6 -2.000e+00 NA NA NA
sids7 4.500e+00 NA NA NA
sids8 4.250e+00 NA NA NA
sids9 NA NA NA NA
treatmentsa2:trials1 2.120e-16 NA NA NA
treatmentsa2:trials2 -5.000e-01 NA NA NA
treatmentsa2:trials3 5.217e-16 NA NA NA
trials1:sids10 1.500e-01 NA NA NA
trials2:sids10 7.500e-01 NA NA NA
trials3:sids10 5.000e-02 NA NA NA
trials1:sids2 -1.041e-16 NA NA NA
trials2:sids2 -2.638e-16 NA NA NA
trials3:sids2 5.000e-01 NA NA NA
trials1:sids3 -1.500e-01 NA NA NA
trials2:sids3 -2.500e-01 NA NA NA
trials3:sids3 4.500e-01 NA NA NA
trials1:sids4 -5.000e-02 NA NA NA
trials2:sids4 -7.500e-01 NA NA NA
trials3:sids4 1.500e-01 NA NA NA
trials1:sids5 -1.000e-01 NA NA NA
trials2:sids5 5.000e-01 NA NA NA
trials3:sids5 3.000e-01 NA NA NA
trials1:sids6 -1.000e-01 NA NA NA
trials2:sids6 5.000e-01 NA NA NA
trials3:sids6 -2.000e-01 NA NA NA
trials1:sids7 4.000e-01 NA NA NA
trials2:sids7 5.000e-01 NA NA NA
trials3:sids7 -2.000e-01 NA NA NA
trials1:sids8 -5.000e-02 NA NA NA
trials2:sids8 2.500e-01 NA NA NA
trials3:sids8 6.500e-01 NA NA NA
trials1:sids9 NA NA NA NA
trials2:sids9 NA NA NA NA
trials3:sids9 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 39 and 0 DF, p-value: NA
Any ideas what I am doing wrong? I suspect there is some problem with “lm” and the ANOVA but I don’t know what and I don’t know how to put in my trend analyses.
###### MORE DETAILS in response to ssdecontrol's response
Well, "trials" is a factor as it codes four levels of experience that are being manipulated. Likewise "sid" is the "subject identification number" that is definitely "nominal" not "ordinal" or "interval". Subjects are pretty much always treated as Factors in ANOVAS.
However, I did try both of these changes, but it greatly distorted the ANOVA (try it yourself and compare). Likewise, it didn't seem to help. PERHAPS MORE DIRECTLY RELEVANT, when I try to create and apply my contrasts I am told that it cannot be done as my numerics need to be factors:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
Error in `contrasts<-`(`*tmp*`, value = c(-3, -1, 1, 3, 1, -1, -1, 1, :
contrasts apply only to factors
STARTING OVER
I seem to make more progress using contr.poly as in
contrasts(tab20.09$trials) <- contr.poly(levels(tab20.09$trials))
The ANOVA doesn't change at all. So that is good and when I do:
lm.model <- lm(responses ~ trials, data = tab20.09)
summary.lm(lm.model)
I get basically the same pattern as Keppel.
BUT, as I am interested in the linear trend of the interaction (treatments x trials), not just on trials, I tried this:
lm3 <- lm(responses ~ treatments*trials, data = tab20.09)
summary.lm(lm3)
and the ME of "trials" goes away . . .
In Keppel’s treatment, he calculated separate error terms for each contrast (i.e., Linear, Quadratic, and Cubic) and used that on both the main effect of “trial” as well as on the “treatment x trial” interaction.
I certainly could hand calculate all of these things again. Perhaps I could even write R functions for the general case; however, it seems difficult to believe that such a basic core contrast for experimental psychology has not yet found an R implementation!!??
Any help or suggestions would be greatly appreciated. Thanks. W
It looks like trials and sids are factors, but you are intending for them to be numeric/integer. Run sapply(tab20.09, class) to see if that's the case. That's what the output means; instead of fitting a continuous/count interaction, it's fitting a dummy variable for each level of each variable and computing all of the interactions between them.
To fix it, just reassign tab20.09$trials <- as.numeric(tab20.09$trials) and tab20.09$sids <- as.numeric(tab20.09$sids) in list syntax, or you can use matrix syntax like tab20.09[, c("trials", "sids")] <- apply(tab20.09[, c("trials", "sids")], 2, as.numeric). The first one is easier in this case, but you should be aware of the second one as well.

Resources