calculating adjusted R-Squared in R - r

I have the following dataset and I would like to calculate the adjusted r-squared based on this dataset.
I have the formula for adjusted R-Squared "Adjusted R2 = 1 – [(1-R2)*(n-1)/(n-k-1)]".
where:
R2: The R-Squared
n: is the number of observations, in this case, "DV.obs"
k: is the number of predictor variables, in this case, "nParam" (where its either 0,1,2,3)
the R code to calculate it is the following, where it is grouped by "ITER", iterations, we have 4 iterations.So the idea is to calculate adjusted R-Squared based on the iterations(4)
iteration 1, the nParam should only be 0, iteration 2, the nParam should only be 1, etc, instead of choosing every nParam in the dataset, since the nParam is exactly the same for each iteration.
The output should be only 4 rows ( for every iteration, as its grouped by(ITER)) and 2 columns (R2, and adjusted R-Squared) and not for every row in the data.
i hope i have explained myself well.
library(dplyr)
ff <- df %>%
group_by(ITER) %>%
summarise(
Rsq = cor(x= DV.obs, y = DV.sim)^2,
adjRsq = 1 - ((1-Rsq)*(length(DV.obs)-1)/(length(DV.obs)- nParam - 1 ))
)
ff
however, this formula will go through every predictor variable(nParam),
df<-structure(list(CASE = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), ITER = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L), nParam = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), DV.obs = c(0.101483807, 0.069196694, 0.053869542, 0.043831971,
0.030330271, 0.023612088, 0.01978679, 0.014310351, 0.01164389,
0.007267871, 0.004536453, 0.002873573, 0.002408037, 0.001417053,
0.001136154, 0.101483807, 0.069196694, 0.053869542, 0.043831971,
0.030330271, 0.023612088, 0.01978679, 0.014310351, 0.01164389,
0.007267871, 0.004536453, 0.002873573, 0.002408037, 0.001417053,
0.001136154, 0.101483807, 0.069196694, 0.053869542, 0.043831971,
0.030330271, 0.023612088, 0.01978679, 0.014310351, 0.01164389,
0.007267871, 0.004536453, 0.002873573, 0.002408037, 0.001417053,
0.001136154, 0.101483807, 0.069196694, 0.053869542, 0.043831971,
0.030330271, 0.023612088, 0.01978679, 0.014310351, 0.01164389,
0.007267871, 0.004536453, 0.002873573, 0.002408037, 0.001417053,
0.001136154, 0.000116054, 0.003829787, 0.01206963, 0.02088975,
0.027388781, 0.03423598, 0.037833661, 0.037369438, 0.035164408,
0.034584139, 0.02947776, 0.023210831, 0.014622821, 0.009632495,
0.006731141, 0.0027853, 0.000116054, 0.003829787, 0.01206963,
0.02088975, 0.027388781, 0.03423598, 0.037833661, 0.037369438,
0.035164408, 0.034584139, 0.02947776, 0.023210831, 0.014622821,
0.009632495, 0.006731141, 0.0027853, 0.000116054, 0.003829787,
0.01206963, 0.02088975, 0.027388781, 0.03423598, 0.037833661,
0.037369438, 0.035164408, 0.034584139, 0.02947776, 0.023210831,
0.014622821, 0.009632495, 0.006731141, 0.0027853, 0.000116054,
0.003829787, 0.01206963, 0.02088975, 0.027388781, 0.03423598,
0.037833661, 0.037369438, 0.035164408, 0.034584139, 0.02947776,
0.023210831, 0.014622821, 0.009632495, 0.006731141, 0.0027853
), DV.sim = c(0, 0.0889808909410658, 0.0947484349571132, 0.0798169790285827,
0.0574006922793388, 0.0505799935506284, 0.0468774569150804, 0.0417447990739346,
0.0375742405164242, 0.0306761993989349, 0.0251120797996223, 0.0205737193532288,
0.0168649279846251, 0.0138327510148287, 0.0113531698574871, 0,
0.0829660195227578, 0.0876380159497916, 0.0723450386112931, 0.0464863987773657,
0.0380595525625348, 0.0343245102453232, 0.0307144539731741, 0.0283392784461379,
0.0245820489723981, 0.0214487023548782, 0.0187365858632326, 0.0163729577744008,
0.0143107050991059, 0.0125108672587574, 0, 0.0762191578459362,
0.0737615750578683, 0.0549565160764756, 0.0280085518714786, 0.0206076781625301,
0.0172540310333669, 0.0134899928846955, 0.0108952926749736, 0.00728254194885496,
0.00491441482789815, 0.00332488210681827, 0.00225250494349749,
0.00152820673925803, 0.00103880306820386, 0, 0.0329456788891303,
0.0365534415712808, 0.03318406650424, 0.0278133129626513, 0.0238151342895627,
0.0205330317793787, 0.0155563822799921, 0.0119589968463779, 0.0072024345056713,
0.00437676923945547, 0.00266755578568207, 0.00162810577310623,
0.000994532813206324, 0.000607859854716811, 0, 0.00238890872602278,
0.02000716184065, 0.0509446502289174, 0.0907202677155637, 0.173563302880525,
0.223891823887825, 0.2226231635499, 0.19175603264451, 0.168494781267643,
0.150974664176703, 0.136206244819164, 0.111464575245381, 0.0913691590994598,
0.0749306779146197, 0.0504548476848009, 0, 0.00141190656836649,
0.0124264488774641, 0.0328390336436031, 0.0603613019163447, 0.123470497330427,
0.172404586815834, 0.178024356626272, 0.151606226187945, 0.130227694458962,
0.117105708281994, 0.107832603356838, 0.0935153502613309, 0.081651206263304,
0.0713645335614684, 0.0545446672743561, 0, 0.00122455342249632,
0.00957195676775054, 0.0233009280455857, 0.0398901057214595,
0.069490838356018, 0.0753487069702148, 0.0619427798080445, 0.0388082119899989,
0.0282194718351961, 0.0223033058814705, 0.0181158699408174, 0.012206885059923,
0.00828045272134247, 0.00562572468560191, 0.00260434861259537,
0, 0.00337575118759914, 0.0123247819279197, 0.0212808990854769,
0.0292664165479362, 0.0407316533482074, 0.0457373328155279, 0.0440263413557409,
0.0350818961969019, 0.0268987657874823, 0.0206920115460456, 0.0160182394650579,
0.00970028643496338, 0.00590740063816313, 0.00360522091817113,
0.00134665597468616)), row.names = c(NA, 124L), class = "data.frame")

You could add distinct(ITER, .keep_all = TRUE)
library(tidyverse)
df %>%
group_by(ITER) %>%
summarise(
Rsq = cor(x = DV.obs, y = DV.sim)^2,
adjRsq = 1 - ((1 - Rsq) * (length(DV.obs) - 1) / (length(DV.obs) - nParam - 1))
) %>%
distinct(ITER, .keep_all = T)
#> `summarise()` has grouped output by 'ITER'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups: ITER [4]
#> ITER Rsq adjRsq
#> <int> <dbl> <dbl>
#> 1 1 0.113 0.113
#> 2 2 0.116 0.0858
#> 3 3 0.334 0.286
#> 4 4 0.268 0.187

The issue is that you get a value per row as your are using the nParam column to compute the adjusted R^2 without any aggregating operation. This could be fixed by using unique(nParam) to "aggregate" nParam to just one value per group:
library(dplyr)
df %>%
group_by(ITER) %>%
summarise(
Rsq = cor(x = DV.obs, y = DV.sim)^2,
adjRsq = 1 - ((1 - Rsq) * (n() - 1) / (n() - unique(nParam) - 1))
)
#> # A tibble: 4 × 3
#> ITER Rsq adjRsq
#> <int> <dbl> <dbl>
#> 1 1 0.113 0.113
#> 2 2 0.116 0.0858
#> 3 3 0.334 0.286
#> 4 4 0.268 0.187

Related

getting Error in svd(X) : infinite or missing values in 'x' when using summary of regression model

I am getting an error
Error in svd(X) : infinite or missing values in 'x'
while doing summary of ordinal regression model. This is my code ..
library(MASS)
a <- dget('dput.txt')
lep <- polr(bmicat2 ~ Leptin, data = a,Hess = TRUE)
summary(lep)
Error in svd(X) : infinite or missing values in 'x'
sample data is given to replicate the error. Can someone please help.
structure(list(bmicat2 = structure(c(1L, 3L, 2L, 1L, 1L, 1L,
1L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 1L,
3L, 3L, 1L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
2L, 1L, 1L, 3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 1L,
1L, 2L, 1L, 3L, 2L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 3L,
3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 2L, 1L, 2L, 2L, 1L, 3L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 2L, 1L,
2L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 3L, 1L, 1L, 2L, 2L, 2L, 1L, 3L,
3L, 3L, 3L, 3L, 2L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 2L, 3L, 2L,
2L, 1L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 2L, 1L, 3L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L,
1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 3L, 2L, 1L, 1L,
1L, 3L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 3L, 1L, 1L, 3L,
1L, 3L, 1L, 3L, 3L, 3L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 1L, 3L, 3L,
1L, 2L, 3L, 3L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 1L, 3L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 2L, 3L, 2L, 1L, 1L,
3L, 3L, 2L, 1L, 3L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 2L, 1L,
3L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
3L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 3L,
2L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 1L, 3L, 3L, 2L, 2L,
1L, 1L, 3L, 2L, 3L, 3L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 3L,
3L, 3L, 2L, 1L, 1L, 3L, 3L, 1L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 1L,
1L, 3L, 1L, 3L, 3L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L, 1L,
3L, 2L, 2L, 2L, 1L, 3L, 1L, 3L, 1L, 3L, 2L, 1L, 2L, 3L, 1L, 3L,
1L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 3L, 2L, 2L, 2L, 1L, 3L,
1L, 1L, 3L), .Label = c("Normal", "Overweight", "Obesity"), class = "factor"),
Leptin = c(47710.88, 200022.04, 161067.35, 55561.76, 100728.06,
69783.61, 54981.01, 58801.34, 128313.05, 157378.46, 292366.57,
121608.04, 206046.88, 54981.01, 154969.33, 516791.75, 104766.73,
440134.48, 286576.47, 343513.87, 40020.2, 30077.63, 359266.48,
290381.41, 23005.7, 48080.73, 134741.37, 114631.03, 49644.25,
139956.69, 138242.54, 19862, 64541.08, 57119.32, 115382.48,
7238.99, 154969.33, 82321.93, 85406.54, 19170.47, 57208.08,
277488.74, 290791.3, 206703.97, 25333.82, 20134.62, 32823.3,
231036.03, 111986.18, 352190.59, 128041.35, 185025.96, 63451.72,
143404.56, 71163.46, 252067.35, 46223.39, 185077.75, 172339.07,
41381.36, 91498.49, 233969.82, 24245.94, 248133.29, 145890.48,
196431.01, 146690.84, 218617.65, 151333.68, 245695.08, 336242.88,
266936.45, 64105.63, 301181.31, 150192.02, 253863.48, 314169.03,
406059.04, 68228.37, 335171.04, 37547.56, 123713.8, 75034,
45708.91, 67449.43, 15920.57, 38444.37, 19170.47, 174853.97,
236689.18, 22879.68, 34599.46, 57562.83, 177486.58, 244481.84,
122637.48, 58094.2, 82921.65, 382788.21, 119733.17, 64192.75,
8787.33, 17146.98, 21986.45, 13077.71, 18320.19, 119777.74,
61615.67, 5708.97, 24307.05, 244118.62, 10780.13, 12158.23,
80265.64, 70215.06, 122189.66, 48219.23, 156702.5, 128313.05,
115072.95, 152956.29, 107776.73, 108914.21, 85835.09, 13608.28,
24853.42, 58359.56, 29967.69, 168944.14, 22435.55, 67709.18,
17444.89, 51058.87, 21072.27, 34702.75, 9711.01, 43870.71,
113571.74, 26863.91, 294914.16, 15920.57, 23381.51, 309409.6,
19587.14, 99905.3, 79494.37, 90768.21, 131129.25, 411527.63,
132269.4, 83735.58, 92014.19, 57030.54, 62578.38, 131676.19,
238711.8, 48080.73, 429691.15, 88750.76, 32770.45, 163022.26,
98045.93, 8421.27, 113659.95, 27210.74, 421265.4, 225005.8,
39431.91, 18748.27, 224660.83, 13695.33, 11186.36, 42727.62,
863581.67, 18031.36, 250895.16, 326547.74, 69351.93, 34288.82,
64932.66, 23381.51, 392858.01, 32399.47, 176370.64, 31651.9,
17592.65, 93304.24, 71938.68, 297130.98, 58624.7, 311565.88,
143685.29, 204518.07, 28689.43, 612308, 119688.61, 116888.21,
60738.48, 274462.8, 307122.48, 202886.86, 119777.74, 194409.2,
259555.37, 479766.12, 97527.66, 177029.67, 17146.98, 70560.06,
50922.4, 213003.8, 142470.09, 26747.8, 235563.59, 49460.99,
185181.36, 27210.74, 156220.34, 284284.42, 254982.61, 67059.61,
46736.22, 97657.21, 399497.2, 13433.03, 385197.12, 143685.29,
46503.31, 333111.93, 228589.14, 40117.95, 86563.73, 459114.82,
334559.84, 39824.45, 90295.84, 19309.96, 25333.82, 15604.37,
548683.37, 13954.22, 63974.92, 283344.56, 47061.73, 14794.69,
56852.91, 63320.83, 21335.74, 18176.13, 44060.31, 31436.88,
779337.47, 177842.32, 15683.81, 1804.18, 18031.36, 150572.24,
69956.22, 353788.11, 42823.25, 25927.76, 23131.34, 351871.83,
130355.47, 48034.54, 9711.01, 296020.99, 77994.24, 106641.08,
NA, 68876.79, 64323.41, 37947.12, 166071.05, 239189.29, 376318.03,
132041.15, 48818.15, 209454.17, 25572.25, 232324.09, 142750.23,
34185.02, 21204.24, 6040.1, 5708.97, 252190.95, 129219.75,
50922.4, 97786.76, 117464.95, 84978.04, 15122, 6356.95, NA,
299220.74, 28801.64, 535105.27, 421922.32, 9486.52, 159801.65,
139492.77, 40410.71, 28011.68, 67189.58, 10780.13, 173997.11,
15283.91, 192081.41, 169590.94, 29747.24, 57961.45, 55606.38,
351712.55, 157233.52, 117553.73, 201046.58, NA, 204409.11,
468179.87, 201640.99, 374968.45, 330911.94, 153817.84, 45098.7,
83050.16, NA, 20134.62, 20606.55, 328498.24, 22115.29, 40313.21,
196751.22, 6356.95, 8042.41, 252005.56, 124522.47, 9711.01,
168745.34, 110580.08, 78722.96, 104897.35, 43728.34, 20270.11,
339321.01, 170488.25, 55829.38, 173292.85, 29967.69, 5708.97,
30132.53, 12715.53, 180648.71, 146219.86, 33716.28, 85835.09,
110404.53, 327146.88, 259303.38, 164396.3, 206156.31, 21204.24,
105158.65, 152478.41, 327897.07, 29025.44, 36237.65, 426358.52,
265525.58, 52464.02, 287117.75, 658217.42, 67709.18, 107645.6,
46829.29, 186790.78, 68920, 363915.23, 269385.74, 126324.01,
146361.1, 153243.28, 101161.4, 318839.47, 132223.74, 96190,
143966.19, 162141.4, 63495.34, 35371.14, 261070.81, 197071.7,
240146.08, 73100.38, 63713.39, 248622.82, 92616.07, 163120.25,
95026.25, 266807.98, 153434.72, 145937.52, 127950.82, 376487.04,
81208.15, 311998.55, 41767.47, 59595.08, 13256.19, 35011.88,
99083.27, 47571.98, 174450.51, 8296.48, 35524.66, 68747.15,
214064.04, 127272.35, 70603.17, 99256.27, 19862, 145373.38,
184560.15, 57828.63, 115426.71)), row.names = c(NA, -425L
), class = c("tbl_df", "tbl", "data.frame"))
It is an issue of lep$Hessian having NA values. If you do this:
lep <- polr(bmicat2 ~ Leptin, data = a,Hess = TRUE)
lep$Hessian[1,1]<-0
lep$Hessian[1,2]<-0
lep$Hessian[1,3]<-0
lep$Hessian[2,1]<-0
lep$Hessian[3,1]<-0
summary(lep)
You do not have any problems but the output from lep and summary(lep) are pretty much the same, I think. Is it the imputation of 0 to missing values in the Hessian that disturbing?

How to randomly change 50% of a column observations based on another column condition?

I'm analyzing a survey and I need to do an interaction.plot() between variable disclosure_1 and TYPE_1 to see how they affect a third variable ADTRUST.
The survey randomly showed a different scenario to each participant. DISCLOSURE_1 is the code to indicate the type of disclosure that was shown to a respondent (B = Before, D = During, A = After, N = None).
TYPE_1 indicates the terminology used (DF = Deepfake, SM = Sythetic Media).
When creating the survey I dumbly only used DF for N because I thought there was no need to create an SM scenario if there was no disclosure shown (so no difference in terminology used). It still makes no sense logically, but when plotting the interaction plot the variable N does not appear. And since the study wants to analyze:
how disclosure impacts ADTRUST
how disclosure positioning impacts ADTRUST
how different terminology used in disclosure impact ADTRUST
I need to randomly substitute 50% of the observations with SM instead of DF under TYPE_1 but ONLY if the column DISCLOSURE_1 is == N.
I have no clue how to do that. Could somebody please help?
NOTE!!!! The structure is part of a bigger dataset. the dput was only done only for [25:26], so keep in mind I need to be able to precisely select the columns in the code.
Thank youu
structure(list(DISCLOSURE_1 = structure(c(4L, 3L, 1L, 3L, 1L,
1L, 4L, 3L, 4L, 3L, 4L, 1L, 3L, 3L, 4L, 1L, 3L, 3L, 1L, 3L, 3L,
4L, 2L, 1L, 1L, 4L, 2L, 3L, 3L, 1L, 4L, 1L, 1L, 4L, 4L, 1L, 3L,
2L, 2L, 1L, 4L, 1L, 1L, 1L, 4L, 3L, 4L, 3L, 2L, 3L, 1L, 1L, 3L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 2L, 2L,
3L, 3L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 3L, 2L, 2L, 3L,
3L, 2L, 3L, 3L, 2L, 1L, 3L, 1L, 1L, 4L, 4L, 1L, 2L, 4L, 3L, 1L,
1L, 1L, 3L, 2L, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 4L, 2L, 3L, 3L, 2L,
4L, 4L, 3L, 4L, 1L, 1L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 1L,
4L, 1L, 2L, 3L, 3L, 1L, 4L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 1L, 4L,
2L, 4L, 1L, 4L, 2L, 1L, 1L, 1L, 3L, 2L, 3L, 2L, 4L, 3L, 1L, 3L,
1L, 1L, 3L, 2L, 3L, 2L, 4L, 3L, 3L, 1L, 1L, 3L, 2L, 2L, 1L, 1L,
3L, 3L, 4L, 2L, 2L, 3L, 3L, 1L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 3L,
3L, 4L, 3L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 4L, 1L, 2L, 2L, 3L, 3L,
1L, 4L, 4L, 1L, 2L, 4L, 2L, 1L, 1L, 4L, 1L, 1L, 2L, 1L, 2L, 2L,
3L, 3L, 3L, 3L, 2L, 1L, 4L, 1L, 1L, 2L, 2L, 4L, 2L, 3L, 1L, 2L,
3L, 3L, 2L, 4L, 3L, 2L, 2L, 4L, 4L, 2L, 1L, 1L, 2L, 2L, 3L, 1L,
1L, 4L, 3L, 1L, 3L, 3L, 3L, 2L, 2L, 2L, 4L, 2L, 4L, 4L, 4L, 1L,
3L, 1L, 3L, 1L, 3L, 1L, 2L, 4L, 3L, 2L, 2L, 2L, 1L, 2L, 2L, 3L,
2L, 1L, 2L, 4L, 1L, 2L, 3L, 2L, 1L, 2L, 4L, 4L, 2L, 3L, 2L, 1L,
2L, 2L, 2L, 4L, 2L, 1L, 3L, 1L, 3L, 3L, 4L, 1L, 2L, 3L, 2L, 2L,
3L, 4L, 3L, 2L, 3L, 3L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 2L, 2L,
1L, 2L, 2L, 1L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 3L, 2L, 1L, 2L, 3L,
2L, 3L, 1L, 2L, 3L, 4L, 2L, 1L, 1L, 3L), .Label = c("A", "B",
"D", "N"), class = "factor"), TYPE_1 = structure(c(1L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("DF",
"SM"), class = "factor")), row.names = c(NA, -367L), class = c("tbl_df",
"tbl", "data.frame"))
With data as your data.frame, this will replace exactly half (rounded down) of the N's with DF with SM:
blnN <- data$DISCLOSURE_1 == "N" & data$TYPE_1 == "DF"
data$TYPE_1[sample(which(blnN), sum(blnN)/2)] <- "SM"
If the 50% requirement is approximate, you can use runif() > 0.5
library(dplyr)
table(df)
TYPE_1
DISCLOSURE_1 DF SM
A 52 51
B 52 53
D 55 51
N 53 0
mut <- df |>
mutate(TYPE_1 = ifelse(DISCLOSURE_1 == "N" &
TYPE_1 == "DF" &
runif(n()) > 0.5,
"SM",
as.character(TYPE_1)))
table(mut)
TYPE_1
DISCLOSURE_1 DF SM
A 52 51
B 52 53
D 55 51
N 27 26

How to separate compact letter display (CLD) in multcomp by group without changing the p-value adjustment method?

Problem
I would like to plot estimated marginal means from a three-way factorial experiment with letters indicating significantly different means, adjusted for multiple comparisons. My current workflow is to fit the model with lmer(), calculate estimated marginal means with emmeans(), then implement the compact letter display algorithm with cld().
My problem is that the graph is too busy when you plot all three-way interactions on the same plot. So I would like to split up the plot and generate different sets of letters for each subplot, starting with "a". The problem is that when I use the by argument in cld to split it up, it does a separate correction for multiple comparisons within each by group. Because there are now fewer tests within each group, this results in a less conservative correction. But if I try to manually split up the output of cld() without a by group, I would have to manually re-implement the letter algorithm for each subplot. I guess I could do that but it seems cumbersome. I am trying to share this code with a client for him to modify later, so that solution would probably be too complex. Does anyone have an easy way to either:
Get the output of cld() to use one combined correction for all by groups.
Using a relatively simple method, reduce the compact letter display for each subgroup to the minimal necessary number of letters.
Reproducible example
Load packages and data.
library(lme4)
library(emmeans)
library(multcomp)
dat <- structure(list(y = c(2933.928571, 930.3571429, 210.7142857, 255.3571429,
2112.5, 1835.714286, 1358.928571, 1560.714286, 9192.857143, 3519.642857,
2771.428571, 7433.928571, 4444.642857, 3025, 3225, 2103.571429,
3876.785714, 925, 1714.285714, 3225, 1783.928571, 2223.214286,
2537.5, 2251.785714, 7326.785714, 5130.357143, 2539.285714, 6116.071429,
5808.928571, 3341.071429, 2212.5, 7562.5, 3907.142857, 3241.071429,
1294.642857, 4325, 4487.5, 2551.785714, 5648.214286, 3198.214286,
1075, 335.7142857, 394.6428571, 1605.357143, 658.9285714, 805.3571429,
1580.357143, 1575, 2037.5, 1721.428571, 1014.285714, 2994.642857,
2116.071429, 800, 2925, 3955.357143, 9075, 3917.857143, 2666.071429,
6141.071429, 3925, 1626.785714, 2864.285714, 7271.428571, 3432.142857,
1826.785714, 514.2857143, 1319.642857, 1782.142857, 2637.5, 1355.357143,
3328.571429, 1914.285714, 817.8571429, 1896.428571, 2121.428571,
521.4285714, 360.7142857, 1114.285714, 1139.285714, 7042.857143,
2371.428571, 2287.5, 4967.857143, 2180.357143, 1944.642857, 2408.928571,
5289.285714, 7028.571429, 3080.357143, 5394.642857, 5973.214286,
7323.214286, 1419.642857, 1455.357143, 4657.142857, 7069.642857,
2451.785714, 4319.642857, 5562.5, 3953.571429, 1182.142857, 1957.142857,
3796.428571, 1773.214286, 400, 871.4285714, 842.8571429, 657.1428571,
1360.714286, 1853.571429, 1826.785714, 3405.357143, 2605.357143,
5983.928571, 4935.714286, 4105.357143, 7666.071429, 3619.642857,
5085.714286, 1592.857143, 1751.785714, 5992.857143, 2987.5, 794.6428571,
3187.5, 825, 3244.642857), f1 = structure(c(4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C", "D"), class = "factor"), f2 = structure(c(2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("foo",
"bar"), class = "factor"), f3 = structure(c(4L, 3L, 2L, 1L, 3L,
4L, 1L, 2L, 4L, 2L, 1L, 3L, 3L, 2L, 4L, 1L, 3L, 1L, 4L, 2L, 2L,
4L, 3L, 1L, 2L, 4L, 1L, 3L, 2L, 3L, 1L, 4L, 3L, 4L, 1L, 2L, 3L,
2L, 4L, 1L, 2L, 1L, 3L, 4L, 1L, 2L, 4L, 3L, 2L, 1L, 3L, 4L, 3L,
1L, 4L, 2L, 4L, 2L, 3L, 1L, 1L, 3L, 2L, 4L, 3L, 4L, 1L, 2L, 1L,
4L, 3L, 2L, 3L, 1L, 4L, 2L, 1L, 3L, 4L, 2L, 4L, 3L, 1L, 2L, 1L,
3L, 4L, 2L, 3L, 1L, 4L, 2L, 4L, 1L, 3L, 2L, 2L, 3L, 4L, 1L, 4L,
1L, 2L, 3L, 4L, 1L, 3L, 2L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L,
4L, 2L, 3L, 1L, 3L, 4L, 2L, 1L, 3L, 2L, 4L), .Label = c("L1",
"L2", "L3", "L4"), class = "factor"), block = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor")), row.names = c(NA, -128L), class = "data.frame")
Fit model and get estimated marginal means.
fit <- lmer(log10(y) ~ f1 * f2 * f3 + (1 | block), data = dat)
emm <- emmeans(fit, ~ f1 + f2 + f3, mode = 'Kenward-Roger', type = 'response')
Version 1
In this version, I take the CLD as a whole which correctly uses the Sidak adjustment for 496 tests. However let's say I wanted to plot only those rows where f2 == 'bar'. The letters are no longer correct because some are redundant (less than 8 are needed). Is there any function that can reduce the letters down?
cldisplay1 <- cld(emm, adjust = 'sidak', Letters = letters)
subset(as.data.frame(cldisplay1), f2 == 'bar') # correct comparisons but contains redundant letters
output
f1 f2 f3 response SE df lower.CL upper.CL .group
8 D bar L1 365.6732 76.1231 96 185.9699 719.0244 a
24 D bar L3 582.8573 121.3349 96 296.4229 1146.0742 ab
16 D bar L2 682.9238 142.1659 96 347.3136 1342.8353 ab
7 C bar L1 898.1560 186.9714 96 456.7740 1766.0470 abcd
6 B bar L1 1627.7069 338.8438 96 827.8006 3200.5652 bcdefg
15 C bar L2 1635.4393 340.4534 96 831.7330 3215.7694 bcdefg
32 D bar L4 1746.6052 363.5951 96 888.2685 3434.3552 bcdefg
31 C bar L4 2348.6629 488.9270 96 1194.4562 4618.1832 cdefgh
21 A bar L3 2499.6772 520.3640 96 1271.2573 4915.1230 cdefgh
5 A bar L1 2545.4594 529.8946 96 1294.5407 5005.1448 cdefgh
23 C bar L3 2561.0138 533.1326 96 1302.4512 5035.7294 cdefgh
30 B bar L4 3158.6969 657.5538 96 1606.4140 6210.9556 efgh
22 B bar L3 3364.9438 700.4887 96 1711.3047 6616.4994 efgh
14 B bar L2 3411.4009 710.1598 96 1734.9313 6707.8482 efgh
13 A bar L2 3769.4223 784.6900 96 1917.0098 7411.8269 efgh
29 A bar L4 7006.3740 1458.5342 96 3563.2217 13776.6551 h
Version 2
In this version, I use the by argument to cld() to split by f2. This reduces the letters within each group, but the Sidak adjustment is now less conservative. For example, row 8 and row 16 are not significantly different at the adjusted alpha-level from the comparison above, but now they are different. But I do not want to change the tests used, just to plot only a subset of the data. Is there a way to specify the number of tests I'm adjusting for as a whole, even though cld is split up with by groups?
cldisplay2 <- cld(emm, adjust = 'sidak', by = 'f2', Letters = letters)
subset(as.data.frame(cldisplay2), f2 == 'bar')
output
f1 f2 f3 response SE df lower.CL upper.CL .group
8 D bar L1 365.6732 76.1231 96 185.9699 719.0244 a
24 D bar L3 582.8573 121.3349 96 296.4229 1146.0742 ab
16 D bar L2 682.9238 142.1659 96 347.3136 1342.8353 abc
7 C bar L1 898.1560 186.9714 96 456.7740 1766.0470 abcd
6 B bar L1 1627.7069 338.8438 96 827.8006 3200.5652 bcde
15 C bar L2 1635.4393 340.4534 96 831.7330 3215.7694 bcde
32 D bar L4 1746.6052 363.5951 96 888.2685 3434.3552 cde
31 C bar L4 2348.6629 488.9270 96 1194.4562 4618.1832 de
21 A bar L3 2499.6772 520.3640 96 1271.2573 4915.1230 def
5 A bar L1 2545.4594 529.8946 96 1294.5407 5005.1448 def
23 C bar L3 2561.0138 533.1326 96 1302.4512 5035.7294 def
30 B bar L4 3158.6969 657.5538 96 1606.4140 6210.9556 ef
22 B bar L3 3364.9438 700.4887 96 1711.3047 6616.4994 ef
14 B bar L2 3411.4009 710.1598 96 1734.9313 6707.8482 ef
13 A bar L2 3769.4223 784.6900 96 1917.0098 7411.8269 ef
29 A bar L4 7006.3740 1458.5342 96 3563.2217 13776.6551 f
With the two separate tables (or plots?) you are displaying a total of 90 + 90 = 180 comparisons. If you want an overall multiplicity adjustment for all of these 180 comparisons, you need to be considerably less conservative than for 496 comparisons. However, it is possible to speccify a different value of level so that the Sidak adjustment works out correctly. For example, if you want the overall alpha to be 0.05, use
cld(emm, adjust = 'sidak', by = 'f2', Letters = letters,
alpha = 1 - sqrt(0.95))
With this, you are specifying alpha = 0.02532. Note that if
p.adj = 1 - (1 - p)^90 < 1 - sqrt(.95)
then
(1 - p)^90 > sqrt(.95)
so that
(1 - p)^180 > .95
thus
1 - (1 - p)^180 < .05
That is, by splitting the CLD table into two parts showing 90 comparisons each, we correctly apply the Sidak adjustment to correct for the 180 comparisons total at a significance level of .05.
Enhancement
Another idea based on this that results in a less conservative adjustment is to specify the Tukey adjustment instead:
cld(emm, adjust = 'tukey', by = 'f2', Letters = letters,
alpha = 1 - sqrt(0.95))
Thus, each separate table has an exact familywise error rate of 1 - sqrt(0.05); and we used the Sidak adjustment (slightly conservative) so that the error rate for the whole family of 180 tests is less than 0.05.

Bar chart for factorial designs in R

I'm currently trying to create a clustered bar chart using ggplot2. It's basically just mean response times for a 2x2x2 factorial design. The three factors are load, compatibility and salience. I'm having a hard time jamming the third factor (salience) in there though. It shouldn't be a stacked graph though
This is what I currently have
bar+stat_summary(fun.y = mean, geom = "bar", position = "dodge") +
+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", position = position_dodge(width = 0.90), width = 0.2)+
+ labs(x = "Compatibility", y = "Mean RT", fill = "Load")
Here's a small sample of the data I'm trying to graph:
ID load comp sal rt
1 1 High Incompatible Non_Salient 787
2 1 Low Compatible Salient 754
3 2 High Incompatible Salient 654
I've seen graphs like these numerous times before but I have no idea how to get ggplot2 to display three independent variables at the same time.
I've tried splitting the graphs by adding
+ facet_wrap( ~ sal)
but that doesn't work either. It just says "Invalid argument to unary operator"
Any help would be appreciated.
Is this the kind of plot you are looking for?
I used the Wii data from the book "Discovering Statistics Using R", which is in a similar format to yours.
structure(list(athlete = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("Athlete", "Non-Athlete"), class = "factor"),
stretch = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No Stretching", "Stretching"
), class = "factor"), wii = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Playing Wii",
"Watching Wii"), class = "factor"), injury = c(2L, 2L, 1L,
2L, 0L, 1L, 2L, 0L, 2L, 2L, 2L, 1L, 4L, 2L, 2L, 0L, 0L, 3L,
3L, 3L, 2L, 1L, 0L, 2L, 2L, 3L, 2L, 2L, 3L, 1L, 2L, 4L, 1L,
2L, 2L, 2L, 1L, 4L, 4L, 1L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 2L,
2L, 2L, 1L, 0L, 3L, 3L, 2L, 1L, 2L, 4L, 1L, 2L, 5L, 5L, 3L,
6L, 4L, 3L, 4L, 5L, 5L, 2L, 6L, 4L, 4L, 4L, 3L, 4L, 3L, 2L,
1L, 4L, 3L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 4L, 2L, 7L, 8L, 6L,
9L, 4L, 7L, 5L, 9L, 6L, 4L, 8L, 5L, 4L, 7L, 10L, 1L, 3L,
2L, 1L, 3L, 3L, 2L, 3L, 4L, 2L, 0L, 1L, 3L, 2L, 0L)),
.Names = c("athlete", "stretch", "wii", "injury"),
class = "data.frame", row.names = c(NA, -120L))
Here is how to produce the plot.
library(ggplot2)
library(Hmisc)
ggplot(data=Wii, aes(x=stretch, y=injury, fill=wii)) +
facet_wrap(~athlete) +
stat_summary(fun.y = mean, geom = "bar", position = "dodge") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", position = position_dodge(width = 0.90), width = 0.2)

What is the meaning of the warning message about log(P) when calculating a polychoric correlation with 'hetcor'?

When calculating a polychoric correlation in R (library(polycor), function hetcor) I get the warning message In log(P) : NaNs produced. I wasn't able to figure out what this warning message might constitute. I suppose it has to do with the calculation of the p-values for testing bivariate normality.
Thus my questions are:
What characteristics of this dataset result in this warning?
What's the meaning of this warning?
Is this warning problematic in terms of using the polychoric correlation matrix for further analyses?
Data subset:
foo <- structure(list(item1 = structure(c(4L, 4L, 4L, 2L, 2L, 2L,
2L, 2L, 4L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 1L,
2L, 2L, 3L, 3L, 3L, 2L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 3L, 2L, 3L,
2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 4L, 2L, 4L, 2L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
2L, 2L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 3L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 3L
), .Label = c("0", "1", "2", "3"), class = c("ordered", "factor"
)), item2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 3L,
1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 2L, 2L, 1L,
3L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 3L, 2L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 2L, 1L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 1L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L, 1L, 1L, 3L), .Label = c("0",
"1", "2", "3"), class = c("ordered", "factor")), item3 = structure(c(4L,
4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 4L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L,
1L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 3L, 2L, 1L), .Label = c("0", "1", "2", "3"), class = c("ordered",
"factor")), item4 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 3L, 2L, 1L,
1L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 4L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 4L, 1L, 2L, 3L), .Label = c("0",
"1", "2", "3"), class = c("ordered", "factor")), item5 = structure(c(4L,
4L, 4L, 1L, 1L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 1L,
3L, 3L, 3L, 4L, 3L, 2L, 1L, 3L, 3L, 4L, 1L, 2L, 1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 4L, 2L, 1L, 2L, 2L, 2L, 2L,
3L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 3L, 3L, 1L,
2L, 1L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 4L, 2L, 2L, 1L, 2L, 2L, 4L, 2L, 4L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 3L, 3L,
1L, 4L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 1L, 1L,
1L, 1L, 2L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L,
3L, 3L, 4L, 3L, 3L), .Label = c("0", "1", "2", "3"), class = c("ordered",
"factor"))), .Names = c("item1", "item2", "item3", "item4",
"item5"))
Computation of correlation matrix:
hetcor(foo)
Comment: the real dataset contains about 2500 rows (and more variables), but when evaluating the contingency tables a sparse matrix doesn't seem to be an issue.
A short (and belated) answer to a very old question. The warning is because some of the cells in the cross tabulation of the variables (for example, variables 1 and 2) have 0 values in the cells. This can lead to problems in estimation.
The polychoric (and tetrachoric) correlations are normal theory approximations of what would happen if bivariate normal (and continuous) data were converted into categorical (dichotomous for tetrachorics, polytomous for polychorics) data. The normal theory approximation assumes that all cells have some value. However, the correlations can be found with 0 cell values, but with a warning. The resulting correlations are correct, but unstable, in that if we add a small correction for continuity (i.e., add .1 or .5 to the 0 cells), the values change a great deal. This problem is discussed by Gunther and Hofler for the case of tetrachoric correlations where they compare solutions with and with the correction for continuity.
(See the article by A. Gunther and M. Hofler. Different results on tetrachorical correlations in mplus and stata-stata announces modified procedure. Int J Methods Psychiatr Res, 15(3):157-66, 2006. for a discussion of this problem with tetrachoric correlations.)
Using the polychoric function in the psych package, we find the same answer as the hetcor function from polycor if we do not apply the correction for continuity, but somewhat different values if we do correct for continuity. I recommend the correction.
See the help function for polychoric in psych for a longer discussion of this problem.

Resources