I have Likert type data, some 4-point scale and others 3-point scale. For example:
variable.x <- c(5,2,4,5,3,1)
variable.y <- c(0,1,1,0,2,1)
my.data <- cbind(variable.x, variable.y)
library(psych)
polychoric(my.data)
#Call: polychoric(x = my.data)
#Polychoric correlations
# vrbl.x vrbl.y
#variable.x 1.00
#variable.y -0.25 1.00
# with tau of
# 0 1 2 3 4
#variable.x -0.97 -0.43 0 0.43 Inf
#variable.y -0.43 0.97 Inf Inf Inf
How can I obtain a significance value for the correlation of -0.25?
Strangely the use of the polycor package gives totally different results:
library(polycor)
polychor(variable.x, variable.y, ML=TRUE, std.err=TRUE)
#Polychoric Correlation, ML est. = -0.7599 (0.2588)
#Test of bivariate normality: Chisquare = 9.088, df = 7, p = 0.2464
# Row Thresholds
# Threshold Std.Err.
#1 -0.9388 0.5519
#2 -0.5774 0.5040
#3 -0.1906 0.5267
#4 0.3979 0.5690
# Column Thresholds
# Threshold Std.Err.
#1 -0.3891 0.5683
#2 0.9692 0.5568
Now the correlation coefficient is -0.76 now, not -0.25. Again, how would I find its significance?
Related
I'm looking at age- and sex-adjusted group differences in a continuous variable of interest. As done in other studies in my field, I want to calculate Cohen's d based on contrasts extracted from a multiple linear regression model.
The original formula (Nakagawa & Cuthill, 2007) is as follows:
n1 = sample size in Group 1
n2 = sample size in Group 2
df' = degrees of freedom used for a corresponding t value in a linear model
t = t-statistic corresponding to the contrast of interest
So far I've attempted to apply this in R, but the results are looking strange (much larger effect sizes than expected).
Here's some simulated data:
library(broom)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010),
Group = as.numeric(c('0','1','0','0','1','1','0','1','0','1')),
age = as.numeric(c('23','28','30','15','7','18','29','27','14','22')),
sex = as.numeric(c('1','0','1','0','0','1','1','0','0','1')),
test_score = as.numeric(c('18','20','19','15','20','23','19','25','10','14')))
# run lm and extract regression coefficients
model <- lm(test_score ~ Group + age + sex, data = df)
tidy_model <- tidy(model)
tidy_model
# A tibble: 4 x 5
#term estimate std.error statistic p.value
#<chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 11.1 4.41 2.52 0.0451
# 2 Group 4.63 2.65 1.75 0.131
# 3 age 0.225 0.198 1.13 0.300
# 4 sex 0.131 2.91 0.0452 0.965
t_statistic <- tidy_model[2,4] # = 1.76
n <- 5 #(equal n of participants in Group1 as in Group2)
cohens_d <- t_statistic*(n + n)/(sqrt(n * n) * sqrt(1)) # 1 dof for 1 estimated parameter (group contrast)
cohens_d # = 3.518096
Could you please flag up where I'm going wrong?
You have set the degrees of freedom to 1. However, you actually have 6 degrees of freedom which you can see if you type: summary(model).
If you set your degrees of freedom to 6 your Cohen's d will be ~1.7 which should be more inline with what you expect.
I have a dataset that is in a long format with 200 variables, 94 subjects, and each subject has anywhere from 1 to 3 measurements for each variable.
Eg:
ID measurement var1 var2 . . .
1 1 2 6
1 2 3 8
1 3 6 12
2 1 3 9
2 2 4 4
2 3 5 3
3 1 1 11
3 2 1 4
. . . .
. . . .
. . . .
However, some variables have missing values for one of three measurements. It was suggested to me that before imputing missing values with the mean for the subject, I should use a repeated measures ANOVA or mixed model in order to confirm the repeatability of measurements.
The first thing I found to calculate the ICC was the ICC() function from the psych package. However, from what I can tell this requires that the data have one row per subject and one column per measurement, which would be further complicated by the fact that I have 200 variables I need to calculate the ICC for individually. I did go ahead and calculate the ICC for a single variable, and obtained this output:
Intraclass correlation coefficients
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.38 2.8 93 188 0.00000000067 0.27 0.49
Single_random_raters ICC2 0.38 2.8 93 186 0.00000000068 0.27 0.49
Single_fixed_raters ICC3 0.38 2.8 93 186 0.00000000068 0.27 0.49
Average_raters_absolute ICC1k 0.65 2.8 93 188 0.00000000067 0.53 0.74
Average_random_raters ICC2k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Average_fixed_raters ICC3k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Number of subjects = 94 Number of Judges = 3
Next, I tried to calculate the ICC using a mixed model. Using this code:
m1 <- lme(var1 ~ measurement, random=~1|ID, data=mydata, na.action=na.omit)
summary(m1)
The output looks like this:
Linear mixed-effects model fit by REML
Data: mydata
AIC BIC logLik
-1917.113 -1902.948 962.5564
Random effects:
Formula: ~1 | ORIGINAL_ID
(Intercept) Residual
StdDev: 0.003568426 0.004550419
Fixed effects: var1 ~ measurement
Value Std.Error DF t-value p-value
(Intercept) 0.003998953 0.0008388997 162 4.766902 0.0000
measurement 0.000473053 0.0003593452 162 1.316429 0.1899
Correlation:
(Intr)
measurement -0.83
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.35050264 -0.30417725 -0.03383329 0.25106803 12.15267443
Number of Observations: 257
Number of Groups: 94
Is this the correct model to use to assess ICC? It is not clear to me what the correlation (Intr) is measuring, and it is different from the ICC obtained using ICC().
This is my first time calculating and using intraclass correlation, so any help is appreciated!
Using a mock dataset...
set.seed(42)
n <- 6
dat <- data.frame(id=rep(1:n, 2),
group= as.factor(rep(LETTERS[1:2], n/2)),
V1 = rnorm(n),
V2 = runif(n*2, min=0, max=100),
V3 = runif(n*2, min=0, max=100),
V4 = runif(n*2, min=0, max=100),
V5 = runif(n*2, min=0, max=100))
Loading some libraries...
library(lme4)
library(purrr)
library(tidyr)
# Add list of variable names to the vector below...
var_list <- c("V1","V2","V3","V4","V5")
map_dfr() is from the purrr library. I use lme4::VarCorr() to get the variances at each level.
map_dfr(var_list,
function(x){
formula_mlm = as.formula(paste0(x,"~ group + (1|id)"));
model_fit = lmer(formula_mlm,data=dat);
re_variances = VarCorr(model_fit,comp="Variance") %>%
data.frame() %>%
dplyr::mutate(variable = x);
return(re_variances)
}) %>%
dplyr::select(variable,grp,vcov) %>%
pivot_wider(names_from="grp",values_from="vcov") %>%
dplyr::mutate(icc = id/(id+Residual))
I computed simple slopes for an interaction with the sim_slopes() function from the interactions package and using the emtrends() function from the emmeans package and results (both the estimates and standard errors) seem to slightly differ even though both computations are based on the same linear model (using the lm() function). Is there an explanation for this? I've pasted the code and output below. x is a continuous variable and z is a categorical variable with 3 levels.
model1 <- lm(DV ~ z * x, data = data)
> sim_slopes(model1, pred = x, modx = z, johnson_neyman = FALSE)
SIMPLE SLOPES ANALYSIS
Slope of x when z = 3:
Est. S.E. t val. p
------ ------ -------- ------
0.50 0.10 4.89 0.00
Slope of x when z = 2:
Est. S.E. t val. p
------ ------ -------- ------
0.74 0.09 7.83 0.00
Slope of x when z = 1:
Est. S.E. t val. p
------ ------ -------- ------
0.33 0.10 3.37 0.00
> emtrends(model1, ~ z, var="x")
NOTE: Results may be misleading due to involvement in interactions
z x.trend SE df lower.CL upper.CL
1 0.290 0.0669 1016 0.158 0.421
2 0.618 0.0611 1016 0.498 0.738
3 0.411 0.0612 1016 0.291 0.531
I am trying to reproduce your results.
I tried interactions between z and x by using iris data, where
DV <- iris$Sepal.Length
z <- iris$Species # z is a categorical variable with 3 levels : setosa, virginica, and versicolor
x <- iris$Petal.Length #x is a continuous variable.
The resulted slope estimates and corresponding standard errors in sim_slopes and emtrends are almost identical.
model1 <- lm(DV ~ z * x, data = iris)
sim_slopes(model1, pred = x, modx = z, johnson_neyman = FALSE)
#SIMPLE SLOPES ANALYSIS
#Slope of x when z = virginica:
# Est. S.E. t val. p
#------ ------ -------- ------
# 1.00 0.09 11.43 0.00
# Slope of x when z = versicolor:
# Est. S.E. t val. p
#------ ------ -------- ------
# 0.83 0.10 8.10 0.00
# Slope of x when z = setosa:
# Est. S.E. t val. p
# ------ ------ -------- ------
# 0.54 0.28 1.96 0.05
emtrends(model1, ~ z, var="x")
# z x.trend SE df lower.CL upper.CL
# setosa 0.542 0.2768 144 -0.00476 1.09
# versicolor 0.828 0.1023 144 0.62611 1.03
# virginica 0.996 0.0871 144 0.82360 1.17
# Confidence level used: 0.95
However, this warning message: NOTE: Results may be misleading due to involvement in interactions does not emerge in my trial.
This result indicates that both packages use the same method to compute the slope. Due to the warning message that emerged in your result, I would suggest you to check again your z and x data and to reconsider how sensible the interaction of these variables is.
My suggestion is based on the explanation of the emmeans author here:
https://mran.microsoft.com/snapshot/2018-03-30/web/packages/emmeans/vignettes/interactions.html
According to the documentation of the mice package, if we want to impute data when we're interested in interaction terms we need to use passive imputation. Which is done the following way.
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I((bmi-25)*(chl-200))"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
It is said that
Imputations created in this way preserve the interaction of bmi with chl
Here, a new variable called bmi.chl is created in the original dataset. The meth step tells how this variable needs to be imputed from the existing ones. The pred step says we don't want to predict bmi and chl from bmi.chl. But now, if we want to apply a model, how do we proceed? Is the product defined by "~I((bmi-25)*(chl-200))" is just a way to control for the imputed values of the main effects, i.e. bmi and chl?
If the model we want to fit is glm(hyp~chl*bmi, family="binomial"), what is the correct way to specify this model from the imputed data? fit1 or fit2?
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
Or do we have to use somehow the imputed values of the new variable created, i.e. bmi.chl?
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
With passive imputation, it does not matter if you use the passively imputed variable, or if you re-calculate the product term in your call to glm.
The reason that fit1 and fit2 yield different results in your example is because are not just doing passive imputation for the product term.
Instead you are transforming the two variables befor multiplying (i.e., you calculate bmi-25 and chl-100). As a result, the passively imputed variable bmi.chl does not represent the product term bmi*chl but rather (bmi-25)*(chl-200).
If you just calculate the product term, then fit1 and fit2 yield the same results like they should:
library(mice)
nhanes2.ext <- cbind(nhanes2, bmi.chl = NA)
ini <- mice(nhanes2.ext, max = 0, print = FALSE)
meth <- ini$meth
meth["bmi.chl"] <- "~I(bmi*chl)"
pred <- ini$pred
pred[c("bmi", "chl"), "bmi.chl"] <- 0
pred[c("hyp"), "bmi.chl"] <- 1
imp <- mice(nhanes2.ext, meth = meth, pred = pred, seed = 51600, print = FALSE)
fit1 <- with(data=imp, glm(hyp~chl*bmi, family="binomial"))
summary(pool(fit1))
# > round(summary(pool(fit1)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# chl:bmi 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 NA 0.43 0.33
fit2 <- with(data=imp, glm(hyp~chl+bmi+bmi.chl, family="binomial"))
summary(pool(fit2))
# > round(summary(pool(fit2)),2)
# est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
# (Intercept) -23.94 38.03 -0.63 10.23 0.54 -108.43 60.54 NA 0.41 0.30
# chl 0.10 0.18 0.58 9.71 0.58 -0.30 0.51 10 0.43 0.32
# bmi 0.70 1.41 0.49 10.25 0.63 -2.44 3.83 9 0.41 0.30
# bmi.chl 0.00 0.01 -0.47 9.67 0.65 -0.02 0.01 25 0.43 0.33
This is not surprising because the ~I(bmi*chl) in mice and the bmi*chl in glm do the exact same thing. They merely calculate the product of the two variables.
Remark:
Note that I added a line saying that bmi.chl should be used as a predictor when imputing hyp. Without this step, passive imputation has no purpose because the imputation model would neglect the product term, thus being incongruent with the analysis model.
Suppose I have to estimate coefficients a,b in regression:
y=a*x+b*z+c
I know in advance that y is always in range y>=0 and y<=x, but regression model produces sometimes y outside of this range.
Sample data:
mydata<-data.frame(y=c(0,1,3,4,9,11),x=c(1,3,4,7,10,11),z=c(1,1,1,9,6,7))
round(predict(lm(y~x+z,data=mydata)),2)
1 2 3 4 5 6
-0.87 1.79 3.12 4.30 9.34 10.32
First predicted value is <0.
I tried model without intercept: all predictions are >0, but third prediction of y is >x (4.03>3)
round(predict(lm(y~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.76 2.94 4.03 4.67 8.92 9.68
I also considered to model proportion y/x instead of y:
mydata$y2x<-mydata$y/mydata$x
round(predict(lm(y2x~x+z,data=mydata)),2)
1 2 3 4 5 6
0.15 0.39 0.50 0.49 0.97 1.04
round(predict(lm(y2x~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.08 0.33 0.46 0.47 0.99 1.07
But now sixth prediction is >1, but proportion should be in range [0,1].
I also tried to apply method where glm is used with offset option: Regression for a Rate variable in R
and
http://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset
but this was not successfull.
Please note, in my data dependent variable: proportion y/x is both zero-inflated and one-inflated.
Any idea, what is suitable approach to build a model in R ('glm','lm')?
You're on the right track: if 0 ≤ y ≤ x then 0 ≤ (y/x) ≤ 1. This suggests fitting y/x to a logistic model in glm(...). Details are below, but considering that you've only got 6 points, this is a pretty good fit.
The major concern is that the model is not valid unless the error in (y/x) is Normal with constant variance (or, equivalently, the error in y increases with x). If this is true then we should get a (more or less) linear Q-Q plot, which we do.
One nuance: the interface to the glm logistic model wants two columns for y: "number of successes (S)" and "number of failures (F)". It then calculates the probability as S/(S+F). So we have to provide two columns which mimic this: y and x-y. Then glm(...) will calculate y/(y+(x-y)) = y/x.
Finally, the fit summary suggests that x is important and z may or may not be. You might want to try a model that excludes z and see if that improves AIC.
fit = glm(cbind(y,x-y)~x+z, data=mydata, family=binomial(logit))
summary(fit)
# Call:
# glm(formula = cbind(y, x - y) ~ x + z, family = binomial(logit),
# data = mydata)
# Deviance Residuals:
# 1 2 3 4 5 6
# -0.59942 -0.35394 0.62705 0.08405 -0.75590 0.81160
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.0264 1.2177 -1.664 0.0961 .
# x 0.6786 0.2695 2.518 0.0118 *
# z -0.2778 0.1933 -1.437 0.1507
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 13.7587 on 5 degrees of freedom
# Residual deviance: 2.1149 on 3 degrees of freedom
# AIC: 15.809
par(mfrow=c(2,2))
plot(fit) # residuals, Q-Q, Scale-Location, and Leverage Plots
mydata$pred <- predict(fit, type="response")
par(mfrow=c(1,1))
plot(mydata$y/mydata$x,mydata$pred,xlim=c(0,1),ylim=c(0,1), xlab="Actual", ylab="Predicted")
abline(0,1, lty=2, col="blue")