Interpretation of Wald test in modelling periodic data (cosinor package) - r

I'm using the cosinor library to fit a model using the built-in dataset
library(cosinor)
fit <- cosinor.lm(Y ~ time(time) + X + amp.acro(X), data = vitamind, period = 12)
Now to test if the X variable contributes to the model I used
test_cosinor(fit, "X", param = "amp")
test_cosinor(fit, "X", param = "acr")
As explained in the documentation
https://cran.r-project.org/web/packages/cosinor/cosinor.pdf
This function performs a Wald test comparing the group with co-variates equal to 1 to the group with covariates equal to 0.
If I understand it right, if p< 0.05 the X variable does not contribute to the model so for example if X = 1 are men and X = 0 women this means that the model is "similar" for both men and women, that mean and women do not follow a different pattern during the period studied, is this correct?
And my second question is what would be the interpretation if p < 0.05 for "amp" and p > 0.05 for "acr". I think that both should be significant for the variable to contribute to the model, is this right?

I am not familiar with the cosinor library, but I am pretty sure the p-value can be interpreted the same as for most other statistical methods.
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. Investopedia
A p-value of 0.05 means that the probability of observing these results given that the null Hypothesis is true is 5%.
So if the p-value is smaller than 0.05 we often reject the null-hyothesis because the probability of it being true is smaller than 5%.
In general if p>0.05 it means that x does not have a statistically significant impact on y. On the other hand if p<0.05 x does have a statistically significant impact on y
So if X=1 are men, X=0 are women and p<0.05 there is a statistically significant impact of gender on y.
If p< 0.05 for amp this would mean that amp also has a statistically significant impact on y. Since the p-value for acr is higher than 0.05 it does not have a statistically significant impact on y.
Be aware though that 0.05 is just a threshold that is often arbitrarily chosen and became common practice with time.

Related

Output from Linear Mixed Models differs from Estimated Marginal Means

I have a query about the output statistics gained from linear mixed models (using the lmer function) relative to the output statistics taken from the estimated marginal means gained from this model
Essentially, I am running an LMM comparing the within-subjects effect of different contexts (with "Negative" coded as the baseline) on enjoyment ratings. The LMM output suggests that the difference between negative and polite contexts is not significant, with a p-value of .35. See the screenshot below with the relevant line highlighted:
LMM output
However, when I then run the lsmeans function on the same model (with the Holm correction), the p-value for the comparison between Negative and Polite context categories is now .05, and all of the other statistics have changed too. Again, see the screenshot below with the relevant line highlighted:
LSMeans output
I'm probably being dense because my understanding of LMMs isn't hugely advanced, but I've tried to Google the reason for this and yet I can't seem to find out why? I don't think it has anything to do with the corrections because the smaller p-value is observed when the Holm correction is used. Therefore, I was wondering why this is the case, and which value I should report/stick with and why?
Thank you for your help!
Regression coefficients and marginal means are not one and the same. Once you learn these concepts it'll be easier to figure out which one is more informative and therefore which one you should report.
After we fit a regression by estimating its coefficients, we can predict the outcome yi given the m input variables Xi = (Xi1, ..., Xim). If the inputs are informative about the outcome, the predicted yi is different for different Xi. If we average the predictions yi for examples with Xij = xj, we get the marginal effect of the jth feature at the value xj. It's crucial to keep track of which inputs are kept fixed (and at what values) and which inputs are averaged over (aka marginalized out).
In your case, contextCatPolite in the coefficients summary is the difference between Polite and Negative when smileType is set to its reference level (no reward, I'd guess). In the emmeans contrasts, Polite - Negative is the average difference over all smileTypes.
Interactions have a way of making interpretation more challenging and your model includes an interaction between smileType and contextCat. See Interaction analysis in emmeans.
To add to #dipetkov's answer, the coefficients in your LMM are based on treatment coding (sometimes called 'dummy' coding). With the interactions in the model, these coefficients are no longer "main-effects" in the traditional sense of factorial ANOVA. For instance, if you have:
y = b_0 + b_1(X_1) + b_2(X_2) + b_3 (X_1 * X_2)
...b_1 is "the effect of X_1" only when X_2 = 0:
y = b_0 + b_1(X_1) + b_2(0) + b_3 (X_1 * 0)
y = b_0 + b_1(X_1)
Thus, as #dipetkov points out, 1.625 is not the difference between Negative and Polite on average across all other factors (which you get from emmeans). Instead, this coefficient is the difference between Negative and Polite specifically when smileType = 0.
If you use contrast coding instead of treatment coding, then the coefficients from the regression output would match the estimated marginal means, because smileType = 0 would now be on average across smile types. The coding scheme thus has a huge effect on the estimated values and statistical significance of regression coefficients, but it should not effect F-tests based on the reduction in deviance/variance (because no matter how you code it, a given variable explains the same amount of variance).
https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/

Change significance level MannKendall trend test -- R

I want to perform Mann-Kendall test at 99% and 90% confidence interval (CI). When running the lines below the analysis will be based on a 95% CI. How to change the code to perform it on 99 and 90% CI?
vec = c(1,2,3,4,5,6,7,8,9,10)
MannKendall(vec)
I cannot comment yet, but I have a question, what do you mean when you say that you need to perform the analysis on a 99 and 95% CI. Do you want to know if your value is significant at the 99 and 90% significance level?
If you just need to know if your score is significant at 99 and 90% significance then r2evans was right, the alpha or significance level is just an arbitrary threshold that you use to define how small your probability should be for you to assume that there "is no effect" or in this case that there is independence between the observations. More importantly, the calculation of the p-value is independent of the confidence level you select, so if you want to know if your result is significant at different confidence levels just compare your p-value at those levels.
I checked how the function works and did not see any indication that the alpha level selected is going to affect the results. if you check the source code of MannKendall(x) (by typing MannKendall without parenthesis or anything) you can see that is just Kendall(1:length(x), x). The function Kendall calculates a statistic tau, that "measures the strength of monotonic association between the vectors x and y", then it returns a p-value by calculating how likely your observed tau is under the assumption that there is no relation between length(x) and x. In other words, how likely it is that you obtain that tau just by chance, as you can see this is not dependent on the confidence level at all, the confidence level only matters at the end when you are deciding how small the probability of your tau should be for you to assume that it cannot have been obtained just by chance.

how to decide two variables are correlated

Running the below command in R:
cor.test(loandata$Age,loandata$Losses.in.Thousands)
loandata is the name of the dataset
Age is the independent Variable
Losses.in.Thousands is the dependent variable
Below is the result in R:
Pearson's product-moment correlation
data: loandata$Age and loandata$Losses.in.Thousands
t = -61.09, df = 15288, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4556139 -0.4301315
sample estimates:
cor
-0.4429622
How to decide whether Age is correlated with Losses.in.Thousand ?
How do we decide by looking at the p-value with alpha = 0.05?
As stated in the other answer, the correlation coefficient produced by cor.test() in the OP is -0.4429. The Pearson correlation coefficient is a measure of the linear association between two variables. It varies between -1.0 (perfect negative linear association) and 1.0 (perfect positive linear association), the magnitude is absolute value of the coefficient, or its distance from 0 (no association).
The t-test indicates whether the correlation is significantly different from zero, given its magnitude relative to its standard error. In this case, the probability value for the t-test, p < 2.2e-16, indicates that we should reject the null hypothesis that the correlation is zero.
That said, the OP question:
How to decide whether Age is correlated with Losses.in.Thousands?
has two elements: statistical significance and substantive meaning.
From the perspective of statistical significance, the t-test indicates that the correlation is non-zero. Since the standard error of a correlation varies inversely with degrees of freedom, the very large number of degrees of freedom listed in the OP (15,288) means that a much smaller correlation would still result in a statistically significant t-test. This is why one must consider substantive significance in addition to statistical significance.
From a substantive significance perspective, interpretations vary. Hemphill 2003 cites Cohen's (1988) rule of thumb for correlation magnitudes in psychology studies:
0.10 - low
0.30 - medium
0.50 - high
Hemphill goes on to conduct a meta analysis of correlation coefficients in psychology studies that he summarized into the following table.
As we can see from the table, Hemphill's empirical guidelines are much less stringent than Cohen's prior recommendations.
Alternative: coefficient of determination
As an alternative, the coefficient of determination, r^2 can be used as a proportional reduction of error measure. In this case, r^2 = 0.1962, and we can interpret it as "If we know one's age, we can reduce our error in predicting losses in thousands by approximately 20%."
Reference: Burt Gerstman's Statistics Primer, San Jose State University.
Conclusion: Interpretation varies by domain
Given the problem domain, if the literature accepts a correlation magnitude of 0.45 as "large," then treat it as large, as is the case in many of the social sciences. In other domains, however, a much higher magnitude is required for a correlation to be considered "large."
Sometimes, even a "small" correlation is substantively meaningful as Hemphill 2003 notes in his conclusion.
For example, even though the correlation between aspirin taking and preventing a heart attack is only r=0.03 in magnitude, (see Rosenthal 1991, p. 136) -- small by most statistical standards -- this value may be socially important and nonetheless influence social policy.
To know if the variables are correlated, the value to look at is cor = -0.4429
In your case, the values are negatively correlated, however the magnitude of correlation isn't very high.
A simple, less confusing way to check if two variables are correlated, you can do:
cor(loandata$Age,loandata$Losses.in.Thousands)
[1] -0.4429622
The null hypothesis of the Pearson test is that the two variables are not correlated: H0 = {rho = 0}
The p-value is the probability that the test's statistic (or its absolute value for a two tailed test) would be beyond the actual observed result (or its absolute value for a two tailed test). You can reject the hypothesis if the p-value is smaller than the confidence level. This is the case in your test, which means the variables are correlated.

Fitting GLM (family = inverse.gaussian) on simulated AR(1)-data.

I am encountering quite an annoying and to me incomprehensible problem, and I hope some of you can help me. I am trying to estimate the autoregression (influence of previous measurements of variable X on current measurement of X) for 4 groups that have a positively skewed distribution to various degrees. The theory is that more positively skewed distributions have less variance, and since the relationship between 2 variables depends on the amount of shared variance, positively skewed distributions have a smaller autoregression that more normally distributed variables.
I use simulations to investigate this, and generate data as follows: I simulate data for n people with tp time points. I use a fixed autoregressive parameter, phi (at .3 so we have a stationary process). To generate positively skewed distributions I use a chi-square distributed error. Individuals differ in the degrees of freedom that is used for the chi2 distributed errors. In other words, degrees of freedom is a level 2 variable (and is in itself chi2(1)-distributed). Individuals with a very low df get a very skewed distribution whereas individuals with a higher df get a more normal distribution.
for(i in 1:n) { # Loop over persons.
chi[i, 1] <- rchisq(1, df[i]) # Set initial value.
for(t in 2:(tp + burn)) { # Loop over time points.
chi[i, t] <- phi[i] * chi[i, t - 1] + # Autoregressive effect.
rchisq(1, df[i]) # Chi-square distributed error.
} # End loop over time points.
} # End loop over persons.
Now that I have the outcome variable generated, I put it in long format, I create a lagged predictor, and I person mean center the predictor (or group mean center, or cluster mean center, all the same). I call this lagged and centered predictor chi.pred. I make the subgroups based on the degrees of freedom of individuals. The 25% with a lowest df goes in subgroup 1, 26% - 50% in subgroup 2, etc.
The problem is this: fitting a multilevel (i.e. mixed or random effects model) autoregressive(1) model with family = inverse.gaussian and link = 'identity', using glmer() from the lme4 package gives me quite a lot of warnings. E.g. "degenerate Hessian", "large eigen value/ratio", "failed to converge with max|grad", etc.. I just don't get why.
The model I fit are
# Random intercept, but fixed slope with subgroups as level 2 predictor of slope.
lmer(chi ~ chi.pred + chi.pred:factor(sub.df.noise) + (1|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
# Random intercept and slope.
lmer(chi ~ chi.pred + (1 + chi.pred|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
The reason I use inverse gaussian is because it is said to work better on skewed data.
Does anybody have any clue why I can't fit the models? I have tried increasing sample size and time points, different optimizers, I have double-double-double checked if lagging and centering the data is correct, increased the number of iterations, added some noise to the subgroups (since otherwise they are 1 on 1 related to degree of freedom) etc.

R: Calculate and interpret odds ratio in logistic regression

I am having trouble interpreting the results of a logistic regression. My outcome variable is Decision and is binary (0 or 1, not take or take a product, respectively).
My predictor variable is Thoughts and is continuous, can be positive or negative, and is rounded up to the 2nd decimal point.
I want to know how the probability of taking the product changes as Thoughts changes.
The logistic regression equation is:
glm(Decision ~ Thoughts, family = binomial, data = data)
According to this model, Thoughts has a significant impact on probability of Decision (b = .72, p = .02). To determine the odds ratio of Decision as a function of Thoughts:
exp(coef(results))
Odds ratio = 2.07.
Questions:
How do I interpret the odds ratio?
Does an odds ratio of 2.07 imply that a .01 increase (or decrease) in Thoughts affect the odds of taking (or not taking) the product by 0.07 OR
Does it imply that as Thoughts increases (decreases) by .01, the odds of taking (not taking) the product increase (decrease) by approximately 2 units?
How do I convert odds ratio of Thoughts to an estimated probability of Decision?
Or can I only estimate the probability of Decision at a certain Thoughts score (i.e. calculate the estimated probability of taking the product when Thoughts == 1)?
The coefficient returned by a logistic regression in r is a logit, or the log of the odds. To convert logits to odds ratio, you can exponentiate it, as you've done above. To convert logits to probabilities, you can use the function exp(logit)/(1+exp(logit)). However, there are some things to note about this procedure.
First, I'll use some reproducible data to illustrate
library('MASS')
data("menarche")
m<-glm(cbind(Menarche, Total-Menarche) ~ Age, family=binomial, data=menarche)
summary(m)
This returns:
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
The coefficients displayed are for logits, just as in your example. If we plot these data and this model, we see the sigmoidal function that is characteristic of a logistic model fit to binomial data
#predict gives the predicted value in terms of logits
plot.dat <- data.frame(prob = menarche$Menarche/menarche$Total,
age = menarche$Age,
fit = predict(m, menarche))
#convert those logit values to probabilities
plot.dat$fit_prob <- exp(plot.dat$fit)/(1+exp(plot.dat$fit))
library(ggplot2)
ggplot(plot.dat, aes(x=age, y=prob)) +
geom_point() +
geom_line(aes(x=age, y=fit_prob))
Note that the change in probabilities is not constant - the curve rises slowly at first, then more quickly in the middle, then levels out at the end. The difference in probabilities between 10 and 12 is far less than the difference in probabilities between 12 and 14. This means that it's impossible to summarise the relationship of age and probabilities with one number without transforming probabilities.
To answer your specific questions:
How do you interpret odds ratios?
The odds ratio for the value of the intercept is the odds of a "success" (in your data, this is the odds of taking the product) when x = 0 (i.e. zero thoughts). The odds ratio for your coefficient is the increase in odds above this value of the intercept when you add one whole x value (i.e. x=1; one thought). Using the menarche data:
exp(coef(m))
(Intercept) Age
6.046358e-10 5.113931e+00
We could interpret this as the odds of menarche occurring at age = 0 is .00000000006. Or, basically impossible. Exponentiating the age coefficient tells us the expected increase in the odds of menarche for each unit of age. In this case, it's just over a quintupling. An odds ratio of 1 indicates no change, whereas an odds ratio of 2 indicates a doubling, etc.
Your odds ratio of 2.07 implies that a 1 unit increase in 'Thoughts' increases the odds of taking the product by a factor of 2.07.
How do you convert odds ratios of thoughts to an estimated probability of decision?
You need to do this for selected values of thoughts, because, as you can see in the plot above, the change is not constant across the range of x values. If you want the probability of some value for thoughts, get the answer as follows:
exp(intercept + coef*THOUGHT_Value)/(1+(exp(intercept+coef*THOUGHT_Value))
Odds and probability are two different measures, both addressing the same aim of measuring the likeliness of an event to occur. They should not be compared to each other, only among themselves!
While odds of two predictor values (while holding others constant) are compared using "odds ratio" (odds1 / odds2), the same procedure for probability is called "risk ratio" (probability1 / probability2).
In general, odds are preferred against probability when it comes to ratios since probability is limited between 0 and 1 while odds are defined from -inf to +inf.
To easily calculate odds ratios including their confident intervals, see the oddsratio package:
library(oddsratio)
fit_glm <- glm(admit ~ gre + gpa + rank, data = data_glm, family = "binomial")
# Calculate OR for specific increment step of continuous variable
or_glm(data = data_glm, model = fit_glm,
incr = list(gre = 380, gpa = 5))
predictor oddsratio CI.low (2.5 %) CI.high (97.5 %) increment
1 gre 2.364 1.054 5.396 380
2 gpa 55.712 2.229 1511.282 5
3 rank2 0.509 0.272 0.945 Indicator variable
4 rank3 0.262 0.132 0.512 Indicator variable
5 rank4 0.212 0.091 0.471 Indicator variable
Here you can simply specify the increment of your continuous variables and see the resulting odds ratios. In this example, the response admit is 55 times more likely to occur when predictor gpa is increased by 5.
If you want to predict probabilities with your model, simply use type = response when predicting your model. This will automatically convert log odds to probability. You can then calculate risk ratios from the calculated probabilities. See ?predict.glm for more details.
I found this epiDisplay package, works fine! It might be useful for others but note that your confidence intervals or exact results will vary according to the package used so it is good to read the package details and chose the one that works well for your data.
Here is a sample code:
library(epiDisplay)
data(Wells, package="carData")
glm1 <- glm(switch~arsenic+distance+education+association,
family=binomial, data=Wells)
logistic.display(glm1)
Source website
The above formula to logits to probabilities, exp(logit)/(1+exp(logit)), may not have any meaning. This formula is normally used to convert odds to probabilities. However, in logistic regression an odds ratio is more like a ratio between two odds values (which happen to already be ratios). How would probability be defined using the above formula? Instead, it may be more correct to minus 1 from the odds ratio to find a percent value and then interpret the percentage as the odds of the outcome increase/decrease by x percent given the predictor.

Resources