Ljung-box score is less than threshold - r

My data is cleaned, there are no outliers still i am having a p score of 2.2e-16 from Ljung-box test, can anyone help me out what could be the problem here?
I need Ljung score greater than 0.05

Related

Interpretation of Wald test in modelling periodic data (cosinor package)

I'm using the cosinor library to fit a model using the built-in dataset
library(cosinor)
fit <- cosinor.lm(Y ~ time(time) + X + amp.acro(X), data = vitamind, period = 12)
Now to test if the X variable contributes to the model I used
test_cosinor(fit, "X", param = "amp")
test_cosinor(fit, "X", param = "acr")
As explained in the documentation
https://cran.r-project.org/web/packages/cosinor/cosinor.pdf
This function performs a Wald test comparing the group with co-variates equal to 1 to the group with covariates equal to 0.
If I understand it right, if p< 0.05 the X variable does not contribute to the model so for example if X = 1 are men and X = 0 women this means that the model is "similar" for both men and women, that mean and women do not follow a different pattern during the period studied, is this correct?
And my second question is what would be the interpretation if p < 0.05 for "amp" and p > 0.05 for "acr". I think that both should be significant for the variable to contribute to the model, is this right?
I am not familiar with the cosinor library, but I am pretty sure the p-value can be interpreted the same as for most other statistical methods.
In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. Investopedia
A p-value of 0.05 means that the probability of observing these results given that the null Hypothesis is true is 5%.
So if the p-value is smaller than 0.05 we often reject the null-hyothesis because the probability of it being true is smaller than 5%.
In general if p>0.05 it means that x does not have a statistically significant impact on y. On the other hand if p<0.05 x does have a statistically significant impact on y
So if X=1 are men, X=0 are women and p<0.05 there is a statistically significant impact of gender on y.
If p< 0.05 for amp this would mean that amp also has a statistically significant impact on y. Since the p-value for acr is higher than 0.05 it does not have a statistically significant impact on y.
Be aware though that 0.05 is just a threshold that is often arbitrarily chosen and became common practice with time.

Calculating exact p-values from a Pearson's correlation test (manually or in R)

(What I believe is) a very simple question. I have just performed a Pearson's correlation test in R, and I'd like to know the exact p-value. However, the p-value is so small R (or tdist in Excel, or any other online calculate-it software) tells me the p-value is <2.2e-16 or 0. I suspect it has something to do with the large number of observations I have (n = 11001).
Here's the output I get from running a pairwise correlation
cor.test(mets$s_M48153,mets$s_M48152)
Pearson's product-moment correlation
data: mets$s_M48153 and mets$s_M48152
t = 88.401, df = 10999, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6334378 0.6552908
sample estimates:
cor
0.6444959
"cor.test(mets$s_M48153,mets$s_M48152)$p.value" also gives me a p-value of 0.
Because of this, I'd like to manually calculate the exact p-value using the t-statistic and degrees of freedom, but I can't find the formula anywhere. Does anyone know the formula, or can tell me how to extract exact p-values from R (if possible)?
Thanks everyone for your suggestions and advice, and sorry for not replying sooner. I've been juggling a few things around until recently. However, I did ask a statistician within my department about this, and he agreed with what r2evans said. If the p-value is smaller than 10^-16, there's little point in reporting an 'exact' value, since the point is that there is strong evidence that the result differs from the null hypothesis.
One case when p-values might be important is when you want to rank by order of significance, but you could get around this by using z-scores to rank instead.
To address the original question, I defer to this guide, which I found long after posting this question: https://stats.stackexchange.com/questions/315311/how-to-find-p-value-using-estimate-and-standard-error.

Calculating F-Ratio (F-value) for multiple regression in R

In a statistics course for management and economics I am taking we were given a dataset consisting of information about job applicants (age, school and bachelor grades, number of internships performed, time spent abroad, performance during an interview) and an assessment of their job performance after they were hired. We are supposed to analyze which variables predict later job performance using multiple linear regressions. In total we are supposed to fit four models that subsequently add more and more predictors and compare their R^2, F-value, p-value for the F-test, and explain if the model holds explanatory value.
So far I have the following code:
#import function for calculating linear regression models with heteroscadacity robust standard errors
url_robust <-"https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv)
#calculate linear models
lm_1 <- lm(performance ~ age + sex, data_A)
summary(lm_1, robust = T)
lm_2 <- lm(performance ~ age + sex + school + bachelor, data_A)
summary(lm_2, robust = T)
lm_3 <- lm(performance ~ age + sex + school + bachelor + abroad + internships, data_A)
summary(lm_3, robust = T)
lm_4 <- lm(performance ~ age + sex + school + bachelor + abroad + internships + interview, data_A)
summary(lm_4, robust = T)
Here is my output from the summary core of lm_4:
Call:
lm(formula = performance ~ age + sex + school + bachelor + abroad +
internships + interview, data = data_A)
Residuals:
Min 1Q Median 3Q Max
-1.26292 -0.31620 -0.00085 0.29548 1.51859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.073422 1.046419 1.981 0.0527 .
age 0.011915 0.023007 0.518 0.6067
sex 0.046603 0.149346 0.312 0.7562
school -0.192044 0.123340 -1.557 0.1254
bachelor 0.243797 0.111866 2.179 0.0338 *
abroad 0.015325 0.009493 1.614 0.1124
internships 0.037040 0.041557 0.891 0.3768
interview 0.339886 0.216935 1.567 0.1231
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4776 on 53 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.3195, Adjusted R-squared: 0.2297
F-statistic: 4.18 on 7 and 53 DF, p-value: 0.0009912
Additionally I have the solutions we are supposed to obtain for each of the parameters from our model as well as the coefficients of model 4 (the most complex model with highest number of predictors).
So far the models I fitted seem to be correct since I obtain the same coefficients with the same t and p-values. My issue is that in the F-test on the multiple linear model I obtain the F-statistic instead of the F-value. Based on my research I believe that the different between F-statistic and F-value (also sometimes called F-ratio) is that the F-statistic is the critical F-value (i.e. that value where if the F-value is smaller than it the F-test is no longer statistically significant) and the F-value/F-ratio is the actual test value. Is this correct? In R I would also like to obtain the F-value. Our professor uses gretl to solve his problems instead of R and gretl apparently always gives you the F-value for linear regression models instead of the F-statistic. I have tried many different methods of somehow getting this F-value but I can't get any of them to work. I will post my failed solution approach below. Any suggestions on how I can get the F-value for each of these models would be highly appreciated! Also if anyone could explain how exactly the F-value would in theory be calculated for one of my models (probably lm_1 would be most practical since it has the fewest number of predictors) that would really help me since I think my current issue is that I have misunderstood something about calculating the F-value.
(For simplicities sake I always started with the least complex model and wanted to find a solution for that and then work my way up to the more complex ones).
Approach 1: perform an anova on my model
anova(lm_1)
Console Output:
Analysis of Variance Table
Response: performance
Df Sum Sq Mean Sq F value Pr(>F)
age 1 0.1212 0.121247 0.4054 0.5266
sex 1 0.0170 0.016995 0.0568 0.8124
Residuals 63 18.8425 0.299087
Here I obtain two different F-values. As far as I understood the anova analysis the first one in the row for age is only testing the effect of age without sex and the second one tests the affect of sex while removing the effects analyzed in the row above? Am still a bit confused about this.
Since F value = variance of the group means (Mean Square Between) / mean of the within group variances (Mean Squared Error), I wanted to manually calculate the F-value form the Anova output but I wasn't exactly sure how to proceed. All solutions I tried did not lead to the F-value I should be obtaining. Could someone explain how I would go about calculating it?
In my research I also often came across the phrase nested models and that this would then give me the F-value which analyses the entire model. Sadly I couldn't figure out what a nested model exactly is and how I would have to input this into R.
Thanks for your help!

R: Calculate and interpret odds ratio in logistic regression

I am having trouble interpreting the results of a logistic regression. My outcome variable is Decision and is binary (0 or 1, not take or take a product, respectively).
My predictor variable is Thoughts and is continuous, can be positive or negative, and is rounded up to the 2nd decimal point.
I want to know how the probability of taking the product changes as Thoughts changes.
The logistic regression equation is:
glm(Decision ~ Thoughts, family = binomial, data = data)
According to this model, Thoughts has a significant impact on probability of Decision (b = .72, p = .02). To determine the odds ratio of Decision as a function of Thoughts:
exp(coef(results))
Odds ratio = 2.07.
Questions:
How do I interpret the odds ratio?
Does an odds ratio of 2.07 imply that a .01 increase (or decrease) in Thoughts affect the odds of taking (or not taking) the product by 0.07 OR
Does it imply that as Thoughts increases (decreases) by .01, the odds of taking (not taking) the product increase (decrease) by approximately 2 units?
How do I convert odds ratio of Thoughts to an estimated probability of Decision?
Or can I only estimate the probability of Decision at a certain Thoughts score (i.e. calculate the estimated probability of taking the product when Thoughts == 1)?
The coefficient returned by a logistic regression in r is a logit, or the log of the odds. To convert logits to odds ratio, you can exponentiate it, as you've done above. To convert logits to probabilities, you can use the function exp(logit)/(1+exp(logit)). However, there are some things to note about this procedure.
First, I'll use some reproducible data to illustrate
library('MASS')
data("menarche")
m<-glm(cbind(Menarche, Total-Menarche) ~ Age, family=binomial, data=menarche)
summary(m)
This returns:
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
The coefficients displayed are for logits, just as in your example. If we plot these data and this model, we see the sigmoidal function that is characteristic of a logistic model fit to binomial data
#predict gives the predicted value in terms of logits
plot.dat <- data.frame(prob = menarche$Menarche/menarche$Total,
age = menarche$Age,
fit = predict(m, menarche))
#convert those logit values to probabilities
plot.dat$fit_prob <- exp(plot.dat$fit)/(1+exp(plot.dat$fit))
library(ggplot2)
ggplot(plot.dat, aes(x=age, y=prob)) +
geom_point() +
geom_line(aes(x=age, y=fit_prob))
Note that the change in probabilities is not constant - the curve rises slowly at first, then more quickly in the middle, then levels out at the end. The difference in probabilities between 10 and 12 is far less than the difference in probabilities between 12 and 14. This means that it's impossible to summarise the relationship of age and probabilities with one number without transforming probabilities.
To answer your specific questions:
How do you interpret odds ratios?
The odds ratio for the value of the intercept is the odds of a "success" (in your data, this is the odds of taking the product) when x = 0 (i.e. zero thoughts). The odds ratio for your coefficient is the increase in odds above this value of the intercept when you add one whole x value (i.e. x=1; one thought). Using the menarche data:
exp(coef(m))
(Intercept) Age
6.046358e-10 5.113931e+00
We could interpret this as the odds of menarche occurring at age = 0 is .00000000006. Or, basically impossible. Exponentiating the age coefficient tells us the expected increase in the odds of menarche for each unit of age. In this case, it's just over a quintupling. An odds ratio of 1 indicates no change, whereas an odds ratio of 2 indicates a doubling, etc.
Your odds ratio of 2.07 implies that a 1 unit increase in 'Thoughts' increases the odds of taking the product by a factor of 2.07.
How do you convert odds ratios of thoughts to an estimated probability of decision?
You need to do this for selected values of thoughts, because, as you can see in the plot above, the change is not constant across the range of x values. If you want the probability of some value for thoughts, get the answer as follows:
exp(intercept + coef*THOUGHT_Value)/(1+(exp(intercept+coef*THOUGHT_Value))
Odds and probability are two different measures, both addressing the same aim of measuring the likeliness of an event to occur. They should not be compared to each other, only among themselves!
While odds of two predictor values (while holding others constant) are compared using "odds ratio" (odds1 / odds2), the same procedure for probability is called "risk ratio" (probability1 / probability2).
In general, odds are preferred against probability when it comes to ratios since probability is limited between 0 and 1 while odds are defined from -inf to +inf.
To easily calculate odds ratios including their confident intervals, see the oddsratio package:
library(oddsratio)
fit_glm <- glm(admit ~ gre + gpa + rank, data = data_glm, family = "binomial")
# Calculate OR for specific increment step of continuous variable
or_glm(data = data_glm, model = fit_glm,
incr = list(gre = 380, gpa = 5))
predictor oddsratio CI.low (2.5 %) CI.high (97.5 %) increment
1 gre 2.364 1.054 5.396 380
2 gpa 55.712 2.229 1511.282 5
3 rank2 0.509 0.272 0.945 Indicator variable
4 rank3 0.262 0.132 0.512 Indicator variable
5 rank4 0.212 0.091 0.471 Indicator variable
Here you can simply specify the increment of your continuous variables and see the resulting odds ratios. In this example, the response admit is 55 times more likely to occur when predictor gpa is increased by 5.
If you want to predict probabilities with your model, simply use type = response when predicting your model. This will automatically convert log odds to probability. You can then calculate risk ratios from the calculated probabilities. See ?predict.glm for more details.
I found this epiDisplay package, works fine! It might be useful for others but note that your confidence intervals or exact results will vary according to the package used so it is good to read the package details and chose the one that works well for your data.
Here is a sample code:
library(epiDisplay)
data(Wells, package="carData")
glm1 <- glm(switch~arsenic+distance+education+association,
family=binomial, data=Wells)
logistic.display(glm1)
Source website
The above formula to logits to probabilities, exp(logit)/(1+exp(logit)), may not have any meaning. This formula is normally used to convert odds to probabilities. However, in logistic regression an odds ratio is more like a ratio between two odds values (which happen to already be ratios). How would probability be defined using the above formula? Instead, it may be more correct to minus 1 from the odds ratio to find a percent value and then interpret the percentage as the odds of the outcome increase/decrease by x percent given the predictor.

Understanding output of 'predict' in R

I'm trying to understand the output from predict(), as well as understand whether this approach is appropriate for the problem I'm trying to solve. The prediction intervals don't make sense to me, but when I plot this on a scatterplot it looks like a good model:
I created a simple linear regression model of deal size ($) with a company's sales volume as a predictor variable. The data is faked, with deal size being a multiple of sales volume plus or minus some noise:
Call:
lm(formula = deal_size ~ sales_volume, data = accounts)
Residuals:
Min 1Q Median 3Q Max
-19123502 -3794671 -3426616 4838578 17328948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.709e+06 1.727e+05 21.48 <2e-16 ***
sales_volume 1.898e-01 2.210e-03 85.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6452000 on 1586 degrees of freedom
Multiple R-squared: 0.823, Adjusted R-squared: 0.8229
F-statistic: 7376 on 1 and 1586 DF, p-value: < 2.2e-16
The predictions were generated thusly:
d = data.frame(accounts, predict(fit, interval="prediction"))
When I plot sales_volume vs. deal_size on a scatterplot, and overlay the regression line with the prediction interval, it looks good, except for a few intervals that span negative values where sales is at or near zero.
I understand fit is the predicted value, but what are lwr and upr? Do they define the intervals in absolute terms (y coordinates)? The intervals seem to be extremely wide, wider than would make sense if my model was a good fit:
sales_volume deal_size fit lwr upr
0 0 3709276.494 -8950776.04 16369329.03
0 8586337.22 3709276.494 -8950776.04 16369329.03
110000 549458.6512 3730150.811 -8929897.298 16390198.92
When you use predict with an lm model, you can specify an interval. You have three choices: none will not return intervals, confidence and prediction. Both of those will return different values. The first column will be as you said the predicted values (column fit). You then have two other columns : lwr and upper which are the lower and upper levels of the confidence intervals.
What is the difference between confidence and prediction ?
confidence is a (by default 95%, use level if you wish to change that) confidence interval of the mean of the predicted value. It is the green interval on your plot. Whereas prediction is a (also 95%) confidence interval of all your values, meaning that should you repeat your experience/survey/... a huge number of times, you can expect that 95% of your values will fall in the yellow interval, thus making it a lot wider than the green one as the green one only evaluates the mean.
And as you an see on your plot, almost all values are in the yellow interval. R doesn't know that your values can only be positive so it explains why the yellow interval "begins" under 0.
Also, when you say "The intervals seem to be extremely wide, wider than would make sense if my model was a good fit", you can see in your plot that the interval is not that big, considering that you can expect 95% of the values to be in it, and you can clearly see a trend in your data. And your model is clearly a good fit as the adjusted R squared and the global p-value tells you.
Just a slight rephrasing of #etienne above, which is very good and accurate.
Confidence interval is the (1-alpha; eg 95%) interval for the mean prediction (or group response). IE if you have 10 new companies with sales volume of 2e+08 the predict(..., interval= "confidence") interval will give you the long-run average interval for your group mean.
With Var(\hat y|X= x*) = \sigma^2 (1/n + (x*-\bar x)^2 / SXX)
The prediction interval is the (1-alpha; eg 95%) interval for an individual response -- predict(..., interval= "predict"). IE for a single new company with sales volume of 2e+08...
With Var(\hat y|X= x*) = \sigma^2 (1 + 1/n + (x*-\bar x)^2 / SXX)
(Sorry that LaTeX isn't supported)

Resources