normality test for all levels of variable in R - r

I want to test the normality for each group or level of a continuous variable before doing an ANOVA. Specifically, it's age vs different religious preferences. The dummies page (http://www.dummies.com/how-to/content/how-to-test-data-normality-in-a-formal-way-in-r.html) mentions a way to do it with shapiro and tapply, but I think there's an error in the code. Anybody know if this should work?

you are mentionning a tutorial page where a shapiro.test is done across a grouping variable. THe code does have a little error:
with(beaver, tapply(temp, activ, shapiro.test)
it should be:
with(beaver2, tapply(temp, activ, shapiro.test))
If your data exceeds 5000 observations, you can take a random sample of it, but I couldn't say if it is statistically valid:
with(beaver2, tapply(temp, activ, function(x) shapiro.test(sample(x, min(NROW(x), 5000)))))
I suggest checking the Cross-Validated site to get an answer on that subject, for instance here
However,
Concerning the remark of Nicola mentionning the erroneous interpretation of a normality p-value, i completely agree. Let's take an example, let's create a distribution that only returns these three values: 1, 2, and 3 with 1/3 probability each. It's nothing like a normal distribution.
However, a shapiro.test indicates a p-value of 1 for this data:
shapiro.test(c(2,3,1))
#### Shapiro-Wilk normality test
#### data: c(1, 2, 3)
#### W = 1, p-value = 1

Related

ANOVA (AOV function) in R: Misleading p_value reported on equal values

I would greatly appreciate any guidance on the following: I am running ANOVA (aov) to retrieve p_value s for a number of subsets of a larger data set. So I kind of bumped into a subset where my numeric variables/values are equally 36. Because it is a part of a loop ANOVA is still executed along with reporting an seemingly infinitely small p_value 1.2855e-134--> Correct me if I am wrong but the smaller the p_value the higher the probability that the difference between the factors is significantly different?
For simplicity this is the subset:
sUBSET_FOR_ANOVA
Here is how I calculate ANOVA and retrieve p_value, where TEMP_DF2 is just the subset you see attached:
#
anova_sweep <- aov(TEMP_DF2$GOOD_PTS~TEMP_DF2$MACH,data = TEMP_DF2)
p_value <- summary(anova_sweep)[[1]][["Pr(>F)"]]
p_value <- p_value[1]
#
Many thanks for any guidance,
I can't replicate your findings. Let's produce an example dataset with all values being 36:
df <- data.frame(gr = rep(letters[1:2], 100),
y = 36)
summary(aov(y~gr, data = df))
Gives:
Df Sum Sq Mean Sq F value Pr(>F)
gr 1 1.260e-27 1.262e-27 1 0.319
Residuals 198 2.499e-25 1.262e-27
Basically, depending on the sample size, we obtain a p-value around 0.3 or so. The F statistic is (by definition) always 1, since the between and within group variances are equal.
Are there results misleading? To some extent, yes. The estimated SS and MS should be 0, aov calculates them as very very small. Some other statistical tests in R and in some packages check for zero variance and would produce an error, but aov apparently does not.
However, more importantly, I would say your data is violating the assumptions of the ANOVA and therefore any result cannot be trusted to base conclusion on. The expectation in R when it comes to statistical tests is usually that it is upon the user to employ the tests in the correct circumstances.

R strucchange bootstrap test statistic due to nonspherical disturbances

I am trying to find a structural break in the mean of a time-series that is skewed, fat-tailed, and heteroskedastic. I apply the Andrews(1993) supF-test via the strucchange package. My understanding is that this is valid even with my nonspherical disturbances. But I would like to confirm this via bootstrapping. I would like to estimate the max t-stat from a difference in mean test at each possible breakpoint (just like the Andrews F-stat) and then bootstrap the critical value. In other words, I want to find my max t-stat in the time-ordered data. Then scramble the data and find the max t-stat in the scrambled data, 10,000 times. Then compare the max t-stat from the time-ordered data to a critical value given by the rank 9,500 max t-stat from the unordered data. Below I generate example data and apply the Andrews supF-test. Is there any way to "correct" the Andrews test for nonspherical disturbances? Is there any way to do the bootstrap I am trying to do?
library(strucchange)
Thames <- ts(matrix(c(rlnorm(120, 0, 1), rlnorm(120, 2, 2), rlnorm(120, 4, 1)), ncol = 1), frequency = 12, start = c(1985, 1))
fs.thames <- Fstats(Thames ~ 1)
sctest(fs.thames)
I'm adding a second answer to analyze the simulated Thames data provided.
Regarding the points from my first general methodological answer: (1) In this case, a log() transformation is clearly appropriate to deal with the extreme skewness of the observations. (2) As the data are heteroscedastic, the inference should be based on HC or HAC covariances. Below I employ the Newey-West HAC estimator, although the data are just heteroscedastic but not autocorrelated. The HAC-corrected inference affects the supF test and the confidence intervals for the breakpoint estimates. The breakpoints themselves and the corresponding segment-specific intercepts are estimated by OLS, i.e., treating the heteroscedasticity as a nuisance term. (3) I did not add any bootstrap or permutation inference as the asymptotic inference appears to be convincing enough in this case.
First, we simulate the data using a particular seed. (Note that other seeds may not lead to such clear-cut breakpoint estimates when analyzing the series in levels.)
library("strucchange")
set.seed(12)
Thames <- ts(c(rlnorm(120, 0, 1), rlnorm(120, 2, 2), rlnorm(120, 4, 1)),
frequency = 12, start = c(1985, 1))
Then we compute the sequence of HAC-corrected Wald/F statistics and estimate the optimal breakpoints (for m = 1, 2, 3, ... breaks) via OLS. To illustrate how much better this works for the series in logs rather than in levels, both versions are shown.
fs_lev <- Fstats(Thames ~ 1, vcov = NeweyWest)
fs_log <- Fstats(log(Thames) ~ 1, vcov = NeweyWest)
bp_lev <- breakpoints(Thames ~ 1)
bp_log <- breakpoints(log(Thames) ~ 1)
The visualization below shows the time series with the fitted intercepts in the first row, the sequence of Wald/F statistics with the 5% critical value of the supF test in the second row, and the residual sum of squares and BIC for the selection of the number of breakpoints in the last row. The code to replicate the graphic is at the end of this answer.
Both supF tests are clearly significant but in levels (sctest(fs_lev)) the test statistic is "only" 82.79 while in logs (sctest(fs_log)) it is 282.46. Also, the two peaks pertaining to the two breakpoints can be seen much better when analyzing the data in logs.
Similarly, the breakpoint estimates are somewhat better and the confidence intervals much narrower for the log-transformed data. In levels, we get:
confint(bp_lev, breaks = 2, vcov = NeweyWest)
##
## Confidence intervals for breakpoints
## of optimal 3-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp_lev, breaks = 2, vcov. = NeweyWest)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 NA 125 NA
## 2 202 242 263
plus an error message and warnings which all reflect that the asymptotic inference is not a useful approximation here. In contrast, the confidence intervals are quite reasonable for the analysis in logs. Due to the increased variance in the middle segment, its start and end are somewhat more uncertain than for the first and last segment:
confint(bp_log, breaks = 2, vcov = NeweyWest)
##
## Confidence intervals for breakpoints
## of optimal 3-segment partition:
##
## Call:
## confint.breakpointsfull(object = bp_log, breaks = 2, vcov. = NeweyWest)
##
## Breakpoints at observation number:
## 2.5 % breakpoints 97.5 %
## 1 107 119 121
## 2 238 240 250
##
## Corresponding to breakdates:
## 2.5 % breakpoints 97.5 %
## 1 1993(11) 1994(11) 1995(1)
## 2 2004(10) 2004(12) 2005(10)
Finally, the replication code for the figure above is included here. The confidence intervals for the breakpoints in levels are cannot added in the graphic due to the error mentioned above. Hence, only the log-transformed series also has the confidence intervals.
par(mfrow = c(3, 2))
plot(Thames, main = "Thames")
lines(fitted(bp_lev, breaks = 2), col = 4, lwd = 2)
plot(log(Thames), main = "log(Thames)")
lines(fitted(bp_log, breaks = 2), col = 4, lwd = 2)
lines(confint(bp_log, breaks = 2, vcov = NeweyWest))
plot(fs_lev, main = "supF test")
plot(fs_log, main = "supF test")
plot(bp_lev)
plot(bp_log)
(1) Skewness and heavy tails. As usual in linear regression models, the asymptotic justification for the inference does not depend on normality and also holds for any other error distribution given zero expectation, homoscedasticity, and lack of correlation (the usual Gauss-Markov assumptions). However, if you have a well-fitting skewed distribution for your data of interest, then you might be able to increase efficiency by basing your inference on the corresponding model. For example, the glogis package provides some functions for structural change testing and dating based on a generalized logistic distribution that allows for heavy tails and skewness. Windberger & Zeileis (2014, Eastern European Economics, 52, 66–88, doi:10.2753/EEE0012-8775520304) used this to track changes in skewness of inflation dynamics over time. (See ?breakpoints.glogisfit for a worked example.) Furthermore, if the skewness itself is not really of interest then a log or sqrt transformation might also be good enough to make the data more "normal".
(2) Heteroscedasticity and autocorrelation. As usual in linear regression models, the standard errors (or more broadly the covariance matrix) is not consistent in the presence of heteroscedasticity and/or autocorrelation. One can either try to include this explicitly in the model (e.g., an AR model) or treat it as a nuisance term and employ heteroscedasticity and autocorrelation consistent (HAC) covariance matrices (e.g., Newey-West or Andrews' quadratic spectral kernal HAC). The function Fstats() in strucchange allows to plug in such estimators, e.g., from the sandwich package. See ?durab for an example using vcovHC().
(3) Bootstrap and permutation p-values. The "scrambling" of the time series you describe above sounds more like applying permutations (i.e., sampling without replacement) rather than bootstrap (i.e., sampling with replacement). The former is feasible if the errors are uncorrelated or exchangeable. If you are regressing just on a constant, then you can employ the function maxstat_test() from the coin package to carry out the supF test. The test statistic is computed in a somewhat different way, however, this can be shown to be equivalent to the supF test in the constant-only case (see Zeileis & Hothorn, 2013, Statistical Papers, 54, 931–954, doi:10.1007/s00362-013-0503-4). If you want to perform the permutation test in a more general model, then you would have to do the permutations "by hand" and simply store the test statistic from each permutation. Alternatively, the bootstrap can be applied, e.g., via the boot package (where you would still need to write your own small function that computes the test statistic from a given bootstrap sample). There are also some R packages (e.g., tseries) that implement bootstrap schemes for dependent series.

Perform a Shapiro-Wilk Normality Test

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:
heisenberg
HWWIchg
1 -15.60
2 -21.60
3 -19.50
4 -19.10
5 -20.90
6 -20.70
7 -19.30
8 -18.30
9 -15.10
However, when I perform the test, I get:
shapiro.test(heisenberg)
Error in [.data.frame(x, complete.cases(x)) :
undefined columns selected
Why isnt`t R selecting the right column and how do I do that?
What does shapiro.test do?
shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".
How to perform shapiro.test in R?
The R help page for ?shapiro.test gives,
x - a numeric vector of data values. Missing values are allowed,
but the number of non-missing values must be between 3 and 5000.
That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:
> shapiro.test(heisenberg$HWWIchg)
# Shapiro-Wilk normality test
# data: heisenberg$HWWIchg
# W = 0.9001, p-value = 0.2528
Interpreting results from shapiro.test:
First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.
In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:
set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
# Shapiro-Wilk normality test
# data: runif(50, min = 2, max = 4)
# W = 0.9601, p-value = 0.08995
So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.
Another issue I'd like to quote here from #PaulHiemstra from under comments about the effects on large sample size:
An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
Although he also points out that R's data size limit protects this a bit:
Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.
If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!
#PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:
In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.
Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:
For linear regression,
Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.
Outliers. A cooks distance of > 1 is reasonable cause for concern.
Those are my thoughts (FWIW).
Hope this clears things up a bit.
You are applying shapiro.test() to a data.frame instead of the column. Try the following:
shapiro.test(heisenberg$HWWIchg)
You failed to specify the exact columns (data) to test for normality.
Use this instead
shapiro.test(heisenberg$HWWIchg)
Set the data as a vector and then place in the function.

How to get bootstrapped p-values and bootstrapped t-values and how does the function boot() work?

I would like to get the bootstrapped t-value and the bootstrapped p-value of a lm.
I have the following code (basically copied from a paper) which works.
# First of all you need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
boot.function <- function(data, indices){
data <- data[indices,]
mod <- lm(prestige ~ income + education, data=data) # the liear model
# the first element of the following vector contains the t-value
# and the second element is the p-value
c(summary(mod)[["coefficients"]][2,3], summary(mod)[["coefficients"]][2,4])
}
Now, I compute the bootstrapping model, which gives me the following:
duncan.boot <- boot(Duncan, boot.function, 1999)
duncan.boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Duncan, statistic = boot.function, R = 1999)
Bootstrap Statistics :
original bias std. error
t1* 5.003310e+00 0.288746545 1.71684664
t2* 1.053184e-05 0.002701685 0.01642399
I have two questions:
My understanding is that the bootsrapped value is the original plus the bias, which means that both bootstrapped values (the bootstrapped t-value as well as the bootstrapped p-value) are greater than the original values. This in turn is not possible, because if the t-value rises (which means more significance) the p-values MUST be lower, right? Therefore I think that I have not yet really understood the output of the boot function (here: duncan.boot). How do I compute the bootstrapped values?
I do not understand how the boot() works. If you look at duncan.boot <- boot(Duncan, boot.function, 1999) you see that I have not passed any arguments for the function "boot.function". I suppose that R sets data <- Duncan. But since I have not passed anything for the argument "indices", I do not understand how the following line in the function "boot.function" works data <- data[indices,]
I hope the questions make sense!??
The boot function is "expecting" to get a function that has two arguments: the first being a data.frame and the second being an "indices" vector (possibly with duplicate entries and probably not using all the indices) to use in selecting rows and probably having some duplicate or triplicates.) It then samples with replacement determined by the pattern of duplicates and triplicates from the original dataframe (multiple times determined by "R" with different "choice sets"), passes those to the indices argument in the boot.function, and then collects the results of the R number of function applications.
Regarding what is reported by the print method for boot objects, take a look at this (done after examining the returned object with str()
> duncan.boot$t0
[1] 5.003310e+00 1.053184e-05
> apply(duncan.boot$t, 2, mean)
[1] 5.342895220 0.002607943
> apply(duncan.boot$t, 2, mean) - duncan.boot$t0
[1] 0.339585441 0.002597411
It becomes more obvious that the T0 value is from the original data while the bias is the difference between the mean of the boot()-ed values and the T0 values. I don't think it makes a lot of sense to be asking why p-values based on parametric considerations are increasing in association with an increase in estimated t-statistics. You are really in two disparate regions of statistical thought when you do that. I would have interpreted the increase in p-values as an effect of the sampling process, which does not take into account the Normal distribution assumptions. It is simply saying something about the sampling distribution of the p-value (which is really just another sample statistic).
(Comment: The sourcebook used at the time of R development was Davison and Hinkley's "Bootstrap Methods and their Applications". I'm no claiming any support for my answer above, but I thought to put it in as a reference after Hagen Brenner asked about sampling with two indices in the comments below. There are many unexpected aspects of bootstrapping that arise after one goes beyond the simple parametric estimation and I would first turn to that reference if I were tackling more complex sampling situations.)

Significance testing in R, determining if the proportion in one column is significantly different from the other column within the single variable

I'm sure this is an easy command in R, but for some reason, I'm having trouble finding a solution.
I'm trying to run a bunch of crosstabs (using the table() command) in R, and each tab has two columns (treatment and no treatment). I would like to know if the difference between the columns are significantly different for each other for all rows (the rows are a handful of answer choices from a survey). I'm not interested in overall significance, only within the crosstab comparing treatment vs. no treatment.
This type of analysis is very easy in SPSS (link below to illustrate what I'm talking about), but I can't seem to get it working in R. Do you know I can do this?
http://help.vovici.net/robohelp/robohelp/server/general/projects_fhpro/survey_workbench_MX/Significance_testing.htm
EDITED:
Here is an example of in R about what I mean:
treatmentVar <-c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1) # treatment is 1 or 0
question1 <-c(1,2,2,3,1,1,2,2,3,1,1,2,2,3,1,3) #choices available are 1, 2, or 3
Questiontab <- table(question1, treatmentVar)
Questiontab
I have tables like this ^ (percentaged by column on the treatmentVar), and I would like to see if there is a significant difference between each question choice (rows) going from treatment 0 to treatment 1. So in the example above, I would want to know if there is a significant difference between 4 and 2 (row 1), 3 and 3 (row 2), and 1 and 3 (row 3). So in this example, the choices for question1 might be significantly difference for choices 1 and 3 (because the difference is 2) but the difference for choice 2 isn't because the difference is zero. Ultimately, I'm trying to determine this type of significance. I hope that helps.
Thanks!
I think the function you're looking for is pairwise.prop.test(). See ?pairwise.prop.test for an example.
Using your example, either the chisq.test or prop.test (equivalent in this case):
> chisq.test(Questiontab)
Pearson's Chi-squared test
data: Questiontab
X-squared = 1.6667, df = 2, p-value = 0.4346
Warning message:
In chisq.test(Questiontab) : Chi-squared approximation may be incorrect
> prop.test(Questiontab)
3-sample test for equality of proportions without continuity
correction
data: Questiontab
X-squared = 1.6667, df = 2, p-value = 0.4346
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3
0.6666667 0.5000000 0.2500000
Warning message:
In prop.test(Questiontab) : Chi-squared approximation may be incorrect
Note the warning; these tests aren't necessarily appropriate for such small numbers.

Resources