How to apply Shapiro test in R? - r

I'm pretty new to statistics and I need your help. I just installed the R software and I have no idea how to work with it. I have a small sample looking as follows:
Group A : 10, 12, 14, 19, 20, 23, 34, 41, 12, 13
Group B : 8, 12, 14, 15, 15, 16, 21, 36, 14, 19
I want to apply t-test but before that I would like to apply Shapiro test to know whether my sample comes from a population which has a normal distribution. I know there is a function shapiro.test() but how can I give my numbers as an input to this function?
Can I simply enter shapiro.test(10,12,14,19,20,23,34,41,12,13, 8,12, 14,15,15,16,21,36,14,19)?

OK, because I'm feeling nice, let's work through this. I am assuming you know how to run commands, etc. First up, put your data into vector:
A = c(10, 12, 14, 19, 20, 23, 34, 41, 12, 13)
B = c(8, 12, 14, 15, 15, 16, 21, 36, 14, 19)
Let's check the help for shapiro.test().
help(shapiro.test)
In there you'll see the following:
Usage
shapiro.test(x)
Arguments
x a numeric vector of data values. Missing values are allowed, but
the number of non-missing values must be between 3 and 5000.
So, the inputs need to be a vector values. Now we know that we can run the 'shapiro.test()' function directly with our vectors, A and B. R uses named arguments for most of its functions, and so we tell the function what we are passing in:
shapiro.test(x = A)
and the result is put to the screen:
Shapiro-Wilk normality test
data: A
W = 0.8429, p-value = 0.0478
then we can do the same for B:
shapiro.test(x = B)
which gives us
Shapiro-Wilk normality test
data: B
W = 0.8051, p-value = 0.0167
If we want, we can test A and B together, although it's hard to know if this is a valid test or not. By 'valid', I mean imagine that you are pulling numbers out of a bag to get A and B. If the numbers in A get thrown back in the bag, and then we take B, we've just double counted. If the numbers in A didn't get thrown back in, testing x =c(A,B) is reasonable because all we've done is increased the size of our sample.
shapiro.test(x = c(A,B))
Do these mean that the data are normally distributed? Well, in the help we see this:
Value
...
p.value an approximate p-value for the test. This is said in Royston (1995) to be adequate for p.value < 0.1
So maybe that's good enough. But it depends on your requirements!

Related

Calculating Running Percentile in R

I'm trying to calculate the percentiles from 1:i in a column. For example, for the nth data point, calculate the percentile only using the first n values.
I have tried using quantile, but can't seem to figure out how to generalize it.
mydata <- c(1, 25, 43, 2, 5, 17, 40, 15, 12, 8)
perc.fn <- function(vec, n){
(rank(vec[1:n], na.last=TRUE) - 1)/(length(vec[1:n])-1)}

In R, sample from a neighborhood according to scores

I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.

Generate samples from frequency table with fixed values

I have a population 2x2 frequency table, with specific values: 20, 37, 37, 20. I need to generate N number of samples from this population (for simulation purposes).
How can I do it in R?
Try this. In the example, the integers represent cell 1, 2, 3, and 4 of the 2x2 table. As you can see the relative frequencies closely resemble those in your 20, 37, 37, 20 table.
probs<-c(20, 37, 37, 20)
N<-1000 #sample size
mysample<-sample(x=c(1,2,3,4), size=N, replace = TRUE, prob = probs/sum(probs))
table(mysample)/N
#Run Again for 100,000 samples
N<-100000
mysample<-sample(x=c(1,2,3,4), size=N, replace = TRUE, prob = probs/sum(probs))
#The relative probabilities should be similar to those in the original table
table(mysample)/N

Calculating cumulative hypergeometric distribution

Suppose I have 100 marbles, and 8 of them are red. I draw 30 marbles, and I want to know what's the probability that at least five of the marbles are red. I am currently using http://stattrek.com/online-calculator/hypergeometric.aspx and I entered 100, 8, 30, and 5 for population size, number of success, sample size, and number of success in sample, respectively. So the probability I'm interested in is Cumulative Probability: $P(X \geq 5)$ which = 0.050 in this case. My question is, how do I calculate this in R?
I tried
> 1-phyper(5, 8, 92, 30, lower.tail = TRUE)
[1] 0.008503108
But this is very different from the previous answer.
phyper(5, 8, 92, 30) gives the probability of drawing five or fewer red marbles.
1 - phyper(5, 8, 92, 30) thus returns the probability of getting six or more red marbles
Since you want the probability of getting five or more (i.e. more than 4) red marbles, you should use one of the following:
1 - phyper(4, 8, 92, 30)
[1] 0.05042297
phyper(4, 8, 92, 30, lower.tail=FALSE)
[1] 0.05042297
Why use:
1 - phyper(..., lower.tail = TRUE)
?
Easier to use:
phyper(..., lower.tail = FALSE)
. Even if they are mathematically equivalent, there are numerical reasons for preferring the latter.
Does that fix your problem? I believe you are putting the correct inputs into the phyper function. Is it possible that you're looking at the wrong output in that web site you linked?

Forecast Mean and Standard Deviation

Apologies if this is a bit of a simple question, but I haven't been able to find any answer to this over the past week and it's driving me crazy.
Background Info: I have a dataset that tracks the weight of 5 individuals over 5 years. Each year, I have a distribution for the weight of individuals in the group, from which I calculate the mean and standard deviation. Data is as follows:
Year = [2002,2003,2004,2005,2006]
Weights_2002 = [12, 14, 16, 18, 20]
Weights_2003 = [14, 16, 18, 20,20]
Weights_2004 = [16, 18, 20, 22, 18]
Weights_2005 = [18, 21, 22, 22, 20]
Weights_2006 = [2, 21, 19, 20, 20]
The Question: How do I project annual distributions of weight for the group the next 10 years? Ideally, I would like the uncertainty about the mean to increase as time goes on. Likewise, I would like the uncertainty about the standard deviation to increase too. Phrased another way, I would like to project the distributions of weight going forward, accounting for both:
Natural Variance in the Data
Increasing uncertainty.
Any help would be greatly, greatly appreciated. If anyone can suggest how to do this in R, that would be even better.
Thanks guys!
Absent specific suggestions on how to use the forecasting tools in R, viz. the comments to your question, here is an alternative approach that uses Monte Carlo simulation.
First, some housekeeping: the value 2 in Weights_2006 is either a typo or an outlier. Since I can't tell which, I will assume it's an outlier and exclude it from the analysis.
Second, you say you want to project the distributions based on increasing uncertainty. But your data doesn't support that.
Year <- c(2002,2003,2004,2005,2006)
W2 <- c(12, 14, 16, 18, 20)
W3 <- c(14, 16, 18, 20,20)
W4 <- c(16, 18, 20, 22, 18)
W5 <- c(18, 21, 22, 22, 20)
W6 <- c(NA, 21, 19, 20, 20)
df <- rbind(W2,W3,W4,W5,W6)
df <- data.frame(Year,df)
library(reshape2) # for melt(...)
library(ggplot2)
data <- melt(df,id="Year", variable.name="Individual",value.name="Weight")
ggplot(data)+
geom_histogram(aes(x=Weight),binwidth=1,fill="lightgreen",colour="grey50")+
facet_grid(Year~.)
The mean weight goes up over time, but the variance decreases. A look at the individual time series shows why.
ggplot(data, aes(x=Year, y=Weight, color=Individual))+geom_line()
In general, an individual's weight increases linearly with time (about 2 units per year), until it reaches 20, when it stops increasing but fluctuates. Since your initial distribution was uniform, the individuals with lower weight saw an increase over time, driving the mean up. But the weight of heavier individuals stopped growing. So the distribution gets "bunched up" around 20, resulting in a decreasing variance. We can see this in the numbers: increasing mean, decreasing standard deviation.
smry <- function(x)c(mean=mean(x),sd=sd(x))
aggregate(Weight~Year,data,smry)
# Year Weight.mean Weight.sd
# 1 2002 16.0000000 3.1622777
# 2 2003 17.6000000 2.6076810
# 3 2004 18.8000000 2.2803509
# 4 2005 20.6000000 1.6733201
# 5 2006 20.0000000 0.8164966
We can model this behavior using a Monte Carlo simulation.
set.seed(1)
start <- runif(1000,12,20)
X <- start
result <- X
for (i in 2003:2008){
X <- X + 2
X <- ifelse(X<20,X,20) +rnorm(length(X))
result <- rbind(result,X)
}
result <- data.frame(Year=2002:2008,result)
In this model, we start with 1000 individuals whose weight forms a uniform distribution between 12 and 20, as in your data. At each time step we increase the weights by 2 units. If the result is >20 we clip it to 20. Then we add random noise distributed as N[0,1]. Now we can plot the distributions.
model <- melt(result,id="Year",variable.name="Individual",value.name="Weight")
ggplot(model,aes(x=Weight))+
geom_histogram(aes(y=..density..),fill="lightgreen",colour="grey50",bins=20)+
stat_density(geom="line",colour="blue")+
geom_vline(data=aggregate(Weight~Year,model,mean), aes(xintercept=Weight), colour="red", size=2, linetype=2)+
facet_grid(Year~.,scales="free")
The red bars show the mean weight in each year.
If you believe that the natural variation in the weight of an individual increases over time, then use N[0,sigma] as the error term in the model, with sigma increasing with Year. The problem is that there is nothing in your data to support that.

Resources