R: Fitting Chi-squared distribution with large x range - r

It is quite easy to get a good fit of a chi-squared distribution for a limited range:
library(MASS)
nnn <- 1000
set.seed(101)
chii <- rchisq(nnn,4, ncp = 0) ## Generating a chi-sq distribution
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3),method="BFGS") ## Fitting
chi_k <- chi_df[[1]][1] ## Degrees of freedom
chi_hist <- hist(chii,breaks=50,freq=FALSE) ## PLotting the histogram
curve(dchisq(x,df=chi_k),add=TRUE,col="green",lwd=3) ## Plotting the line
However, assume I have a data set where the distribution is spread out over the X-axis, and its new values are instead given by something like:
chii <- 5*rchisq(nnn,4, ncp = 0)
Without knowing this multiplicative factor 5 for a real data set, how do I normalize the rchisq()/ complex data to get a good fit with fitdistr()?
Thanks in advance for your help!

You will have to loop across degrees of freedom to find the best fit for your data. First you probably know that the mean of the chi-squared distribution is the degree of freedom, let's use that to adjust your data and solve your problem.
In summary you loop across possible degrees of freedom to find the one that best fits your adjusted data.
library(MASS)
nnn <- 1000
set.seed(101)
x <- round(runif(1,1,100)) # generate a random multiplier
chii <- x*rchisq(nnn,4, ncp = 0) ## Generating a shifted chi-sq distribution
max_df <- 100 # max degree of freedom to test (here from 1 to 100)
chi_df_disp <- rep(NA,max_df)
# loop across degree of freedom
for (i in 1:max_df) {
chii_adjusted <- (chii/mean(chii))*i # Adjust the chi-sq distribution so that the mean matches the tested degree of freedom
chi_fit <- fitdistr(chii_adjusted,"chi-squared",start=list(df=i),method="BFGS") ## Fitting
chi_df_disp[i] <- chi_fit$estimate/i # This is going to give you the dispersion between the fitted df and the tested df
}
# Find the value with the smallest dispersion (i.e. the best match between the estimated df and the tested df)
real_df <- which.min(abs(chi_df_disp-1))
print(real_df) # print the real degree of freedom after correction
Now you can use the "real" degree of freedom to adjust you chi-squared distribution and plot the theoretical distribution line.
chii_adjusted <- (chii/mean(chii))*real_df
chi_hist <- hist(chii_adjusted,breaks=50,freq=FALSE) ## PLotting the histogram
curve(dchisq(x,df=real_df),add=TRUE,col="green",lwd=3) ## Plotting the line

Related

Test for Poisson residuals in the analysis of variance model

I try to find any way for test Poisson residuals like normals in aov(). In my hypothetical example:
# For normal distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y1 <- rnorm(length(x), mean=10, sd=1.5)
#Normality test in aov residuals
y1.av<-aov(y1 ~ x)
shapiro.test(y1.av$res)
# Shapiro-Wilk normality test
#
#data: y1.av$res
#W = 0.99782, p-value = 0.7885
Sounds silly, OK!!
Now, I'll like to make a same approche but for Poisson distribution:
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
#Normality test in aov residuals
y2.av<-aov(y2 ~ x)
poisson.test(y2.av$res)
Error in poisson.test(y2.av$res) :
'x' must be finite, nonnegative, and integer
There is any stat approach for make this?
Thanks!
You could analyse your data below a counting context. Discrete data, such as variables of Poisson nature, can be analysed based on observed frequencies. You can formulate hypothesis testing for this task. Being your data y you can contrast the null hypothesis that y follows a Poisson distribution with some parameter lambda against the alternative hypothesis that y does not come from the Poisson distribution. Let's sketch the test with you data:
#Data
set.seed(123)
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
Now we obtain the counts, which are elemental for the test:
#Values
df <- as.data.frame(table(y2),stringsAsFactors = F)
df$y2 <- as.integer(df$y2)
After that we must separate the observed values O and its groups or categories classes. Both elements constitute the y variable:
#Observed values
O <- df$Freq
#Groups
classes <- df$y2
As we are testing a Poisson distribution, we must compute the lambda parameter. This can be obtained with Maximum Likelihood Estimation (MLE). The MLE for Poisson is the mean (considering we have counts and groups in order to determine this value), so we compute it with next code:
#MLE
meanval <- sum(O*classes)/sum(O)
Now, we have to get the probabilities of each class:
#Probs
prob <- dpois(classes,meanval)
Poisson distribution can go to infinite values, so we must compute the probability for the values that can be greater than our last group in order to have probabilities that sum to one:
prhs <- 1-sum(prob)
This probability can be easily added to the last value of our group in order to transform to account for values greater or equal to it (For example, instead of only having the probability that y equals to 20 we can have the probability that y is greater or equal to 20):
#Add probability
prob[length(prob)]<-prob[length(prob)]+prhs
With this we can conduct a goodness of fit test using chisq.test() function in R. It requires the observed values O and the probabilities prob that we have computed. Just a reminder that this test uses to set wrong degrees of freedom, so we can correct it by the formulation of the test that uses k-q-1 degrees. Where k is the number of groups and q is the number of parameters computed (we have computed one parameter with MLE). Next the test:
chisq.test(O,p=prob)
The output:
Chi-squared test for given probabilities
data: O
X-squared = 7.6692, df = 17, p-value = 0.9731
The key value from the test is the X-squared value which is the test statistic. We can reuse the value to obtain the real p-value (In our example, we have k=18 and minus 2, the degrees of freedom are 16).
The p.value can be obtained with next code:
p.value <- 1-pchisq(7.6692, 16)
The output:
[1] 0.9581098
As this value is not greater that known significance levels we do not reject the null hypothesis and we can affirm that y comes from a Poisson distribution.

Parameters and AUC and IC50 of a dose response curve

I have these dose response data:
df <- data.frame(viability=c(14,81,58,78,71,83,64,16,32,100,100,81,86,83,100,90,15,100,38,100,91,84,92,100),
dose=c(10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061,10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061,10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061),
stringsAsFactors=F)
I then use the drc package's drm function to fit a log-logistic curve to these data:
library(drc)
fit <- drm(viability~dose,data=df,fct=LL.4(names=c("slope","low","high","ED50")),type="continuous")
> summary(fit)
Model fitted: Log-logistic (ED50 as parameter) (4 parms)
Parameter estimates:
Estimate Std. Error t-value p-value
slope:(Intercept) 5.15328 18.07742 0.28507 0.7785
low:(Intercept) 20.19430 12.61122 1.60130 0.1250
high:(Intercept) 83.33181 4.96736 16.77586 0.0000
ED50:(Intercept) 2.98733 1.99685 1.49602 0.1503
Residual standard error:
21.0743 (20 degrees of freedom)
I then generate predictions so I'll be able to plot the curve:
pred.df <- expand.grid(dose=exp(seq(log(max(df$dose)),log(min(df$dose)),length=100)))
pred <- predict(fit,newdata=pred.df,interval="confidence")
pred.df$viability <- pmax(pred[,1],0)
pred.df$viability <- pmin(pred.df$viability,100)
pred.df$viability.low <- pmax(pred[,2],0)
pred.df$viability.low <- pmin(pred.df$viability.low,100)
pred.df$viability.high <- pmax(pred[,3],0)
pred.df$viability.high <- pmin(pred.df$viability.high,100)
I also use the PharmacoGx Bioconductor package to compute AUC and IC50 for both the curve and its high and low bounds:
library(PharmacoGx)
auc.mid <- computeAUC(rev(pred.df$dose),rev(pred.df$viability))/((max(pred.df$viability)-min(pred.df$viability))*(max(pred.df$dose)-min(pred.df$dose)))
auc.low <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.low))/((max(pred.df$viability.low)-min(pred.df$viability.low))*(max(pred.df)-min(pred.df$dose)))
auc.high <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.high))/((max(pred.df$viability.high)-min(pred.df$viability.high))*(max(pred.df$dose)-min(pred.df$dose)))
ic50.mid <- computeIC50(rev(pred.df$dose),rev(pred.df$viability))
ic50.low <- computeIC50(rev(pred.df$dose),rev(pred.df$viability.low))
ic50.high <- computeIC50(rev(pred.df$dose),rev(pred.df$viability.high))
Ceating a table with all the parameters so I can plot everything together:
ann.df <- data.frame(param=c("slope","low","high","ED50","auc.mid","auc.high","auc.low","ic50.mid","ic50.high","ic50.low"),value=signif(c(summary(fit)$coefficient[,1],auc.mid,auc.high,auc.low,ic50.mid,ic50.high,ic50.low),2),stringsAsFactors=F)
And finally plotting it all:
library(ggplot2)
library(grid)
library(gridExtra)
pl <- ggplot(df,aes(x=dose,y=viability))+geom_point()+geom_ribbon(data=pred.df,aes(x=dose,y=viability,ymin=viability.low,ymax=viability.high),alpha=0.2)+labs(y="viability")+
geom_line(data=pred.df,aes(x=dose,y=viability))+coord_trans(x="log")+theme_bw()+scale_x_continuous(name="dose",breaks=sort(unique(df$dose)),labels=format(signif(sort(unique(df$dose)),3),scientific=T))
ggdraw(pl)+draw_grob(tableGrob(ann.df,rows=NULL),x=0.1,y=0.175,width=0.3,height=0.4)
Which gives:
My questions are:
I thought that slope should be negative. How come it's 5.2?
the auc.mid, auc.high, and auc.lowcumputed as:
auc.mid <- computeAUC(rev(pred.df$dose),rev(pred.df$viability))
auc.low <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.low))
auc.high <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.high))
give 21.47818, 37.52389, and 2.678228, respectively.
Since these are not in the [0,1] range I thought that divinding them by the area under the highest corresponding viability will give what I'm looking for, i.e., relative AUC, but these values seem too low relative to what the figure shows. What are these AUCs then?
Also, how come auc.mid > auc.low > auc.high? I would think that it should be auc.high > auc.mid > auc.low
The IC50 values also seem a little low. Do they make sense?
Bonus question: how do I avoid the trailing zeros in slope, low, high, ED50, ic50.mid, and ic50.high in the figure?
The parameter you are pulling out is the hill slope parameter, or the coefficient in front of the concentration variable in the exponential, not the actual slope of the curve.
The AUC provided is in the [0-100] range, for the area above the curve. I ran the code and got the order as auc.low>auc.mid>auc.high. Traditionally the area under the response curve was reported, or 1-viability.
It is important to note that the PharmacoGx package uses a 3 parameter hill slope model, similar to LL.3 in drc. Therefore, the plot will not correspond to the function fit by PharmacoGx to calculate the IC50 or AUC.
Source: PharmacoGx dev.

How can I calculate Residual Standard Error in R for Test Data set?

I have split the Boston dataset into training and test sets as below:
library(MASS)
smp_size <- floor(.7 * nrow(Boston))
set.seed(133)
train_boston <- sample(seq_len(nrow(Boston)), size = smp_size)
train_ind <- sample(seq_len(nrow(Boston)), size = smp_size)
train_boston <- Boston[train_ind, ]
test_boston <- Boston[-train_ind,]
nrow(train_boston)
# [1] 354
nrow(test_boston)
# [1] 152
Now I get the RSE using lm function as below:
train_boston.lm <- lm(lstat~medv, train_boston)
summary(train_boston.lm)
summary(train_boston.lm)$sigma
How can I calculate Residual Standard error for the test data set? I can't use lm function on the test data set. Is there any method to calculate RSE on test data set?
Here your residual standard error is the same as
summary(train_boston.lm)$sigma
# [1] 4.73988
sqrt(sum((fitted(train_boston.lm)-train_boston$lstat)^2)/
(nrow(train_boston)-2))
# [1] 4.73988
you loose are estimating two parameters so your degrees of freedom is n-2
With your test data, you're not really doing the same estimation, but if you wanted to calculate the same type of calculation substituting the predicted value from the model for your new data for the fitted values from the original model, you can do
sqrt(sum((predict(train_boston.lm, test_boston)-test_boston$lstat)^2)/
(nrow(test_boston)-2))
Although it may make more sense just to calculate the standard deviation of the predicted residuals
sd(predict(train_boston.lm, test_boston)-test_boston$lstat)

Chi square goodness of fit for random numbers generated

I have used Inverse CDF method to generate 1000 samples from an exponential and a Cauchy random variable.
Now to verify whether these belong to their relevant distributions, I have to perform Chi-Squared Test for Goodness of fit.
I have tried two approaches (as below) -
Chisq.test(y) #which has 1000 samples from supposed exponential distribution
chisq.test(z) #cauchy
I am getting the following error:
data: y
X-squared = 234.0518, df = 999, p-value = 1
Warning message:
In chisq.test(y) : Chi-squared approximation may be incorrect
chisq.test(z)
Error in chisq.test(z) :
all entries of 'x' must be nonnegative and finite
I downloaded the vcd library to use goodfit()
and typed:
t1 <- goodfit(y,type= "exponential",method= "MinChiSq")
summary(t1)
In this case, the error message:
Error: could not find function "goodfit"
can somebody please guide on how to implement the Chi-Squared GOF test properly?
Note: The samples are not from normal distribution (exponential and cauchy respectively)
I am trying to understand if it is possible to get the observed and expected data instead with no luck so far.
edit - I did type in library(vcd) before writing the rest of the code. Apologies to have assumed it was obvious.
The chisq.test(...) function is designed primarily for use with counts, so it expects its arguments to be either countable (using table(...) for example), or to be counts already. It basically creates a contingency table for x and y (the first two arguments) and then uses the chisq test to determine if they are from the same distribution.
You are probably better off using the Kolmogorov–Smirnov test, which is designed for problems like yours. The K-S test compares the ecdf of the sample to the cdf of the test distribution and tests the null hypothesis that they are the same.
set.seed(1)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100))
ks.test(df$y,"pexp")
# One-sample Kolmogorov-Smirnov test
#
# data: df$y
# D = 0.0387, p-value = 0.1001
# alternative hypothesis: two-sided
ks.test(df$z,"pcauchy",100,100)
# One-sample Kolmogorov-Smirnov test
#
# data: df$z
# D = 0.0296, p-value = 0.3455
# alternative hypothesis: two-sided
Note that in this case, the K-S test predicts a 90% chance that your sample df$y did not come from an exponential distribution, even though it clearly did.
You can use chisq.test(...) by artificially binning your data and then comparing the counts in each bin to what would be expected from your test distribution (using p=...), but this is convoluted and the answer you get depends on the number of bins.
breaks <- c(seq(0,10,by=1))
O <- table(cut(df$y,breaks=breaks))
p <- diff(pexp(breaks))
chisq.test(O,p=p, rescale.p=T)
# Chi-squared test for given probabilities
#
# data: O
# X-squared = 7.9911, df = 9, p-value = 0.535
In this case the chisq test predicts a 47% chance that your sample did not come from an exponential distribution.
Finally, even though they are qualitative, I find Q-Q plots to be very useful. These plot quantiles of your sample against quantiles of the test distribution. If the sample is drawn from the test distribution, the Q-Q plot should fall close to the line y=x.
par(mfrow=c(1,2))
plot(qexp(seq(0,1,0.01)),quantile(df$y,seq(0,1,0.01)),
main="Q-Q Plot",ylab="df$Y", xlab="Exponential",
xlim=c(0,5),ylim=c(0,5))
plot(qcauchy(seq(0,.99,0.01),100,100),quantile(df$z,seq(0,.99,0.01)),
main="Q-Q Plot",ylab="df$Z",xlab="Cauchy",
xlim=c(-1000,1000),ylim=c(-1000,1000))
Looking at the Q-Q plots gives me much more confidence in asserting that df$y and df$z are drawn, respectively, from the Exponential and Cauchy distributions than either the K-S or ChiSq tests, even though I can't put a number on it.
# Simulation
set.seed(123)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100)
)
#This seems to be different, probably because of how you are simulating the data
chisq.test(df$y)
# Chi-squared test for given probabilities
#
# data: df$y
# X-squared = 978.485, df = 999, p-value = 0.6726
#
# Warning message:
# In chisq.test(df$y) : Chi-squared approximation may be incorrect
3 details:
1) you need to load the package. library(vcd)
2) There is no "exponential" type of distribution in the goodfit function
3) the method is MinChisq, Not MinChiSq
.
library(vcd)
t1 <- goodfit(df$y, type= "binomial", method= "MinChisq")
summary(t1)
# Goodness-of-fit test for binomial distribution
#
# X^2 df P(> X^2)
# Pearson 31.00952 6 2.524337e-05
# Warning message:
# In summary.goodfit(t1) : Chi-squared approximation may be incorrect

Simulating the significance level of a t-test in R

The following code generates 10000 t-tests on a custom distribution I have created.
> x <- replicate(10000,{
t.test(rcn(20,.25,25), mu=0, alternative="greater")
})
I am constructing an empirical level of significance and so I am interested in the number of test-statistics that are greather than the critical value of the respective t-distribution, which is 1.729 (for a t-distribution with 19 degrees of freedom).
How can I select (and count) these test-statistics here? The ratio of their number over 10000, the total number of simulations, will give me my empirical level of significance.
You can directly access the test statistic of each t.test in replicate. For example:
x <- replicate(10000, {
t.test(rcn(20,.25,25), mu=0, alternative="greater")$statistic
})
This returns a vector of t-values.
You can compare it with your critical value and count the TRUEs:
crit <- 1.729
sum(x > crit)

Resources