r qqp function - why is the 'perfect fit' a flat line on 0? - r

This may be more of a statistical question than a programming one. I just wanted to make sure I was getting the programming right first.
I have a large count dataset (108 sites with 31 species = 3348 observations) but a lot of these are 0 counts because only not species were not present at every site. I have had log transformation suggested to me but others have also said that you shouldn't log transform count data. Here is my data for the first 8 species (also contains the very abundant species with the highest counts):
example.abund <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,
0,0,1,0,8,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,2,0,3,1,0,0,0,0,0,0,0,0,0,
2,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,
0,1,0,0,0,28,1,0,1,0,0,1,0,2,0,0,2,0,0,0,1,0,0,0,1,0,0,0,2,0,0,1,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,8,7,7,1,1,13,0,8,0,3,0,1,1,
1,4,4,0,1,0,1,0,0,0,0,6,5,2,0,2,58,4,2,47,4,0,0,0,2,59,2,0,0,6,1,36,28,2,
1,1,0,6,0,0,2,5,0,0,0,0,87,7,0,1,1,1,0,0,1,1,0,6,11,0,0,0,3,0,4,0,7,2,
0,5,0,4,1,0,1,12,0,2,0,9,0,1,0,0,0,24,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,3,1,0,1,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,15,0,2,
81,0,1,32,26,13,2,61,0,66,2,2,0,17,43,43,0,25,19,2,25,26,91,61,0,13,0,62,186,1,4,22,1,50,3,67,86,11,56,26,74,0,6,8,7,0152,8,14,1,97,1,0,12,11,3,1,1,112,2,35,36,5,61,26,211,15,8,173,17,97,22,18,88,11,1,66,15,3,3,3,2,0,1,0,41,9,14,1,0,38,0,0,51,27,11,38,31,1,0,221,68,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,29,0,0,0,0,
0,82,12,0,0,3,0,9,0,0,164,0,0,0,0,1,0,15,0,0,0,6,56,0,0,0,6,0,0,1,0,5,5,8,
0,4,0,0,6,0,0,2,0,0,3,0,0,0,0,683,0,0,0,0,3,149,252,11,13,195,19,0,59,0,0,1,28,0,
0,0,0,0,0,0,0,0,0,0,31,55,85,0,142,0,44,52,0,0192,0,45,0,0,0,0,0,0,11,2,0,0,6,
0,0,0,0,0,0,0,0,0,0,0,0,0,19,3,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0)
I am need to make a mixed model to fit the data, but first I am trying to figure out the most appropriate distribution to use. I was following the steps in this blog. But all of the red lines (meant to represent 'perfect fit' for that distribution) are coming up as being 0 along the entire plot.
My question is: have a coded this correctly and there are so many 0s in my data that the perfect fit is 0? Or is there something wrong with the way I have coded?
Code example:
#so that the families without 0s can recognise data
example.abund.1 <- example.abund + 1
plot(hist(example.abund))
qqp(example.abund, "norm")
qqp(example.abund.1, "lnorm") #lognorm
#have to generate estimates of parameters:
nbinom <- fitdistr(example.abund.1, "Negative Binomial")
qqp(example.abund.1, "nbinom", size = nbinom$estimate[[1]], mu = nbinom$estimate[[2]])
poisson <- fitdistr(example.abund.1, "Poisson")
qqp(example.abund.1, "pois", poisson$estimate)
gamma <- fitdistr(example.abund.1, "gamma")
qqp(example.abund.1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]])

Related

PCA - All variables with same signal on PC1 coordinates

So, I am analyzing a dataset that consists of 160 observations and 20 variables and am performing a PCA. It is about patients affected by a disease and the variables are antibodies levels measured in the same experiment and the values are on the same units (u/mL). These variables are all positive values so I can't understand how I would have samples on the positive PC1 side of the plot without any variable contributing to that side (given that there are no negative values involved on these variables).
For confounding factors, what I have is: patients' age, gender and the duration of infection, but these 3 were not added in the PC analysis.
I am having some trouble to understand the following: when using the rpackage factoextra's function fviz_pca_biplot() to see both the sample distribution as well as each variable contribution to PCs 1 and 2, I realized that my 20 variables have high negative value for PC1.
For the following images, I generated them using a small sample of my original data and, eventhough the variables contribution are not the same, they are still highly negative for PC1. This is understandable if I do not center my data in the prcomp() function (image 1) as it is possible to see that all of my samples are on the negative side of the PC1 component and it explains most of the data inertia.
library(factoextra)
PCAf <- read.table("PCA_small_sample.csv", sep = ";", header = T, row.names = 1)
res.pca <- prcomp(PCAf, scale = TRUE, center = F)
fviz_pca_biplot(res.pca)
However, I have been taught that it is necessary to center the data when performing PCA and the image becomes like this:
res.pca <- prcomp(PCAf, scale = TRUE)
fviz_pca_biplot(res.pca)
This diminishes PC1 explained variance and increases PC2 but, eventhough it changes the variables coordinates, there is no positive coord to PC1.
res.var <- get_pca_var(res.pca)
res.var$coord
These are the values for the non centered PCA:
And for the centered PCA:
Am I doing something wrong, should I really present my analysis with the second image eventhough the vectors do not match what we are seeing?
My main question is: When presenting the PCA, it is better to do so with the centralized data, right? Then, should I perform some sort of correction to the variables' coordinates/contribution to the PCs? Because this second image does not seem too reliable to me, but this may be due to lack of experience... I mean, since all variables are going toward the left side of the plot, what would be pulling some of the samples (e.g. 7,10,8,4,20) towards the right side of the plot (positive PC1)? It seems counterintuitive that there isn't even a single vector on the right side.
This also brings me the question: Should I add confounding factors when performing a PCA? I performed linear regression to account for them but did not include them in the PC analysis.
Anyway, thank you all so much in advance.
PS: I uploaded a file containing a sample of my data, code and images on github
PS2: When plotting this with a generic dataset, I do not see the same issue. At first it happens but when centering the data, there are vectors on the four quadrants, for which I am able to extract some rationale.
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
PCAf <- t(data.matrix)
res.pca_NC <- prcomp(PCAf, scale = TRUE, center = F)
res.pca_C <- prcomp(PCAf, scale = TRUE, center = T)
fviz_pca_biplot(res.pca_NC)
fviz_pca_biplot(res.pca_C)
Not centered - generic PCA:
Centered - generic PCA:

R: How to read Nomograms to predict the desired variable

I am using Rstudio. I have created nomograms using function nomogram from package rms using following code (copied from the example code of the documentation):
library(rms)
n <- 1000 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
f <- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
Result:
This code produces a nomogram but there is no line connecting each scale (called isopleth) to help predict the desired variable ("Risk of Death") from the plot. Usually, nomograms have the isopleth for prediction (example from wikipedia). But here, how do I predict the variable value?
EDIT:
From the documentation:
The nomogram does not have lines representing sums, but it has a
reference line for reading scoring points (default range 0--100). Once
the reader manually totals the points, the predicted values can be
read at the bottom.
I don't understand this. It seems that predicting is supposed to be done without the isopleth, from the scale of points. but how? Can someone please elaborate with this example on how I can read the nomograms to predict the desired variable? Thanks a lot!
EDIT 2 (FYI):
In the description of the bounty, I am talking about the isopleth. When starting the bounty, I did not know that nomogram function does not provide isopleth and has points scale instead.
From the documentation, the nomogram is used to manualy obtain prediction:
In the top of the plot (over Total points)
you draw a vertical line for each of the variables of your patient (for example age=40, cholesterol=220 ( and sex=male ), blood.pressure=172)
then you sum up the three values you read on the Points scale (40+60+3=103) to obtain Total Points.
Finally you draw a vertical line on the Total Points scale (103) to read the Risk of death (0.55).
These are regression nomograms, and work in a different way to classic nomograms. A classic nomogram will perform a full calculation. For these nomograms you drop a line from each predictor to the scale at the bottom and add your results.
The only way to have a classic 'isopleth' nomogram working on a regression model would be 1 have just two predictors or 2 have a complex multi- step nomogram.

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources