Wald-test for single statistic in R - r

I have a series of hazard-rates at two points (low and high point) in the curve with corresponding standard errors. I calculate the hazard-ratio by dividing the high point hazard-rate by the low point hazard-rate. This is the hratio column. Now in the next column I would like to show the probability (p-value) that the ratio is significantly different from 1 using the Wald-test.
I have tried doing this using the wald.test() from the aods3 package, but I keep getting an error messages. It seems that the code only allows for the comparison of two related regression models.
How would you go about doing this?
> wald
fit.low se.low fit.high se.high hratio
1 0.09387638 0.002597817 0.09530283 0.002800329 0.9850324
2 0.10941588 0.002870383 0.10831292 0.003061924 1.0101831
3 0.02549611 0.001054303 0.02857411 0.001368525 0.8922802
4 0.02818208 0.000917136 0.02871669 0.000936373 0.9813833
5 0.04857652 0.000554676 0.04897211 0.000568229 0.9919222
6 0.05121328 0.000565592 0.05142951 0.000554893 0.9957956
> library(aods3)
> wald$pv <- wald.test(b=wald$hratio)
Error in wald.test(b = wald$hratio) :
One of the arguments Terms or L must be used.

define L=NULL, Terms=NULL, Sigma = vcov(b)

Related

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

Error when using the Benjamini-Hochberg false discovery rate in R after Wilcoxon Rank

I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.
When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;
wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value
(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).
However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code
pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")
Re-assessing which p values are less than 0.05 via the below code, I get 0!
p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)
I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!
Thank-you
(As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).
Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.
For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.
wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value
So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.
You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes
library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001 304 129 339 102 16 617
ENSDARG00000000002 605 637 406 82 230 1245
First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732
all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732
So we proceed with DESeq2
library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)
out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 69, 0.27%
LFC < 0 (down) : 47, 0.18%
outliers [1] : 1270, 4.9%
low counts [2] : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes
res = results(dds)
res[which(res$padj < 0.05),]

Multinomial logit models and nested logit models

I am using the mlogit package in program R. I have converted my data from its original wide format to long format. Here is a sample of the converted data.frame which I refer to as 'long_perp'. All of the independent variables are individual specific. I have 4258 unique observations in the data-set.
date_id act2 grp.bin pdist ship sea avgknots shore day location chid alt
4.dive 40707_004 TRUE 2 2.250 second light 14.06809 2.30805 12 Lower 4 dive
4.fly 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 fly
4.none 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 none
5.dive 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 dive
5.fly 40707_006 TRUE 2 0.000 second light 15.12650 2.53312 12 Lower 5 fly
5.none 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 none
6.dive 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 dive
6.fly 40707_007 TRUE 1 1.995 second light 14.02101 2.01680 12 Lower 6 fly
6.none 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 none
'act2' is the dependent variable and consists of choices a bird floating on the water could make when approached by a ship; fly, dive, or none. I am interested in how these probabilities relate to the remaining independent variables in the data.frame, i.e. perpendicular distance to the ship path (pdist) sea conditions (sea), speed (avgknots), distance to shore (shore) etc. The independent variables are made of dichotomous, factor and continuous variables.
I ran two multinomial logit models, one including all the choice options and another including only a subset. I then compared these models with the hmftest() function to test for the IIA assumption. The results were confusing the say the least. I will include the codes for the two models and the test output (in case I am miss-specifying the models in the code).
# model including all choice options (fly, dive, none)
mod.1 <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none')
# model including only a subset of choice options (fly, dive)
mod.alt <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none', alt.subset = c("fly","dive"))
# IIA test
hmftest(mod.1, mod.alt)
# output
Hausman-McFadden test
data: long_perp
chisq = -968.7303, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
As you can see the chisquare statistic is negative! I assume I am either 1. doing something wrong, or 2. IIA is violated. This result holds true for choice subset (fly, dive), but the IIA assumption is upheld with choice subset (none, dive)? This confuses me.
Next I tried to formulate a nested model as a way to relax the IIA assumption. I nested the choices as nest1 = none, nest2 = fly, dive. This makes sense to me as this seems like a logical break, the bird decides to react or not then decides which reaction to make.
I am unclear on how to run the nested logit models (even after reading the two vignettes for mlogit, Croissant vignette and Train vignette).
When I run my analysis following the example in the Croissant vignette I get the following error.
nested.1 <- mlogit(act2 ~ 0 | pdist + as.factor(grp.bin) + as.factor(ship) +
as.factor(sea) + avgknots + shore + as.factor(location),
long_perp , reflevel="none",nests = list(noact = "none",
react = c("dive","fly")), unscaled = TRUE)
# Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[19,19] = 0
I have read a bit about this error message and it may occur because of complete separation. I have looked at some tables of the data and do not believe this is happening as I have 4,000+ observations and only one factor variable with more than 2 levels (it has 3).
Help on these specific problems is greatly appreciated but I am also open to alternate analyses that I can use to answer my question. I am mainly interested in the probability of flying as a function of perpendicular distance to the ships path.
Thanks, Tim
To get a positive chi-sq, change the code as follows:
alt.subset = c("none", "fly")
that is, the ref level will be in the subset too. It may help, though the P-value may not change much.

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Error with sem function at R : differences in factors

I wanted to use the function sem (with the package lavaan) on my data in R :
Model1<- 'Transfer~Amotivation+Gender+Age
Amotivation~Gender+Age
transfer are 4 questions with a 5 point likert scale
Amotivation: 4 questions with a 5 pint likert scale
Gender: 0 (=male) and 1 (=female)
Age: just the different ages
And i got next error:
in getDataFull (data= data, group = group, grow.label = group.label,:
lavaan WARNING: some observed variances are (at least) a factor 100 times larger than others; please rescale
Is anybody familiar with this error? Does it influence my results? Do I have to change anything? I really don't know what this error means.
Your scales are not equivalent. Your gender variables are constrained to be either 0 or 1. Amotivation is constrained to be between 1 and 5, but age is even less constrained. I created some sample data for gender, age, and amotivation. You can see that the variance for the age variable is over 4,000 times higher than the variance for gender, and about 500 times higher than sample amotivation data.
gender <- c(0,1,1,1,0,0,1,1,0,1,1,0,0,1,1,1)
age <- c(18,42,87,12,24,26,98,84,23,12,95,44,54,23,10,16)
set.seed(42)
amotivation <- rnorm(16, 3, 1.5)
var(gender) # 0.25 variance
var(age) # 1017.27 variance
var(amotivation) # 2.21 variance
I'm not sure how the unequal variances influence your results, or if you need to do anything at all. To make your age variable more closely match the amotivation scale, you could transform the data so that it's also on a 5 point scale.
newage <- age/max(age)*5
var(newage) # 2.65 variance
You could try running the analysis both ways (using your original data and the transformed data) and see if there are differences.

Resources