How do you set the alternative hypothesis text string in a `htest` object in R? - r

I am creating a hypothesis test in R as a htest object. I have managed to create the object I want, with the required estimate, test statistic and p-value. My only remaining problem is that the statement I want to give for my alternative hypothesis does not conform to the textual structure used in the printing method for an htest object. The setup for these objects seems to assume you have an alternative hypothesis that is a one-sided or two-sided test operating on an unknown parameter. It does not appear to accomodate more generate statements of alternative hypotheses, such as for goodness-of-fit tests. To be a bit more specific about my problem, here is the textual structure of the output print message for a htest object:
alternative hypothesis: true [name(null.value)] is [less than/equal to/greater than] [null.value]
I would like a more general print output like this:
alternative hypothesis: [character string]
When you create a htest object you can set name(null.value) and null.value to any character string you want, so it is possible to alter the start an end parts of the print message to anything you want. You can also set alternative to NA and this removes the middle part. However, the intermediate words "true" and "is" seem to be fixed. This means that you seem to be stuck with a message with the structure true [character string] is [character string].
My question: When creating a htest object, is there any way to get a print message for the alternative hypothesis that is an arbitrary character string? If so, what is the simplest way to do this?

As long as you set x$null.value <- NULL, it will print any string you construct for x$alternative
x <- t.test(1:10)
x$null.value <- NULL
x$alternative <- sprintf('%.2f on %s degrees of freedom, p %s',
x$statistic, x$parameter, format.pval(x$p.value, eps = 0.001))
x
# One Sample t-test
#
# data: 1:10
# t = 5.7446, df = 9, p-value = 0.0002782
# alternative hypothesis: 5.74 on 9 degrees of freedom, p < 0.001
# 95 percent confidence interval:
# 3.334149 7.665851
# sample estimates:
# mean of x
# 5.5

Related

BradleyTerry2 package in R - Using null hypothesis as reference player

I am using the BradleyTerry2 package in R to analyse my data. When using the BTm function to calculate ability scores, the first item in the dataset is removed as a reference, given a score of 0 and then other ability scores are calculated relative to this reference.
Is there a way to use a null hypothesis as a reference, rather than using the first item in the dataset?
This is the code I am using. The "ID" field is player id. This code calculates an ability score for each "Matchup," relative to the first matchup in the dataset.
BTv1 <- BTm(player1=winner,player2=loser,id="ID",formula=~Matchup+(1|ID),data=btmdata)
I am trying to test against the null hypothesis that matchup has no effect on match outcomes, but currently I don't know what ability score corresponds to the null hypothesis. I would like to use this null hypothesis as a reference, rather than using the first matchup in the dataset.
For those wanting to reproduce my results, you can find my files on my university onedrive.
You can test the significance of terms in the model for ability using the anova function, i.e.
anova(BTv1, test = "Chisq")
Using the example data and script that you shared, we get the following result:
Sequential Wald Tests
Model: binomial, link: logit
Response: NULL
Predictor: ~Characters + (1 | ID)
Terms added sequentially (first to last)
Statistic Df P(>|Chi|)
NULL
Characters 46.116 26 0.008853 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Edit: For the model BTv2 with log-ability modelled by ~ Matchup+(1|ID)
Before investigating individual matchups, we should check the significance of the term overall. Unfortunately the anova() method for BTm objects doesn't currently work for terms with inestimable parameters, as in this case. So we'll compute this directly:
cf <- coef(BTv2)[!is.na(coef(BTv2))]
V <- vcov(BTv2)
ind <- grep("Matchup", names(cf))
chisq <- c(t(cf[ind]) %*% chol2inv(chol(V[ind, ind])) %*% cf[ind])
df <- length(ind)
c(chisq = chisq, df = df)
# chisq df
# 107.5667 167.0000
The Chi-squared statistic is less than the degrees of freedom, so the Matchup term is not significant - the model is over-fitting and it's not a good idea to investigate matchup-specific effects.
All the same, let's look at the model when fitted to the matches involving just 3 of the characters, for illustration.
summary(BTv2)$fixef
# Estimate Std. Error z value Pr(>|z|)
# MatchupCaptainFalcon;Falco -0.1327177 0.3161729 -0.4197632 0.6746585
# MatchupCaptainFalcon;Peach 0.1464518 0.3861823 0.3792297 0.7045173
# MatchupFalco;Peach -0.4103029 0.3365761 -1.2190496 0.2228254
In this case only 3 parameters are estimable, the rest are fixed to zero. Under model BTv2 for players i and j playing characters c and d respectively, we have
logit(p(i playing c beats j playing d))
= log_ability_i - log_ability_j + U_i - U_j
= Matchup_{c;d} - Matchup_{d;c} + U_i - U_j
where U_i and U_j are random player effects. So for players of the same baseline ability we have for example,
logit(p(CaptainFalcon beats Falco)) = -0.1327177 - 0 = -0.1327177
logit(p(Falco beats CaptainFalcon)) = 0 - (-0.1327177) = 0.1327177
So this tells you whether one character is favoured over another in a particular pairwise matchup.
Let's return to the BTv1 model based on all the data. In this model, for players of the same baseline ability we have
logit(p(i playing c beats j playing d)) = log_ability_i - log_ability_j
= Characters_c - Characters_d
The effect for "CharactersBowser" is set to zero, the rest are estimable. So e.g.
summary(BTv1)$fixef[c("CharactersFalco", "CharactersPeach"),]
# Estimate Std. Error z value Pr(>|z|)
# CharactersFalco 2.038925 0.9576332 2.129130 0.03324354
# CharactersPeach 2.119304 0.9508804 2.228781 0.02582845
means that
logit(p(Bowser beats Peach)) = 0 - 2.119304 = -2.119304
logit(p(Falcon beats Peach)) = 2.038925 - 2.119304 = -0.080379
So we can still compare characters in a particular matchup. We can use quasi-variances to compare the character effects
# add in character with fixed effect set to zero (Bowser)
V <- cbind(XCharactersBowser = 0, rbind(XCharactersBowser = 0,
vcov(BTv1)))
cf <- c(CharactersBowser = 0, coef(BTv1))
# compute quasi-variances
qv <- qvcalc(V, "XCharacters", estimates = cf,
labels = sub("Characters", "", names(cf)))
# plot and compare
# (need to set ylim because some estimates are essentially infinite)
par(mar = c(7, 4, 3, 1))
plot(qv, ylim = c(-5, 5))
See e.g. https://doi.org/10.1093/biomet/91.1.65 for more on quasi-variances.

Creating a function with a variable number of variables in the body

I'm trying to create a function that has set arguments, and in the body of the function it has a set formula but a random number of variables in the formula, determined only when the data is received. How do I write the function body so that it can adjust itself for an unknown number of variables?
Here's the backstory: I'm using the nls.lm package to optimize a function for a set of parameters in the function. nls.lm requires a function that returns a vector of residuals. This function is pretty simple: observed-predicted values. However, I also need to create a function to actually get the predicted values. This is where it gets tricky, since the predicted value formula contains the parameters that need to be regressed and optimized.
This is my general formula I am trying to perform non-linear regression on:
Y=A+(B-A)/(1+(10^(X-C-N))
Where A and B are global parameters shared across the entire data set and N is some constant. C can be anywhere from 1 to 8 parameters that need to be determined individually depending on the data set associated with each parameter.
Right now my working function contains the formula and 8 parameters to be estimated.
getPredictors<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/(1+(10^(xvalues-
params$1*Indicator[1,]-params$2*Indicator[2,]-
params$3*Indicator[3,]-params$4*Indicator[4,]-
params$5*Indicator[5,]-params$6*Indicator[6,]-
params$7*Indicator[7,]-params$8*Indicator[8,]-
constant))))
}
params is a list of parameters with an initial value. Indicator is a table where each row consists 1's and 0's that act as an indicator variable to correctly pair each individual parameter with its associated data points. In its simplest form, if it had only one data point per parameter, it would look like a square identity matrix.
When I pair this function with nls.lm() I am successful in my regression:
residFun<- function(p, observed, xx) {
observed - getPredictors(p,xx)
}
nls.out<- nls.lm(parameterslist, fn = residFun, observed = Yavg, xx = Xavg)
> summary(nls.out)
Parameters:
Estimate Std. Error t value Pr(>|t|)
1 -6.1279 0.1857 -32.997 <2e-16 ***
2 -6.5514 0.1863 -35.174 <2e-16 ***
3 -6.2077 0.1860 -33.380 <2e-16 ***
4 -6.4275 0.1863 -34.495 <2e-16 ***
5 -6.4805 0.1863 -34.783 <2e-16 ***
6 -6.1777 0.1859 -33.235 <2e-16 ***
7 -6.3098 0.1862 -33.882 <2e-16 ***
8 -7.7044 0.1865 -41.303 <2e-16 ***
A 549.7203 11.5413 47.631 <2e-16 ***
B 5.9515 25.4343 0.234 0.816
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 82.5 on 86 degrees of freedom
Number of iterations to termination: 7
Reason for termination: Relative error in the sum of squares is at most `ftol'.
Now, the problem comes in when the data I receive does not contain 8 parameters. I can't just substitute 0 for these values since I am certain the degrees of freedom change with less parameters. Therefore I will need some way to create a getPredictors function on the fly, depending on the data I receive.
I've tried a couple things. I've tried combining all the parameters into a list of strings like so: (It's still 8 parameters, for comparison reasons, but it can be anywhere from 1-7 parameters.)
for (i in 1:length(data$subjects)){
paramsandindics[i]<-c(paste0("params$",i,"*","Indicator[",i,",]"))
}
combined<-paste0(paramsandindics, collapse="-")
> combined
[1] "params$1*Indicator[1,]-params$2*Indicator[2,]-
params$3*Indicator[3,]-params$4*Indicator[4,]-
params$5*Indicator[5,]-params$6*Indicator[6,]-
params$7*Indicator[7,]-params$8*Indicator[8,]"
Which appears to get me what I need. So I try dropping it into a new equation
getPredictors2<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/
(1+(10^(xvalues-parse(text=combined)-constant))))
}
But I get an error "non-numeric argument to binary operator". Makes sense, it's probably trying to subtract a character string which won't work. So I switch to:
getPredictors2<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/
(1+(10^(xvalues-eval(parse(text=combined))-constant))))
}
Which immediately evaluates the whole thing, producing only 1 parameter, which breaks my regression.
Ultimately I'd like a function that is written to accept a variable or dynamic number of variables to be filled in in the body of the function. These variables need to be written as-is and not evaluated immediately because the Levenberg-Marquardt algorithm, which is employed in the nls.lm (part of the minpack.lm package) requires an equation in addition to initial parameter guesses and residuals to minimize.
A simple example should suffice. I'm sorry if none of my stuff is reproducible- the data set is quite specific and too large to properly upload.
Sorry if this is long winded. This is my first time trying any of this (coding, nonlinear regression, stackoverflow) so I am a little lost. I'm not sure I am even asking the right question. Thank you for your time and consideration.
EDIT
I've included a smaller sample involving 2 parameters as an example. I hope it can help.
Subjects<-c("K1","K2")
#Xvalues
Xvals<-c(-11, -10, -9, -11, -10, -9)
#YValues, Observed
Yobs<-c(467,330,220,567,345,210)
#Indicator matrix for individual parameters
Indicators<-matrix(nrow = 2, ncol = 6)
Indicators[1,]<-c(1,1,1,0,0,0)
Indicators[2,]<-c(0,0,0,1,1,1)
#Setting up the parameters and functions needed for nls.lm
parameternames<-c("K1","K2","A","B")
#Starting values that nls.lm will iterate on
startingestimates<-c(-7,-7,0,500)
C<-.45
parameterlist<-as.list(setNames(startingestimates, parameternames))
getPredictors<- function(params, xx){
(params$A) + ((params$B-params$A)/
(1+(10^(xx-params$K1*Indicators[1,]-params$K2*Indicators[2,]-C))))}
residFunc<- function(p, observed, xx) {
observed - getPredictors(p,xx)
}
nls.output<- nls.lm(parameterlist, fn = residFunc, observed = Yobs, xx = Xvals)
#Latest attempt at creating a dynamic getPredictor function
combinationtext<-c()
combination<-c()
for (i in 1:length(Subjects)){
combinationtext[i]<-c(paste0("params$K",i,"*","Indicators[",i,",]"))
}
combination<-paste0(combinationtext, collapse="-")
getPredictorsDynamic<-function(params, xx){
(params$A) + ((params$B-params$A)/
(1+(10^(xx-(parse(text=combination))-C))))}
residFunc2<- function(p, observed, xx) {
observed - getPredictorsDynamic(p,xx)
}
nls.output2<-nls.lm(parameterlist, fn = residFunc2, observed = Yobs, xx = Xvals)
#Does not work

Is it possible to change Type I error threshold using t.test() function?

I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.

Bootstrapping sample means in R using boot Package, Creating the Statistic Function for boot() Function

I have a data set with 15 density calculations, each from a different transect. I would like to resampled these with replacement, taking 15 randomly selected samples of the 15 transects and then getting the mean of these resamples. Each transect should have its own personal probability of being sampled during this process. This should be done 5000 times. I have a code which does this without using the boot function but if I want to calculate the BCa 95% CI using the boot package it requires the bootstrapping to be done through the boot function first.
I have been trying to create a function but I cant get any that seem to work. I want the bootstrap to select from a certain column (data$xs) and the probabilites to be used are in the column data$prob.
The function I thought might work was;
library(boot)
meanfun <- function (data, i){
d<-data [i,]
return (mean (d)) }
bo<-boot (data$xs, statistic=meanfun, R=5000)
#boot.ci (bo, conf=0.95, type="bca") #obviously `bo` was not made
But this told me 'incorrect number of dimensions'
I understand how to make a function in the normal sense but it seems strange how the function works in boot. Since the function is only given to boot by name, and no specification of the arguments to pass into the function I seem limited to what boot itself will pass in as arguments (for example I am unable to pass data$xs in as the argument for data, and I am unable to pass in data$prob as an argument for probability, and so on). It seems to really limit what can be done. Perhaps I am missing something though?
Thanks for any and all help
The reason for this error is, that data$xs returns a vector, which you then try to subset by data [i, ].
One way to solve this, is by changing it to data[i] or by using data[, "xs", drop = FALSE] instead. The drop = FALSE avoids type coercion, ie. keeps it as a data.frame.
We try
data <- data.frame(xs = rnorm(15, 2))
library(boot)
meanfun <- function(data, i){
d <- data[i, ]
return(mean(d))
}
bo <- boot(data[, "xs", drop = FALSE], statistic=meanfun, R=5000)
boot.ci(bo, conf=0.95, type="bca")
and obtain:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = bo, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% ( 1.555, 2.534 )
Calculations and Intervals on Original Scale
One can use boot.array to extract all or a subset of the resampled sets. In this case:
bo.ci<-boot.ci(boot.out = bo, conf = 0.95, type = "bca")
resampled.data<-boot.array(bo,1)
To extract the first and second sets of resampled data:
resample.1<-resampled.data[1,]
resample.2<-resampled.data[2,]
Then proceed to extract the individual statistic you'd want from any subset. For isntance, If you assume normality you could run a student's t.test on teh first subset:
t.test(resample.1)
Which for this example and particular seed value(s) gives:
data: resample.1
t = 6.5216, df = 14, p-value = 1.353e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.234781 10.365219
sample estimates:
mean of x
7.8
r resampling boot.array

Displaying only the p-value of multiple t.tests

I have
replicate(1000, t.test(rnorm(10)))
What it does that it draws a sample of size ten from a normal distribution, performs a t.test on it, and does this a 1000 times.
But for my assignment I'm only interested in the p-value (the question is: how many times is the null hypothesis rejected).
How do I get only the p-values, or can I add something that already says how many times the null hypothesis is rejected(how many times the p-value is smaller than 0.05)
t.test returns a object of class htest which is a list containing a number of components including p.value (which is what you want).
You have a couple of options.
You can save the t.test results in a list and then extract the p.value component
# simplify = FALSE to avoid coercion to array
ttestlist <- replicate(1000, t.test(rnorm(10)), simplify = FALSE)
ttest.pval <- sapply(ttestlist, '[[', 'p.value')
Or you could simply only save that component of the t.test object
pvals <- replicate(1000, t.test(rnorm(10))$p.value)
Here are the steps I'd use to solve your problem. Pay attention to how I broke it down into the smallest component parts and built it up step by step:
#Let's look at the structure of one t.test to see where the p-value is stored
str(t.test(rnorm(10)))
#It is named "p.value, so let's see if we can extract it
t.test(rnorm(10))[["p.value"]]
#Now let's test if its less than your 0.05 value
ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0)
#That worked. Now let's replace the code above in your replicate function:
replicate(1000, ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0))
#That worked too, now we can just take the sum of that:
#Make it reproducible this time
set.seed(42)
sum(replicate(1000, ifelse(t.test(rnorm(10))[["p.value"]]< 0.05,1,0)))
Should yield this:
[1] 54

Resources