F-testing formula in R - r

I am new to R and I am trying to test my linear model. The output from lm() function is as follows:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1615.2716 83.2051 19.41 <2e-16 ***
rts$angle 11.8387 0.8895 13.31 <2e-16 ***
I wanted to test the null hypothesis, which gave me following output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2503.17 70.04 35.74 <2e-16 ***
Now, I am using the F-testing formula:
(rss0 <- deviance(nullmod))
# [1] 158056425
(rss <- deviance(rtsld))
# [1] 79219962
(df0 <- df.residual(nullmod))
# [1] 179
(df <- df.residual(rtsld))
# [1] 178
(fstat <- ((rss0-rss)/(df0-df))/(rss/df))
# [1] 177.1383
1-pf(fstat, df0-df, df)
# [1] 0
I do not understand why I am getting 0 for the p-value of my f-statistic. Could someone please help me to understand this output?

The pf() function has a default for lower.tail=TRUE. This means, that the F-test performed is defaulted to a lower tailed test. In a F-test we always use an upper-tailed test (for a great explanation, see here). While it makes intuitive sense to calculate the upper tailed test with 1-pf(), R needs a bit of prodding to make it work this way. This is because, when you have a large effect, the lower tail can very easily return a value that is so incredibly close to 1, that R doesn't recognize the floating point (or so I've been told, I'm not entirely sure how accurate this is as I haven't looked at the raw code of the pf() function).
#Roland already posed the solution of wrapping your 1-pval() call with the format.pval(), as this forces R to recognize the p-value. However, I would argue to use the command:
pf(fstat, df0-df, df, lower.tail=FALSE)
This returns a more accurate upper.tailed test. When you wrap this in the format.pval() command, you get the same result as when you use the format.pval(1 - pf()) command. This is because format.pval is bounded to round up. However, when using format.pval with increased shown decimals using the 1 - pf(lower.tail=TRUE) formula is incapable of reconstructing the more accurate estimate.
> pf(fstat, df0-df, df, lower.tail=FALSE)
[1] 0.0000000000000000000000000001685664
> format.pval(pf(fstat, df0-df, df, lower.tail=FALSE), eps=0.0000000000000000000000000001)
[1] "0.00000000000000000000000000016857"
> format.pval(1-pf(fstat, df0-df, df, lower.tail=TRUE), eps=0.0000000000000000000000000001)
[1] "< 0.0000000000000000000000000001"
Note that even now, the format.pval wrap on the upper-tailed test is rounding up. Of course, when your p-value is this small (and in fact, the entire issue arises only when your p-value is very small), there's hardly a difference between the two methods. But why settle for less accurate?

Related

The linear model in R ignores entirely one value in the interaction estimation

I am running a linear model on the impact of a particular team composition on the performance of teams. I have included the region as a separate variable since I plan to compare the coeeficients afterwards. I run the model with interactions between the strategy and every single region. That is what the code for the model looks like. lm(Kills~Solocarry + Region:Solocarry, unidt) This is the result I get
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.8390 0.6221 43.139 < 2e-16 ***
SolocarryTRUE -3.0468 1.4180 -2.149 0.03178 *
SolocarryFALSE:RegionEurope 0.9498 0.8943 1.062 0.28833
SolocarryTRUE:RegionEurope -3.3712 1.7146 -1.966 0.04942 *
SolocarryFALSE:RegionKorea -3.2339 0.8861 -3.649 0.00027 ***
SolocarryTRUE:RegionKorea -7.6628 1.7591 -4.356 1.39e-05 ***
SolocarryFALSE:RegionNAmerica -1.0089 0.8876 -1.137 0.25581
SolocarryTRUE:RegionNAmerica -7.1216 1.7591 -4.048 5.36e-05 ***
SolocarryFALSE:RegionOceania 0.6659 0.8927 0.746 0.45581
SolocarryTRUE:RegionOceania -1.5712 1.7146 -0.916 0.35959
And here you can see that the last one region (that is Brazil) is missing. In the beginning, I thought that Brazil is included as an intercept, but all the other regions have both estimates for True and False for the solocarry coefficient, so shouldn't I have at least something like SolocarryTRUE:RegionBrazil or SolocarryFALSE:RegionBrazil? When I check it with the unique function, I receive this.
> unique(unidt$Region)
[1] "Europe" "NAmerica" "Korea" "Brazil" "Oceania"
There are 400 observations for each of the regions (except North America but there are 399). When I open the data frame, I can see that it is still there. I have already tried turning it off and on again. Has anybody seen something like this? If yes or you have any idea how to solve it, I would appreciate your help.

Creating a function with a variable number of variables in the body

I'm trying to create a function that has set arguments, and in the body of the function it has a set formula but a random number of variables in the formula, determined only when the data is received. How do I write the function body so that it can adjust itself for an unknown number of variables?
Here's the backstory: I'm using the nls.lm package to optimize a function for a set of parameters in the function. nls.lm requires a function that returns a vector of residuals. This function is pretty simple: observed-predicted values. However, I also need to create a function to actually get the predicted values. This is where it gets tricky, since the predicted value formula contains the parameters that need to be regressed and optimized.
This is my general formula I am trying to perform non-linear regression on:
Y=A+(B-A)/(1+(10^(X-C-N))
Where A and B are global parameters shared across the entire data set and N is some constant. C can be anywhere from 1 to 8 parameters that need to be determined individually depending on the data set associated with each parameter.
Right now my working function contains the formula and 8 parameters to be estimated.
getPredictors<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/(1+(10^(xvalues-
params$1*Indicator[1,]-params$2*Indicator[2,]-
params$3*Indicator[3,]-params$4*Indicator[4,]-
params$5*Indicator[5,]-params$6*Indicator[6,]-
params$7*Indicator[7,]-params$8*Indicator[8,]-
constant))))
}
params is a list of parameters with an initial value. Indicator is a table where each row consists 1's and 0's that act as an indicator variable to correctly pair each individual parameter with its associated data points. In its simplest form, if it had only one data point per parameter, it would look like a square identity matrix.
When I pair this function with nls.lm() I am successful in my regression:
residFun<- function(p, observed, xx) {
observed - getPredictors(p,xx)
}
nls.out<- nls.lm(parameterslist, fn = residFun, observed = Yavg, xx = Xavg)
> summary(nls.out)
Parameters:
Estimate Std. Error t value Pr(>|t|)
1 -6.1279 0.1857 -32.997 <2e-16 ***
2 -6.5514 0.1863 -35.174 <2e-16 ***
3 -6.2077 0.1860 -33.380 <2e-16 ***
4 -6.4275 0.1863 -34.495 <2e-16 ***
5 -6.4805 0.1863 -34.783 <2e-16 ***
6 -6.1777 0.1859 -33.235 <2e-16 ***
7 -6.3098 0.1862 -33.882 <2e-16 ***
8 -7.7044 0.1865 -41.303 <2e-16 ***
A 549.7203 11.5413 47.631 <2e-16 ***
B 5.9515 25.4343 0.234 0.816
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 82.5 on 86 degrees of freedom
Number of iterations to termination: 7
Reason for termination: Relative error in the sum of squares is at most `ftol'.
Now, the problem comes in when the data I receive does not contain 8 parameters. I can't just substitute 0 for these values since I am certain the degrees of freedom change with less parameters. Therefore I will need some way to create a getPredictors function on the fly, depending on the data I receive.
I've tried a couple things. I've tried combining all the parameters into a list of strings like so: (It's still 8 parameters, for comparison reasons, but it can be anywhere from 1-7 parameters.)
for (i in 1:length(data$subjects)){
paramsandindics[i]<-c(paste0("params$",i,"*","Indicator[",i,",]"))
}
combined<-paste0(paramsandindics, collapse="-")
> combined
[1] "params$1*Indicator[1,]-params$2*Indicator[2,]-
params$3*Indicator[3,]-params$4*Indicator[4,]-
params$5*Indicator[5,]-params$6*Indicator[6,]-
params$7*Indicator[7,]-params$8*Indicator[8,]"
Which appears to get me what I need. So I try dropping it into a new equation
getPredictors2<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/
(1+(10^(xvalues-parse(text=combined)-constant))))
}
But I get an error "non-numeric argument to binary operator". Makes sense, it's probably trying to subtract a character string which won't work. So I switch to:
getPredictors2<- function(params, xvalues) {
(params$B) + ((params$A-params$B)/
(1+(10^(xvalues-eval(parse(text=combined))-constant))))
}
Which immediately evaluates the whole thing, producing only 1 parameter, which breaks my regression.
Ultimately I'd like a function that is written to accept a variable or dynamic number of variables to be filled in in the body of the function. These variables need to be written as-is and not evaluated immediately because the Levenberg-Marquardt algorithm, which is employed in the nls.lm (part of the minpack.lm package) requires an equation in addition to initial parameter guesses and residuals to minimize.
A simple example should suffice. I'm sorry if none of my stuff is reproducible- the data set is quite specific and too large to properly upload.
Sorry if this is long winded. This is my first time trying any of this (coding, nonlinear regression, stackoverflow) so I am a little lost. I'm not sure I am even asking the right question. Thank you for your time and consideration.
EDIT
I've included a smaller sample involving 2 parameters as an example. I hope it can help.
Subjects<-c("K1","K2")
#Xvalues
Xvals<-c(-11, -10, -9, -11, -10, -9)
#YValues, Observed
Yobs<-c(467,330,220,567,345,210)
#Indicator matrix for individual parameters
Indicators<-matrix(nrow = 2, ncol = 6)
Indicators[1,]<-c(1,1,1,0,0,0)
Indicators[2,]<-c(0,0,0,1,1,1)
#Setting up the parameters and functions needed for nls.lm
parameternames<-c("K1","K2","A","B")
#Starting values that nls.lm will iterate on
startingestimates<-c(-7,-7,0,500)
C<-.45
parameterlist<-as.list(setNames(startingestimates, parameternames))
getPredictors<- function(params, xx){
(params$A) + ((params$B-params$A)/
(1+(10^(xx-params$K1*Indicators[1,]-params$K2*Indicators[2,]-C))))}
residFunc<- function(p, observed, xx) {
observed - getPredictors(p,xx)
}
nls.output<- nls.lm(parameterlist, fn = residFunc, observed = Yobs, xx = Xvals)
#Latest attempt at creating a dynamic getPredictor function
combinationtext<-c()
combination<-c()
for (i in 1:length(Subjects)){
combinationtext[i]<-c(paste0("params$K",i,"*","Indicators[",i,",]"))
}
combination<-paste0(combinationtext, collapse="-")
getPredictorsDynamic<-function(params, xx){
(params$A) + ((params$B-params$A)/
(1+(10^(xx-(parse(text=combination))-C))))}
residFunc2<- function(p, observed, xx) {
observed - getPredictorsDynamic(p,xx)
}
nls.output2<-nls.lm(parameterlist, fn = residFunc2, observed = Yobs, xx = Xvals)
#Does not work

Is it possible to change Type I error threshold using t.test() function?

I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)
I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

Error in linear model when values are 0

I have a data set that has names, value 1, and value 2. I need to run a regression and obtain the t-statistic for each of the names. I got help on StackOverflow in constructing the linear model. I noticed that sometimes I get data that's 0's. It's OK and I want the model to keep running and not bomb. However, when the 0's are in there, the linear model bombs.
v1<-rnorm(1:50)
v2<-rnorm(1:50)
data<-data.frame(v1,v2)
data[1:50,"nm"]<-"A"
data[50:100,"nm"]<-"B"
data[50:100,"v1"]<-0
data[50:100,"v2"]<-0
data<-data[c("nm","v1","v2")]
## run regression and generate universe
plyrFunc <- function(x){
mod <- lm(v1~v2, data = x)
return(summary(mod)$coefficients[2,3])
}
lm <- ddply(data, .(nm), plyrFunc)
As you can see, for name B, since everything is 0, the model bombs. I cannot just remove all 0's because often times the values are indeed 0.
I don't know how to edit the above code so that it keeps going.
Can anyone let me know? Thank you!
The model actually works fine, it is a subsetting of summary(mod)$coefficients that throws you an error because it contains only one row in the all-zeroes case:
> summary(lm(v1~v2,data[data$nm=="A",]))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1462766 0.1591779 -0.9189503 0.3628138
v2 -0.1315238 0.1465024 -0.8977590 0.3738900
> summary(lm(v1~v2,data[data$nm=="B",]))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NaN NaN
Thus, you need to modify your function to take this case into account:
plyrFunc <- function(x){
mod <- lm(v1~v2, data = x)
res <- summary(mod)$coefficients
if (nrow(res)>1) res[2,3] else NA
}
library(plyr)
result <- ddply(data, .(nm), plyrFunc)
Output for your sample data set:
nm V1
1 A -0.1825896
2 B NA

Resources