Extract p-value from gam.check in R - r

When I run gam.check(my_spline_gam), I get the following output.
Method: GCV Optimizer: magic
Smoothing parameter selection converged after 9 iterations.
The RMS GCV score gradiant at convergence was 4.785628e-06 .
The Hessian was positive definite.
The estimated model rank was 25 (maximum possible: 25)
Model rank = 25 / 25
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x) 24.000 22.098 0.849 0.06
My question is whether I can extract this p-value separately to a table.

Looks like you cannot store the result in an object the normal way. You could use capture.output to store the console output in an object, and then subsequently use str_split to get the correct value. So for the example in the help file this would be:
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
r <- capture.output(gam.check(b))
p <- strsplit(r[12], " ")[[1]][11]
But because the p-value is just a string you wouldn't get the exact p-value this way.
Edit: user20650's answer will give you the proper output:
r <- k.check(b)
r[,'p-value']

Use capture.output coupled with a little string manipulation -
gam_obj <- capture.output(gam.check(b,pch=19,cex=.3))
gam_tbl <- gam_obj[12:length(gam_obj)]
str_spl = function(x){
p_value <- strsplit(x, " ")[[1]]
output_p <- as.numeric(p_value[length(p_value)])
}
p_values <- data.frame(sapply(gam_tbl, str_spl))
Output

Related

Creating a function in r to run a chi squared test

The goal is to create a chi squared test function with arguments dat and res.type="pearson" that returns an R list containing the test statistic, p-value, expected counts, residual type, and the residuals of that type, where the expected counts and the residuals are stored in two r × c matrices.
I think I have figured out how to get the expected counts and test statistic but can't figure out the res.type argument. The pearson residual equation is (o_ij-e_ij)/sqrt(e_ij) and the standard residual equation is (o_ij-e_ij)/sqrt(e_ij(1-n_i./n_..)(1-n_.j/n..)
Here is what I have so far
chisquared <- function(dat, res.type) {
expdata <- matrix(c((dat[4,1]*dat[1,4])/dat[4,4],
(dat[4,2]*dat[1,4])/dat[4,4], (dat[4,3]*dat[1,4])/dat[4,4],
(dat[4,1]*dat[2,4])/dat[4,4], (dat[4,2]*dat[2,4])/dat[4,4],
(dat[4,3]*dat[2,4])/dat[4,4], (dat[4,1]*dat[3,4])/dat[4,4],
(dat[4,2]*dat[3,4])/dat[4,4], (dat[4,3]*dat[3,4])/dat[4,4]),
nrow=3, ncol=3, byrow="T")
sqdist <- matrix(c((dat[1,1]-expdata[1,1])^2/expdata[1,1],
(dat[1,2]-expdata[1,2])^2/expdata[1,2], (dat[1,3]-
expdata[1,3])^2/expdata[1,3], (dat[2,1]-
expdata[2,1])^2/expdata[2,1], (dat[2,2]-
expdata[2,2])^2/expdata[2,2], (dat[2,3]-
expdata[2,3])^2/expdata[2,3], (dat[3,1]-
expdata[3,1])^2/expdata[3,1], (dat[3,2]-
expdata[3,2])^2/expdata[3,2], (dat[3,3]-
expdata[3,3])^2/expdata[3,3]), nrow=3, ncol=3, byrow="T")
ts <- sum(sqdist[1,1], sqdist[1,2], sqdist[1,3], sqdist[2,1],
sqdist[2,2], sqdist[2,3], sqdist[3,1], sqdist[3,2], sqdist[3,3])
}
I'm sure there is an easier way to do this, other than the chisq.test function as I am not allowed to use it, so if anyone could provide some advice for that as well it would be appreciated

Row-wise correlation between two matrices

I have the following matrices :
And:
I want to calculate their row-wise pearson correlations and I've tried these pieces of code:
RowCor<- sapply(1:21, function(i) cor(EurodistCL.scl[i,], EurodistM.scl[i,], method = "pearson"))
And:
cA <- EurodistCL.scl - rowMeans(EurodistCL.scl)
cB <- EurodistM.scl- rowMeans(EurodistM.scl)
sA <- sqrt(rowMeans(cA^2))
sB <- sqrt(rowMeans(cB^2))
rowMeans(cA * cB) / (sA * sB)
Both give the same output, a correlation vector of 21 ones.
Although the matrices are clearly highly correlated, they are not perfectly correlated so I would expect some correlation coefficient to be 0.99 or 0.98
Why am I getting only ones? Is something wrong in the code or in the theory?
It is because you have only two values in a row. Even random values would give (+ or -) 1. Try this
a <- runif(2)
b <- runif(2)
cor(a, b)
So, it is the theory that is incorrect. Although one can get a coefficient of correlation with two samples, it is of little use.
To estimate correlation coefficient, you need more than two corresponding samples.

Auto.VR and Lo.Mac, decision rule?

I am trying to decide if a certain series follows a random walk by using Automatic Variance Ratio test (Auto.VR) and Lo-MacKinlay Variance Ratio Tests (Lo.Mac).
If we consider below mentioned examples, what would the decision be?
(Null Hypothesis: r is serially uncorrelated.)
data(exrates)
y <- exrates$ca
nob <- length(y)
r <- log(y[2:nob])-log(y[1:(nob-1)])
Auto.VR(r)
$stat [1] 2.202953
$sum [1] 1.153463
data(exrates)
y <- exrates$ca
nob <- length(y)
r <- log(y[2:nob])-log(y[1:(nob-1)])
kvec <- c(2,5,10)
Lo.Mac(r,kvec)
$Stats
M1 M2
k=2 2.5701993 2.0412787
k=5 1.5517705 1.2834571
k=10 0.3759064 0.3195872
I was working to figure out my question. Then, I import the same data (above mentioned) to the python. I did this because I've found an explanation about the decision rule for at this link "http://arch.readthedocs.io/en/latest/unitroot/tests.html#variance-ratios" which says "Rejection of the null with a positive test statistic indicates the presence of positive serial correlation in the time series."
here the r_NaN is the same as above mentioned series of "r" .
from arch.unitroot import VarianceRatio
vr_r = VarianceRatio(r_NaN, 2)
print(vr_r.summary().as_text())
output:
Variance-Ratio Test Results
=====================================
Test Statistic -20.527
P-value 0.000
Lags 2
-------------------------------------
And, the decision is: because the Test Statistic is negative we are not able to reject the null hyphotesis. Hence, we conclude that the series of r follows a random walk.

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Resources