Calculation of DFFITS as diagnostic for Leverage and Influence in regression

Calculation of DFFITS as diagnostic for Leverage and Influence in regression - r

I am trying to calculate DFFITS by hand. The value obtained should be equal to the first value obtained by dffits function. However there must be something wrong with my own calculation.
attach(cars)
x1 <- lm(speed ~ dist, data = cars) # all observations
x2 <- lm(speed ~ dist, data = cars[-1,]) # without first obs
x <- model.matrix(speed ~ dist) # x matrix
h <- diag(x%*%solve(crossprod(x))%*%t(x)) # hat values
num_dffits <- x1$fitted.values[1] - x2$fitted.values[1] #Numerator
denom_dffits <- sqrt(anova(x2)$`Mean Sq`[2]*h[1]) #Denominator
df_fits <- num_dffits/denom_dffits #DFFITS
dffits(x1)[1] # DFFITS function

Your numerator is wrong. As you have removed first datum from the second model, corresponding predicted value is not in fitted(x2). We need to use predict(x2, cars[1, ]) in place of fitted(x2)[1].
Hat values can be efficiently computed by
h <- rowSums(qr.Q(x1$qr) ^ 2)
or using its R wrapper function
h <- hat(x1$qr, FALSE)
R also has a generic function for getting hat values, too:
h <- lm.influence(x1, FALSE)$hat
or its wrapper function
h <- hatvalues(x1)
You also don't have to call anova to get MSE:
c(crossprod(x2$residuals)) / x2$df.residual

Related

Estimation method(s) of exponent available in DeMAND package, in R

I'm interested in finding the estimated value of the exponent of Pareto distributed data using DeMAND package available in R. According to https://rdrr.io/bioc/DeMAND/src/R/pareto.R I installed it and tried to compile the code as follows:
#### Functions for continuous power law or Pareto distributions
# Revision history at end of file
### Standard R-type functions for distributions:
# dpareto Probability density
# ppareto Probability distribution (CDF)
# qpareto Quantile function
# rpareto Random variable generation
### Functions for fitting:
# pareto.fit Fit Pareto to data
# .pareto.fit.threshold Determine scaling threshold and then fit
# --- not for direct use, call pareto.fit instead
# .pareto.fit.ml Fit Pareto to data by maximum likelihood
# --- not for direct use, call pareto.fit instead
# pareto.loglike Calculate log-likelihood under Pareto
# .pareto.fit.regression.cdf Fit Pareto data by linear regression on
# log-log CDF (disrecommended)
# --- not for direct use, call pareto.fit instead
# loglogslope Fit Pareto via regression, extract scaling
# exponent
# loglogrsq Fit Pareto via regression, extract R^2
### Functions for testing:
#
### Functions for visualization:
# plot.eucdf.loglog Log-log plot of the empirical upper cumulative
# distribution function, AKA survival function
# plot.survival.loglog Alias for plot.eucdf.loglog
### Back-stage functions, not intended for users:
# .ks.dist.for.pareto Find Kolmogorov-Smirnov distance between fitted
# and empirical distribution; called by
# .pareto.fit.threshold
# .ks.dist.fixed.pareto Find K-S distance between given Pareto and
# empirical distribution
# Probability density of Pareto distributions
# Gives NA on values below the threshold
# Input: Data vector, lower threshold, scaling exponent, "log" flag
# Output: Vector of (log) probability densities
dpareto <- function(x, threshold = 1, exponent, log=FALSE) {
# Avoid doing limited-precision arithmetic followed by logs if we want
# the log!
if (!log) {
prefactor <- (exponent-1)/threshold
f <- function(x) {prefactor*(x/threshold)^(-exponent)}
} else {
prefactor.log <- log(exponent-1) - log(threshold)
f <- function(x) {prefactor.log -exponent*(log(x) - log(threshold))}
}
d <- ifelse(x<threshold,NA,f(x))
return(d)
}
# Cumulative distribution function of the Pareto distributions
# Gives NA on values < threshold
# Input: Data vector, lower threshold, scaling exponent, usual flags
# Output: Vector of (log) probabilities
ppareto <- function(x, threshold=1, exponent, lower.tail=TRUE, log.p=FALSE) {
if ((!lower.tail) && (!log.p)) {
f <- function(x) {(x/threshold)^(1-exponent)}
}
if ((lower.tail) && (!log.p)) {
f <- function(x) { 1 - (x/threshold)^(1-exponent)}
}
if ((!lower.tail) && (log.p)) {
f <- function(x) {(1-exponent)*(log(x) - log(threshold))}
}
if ((lower.tail) && (log.p)) {
f <- function(x) {log(1 - (x/threshold)^(1-exponent))}
}
p <- ifelse(x < threshold, NA, f(x))
return(p)
}
# Quantiles of Pareto distributions
# Input: vector of probabilities, lower threshold, scaling exponent, usual flags
# Output: Vector of quantile values
qpareto <- function(p, threshold=1, exponent, lower.tail=TRUE, log.p=FALSE) {
# Quantile function for Pareto distribution
# P(x) = 1 - (x/xmin)^(1-a)
# 1-p = (x(p)/xmin)^(1-a)
# (1-p)^(1/(1-a)) = x(p)/xmin
# xmin*((1-p)^(1/(1-a))) = x(p)
# Upper quantile:
# U(x) = (x/xmin)^(1-a)
# u^(1/(1-a)) = x/xmin
# xmin * u^(1/(1-a)) = x
# log(xmin) + (1/(1-a)) log(u) = log(x)
if (log.p) {
p <- exp(p)
}
if (lower.tail) {
p <- 1-p
}
# This works, via the recycling rule
# q<-(p^(1/(1-exponent)))*threshold
q.log <- log(threshold) + (1/(1-exponent))*log(p)
q <- exp(q.log)
return(q)
}
# Generate Pareto-distributed random variates
# Input: Integer size, lower threshold, scaling exponent
# Output: Vector of real-valued random variates
rpareto <- function(n, threshold=1, exponent) {
# Using the transformation method, because we know the quantile function
# analytically
# Consider replacing with a non-R implementation of transformation method
ru <- runif(n)
r<-qpareto(ru,threshold,exponent)
return(r)
}
# Estimate parameters of Pareto distribution
# A wrapper for functions implementing actual methods
# Input: data vector, lower threshold (or "find", indicating it should be found
# from the data), method (likelihood or regression, defaulting to former)
# Output: List indicating type of distribution ("pareto"), parameters,
# information about fit (depending on method), OR a warning and NA
# if method is not recognized
pareto.fit <- function(data, threshold, method="ml") {
if (threshold == "find") {
return(.pareto.fit.threshold(data,method=method))
}
switch(method,
ml = { return(.pareto.fit.ml(data,threshold)) },
regression.cdf = { return(.pareto.fit.regression.cdf(data,threshold)) },
{ cat("Unknown method\n"); return(NA)}
)
}
# Estimate lower threshold of Pareto distribution
# Use the method in Clauset, Shalizi and Newman (2007): consider each distinct
# data value as a possible threshold, fit using that threshold, and then find
# the Kolmogorov-Smirnov distance between estimated and empirical distributions.
# Pick the threshold which minimizes this distance. Then function then returns
# the output of one of the fixed-threshold estimators.
# Input: data vector, method (defaulting to ML)
# Output: List indicating type of distribution ("pareto"), parameters,
# information about fit (depending on method)
.pareto.fit.threshold <- function(data, method="ml") {
possibles <- unique(data)
ks.distances <- sapply(possibles,.ks.dist.for.pareto,data=data,method=method)
min.index = which.min(ks.distances)
min = possibles[min.index]
return(pareto.fit(data,threshold=min,method=method))
}
# Calculate the KS discrepancy between a data set and its fit Pareto
# distribution, assuming a given threshold. Not intended for users but rather
# for the .pareto.fit.threshold function.
# N.B., this KS statistic CANNOT be plugged in to the usual tables to find valid
# p-values, as the exponent has been estimated from the data.
# Input: real threshold, data vector, method flag
# Output: real-valued KS statistic
.ks.dist.for.pareto <- function(threshold,data,method="ml") {
model <- pareto.fit(data,threshold=threshold,method=method)
return(model$ks.dist)
}
# Calculate KS distanced between a data set and given Pareto distribution
# Not intended for users
# Input: real threshold, real exponent, data vector
# Output: real-valued KS statistic
.ks.dist.fixed.pareto <- function(data,threshold,exponent) {
data <- data[data>=threshold]
d <- suppressWarnings(ks.test(data,ppareto,threshold=threshold,exponent=exponent))
# ks.test complains about p-values when there are ties, we don't care
return(as.vector(d$statistic))
}
# Estimate scaling exponent of Pareto distribution by maximum likelihood
# Input: Data vector, lower threshold
# Output: List giving distribution type ("pareto"), parameters, log-likelihood
.pareto.fit.ml <- function (data, threshold) {
data <- data[data>=threshold]
n <- length(data)
x <- data/threshold
alpha <- 1 + n/sum(log(x))
loglike = pareto.loglike(data,threshold,alpha)
ks.dist <- .ks.dist.fixed.pareto(data,threshold=threshold,exponent=alpha)
fit <- list(type="pareto", exponent=alpha, xmin=threshold, loglike = loglike,
ks.dist = ks.dist, samples.over.threshold=n)
return(fit)
}
# Calculate log-likelihood under a Pareto distribution
# Input: Data vector, lower threshold, scaling exponent
# Output: Real-valued log-likelihood
pareto.loglike <- function(x, threshold, exponent) {
L <- sum(dpareto(x, threshold = threshold, exponent = exponent, log = TRUE))
return(L)
}
# Log-log plot of the survival function (empirical upper CDF) of a data set
# Input: Data vector, lower limit, upper limit, graphics parameters
# Output: None (returns NULL invisibly)
plot.survival.loglog <- function(x,from=min(x),to=max(x),...) {
plot.eucdf.loglog(x,from,to,...)
}
plot.eucdf.loglog <- function(x,from=min(x),to=max(x),type="l",...) {
# Use the "eucdf" function (below)
x <- sort(x)
x.eucdf <- eucdf(x)
# This is nice if the number of points is small...
plot(x,x.eucdf(x),xlim=c(from,to),log="xy",type=type,...)
# Should check how many points and switch over to a curve-type plot when
# it gets too big
invisible(NULL)
}
# Calculate the upper empirical cumulative distribution function of a
# one-dimensional data vector
# Uses the standard function ecdf
# Should, but does not yet, also produce a function of class "stepfun"
# (like ecdf)
# Input: data vector
# Output: a function
eucdf <- function(x) {
# Exploit built-in R function to get ordinary (lower) ECDF, Pr(X<=x)
x.ecdf <- ecdf(x)
# Now we want Pr(X>=x) = (1-Pr(X<=x)) + Pr(X==x)
# If x is one of the "knots" of the step function, i.e., a point with
# positive probability mass, should add that in to get Pr(X>=x)
# rather than Pr(X>x)
away.from.knot <- function(y) { 1 - x.ecdf(y) }
at.knot.prob.jump <- function(y) {
x.knots = knots(x.ecdf)
# Either get the knot number, or give zero if this was called
# away from a knot
k <- match(y,x.knots,nomatch=0)
if ((k==0) || (k==1)) { # Handle special cases
if (k==0) {
prob.jump = 0 # Not really a knot
} else {
prob.jump = x.ecdf(y) # Special handling of first knot
}
} else {
prob.jump = x.ecdf(y) - x.ecdf(x.knots[(k-1)]) # General case
}
return(prob.jump)
}
# Use one function or the other
x.eucdf <- function(y) {
baseline = away.from.knot(y)
jumps = sapply(y,at.knot.prob.jump)
ifelse (y %in% knots(x.ecdf), baseline+jumps, baseline)
}
return(x.eucdf)
}
# Calculate valid p-value for the goodness of fit of a power-law
# tail to a data set, via simulation
# Input: data vector (x), number of replications (m)
# Output: p-value
pareto.tail.ks.test <- function(x,m) {
x.pt <- pareto.fit(x,threshold="find")
x0 <- x.pt$xmin # extract parameters of fitted dist.
alpha <- x.pt$exponent
ntail <- sum(x>=x0) # How many samples in the tail?
n <- length(x)
ptail <- ntail/n # Total prob. of the tail
# Carve out the non-tail data points
body <- x[x < x0]
# Observed value of KS distance:
d.ks <- x.pt$ks.dist
# KS statistics of resamples:
r.ks <- replicate(m,.ks.resimulate.pareto.tail(n,ptail,x0,alpha,body))
p.value <- sum(r.ks >= d.ks)/m
return(p.value)
}
# Resimulate from a data set with a Pareto tail, estimate on
# the simulation and report the KS distance
# Inputs: Size of sample (n), probability of being in the tail (tail.p),
# threshold for tail (threshold), power law exponent (exponent),
# vector giving values in body (data.body)
# Output: KS distance
.ks.resimulate.pareto.tail <- function(n,tail.p,threshold,exponent,data.body) {
# Samples come from the tail with probability ptail, or else from the body
# decide randomly how many samples come from the tail
tail.samples <- rbinom(1,n,tail.p)
# Draw the samples from the tail
rtail <- rpareto(tail.samples,threshold,exponent)
# Draw the samples from the body (with replacement!)
rbody <- sample(data.body,n-tail.samples,replace=TRUE)
b <- c(rtail,rbody)
b.ks <- pareto.fit(b,threshold="find")$ks.dist
return(b.ks)
}
### The crappy linear regression way to fit a power law
# The common procedure is to fit to the binned density function, which is even
# crappier than to fit to the complementary distribution function; this
# currently only implements the latter
# First, produce the empirical complementary distribution function, as
# a pair of lists, {x}, {C(x)}
# Then regress log(C) ~ log(x)
# and report the slope and the R^2
# Input: Data vector, threshold
# Output: List with distributional parameters and information about the
# fit
.pareto.fit.regression.cdf <- function(x,threshold=1) {
# Discard data under threshold
x <- x[x>=threshold]
n <- length(x)
# We need the different observed values of x, in order
distinct_x <- sort(unique(x))
x.eucdf <- eucdf(x)
upper_probs <- x.eucdf(distinct_x)
loglogfit <- lm(log(upper_probs) ~ log(distinct_x))
intercept <- as.vector(coef(loglogfit)[1]) # primarily useful for plotting
slope <- as.vector(-coef(loglogfit)[2]) # Remember sign of parameterization
# But that's the exponent of the CDF, that of the pdf is one larger
# and is what we're parameterizing by
slope <- slope+1
r2 <- summary(loglogfit)$r.squared
loglike <- pareto.loglike(x, threshold, slope)
ks.dist <- .ks.dist.fixed.pareto(x,threshold=threshold,exponent=slope)
result <- list(type="pareto", exponent = slope, rsquare = r2,
log_x = log(distinct_x), log_p = log(upper_probs),
intercept = intercept, loglike = loglike, xmin=threshold,
ks.dist = ks.dist, samples.over.threshold=n)
return(result)
}
# Wrapper function to just get the exponent estimate
loglogslope <- function(x,threshold=1) {
llf <- .pareto.fit.regression.cdf(x,threshold)
exponent <- llf$exponent
return(exponent)
}
# Wrapper function to just get the R^2 values
loglogrsq <- function(x,threshold=1) {
llf <- .pareto.fit.regression.cdf(x,threshold)
r2 <- llf$rsquare
return(r2)
}
# Revision history:
# no release 2003 First draft
# v 0.0 2007-06-04 First release
# v 0.0.1 2007-06-29 Fixed "not" for "knot" typo, thanks to
# Nicholas A. Povak for bug report
# v 0.0.2 2007-07-22 Fixed bugs in plot.survival.loglog, thanks to
# Stefan Wehrli for report
# v 0.0.3 2008-03-02 Realized R has a "unique" function; added
# estimating xmin via method in minimal KS dist.
# v 0.0.4 2008-04-24 Made names of non-end-user functions start
# with period, hiding them in workspace
# v 0.0.5 2011-02-03 Suppressed the warning ks.test produces about
# not being able to calculate p-values in the
# presence of ties
pareto.fit(rpareto(100,2,3), "find", method='ml')
(Just copy pasting the above code will give the results)
This did give me a result, and I need to know what estimation method really have been used here in estimating the exponent? Is it maximum likelihood estimation method or any other? What does it mean by method='ml'? (m-maximum l-likelihood ?) "Help" in R doesn't work for this particular package. Are there any other estimation methods available in this package such as least squares, method of moment etc?
Knowing exactly the answers for these questions will be really helpful as I'm performing a comparison for accuracy between several packages under distinct estimation methods. So knowing the estimation method exactly is mandatory. Thanks in advance.

Manually implementing Regression Likelihood Ratio Test

I'm trying to implement my own linear regression likelihood ratio test.
The test is where you take the sum of squares of a reduced model and the sum of squares of a full model and compare it to the F statistic.
However, I am having some trouble implementing the function, especially when dealing with dummy variables.
This is the dataset I am working with and testing the function on.
Here is the code so far:
The function inputs are the setup matrix mat, the response matrix which has just one column, the indices (variables) being test, and the alpha value the test is at.
linear_regression_likelihood <- function(mat, response, indices, alpha) {
mat <- as.matrix(mat)
reduced <- mat[,c(1, indices)]
q <- 1 #set q = 1 just to test on data
p <- dim(mat)[2]
n <- dim(mat)[1]
f_stat <- qf(1-alpha, df1 = p-q, df2 = n-(p+1))
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
y_hat_full <- mat%*%beta_hat_full
SSRes_full <- t(response - y_hat_full)%*%(response-y_hat_full)
beta_hat_red <- qr.solve(t(reduced)%*%reduced)%*%t(reduced)%*%response
y_hat_red <- reduced%*%beta_hat_red
SSRes_red <- t(response - y_hat_red)%*%(response-y_hat_red)
s_2 <- (t(response - mat%*%beta_hat_full)%*%(response - mat%*%beta_hat_full))/(n-p+1)
critical_value <- ((SSRes_red - SSRes_full)/(p-q))/s_2
print(critical_value)
if (critical_value > f_stat) {
return ("Reject H0")
}
else {
return ("Fail to Reject H0")
}
}
Here is the setup code, where I setup the matrix in the correct format. Data is the read in CSV file.
data <- data[, 2:5]
mat <- data[, 2:4]
response <- data[, 1]
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy
mat <- cbind(1, mat[1:2], dummy)
linear_regression_likelihood(mat, response, 2:3, 0.05)
This is the error I keep getting.
Error in solve.default(as.matrix(c)) : system is computationally singular: reciprocal condition number = 1.63035e-18
I know it has to do with taking the inverse of the matrix after it is multiplied, but the function is unable to do so. I thought it may be due to the dummy variables having too small of values, but I am not sure of any other way to include the dummy variables.
The test I am doing is to check whether the factor variable x3 has any affect on the response y. The actual answer which I verified using the built in functions states that we fail to reject the null hypothesis.

The error originates from line
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
If you go through your function step-by-step you will see an error
Error in qr.solve(t(mat) %*% mat) : singular matrix 'a' in solve
The problem here is that your model matrix does not have full column rank, which translates to your regression coefficients not being unique. This is a result of the way you "dummyfied" x3. In order to ensure full rank, you need to remove one dummy column (or manually remove the intercept).
In the following example I remove the A column from dummy which means that resulting x3 coefficients measure the effect of a unit-change in B, C, and D against A.
# Read data
data <- read.csv("data_hw5.csv")
data <- data[, 2:5]
# Extract predictor and response data
mat <- data[, 2:4]
response <- data[, 1]
# Dummify categorical predictor x3
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy <- dummy[, -1] # Remove A to have A as baseline
mat <- cbind(1, mat[1:2], dummy)
# Apply linear_regression_likelihood
linear_regression_likelihood(mat, response, 2:3, 0.05);
# [,1]
#[1,] 8.291975
#[1] "Reject H0"
A note
The error could have been avoided if you had used base R's function model.matrix which ensures full rank when "dummyfying" categorical variables (model.matrix is also implicitly called in lm and glm to deal with categorical, i.e. factor variables).
Take a look at
mm <- model.matrix(y ~ x1 + x2 + x3, data = data)
which by default omits the first level of factor variable x3. mm is identical to mat after (correct) "dummification".

Partial Cross-correlation in R

I think the title is fairly self-explanatory. I want to compute the cross-correlation between two time series controlled for the values at other lags. I can't find any existing R code to do this, and I'm not at all confident enough in my knowledge of statistics (or R) to try to write something myself. It would be analogous to the partial autocorrelation function, just for the cross-correlation instead of the autocorrelation.
If it helps at all, my larger objective is to look for lagged correlations between different measurements of a physical system (to start with, flux and photon index from gamma ray measurements of blazars), with the goal of building a general linear model to try to predict flaring events.

Look at my answer to my own question (same as the one you posted).
You can make use of the pacf function in R, extending it to a matrix with 2 or more time series. I have checked results between the multivariate acf and ccf functions and they yield the same results, so the same can be concluded about the multivariate pacfand the non-existing pccf.

I believe this work,
pccf <- function(x,y,nlags=7,partial=TRUE){
# x (numeric): variable that leads y
# y (numeric): variable of interest
# nlags (integer): number of lags (uncluding zero)
# partial (boolean): partial or absolute correlation
# trim y
y <- y[-(1:(nlags-1))]
# lagged matrix of x
x_lagged <- embed(x,nlags)
# process for each lag
rho <- lag <- NULL
for(i in 1:(nlags)){
if(partial){
# residuals of x at lag of interest regressed on all other lags of x
ex <- lm(x_lagged[,i] ~ x_lagged[,-i])$residuals
# residuals of y regressed on all lags of x but the one of interest
ey <- lm(y ~ x_lagged[,-i])$residuals
}else{
ex <- x_lagged[,i]
ey <- y
}
# calculate correlation
rho[i] = cor(ex,ey, use="pairwise.complete.obs")
lag[i] = i-1
}
return(
tibble(lag=lag, rho=rho) %>%
arrange(lag)
)
}
# test
n <- 200 # count
nlag <- 6 # number of lags
x <- as.numeric(arima.sim(n=n,list(ar=c(phi=0.9)),sd=1)) # simulate times series x
y <- lag(x,nlag) + rnorm(n,0,0.5) # simulate y to lag x
y <- y[(nlag+1):n] # remove NAs from lag
x <- x[(nlag+1):n] # align with y
pccf(x,y,nlags=10,partial=FALSE) %>%
mutate(type='Cross correlation') %>%
bind_rows(
pccf(x,y,nlags=10,partial=TRUE) %>%
mutate(type='Partial cross correlation')
) %>%
ggplot() +
geom_col(aes(-lag,rho),width=0.1) +
facet_wrap(~type,scales='free_y', ncol=1) +
scale_x_continuous(breaks=-10:0) +
theme_bw(base_size=20)

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.

You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

nls() in R using entire matrix

I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James

This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculation of DFFITS as diagnostic for Leverage and Influence in regression - r

Related

Estimation method(s) of exponent available in DeMAND package, in R

Manually implementing Regression Likelihood Ratio Test

Partial Cross-correlation in R

Using anova() on gamma distributions gives seemingly random p-values

nls() in R using entire matrix

Categories

Resources