Boxcoxfit function in R without intecept - r

I have a problem with boxcoxfit function.
I simulated some data and now I want to create estimators for regression parameters and parameter in box-cox transformation.
I use a package geoR
I have a matrix X with 2 columns and Y with not-negative values (which I get by inverse transformation of box-cox).
I use a boxcoxfit(Y~X) and the answer has 4 parameters (one extra for intercept). When I add intercept to mz matrix X, and run the boxcosfit again, for lambda=2, I get a nonsence estimator for the intercept.
Here is my full code:
library(geoR)
#optional
set.seed(80974140)
XX=matrix(rnorm(2000,100,12),ncol=2,nrow=1000)
epsilon=rnorm(1000,0,1)
beta=c(0,2,3)
a=2
#invers transformation
inverz=function(y,a){
if (a==0) inverz<-exp(y)
else inverz<-(y*a+1)^(1/a)
return(inverz)
}
jedna=rep(1,1000)
X=cbind(jedna,XX) #intercept
TY=X%*%beta+epsilon #regression model
head(cbind(TY,X))
Y=inverz(TY,a) #Observed data
summary(Y)
head(cbind(Y,X,epsilon))
boxcoxfit(object=Y,xmat=X)
And the output:
Fitted parameters:
lambda beta0 beta1 beta2 sigmasq
1.9903028 2.0598958 1.9415787 2.8965162 0.9945854 `
Can I substract somehow the intercept form the boxcoxfit?
Can I get estimate std. deviation for the coefficients?
Thanks for your answers
PS: Sorry for my bad English

Related

How to manually calculate coefficients for Gamma GLM

The input I'm giving to the GLM function is:
glm(family=fam,data=regFrame1,start=starter1,formula=as.formula(paste(yvar,"~.+0")),na.action=na.exclude,y=T)
Where the family is Gamma and the link function is identity.
I'm trying to manually reproduce the coefficients from my model where one of them is for example:
Estimate Std. Error t value Pr(>|t|)
coefficient A 480.6062 195.2952 2.461 0.013902 *
I know the equation I need for coefficient A is:
βA = (XTX)−1XTY
Where y is my dependent variable and x is my independent variable.
In R I write this to produce βA:
# x transposed multiplied by x when both are matrices
xtx <- t(x) %*% x
# x transposed multiplied by y when both are matrices
xty <- t(x) %*% y
# we need to inverse xtx
xtxinv <- solve(xtx, tol=0)
# finally we multiply the inverse of xtx by xty to get betaHat
betaHat <- xtxinv %*% xty
betaHat = 148
When I complete this calculation manually I get the coefficient that is produced when running a GLM on the default normal Gaussian family without specifying a family. Which looks like this:
glm(data=regFrame1,formula=as.formula(paste(yvar,"~.+0")),na.action=na.exclude,y=T)
So the question is how do I tailor my manual calculation to the Gamma family identity link function instead of the Gaussian identity default that is in the glm.fit function in R.
The only two differences with my two runs using the glm function are:
providing the family (Gamma identity)
giving the model starting values (100 for each column in the dataframe)
I tried to recreate glm.fit function manually to get out the coefficient (beta). When I didn't provide a family or starting values I got the correct answer but when I gave Gamma as the family and identity as the link with starting values I get a much different coefficient.
For linear regression, which is fit with least squares, BA is indeed (XTX)-1XTY. However, for generalized linear regression, BA is fit by iteratively weighted least squares, which is an iterative algorithm. Therefore, there is no direct formula to compute BA. However, we can compute the equivalent of the hat matrix H in linear regression. In linear regression, the hat matrix is H=X(XTX)-1XT. In generalized linear model, the analogy of the hat matrix is H=WX(XTWX)-1XT where W = diag(mu'(XB)). In both cases, Hy give the fitted values, yA. Here is code to demonstrate.
#' Test that the two parameterizations of Gamma are the same
curve(dgamma(x, 3, scale=3), xlim=c(0, 10))
grid <- seq(0, 10, length=1000)
d <- 1/grid/gamma(3)*(grid/(1/3)/9)^3*exp(-grid/3)
plot(grid, d, type='l')
#' Generate random variates according to GLM with
#' Y_i ~ Gamma(mean=mu,
#' squared coefficient of variation (variance over squared mean) = phi)
#' Y_i ~ Gamma(shape=alpha, scale=beta)
#' mu = alpha*beta
#' phi= 1/alpha
#' Let Beta = (3, 4)
set.seed(123)
X <- data.frame(x1=runif(1000, 0, 10))
mu = (3+4*X$x1)^(-1)
y=NULL
for (i in 1:1000) {
alpha = 1/3
beta = mu[i] * 3
y[i]=rgamma(1, alpha, scale=beta)
}
#' Fit the model and compute the hat matrix, then the fitted values manually
mod <- glm(y ~ ., family=Gamma(), data=X)
x <- as.matrix(cbind(1, X))
W=diag(c(-(x%*%c(3, 4))^(-2)))
H=W%*%x%*%solve(t(x)%*%W%*%x)%*%t(x)
#Manual fitted values
head(H%*%y)
#Fitted values from model
head(mod$fitted.values)

Manually get the responses from GLM with gamma distribution and a GLM with inverse guassian distribution

I've been trying to manually get the response values given by the predict.glm function from the stats package in R. However, I'm unable to do so. I only know how to manually get the value with a binomial distribution. I would really appreciate some help. I created two small models (one with Gamma family and one with inverse Gaussian family).
library(stats)
library(dplyr)
data("USArrests")
#Gamma distribution model
model_gam <- glm(Rape~Murder + Assault + UrbanPop, data=USArrests, family=Gamma)
print(summary(model_gam))
responses_gam <- model_gam %>% predict(USArrests[1,], type="response")
print(responses_gam)
#Trying to manually get responses for gamma model
paste(coef(model_gam), names(coef(model_gam)), sep="*", collapse="+")
# "0.108221470842499*(Intercept)+-0.00122165587689519*Murder+-9.47425665022909e-05*Assault+-0.000467789606041651*UrbanPop"
print(USArrests[1,])
#Murder: 13.2, Assault: 236, UrbanPop: 58
x = 0.108221470842499 - 0.00122165587689519 * 13.2 - 9.47425665022909e-05 * 236 - 0.000467789606041651 * 58
# This is wrong. Do I have to include the dispersion? (which is 0.10609)
print (exp(x)/(1+exp(x)))
# result should be (from predict function): 26.02872
# exp(x)/(1+exp(x)) gives: 0.510649
# Gaussian distribution model
model_gaus <- glm(Rape~Murder + Assault + UrbanPop, data=USArrests, family=inverse.gaussian(link="log"))
responses_gaus <- model_gaus %>% predict(USArrests[1,], type="response")
print(summary(model_gaus))
print(responses_gaus)
#Trying to manually get responses for gaussian model
paste(coef(model_gaus), names(coef(model_gaus)), sep="*", collapse="+")
# "0.108221470842499*(Intercept)+-0.00122165587689519*Murder+-9.47425665022909e-05*Assault+-0.000467789606041651*UrbanPop"
x = 1.70049202188329-0.0326196928618521* 13.2 -0.00234379099421488*236-0.00991369000675323*58
# Dispersion in this case is 0.004390825
print(exp(x)/(1+exp(x)))
# result should be (from predict function): 26.02872
# exp(x)/(1+exp(x)) it is: 0.5353866
built-in predict()
predict(model_gaus)["Alabama"] ## 3.259201
by hand
cat(paste(round(coef(model_gaus),5), names(coef(model_gaus)), sep="*", collapse="+"),"\n")
## 1.70049*(Intercept)+0.03262*Murder+0.00234*Assault+0.00991*UrbanPop
USArrests["Albama",]
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
The value of the intercept is always 1, so we have
1.70049*1+0.03262*13.2+0.00234*236+0.00991*58
## [1] 3.258094
(close enough, since I rounded some things)
You don't need to do anything with the dispersion or the inverse-link function, since the Gaussian model uses an identity link.
using the model matrix
Mathematically, the regression equation is defined as X %*% beta where beta is the vector of coefficients and X is the model matrix (for your example, it's a column of ones for the intercept plus your predictors; for models with categorical predictors or more complex terms like splines, it's a little more complicated). You can extract the model matrix from the matrix with model.matrix():
Xg <- model.matrix(model_gaus)
drop(Xg["Alabama",] %*% coef(model_gaus))
For the Gamma model, you would use exactly the same procedure, but at the end you would transform the linear expression you computed (the linear predictor) by 1/x (the inverse link function for the Gamma). (Note that you need predict(..., type = "response") to get the inverse-transformed prediction; otherwise [default type = "link"] R will give you just the plain linear expression.] If you used a log link instead you would exponentiate. More generally,
invlinkfun <- family(fitted_model)$linkinv
X <- model.matrix(fitted_model)
beta <- coef(fitted_model)
invlinkfun(X %*% beta)
The inverse Gaussian model uses a 1/mu^2 link by default; inverse.gaussian()$linkinv is function(eta) { 1/sqrt(eta) }

retrieve formula used by predict function in exponential equation in R

I can't figure out how to reconstruct the results nor the formula from the predict function of a linear model. I get the same results also when using this data in ggplot geom_smooth(method='lm',formula,y ~ exp(x)).
Here's some sample data
x=c(1,10,100,1000,10000,100000,1000000,3000000)
y=c(1,1,10,15,20,30,40,60)
I would like to use an exponential function so (ignore for the moment that I log the x value, because exp() fails for very large values):
model = lm( y ~ exp(log10(x)))
mypred = predict(model)
plot(log(x),mypred)
I have tried
lm_coef <- coef(model)
plot(log10(x),lm_coef[1]*exp(-lm_coef[2]*x))
However this is giving me a decreasing exponential instead of the increasing. My goal is to extract the equation of the exponential function so I can reuse the coefficients in another context.. What equation is predict() using and is there a way to see it?
I did something along the lines of:
Df<-data.frame(x=c(1,10,100,1000,10000,100000,1000000,3000000),
y=c(1,1,10,15,20,30,40,60))
model<-lm(data = Df, formula = y~log(x))
predict(model)
plot(log(Df$x),predict(model))
summary(model)
The relevant output you get is:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.0700 4.7262 -1.284 0.246386
log(x) 3.5651 0.5035 7.081 0.000398 ***
---
Your equation therefore is 3.5651*log(x)-6.0700

bootstrap standard errors of a linear regression in R

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Exporting Linear Regression Results Including Confidence Intervals

Hey out there how can I can I export a table of the results used to make the chart I generated for this linear regression model below.
d <- data.frame(x=c(200110,86933,104429,240752,255332,75998,
204302,97321,342812,220522,110990,259706,65733),
y=c(200000,110000,165363,225362,313284,113972,
137449,113106,409020,261733,171300,344437,89000))
lm1 <- lm(y~x,data=d)
p_conf1 <- predict(lm1,interval="confidence")
nd <- data.frame(x=seq(0,80000,length=510000))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
plot(y~x,data=d,ylim=c(-21750,600000),xlim=c(0,600000)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
Still not entirely sure what you want but this would seem to be reasonable:
dat1 <- data.frame(d,p_conf1)
dat2 <- data.frame(nd,y=NA,p_conf2)
write.csv(rbind(dat1,dat2),file="linpredout.csv")
It includes x, y (equal to the observation or NA for non-observed points), the predicted value fit, and lwr/upr bounds.
edit: fix typo.
This will return a matrix that has some of the information needed to construct the confidence intervals:
> coef(summary(lm1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21749.037058 2.665203e+04 0.8160369 4.317954e-01
x 1.046954 1.374353e-01 7.6177997 1.037175e-05
Any text on linear regression should have the formula for the confidence interval. You may need to calculate some ancillary quantities dependent on which formula you're using. The code for predict is visible ... just type at the console :
predict.lm
And don't forget that confidence intervals are different than prediction intervals.

Resources