I have a lot of data, like the image below, which have a relationship with each other. I want to make an equation to describe this relationship, something like Power = a * WindSpeed ^ b . How can I use Bayesian Inference to find a and b? I want to use R for this.
Welcome to SO and good luck. Do not forget to pay attention to the comment (they are really relevant & you can increase probability to get an answer).
Please see below example of using Bayesian univariate regression using Bolstadt package.
library(Bolstad)
# Simulation of Power vs Wind
# Power = 5 * Windspeed ^ 2
set.seed(123)
n <- 100
# y = Power
# x = WindSpeed
# e = error term
x <- (1:(25 * n))/ n
e <- rnorm(length(x)) / 10
# y = a * x ^ b
# log(y) = log(a) + b * log(x) + e
# or
# in exponential form
y <- exp(log(5) + e) * x ^ 2
# bayes univariate linear regression model
z <- bayes.lin.reg(log(y), log(x))
# Standard deviation of residuals: 0.0943
# Posterior Mean Posterior Std. Deviation
# -------------- ------------------------
# Intercept: 6.076 0.0059657
# Slope: 1.996 0.0062209
# ------------------------------------------------
# pay attention the result of bayession regression
# are shifted for intercept by the mean
# is is accouted as below
intercept_shifted <- z$intercept$mean - z$slope$mean * mean(log(x))
intercept_shifted
# [1] 1.617218
# validate by standar linear model:
lm(log(y) ~ log(x))
# Coefficients:
# (Intercept) log(x)
# 1.617 1.996
a = exp(intercept_shifted)
a
# [1] 5.039051
b = z$slope$mean
b
# [1] 1.996134
Related
If I calculate the y value for a specific x value using predict() function I obtain a value different from the one I can calculate using the explicit fitting equation.
I fitted the data below using nls(MyEquation) and obtained the m1, m2,... parameters.
Then, I want to reverse calculate the y value for a specific x value using both the predict(m) function or the explicit equation I used for fitting (putting in the desired x value).
I obtain different y values for the same x value. Which one is the correct one?
> df
pH activity
1 3.0 0.88
2 4.0 1.90
3 5.0 19.30
4 6.0 70.32
5 7.0 100.40
6 7.5 100.00
7 8.0 79.80
8 9.0 7.75
9 10.0 1.21
x <- df$pH
y <- df$activity
m<-nls(y~(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)), start = list(m1=1,m2=100,m3=7,m5=1))
> m
Nonlinear regression model
model: y ~ (m1 * (10^(-x)) + m2 * 10^(-m3))/(10^(-m3) + 10^(-x)) - (m5 * (10^(-x)) + 1 * 10^(-i))/(10^(-i) + 10^(-x))
data: parent.frame()
m1 m2 m3 m5
-176.032 13.042 6.282 -180.704
residual sum-of-squares: 1522
Number of iterations to convergence: 14
Achieved convergence tolerance: 5.805e-06
list2env(as.list(coef(m)), .GlobalEnv)
#calculate y based on fitting parameters
# choose the 7th x value (i.e. x[7]) that corresponds to pH = 8
# (using predict)
> x_pH8 <- x[7]
> predict(m)[7]
[1] 52.14299
# (using the explicit fitting equation with the fitted parameters
> x1 <- x_pH8
> (m1*(10^(-x1))+m2*10^(-m3))/(10^(-m3)+10^(-x1)) - (m5*(10^(-x1))+1*10^(-8.3))/(10^(-8.3)+10^(-x1))
[1] 129.5284
As you can see:
predict(m)[7] gives y = 52.14299 (for x = 8)
while
(m1*(10^(-x1))+m2*10^(-m3))/(10^(-m3)+10^(-x1)) - (m5*(10^(-x1))+1*10^(-8.3))/(10^(-8.3)+10^(-x1)) gives y = 129.5284 (for x = 8)
The value of i you use in the manual calculation is probably not the same as the one you use in the model fitting. I don't get any discrepancy:
x <- df$pH
y <- df$activity
i <- 8.3
m <- nls(y~(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)), start = list(m1=1,m2=100,m3=7,m5=1))
x <- 8
with(as.list(coef(m)),
(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)))
# [1] 75.46504
predict(m)[7]
# [1] 75.46504
I am trying to calculate manually the r-squared given by lm() in R
Considering:
fit <- lm(obs_values ~ preds_values, df)
with sd(df$obs_values) == sd(df$preds_values) and mean(df$obs_values) == mean(df$preds_values)
To do so I can extract the residuals by doing
res_a = residuals(fit) and then inject them in the formula as :
y = sum( (df$obs_values - mean(df$obs_values))^2 )
r-squared = 1 - sum(res_a^2)/y
Here I get the expected r-squared
Now, I would like to get the residual manually.
It should be as trivial as :
res_b = df$obs_values - df$predss_values, but for some reason, res_b is different than res_a...
You can't just do y - x in a regression y ~ x to get residuals. Where have regression coefficients gone?
fit <- lm(y ~ x)
b <- coef(fit)
resi <- y - (b[1] + b[2] * x)
You have many options:
## Residuals manually
# option 1
beta_hat <- coef(fit)
obs_values_hat <- beta_hat["(Intercept)"] + beta_hat["preds_values"] * preds_values
u_hat <- obs_values - obs_values_hat # residuals
# option 2
obs_values_hat <- fitted(fit)
u_hat <- obs_values - obs_values_hat # residuals
# (option 3 - not manually) or just u_hat <- resid(fit)
## R-squared manually
# option 1
var(obs_values_hat) / var(obs_values)
# option 2
1 - var(u_hat) / var(obs_values)
# option 3
cor(obs_values, obs_values_hat)^2
Take for instance the following one-knot, degree two, spline:
library(splines)
library(ISLR)
fit.spline <- lm(wage~bs(age, knots=c(42), degree=2), data=Wage)
summary(fit.spline)
I see estimates that I don't expect.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.349 3.950 14.518 < 2e-16 ***
bs(age, knots = c(42), degree = 2)1 59.511 5.786 10.285 < 2e-16 ***
bs(age, knots = c(42), degree = 2)2 65.722 4.076 16.122 < 2e-16 ***
bs(age, knots = c(42), degree = 2)3 37.170 9.722 3.823 0.000134 ***
Is there a way to extract the quadratic model (and its coefficients) for before and after the knot? That is, how can I extract the two quadratic models before and after the cut point of age = 42?
Using summary(fit.spline) yields coefficients, but (to my understanding) they are not meaningful for interpretation.
I was constantly asked to wrap up the idea in my original answer into a user-friendly function, able to reparametrize a fitted linear or generalized linear model with a bs or ns term. Eventually I rolled out a small R package SplinesUtils at https://github.com/ZheyuanLi/SplinesUtils (with a PDF version package manual). You can install it via
## make sure you have the `devtools` package avaiable
devtools::install_github("ZheyuanLi/SplinesUtils")
The function to be used here is RegBsplineAsPiecePoly.
library(SplinesUtils)
library(splines)
library(ISLR)
fit.spline <- lm(wage ~ bs(age, knots=c(42), degree=2), data = Wage)
ans1 <- RegBsplineAsPiecePoly(fit.spline, "bs(age, knots = c(42), degree = 2)")
ans1
#2 piecewise polynomials of degree 2 are constructed!
#Use 'summary' to export all of them.
#The first 2 are printed below.
#8.2e-15 + 4.96 * (x - 18) + 0.0991 * (x - 18) ^ 2
#61.9 + 0.2 * (x - 42) + 0.0224 * (x - 42) ^ 2
## coefficients as a matrix
ans1$PiecePoly$coef
# [,1] [,2]
#[1,] 8.204641e-15 61.91542748
#[2,] 4.959286e+00 0.20033307
#[3,] -9.914485e-02 -0.02240887
## knots
ans1$knots
#[1] 18 42 80
The function defaults to parametrize piecewise polynomials in shifted form (see ?PiecePoly). You can set shift = FALSE for a non-shifted version.
ans2 <- RegBsplineAsPiecePoly(fit.spline, "bs(age, knots = c(42), degree = 2)",
shift = FALSE)
ans2
#2 piecewise polynomials of degree 2 are constructed!
#Use 'summary' to export all of them.
#The first 2 are printed below.
#-121 + 8.53 * x + 0.0991 * x ^ 2
#14 + 2.08 * x + 0.0224 * x ^ 2
## coefficients as a matrix
ans2$PiecePoly$coef
# [,1] [,2]
#[1,] -121.39007747 13.97219046
#[2,] 8.52850050 2.08267822
#[3,] -0.09914485 -0.02240887
You can predict the splines with predict.
xg <- 18:80
yg1 <- predict(ans1, xg) ## use shifted form
yg2 <- predict(ans2, xg) ## use non-shifted form
all.equal(yg1, yg2)
#[1] TRUE
But since there is an intercept in the model, the predicted values would differ from model prediction by the intercept.
yh <- predict(fit.spline, data.frame(age = xg))
intercept <- coef(fit.spline)[[1]]
all.equal(yh, yg1 + intercept, check.attributes = FALSE)
#[1] TRUE
The package has summary, print, plot, predict and solve methods for a "PiecePoly" class. Explore the package for more.
Given below is the code for analysis of a resolvable alpha design (alpha lattice design) using the R package asreml.
# load the data
library(agridat)
data(john.alpha)
dat <- john.alpha
# load asreml
library(asreml)
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- asreml(yield ~ 1 + rep, data=dat, random=~ gen + rep:block)
# variance due to `gen`
sg2 <- summary(model1 )$varcomp[1,'component']
# mean variance of a difference of two BLUPs
vblup <- predict(model1 , classify="gen")$avsed ^ 2
# model2 - fixed `gen`
#----------------------
model2 <- asreml(yield ~ 1 + gen + rep, data=dat, random = ~ rep:block)
# mean variance of a difference of two adjusted treatment means (BLUE)
vblue <- predict(model2 , classify="gen")$avsed ^ 2
# H^2 = .803
sg2 / (sg2 + vblue/2)
# H^2c = .809
1-(vblup / 2 / sg2)
I am trying to replicate the above using the R package lme4.
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- lmer(yield ~ 1 + (1|gen) + rep + (1|rep:block), dat)
# variance due to `gen`
varcomp <- VarCorr(model1)
varcomp <- data.frame(print(varcomp, comp = "Variance"))
sg2 <- varcomp[varcomp$grp == "gen",]$vcov
# model2 - fixed `gen`
#----------------------
model2 <- lmer(yield ~ 1 + gen + rep + (1|rep:block), dat)
How to compute the vblup and vblue (mean variance of difference) in lme4 equivalent to predict()$avsed ^ 2 of asreml ?
I'm not that familiar with this variance partitioning stuff, but I'll take a shot.
library(lme4)
model1 <- lmer(yield ~ 1 + rep + (1|gen) + (1|rep:block), john.alpha)
model2 <- update(model1, . ~ . + gen - (1|gen))
## variance due to `gen`
sg2 <- c(VarCorr(model1)[["gen"]]) ## 0.142902
Get conditional variances of BLUPs:
rr1 <- ranef(model1,condVar=TRUE)
vv1 <- attr(rr$gen,"postVar")
str(vv1)
## num [1, 1, 1:24] 0.0289 0.0289 0.0289 0.0289 0.0289 ...
This is a 1x1x24 array (effectively just a vector of variances; we could collapse using c() if we needed to). They're not all the same, but they're pretty close ... I don't know whether they should all be identical (and this is a roundoff issue)
(uv <- unique(vv1))
## [1] 0.02887451 0.02885887 0.02885887
The relative variation is approximately 5.4e-4 ...
If these were all the same then the mean variance of a difference of any two would be just twice the variance (Var(x-y) = Var(x)+Var(y); by construction the BLUPs are all independent). I'm going to go ahead and use this.
vblup <- 2*mean(vv1)
For the model with gen fitted as a fixed effect, let's extract the variances of the parameters relating to genotypes (which are differences in the expected value from the first level):
vv2 <- diag(vcov(model2))[-(1:3)]
summary(vv2)
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06631 0.06678 0.07189 0.07013 0.07246 0.07286
I'm going to take the means of these values (not double the values, since these are already the variances of differences)
vblue <- mean(vv2)
sg2/(sg2+vblue/2) ## 0.8029779
1-(vblup/2/sg2) ## 0.7979965
The H^2 estimate looks right on, but the H^2c estimate is a little different (0.797 vs. 0.809, a 1.5% relative difference); I don't know if that is big enough to be of concern or not.
I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA