I am dealing with the relationship:
y = h * R + x * v * h
where:
x = (N - M) * exp(-Q * u) + M
which gives the principal equation:
y = h * R + v * h * (N - M) * exp(-Q * u) + v * h * M
All uppercase letters are constants, and all lowercase letters are variables.
I have real data for all the variables, but I either do not know the values of the constants (R and Q), or want to check the ability of the data to fit the values of the constants (N and M). I want to use nls() to fit the equation using the data for the variables, to estimate these constant parameters.
How do I write code using the nls() function to depict the principal equation, to allow estimation of the parameters R, N, Q, and M from the mock measurement data? (Mock measurement data = lower cases letters with _j suffix, see below.)
To create mock data:
library(dplyr)
library(ggplot2)
### Generate mock data
# Equations:
# y = h*R + x*v*h
# x = (N-M)*exp(-Q*u) + M
# y = h*R + ((N-M)*exp(-Q*u) + M)*v*h
# y = h*R + v*h*(N-M)*exp(-Q*u) + v*h*M
### Variables have varying periodicity,
# and so can be approximated via different functions,
# with unique noise added to each to simulate variability:
# Variability for each variable
n <- 1000 # number of data points
t <- seq(0,4*pi,length.out = 1000)
a <- 3
b <- 2
y.norm <- rnorm(n)
u.norm <- rnorm(n)
u.unif <- runif(n)
v.norm <- rnorm(n)
v.unif <- runif(n)
amp <- 1
# Create reasonable values of mock variable data for all variables except h;
# I will calculate from known fixed values for R, N, Q, and M.
y <- 1.5*a*sin(b*t)+y.norm*amp-10 # Gaussian/normal error
u <- ((1*a*sin(11*b*t)+u.norm*amp)+(0.5*a*sin(13*b*t)+u.unif*amp)+7)/2
v <- 1/((2*a*sin(11*b*t)+v.norm*amp)+(1*a*sin(13*b*t)+v.unif*amp)+20)*800-25
# Put vectors in dataframe
dat <- data.frame("t" = t, "y" = y, "u" = u, "v" = v)
### Create reasonable values for constants:
R=0.5
N=1.12
Q=0.8
M=1
### Define final variable based on these constants and the previous
# mock variable data:
dat$h = y/(R + v*(N-M)*exp(-Q*dat$u))
### Gather data to plot relationships:
dat_gathered <- dat %>%
gather(-t, value = "value", key = "key")
### Plot data to check all mock variables:
ggplot(dat_gathered, aes(x = t, y = value, color = key)) + geom_line()
# Add small error (to simulate measurement error):
dat <- dat %>%
mutate(h_j = h + rnorm(h, sd=0.05)/(1/h)) %>%
mutate(u_j = u + rnorm(u, sd=0.05)/(1/u)) %>%
mutate(v_j = v + rnorm(v, sd=0.05)/(1/v)) %>%
mutate(y_j = y + rnorm(y, sd=0.05)/(1/y))
nls appears to work OK, but it looks like the solution (in terms of parameters) is non-unique ... or I made a mistake somewhere.
## parameter values chosen haphazardly
n1 <- nls(y ~ h_j*(R + v_j*((N-M)*exp(-Q*u_j)+M)),
start=list(R=1,N=2,M=1,Q=1),
data=dat)
## starting from known true values
true_vals <- c(R=0.5,N=1.12,Q=0.8,M=1)
n2 <- update(n1, start=as.list(true_vals))
round(cbind(coef(n1),coef(n2),true_vals),3)
true_vals
R 0.495 0.495 0.50
N 0.120 0.120 1.12
M 0.001 0.818 0.80
Q 0.818 0.001 1.00
Using AIC() on the two fits shows they have essentially equivalent goodness of fits (and the predictions are almost identical), which suggests that there's some symmetry in your model that allows M and Q to be interchanged. I haven't thought about/looked at the equation hard enough to know why this would be the case.
Related
I have a fairly simple equation, in which I have direct measurements of the variables through time, and two different unknown parameters I need to solve for, but which I know can be considered constants over the time periods I'm studying.
Both of these "constants" have fairly narrow ranges of variability in nature. In principle, it seems like some kind of optimization procedure/function should be able to do this easily, by finding the pair of values that minimizes the standard deviation of each of the constant values across the time series.
However, I am new to optimization and parameter fitting. Any help figuring out how to use r code to find the pair (or pairs) of values in this situation would be greatly appreciated.
Below is a simplified form of the equation I'm dealing with:
A * x + B * z - B * d = c + e
A and B are the constants I need to solve for.
Possible real-world values of A are 0.4-0.8
Possible real-world values of B are 0.85-0.99
To create a reasonable mock data set, assuming perfect measurements of all variables, and known values of A and B:
### Generate mock data
### Variables all have a daily cycle and are strongly autocorrelated,
# and so can be approximated via sin function,
# with unique noise added to each to simulate variability:
# Variability for each variable
n <- 1000 # number of data points
t <- seq(0,4*pi,length.out = 1000)
a <- 3
b <- 2
x.unif <- runif(n)
z.norm <- rnorm(n)
c.unif <- runif(n)
d.norm <- rnorm(n)
d.unif <- runif(n)
e.norm <- rnorm(n)
amp <- 1
# Create reasonable values of mock variable data for all variables except e;
# I will calculate from known fixed values for A and B.
x <- a*sin(b*t)+x.unif*amp + 10 # uniform error
z <- a*sin(b*t)+z.norm*amp + 10 # Gaussian/normal error
c <- ((a*sin(b*t)+c.unif*amp) + 10)/4
d <- ((a*sin(b*t)+d.norm*amp)+(a*sin(b*t)+d.unif*amp)+10)/2
# Put vectors in dataframe
dat <- data.frame("t" = t, "x" = x, "z" = z, "c" = c, "d" = d)
# Equation: A*x + B*z - B*d = c + e
# Solve for e:
# e = A*x + B*z - B*d - c
# Specify "true" values for A and B:
A = 0.6
B = 0.9
# Solve for e:
dat <- dat %>%
mutate(e = A*x + B*z - B*d - c)
# Gather data for easy visualizing of results for e:
dat_gathered <- dat %>%
gather(-t, value = "value", key = "key")
# Plot all variables
ggplot(dat_gathered, aes(x = t, y = value, color = key)) + geom_line()
# Add small error (to simulate measurement error) to all variables except A and B:
dat <- dat %>%
mutate(x_j = x + rnorm(x, sd=0.02)/(1/x)) %>%
mutate(z_j = z + rnorm(z, sd=0.02)/(1/z)) %>%
mutate(c_j = c + rnorm(c, sd=0.02)/(1/c)) %>%
mutate(d_j = d + rnorm(d, sd=0.02)/(1/d)) %>%
mutate(e_j = e + rnorm(e, sd=0.02)/(1/e))
The variables in dat with the _j suffix represent real world data (since they have measurement error added). Knowing the constraint that:
A is within 0.4-0.8
B is within 0.85-0.99
Is it possible to use the noisy "_j" data to optimize for the pair of constant values that minimize deviation of A and B across the entire time series?
A little bit of algebra and setting this up as a linear regression problem with no intercept seems to work fine:
m1 <- lm(e_j+c_j ~ 0 + x_j + I(z_j-d_j), data=dat)
coef(m1) ## A =0.6032, B = 0.8916
It doesn't do anything to constrain the solution, though.
I have existing code for fitting a sigmoid curve to data in R. How can I used selfstart (or another method) to automatically find start values for the regression?
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)
y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)
This is a common (and interesting) problem in non-linear curve fitting.
Background
We can find sensible starting values if we take a closer look at the function sigmoid
We first note that
So for large values of x, the function approaches a. In other words, as a starting value for a we may choose the value of y for the largest value of x.
In R language, this translates to y[which.max(x)].
Now that we have a starting value for a, we need to decide on starting values for b and c. To do that, we can make use of the geometric series
and expand f(x) = y by keeping only the first two terms
We now set a = 1 (our starting value for a), re-arrange the equation and take the logarithm on both sides
We can now fit a linear model of the form log(1 - y) ~ x to obtain estimates for the slope and offset, which in turn provide the starting values for b and c.
R implementation
Let's define a function that takes as an argument the values x and y and returns a list of parameter starting values
start_val_sigmoid <- function(x, y) {
fit <- lm(log(y[which.max(x)] - y + 1e-6) ~ x)
list(
a = y[which.max(x)],
b = unname(-coef(fit)[2]),
c = unname(-coef(fit)[1] / coef(fit)[2]))
}
Based on the data for x and y you give, we obtain the following starting values
start_val_sigmoid(x, y)
#$a
#[1] 1
#
#$b
#[1] 0.2027444
#
#$c
#[1] 15.01613
Since start_val_sigmoid returns a list we can use its output directly as the start argument in nls
nls(y ~ a / ( 1 + exp(-b * (x - c))), start = start_val_sigmoid(x, y))
#Nonlinear regression model
# model: y ~ a/(1 + exp(-b * (x - c)))
# data: parent.frame()
# a b c
# 1.0395 0.1254 29.1725
# residual sum-of-squares: 0.2119
#
#Number of iterations to convergence: 9
#Achieved convergence tolerance: 9.373e-06
Sample data
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
suppose I have a data like y and I fit a smooth function to this data with Fourier basis
y<- c(1,2,5,8,9,2,5)
x <- seq_along(y)
Fo <- create.fourier.basis(c(0, 7), 4)
precfd = smooth.basis(x,y,Fo)
plotfit.fd(y, x, precfd$fd)
precfd <- smooth.basis(x, y, Fo);coef(precfd)
the out put of last line gives me this:
const 411.1060285
sin1 -30.5584033
cos1 6.5740933
sin2 26.2855849
cos2 -26.0153965
I know what is the coefficient but what in const? in original formula there is no constant part as this link say:
http://lampx.tugraz.at/~hadley/num/ch3/3.3a.php
The first basis function in create.fourier.basis is a constant function to allow for a non-zero mean (intercept) in the data. From the documentation of the create.fourier.basis function:
The first basis function is the unit function with the value one everywhere. The next two are the sine/cosine pair with period defined in the argument period. The fourth and fifth are the sin/cosine series with period one half of period. And so forth. The number of basis functions is usually odd.
You can drop the first (unit) basis function in create.fourier.basis with the argument dropind = 1. Below some example code that illustrates which basis functions are used in create.fourier.basis. Note: the scaling of the basis functions depends on the period argument in create.fourier.basis.
Example 1: non-zero mean
library(fda)
## time sequence
tt <- seq(from = 0, to = 1, length = 100)
## basis functions
phi_0 <- 1
phi_1 <- function(t) sin(2 * pi * t) / sqrt(1 / 2)
phi_2 <- function(t) cos(2 * pi * t) / sqrt(1 / 2)
## signal
f1 <- 10 * phi_0 + 5 * phi_1(tt) - 5 * phi_2(tt)
## noise
eps <- rnorm(100)
## data
X1 <- f1 + eps
## create Fourier basis with intercept
four.basis1 <- create.fourier.basis(rangeval = range(tt), nbasis = 3)
## evaluate values basis functions
## eval.basis(tt, four.basis1)
## fit Fourier basis to data
four.fit1 <- smooth.basis(tt, X1, four.basis1)
coef(four.fit1)
Example 2: zero mean
## signal
f2 <- 5 * phi_1(tt) - 5 * phi_2(tt)
## data
X2 <- f2 + eps
## create Fourier basis without intercept
four.basis2 <- create.fourier.basis(rangeval = range(tt), nbasis = 3, dropind = 1)
## evaluate values basis functions
## eval.basis(tt, four.basis2)
## fit Fourier basis to data
four.fit2 <- smooth.basis(tt, X2, four.basis2)
coef(four.fit2)
I am interested to build a R function that I can use to test the limits of the Taylor series approximation. I am aware that there is limits to what I am doing, but it's exactly those limits I wish to investigate.
I have two normally distributed random variables x and y. x has a mean of 7 and a standard deviation (sd) of 1. y has a mean of 5 and a sd of 4.
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
I know how to estimate the mean ratio of y/x, like this
# E(y/x) = E(y)/E(x) - Cov(y,x)/E(x)^2 + Var(x)*E(y)/E(x)^3
me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
[1] 1.328125
I am however stuck on how to estimate the Standard Deviation of the ratio? I realize I have to use a Taylor expansion, but not how to use it.
Doing a simple simulation I get
x <- rnorm(10^4, mean = 4, sd = 1); y <- rnorm(10^4, mean = 5, sd = 4)
sd(y/x)
[1] 2.027593
mean(y/x)[1]
1.362142
There is an analytical expression for the PDF of the ratio of two gaussians, done
by David Hinkley (e.g. see Wikipedia). So we could compute all momentums, means etc. I typed it and apparently it clearly doesn't have finite second momentum, thus it doesn't have finite standard deviation. Note, I've denoted your Y gaussian as my X, and your X as my Y (formulas assume X/Y). I've got mean value of ratio pretty close to the what you've got from simulation, but last integral is infinite, sorry. You could sample more and more values, but from sampling std.dev is growing as well, as noted by #G.Grothendieck
library(ggplot2)
m.x <- 5; s.x <- 4
m.y <- 4; s.y <- 1
a <- function(x) {
sqrt( (x/s.x)^2 + (1.0/s.y)^2 )
}
b <- function(x) {
(m.x*x)/s.x^2 + m.y/s.y^2
}
c <- (m.x/s.x)^2 + (m.y/s.y)^2
d <- function(x) {
u <- b(x)^2 - c*a(x)^2
l <- 2.0*a(x)^2
exp( u / l )
}
# PDF for the ratio of the two different gaussians
PDF <- function(x) {
r <- b(x)/a(x)
q <- pnorm(r) - pnorm(-r)
(r*d(x)/a(x)^2) * (1.0/(sqrt(2.0*pi)*s.x*s.y)) * q + exp(-0.5*c)/(pi*s.x*s.y*a(x)^2)
}
# normalization
nn <- integrate(PDF, -Inf, Inf)
nn <- nn[["value"]]
# plot PDF
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = function(x) PDF(x)/nn) + xlim(-2.0, 6.0)
print(p)
# first momentum
m1 <- integrate(function(x) x*PDF(x), -Inf, Inf)
m1 <- m1[["value"]]
# mean
print(m1/nn)
# some sampling
set.seed(32345)
n <- 10^7L
x <- rnorm(n, mean = m.x, sd = s.x); y <- rnorm(n, mean = m.y, sd = s.y)
print(mean(x/y))
print(sd(x/y))
# second momentum - Infinite!
m2 <- integrate(function(x) x*x*PDF(x), -Inf, Inf)
Thus, it is impossible to test any Taylor expansion for std.dev.
With the cautions suggested by #G.Grothendieck in mind: a useful mnemonic for products and quotients of independent X and Y variables is
CV^2(X/Y) = CV^2(X*Y) = CV^2(X) + CV^2(Y)
where CV is the coefficient of variation (sd(X)/mean(X)), so CV^2 is Var/mean^2. In other words
Var(Y/X)/(m(Y/X))^2 = Var(X)/m(X)^2 + Var(Y)/m(Y)^2
or rearranging
sd(Y/X) = sqrt[ Var(X)*m(Y/X)^2/m(X)^2 + Var(Y)*m(Y/X)^2/m(Y)^2 ]
For random variables with the mean well away from zero, this is a reasonable approximation.
set.seed(101)
y <- rnorm(1000,mean=5)
x <- rnorm(1000,mean=10)
myx <- mean(y/x)
sqrt(var(x)*myx^2/mean(x)^2 + var(y)*myx^2/mean(y)^2) ## 0.110412
sd(y/x) ## 0.1122373
Using your example is considerably worse because the CV of Y is close to 1 -- I initially thought it looked OK, but now I see that it's biased as well as not capturing the variability very well (I'm also plugging in the expected values of the mean and SD rather than their simulated values, but for such a large sample that should be a minor part of the error.)
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
myx <- me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
x <- rnorm(1e4,me.x,sd.x); y <- rnorm(1e4,me.y,sd.y)
c(myx,mean(y/x))
sdyx <- sqrt(sd.x^2*myx^2/me.x^2 + sd.y^2*myx^2/me.y^2)
c(sdyx,sd(y/x))
## 1.113172 1.197855
rvals <- replicate(1000,
sd(rnorm(1e4,me.y,sd.y)/rnorm(1e4,me.x,sd.x)))
hist(log(rvals),col="gray",breaks=100)
abline(v=log(sdyx),col="red",lwd=2)
min(rvals) ## 1.182698
All the canned delta-method approaches to computing the variance of Y/X use the point estimate for Y/X (i.e. m(Y/X) = mY/mX), rather than the second-order approximation you used above. Constructing higher-order forms for both the mean and the variance should be straightforward if possibly tedious (a computer algebra system might help ...)
mvec <- c(x = me.x, y = me.y)
V <- diag(c(sd.x, sd.y)^2)
car::deltaMethod(mvec, "y/x", V)
## Estimate SE
## y/x 1.25 1.047691
library(emdbook)
sqrt(deltavar(y/x,meanval=mvec,Sigma=V)) ## 1.047691
sqrt(sd.x^2*(me.y/me.x)^2/me.x^2 + sd.y^2*(me.y/me.x)^2/me.y^2) ## 1.047691
For what it's worth, I took the code in #SeverinPappadeux's answer and made it into a function gratio(mx,my,sx,sy). For the Cauchy case (gratio(0,0,1,1)) it gets confused and reports a mean of 0 (which should be NA/divergent) but correctly reports the variance/std dev as divergent. For the parameters specified by the OP (gratio(5,4,4,1)) it gives mean=1.352176, sd=NA as above. For the first parameters I tried above (gratio(10,5,1,1)) it gives mean=0.5051581, sd=0.1141726.
These numerical experiments strongly suggest to me that the ratio of Gaussians sometimes has a well-defined variance, but I don't know when (time for another question on Math StackOverflow or CrossValidated?)
Such approximations are unlikely to be useful since the distribution may not have a finite standard deviation. Look at how unstable it is:
set.seed(123)
n <- 10^6
X <- rnorm(n, me.x, sd.x)
Y <- rnorm(n, me.y, sd.y)
sd(head(Y/X, 10^3))
## [1] 1.151261
sd(head(Y/X, 10^4))
## [1] 1.298028
sd(head(Y/X, 10^5))
## [1] 1.527188
sd(Y/X)
## [1] 1.863168
Contrast that with what happens when we try the same thing with a normal random variable:
sd(head(Y, 10^3))
## [1] 3.928038
sd(head(Y, 10^4))
## [1] 3.986802
sd(head(Y, 10^5))
## [1] 3.984113
sd(Y)
## [1] 3.999024
Note: If you were in a different situation, e.g. the denominator has compact support, then you could do this:
library(car)
m <- c(x = me.x, y = me.y)
v <- diag(c(sd.x, sd.y)^2)
deltaMethod(m, "y/x", v)
I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)