I'm interested in extracting noise parameters from (x, y) data, where x are known input values, y are the corresponding signals, and I typically know quite well the functional form of the noise-generating processes.
###
library(MASS)
n <- 100000
x <- runif(n, min = 50, max = 1000)
y1 <- rnorm(n, 0, 5) + rnegbin(x / 4, theta = 100)
plot(x[1:100000*100], y1[1:100000*100]) #plot every 1ooth datapoint only, for speed
y2 <- x / 7.5 + rnorm(n, 50, 2) + rnorm(n, 0, 0.1 * x / 7.5)
plot(x[1:100000*100], y2[1:100000*100])
y3 <- x + rnorm(n, 0, 5) + rnorm(n, 0, 0.1 * x)
plot(x[1:100000*100], y3[1:100000*100])
y4 <- rnorm(n, 0, 5) + (rnegbin((x + rnorm(n, 0, 5)), theta = 100)) / 2
plot(x[1:100000*100], y4[1:100000*100])
###
In the (x, y1) example, the y data are (noisy) counts generated according to a negative binomial model (theta = 100) plus some constant (homoskedastic) gaussian noise (with avg = 0, stdev = 5) from the input values x. Also, there is a step somewhere in the process before the generation of the counts that scales down the signal by a factor of 4.
In the (x, y2) example, the signals are generated by a different kind of process where the response factor is 1 / 7.5 so y values are first of all scaled down by that factor, then constant gaussian background noise (avg = 50, stdev = 2), plus more gaussian noise that is proportional to the y (= x / 7.5) is added.
I suppose my main question is: Is there a function or functions that take (x, y) data, as well as user-specified noise model as inputs and spits out the best estimates for, in the case of example y1, the four parameters: theta, constant noise avg, constant noise stdev, and scaling factor?
As a simplification, as in example (x, y3), I may be able to scale the signals manually before the fitting so the signals are (on average) directly proportional to the input, and the algorithm should not consider any scaling factors but assume that on avarage, y is directly proportion to x (slope = 1), eliminating this parameter from consideration, and thus simplifying the fitting process.
There is even more complicated example (x, y4). Here, the signals are produced by a complex process, with the signal is initially proportional to x (no scaling), constant gaussian noise is added (avg = 0, stdev = 5), then noisy counts are generated from that already noisy input, signals are divided by 2, and then more constant gaussian noise (avg = 0, stdev = 5) is added to top it off.
All these seem like a doable things in data processing oriented R, but I haven't been able to find suitable solution. Still, I am not very experienced programmer in general (and R specifically), and I don't have any experience in developing algorithms, so my preceptions may be totally off, in that case my apologies.
The noise levels and exact models presented in these toy examples may not be realistic, but I'm just trying to illustrate that I'd like to find a way to model somewhat complex and user-specified multi-parameter noise scenarios. In lot of cases, I know to high accuracy what kind of processes generated the real-world data of interest so that is not a problem, I can write down the function. The bottleneck is to know what are the appropriate R function(s)/packages to use (and something like this is available in R) and how to use them correctly so I can extract the parameters (which are my main interest) that characterize these processes.
In case there are no general solutions that would cover lot of ground, I'm of course interested in algorithms and functions that specialize in just one type of data as well (e.g., just linear regression). Finally, if fitting complex noise models is not an option, I'd be also interested in starting with something simpler, like just fitting negative binomial alone, in some cases other noise sources may be negligible so it may just do. Like this one
y5 <- rnegbin(x, theta = 100)
Thank you!
Related
I'm trying to implement a "change point" analysis, or a multiphase regression using nls() in R.
Here's some fake data I've made. The formula I want to use to fit the data is:
$y = \beta_0 + \beta_1x + \beta_2\max(0,x-\delta)$
What this is supposed to do is fit the data up to a certain point with a certain intercept and slope ($\beta_0$ and $\beta_1$), then, after a certain x value ($\delta$), augment the slope by $\beta_2$. That's what the whole max thing is about. Before the $\delta$ point, it'll equal 0, and $\beta_2$ will be zeroed out.
So, here's my function to do this:
changePoint <- function(x, b0, slope1, slope2, delta){
b0 + (x*slope1) + (max(0, x-delta) * slope2)
}
And I try to fit the model this way
nls(y ~ changePoint(x, b0, slope1, slope2, delta),
data = data,
start = c(b0 = 50, slope1 = 0, slope2 = 2, delta = 48))
I chose those starting parameters, because I know those are the starting parameters, because I made the data up.
However, I get this error:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Have I just made unfortunate data? I tried fitting this on real data first, and was getting the same error, and I just figured that my initial starting parameters weren't good enough.
(At first I thought it could be a problem resulting from the fact that max is not vectorized, but that's not true. It does make it a pain to work with changePoint, wherefore the following modification:
changePoint <- function(x, b0, slope1, slope2, delta) {
b0 + (x*slope1) + (sapply(x-delta, function (t) max(0, t)) * slope2)
}
This R-help mailing list post describes one way in which this error may result: the rhs of the formula is overparameterized, such that changing two parameters in tandem gives the same fit to the data. I can't see how that is true of your model, but maybe it is.
In any case, you can write your own objective function and minimize it. The following function gives the squared error for data points (x,y) and a certain value of the parameters (the weird argument structure of the function is to account for how optim works):
sqerror <- function (par, x, y) {
sum((y - changePoint(x, par[1], par[2], par[3], par[4]))^2)
}
Then we say:
optim(par = c(50, 0, 2, 48), fn = sqerror, x = x, y = data)
And see:
$par
[1] 54.53436800 -0.09283594 2.07356459 48.00000006
Note that for my fake data (x <- 40:60; data <- changePoint(x, 50, 0, 2, 48) + rnorm(21, 0, 0.5)) there are lots of local maxima depending on the initial parameter values you give. I suppose if you wanted to take this seriously you'd call the optimizer many times with random initial parameters and examine the distribution of results.
Just wanted to add that you can do this with many other packages. If you want to get an estimate of uncertainty around the change point (something nls cannot do), try the mcp package.
# Simulate the data
df = data.frame(x = 1:100)
df$y = c(rnorm(20, 50, 5), rnorm(80, 50 + 1.5*(df$x[21:100] - 20), 5))
# Fit the model
model = list(
y ~ 1, # Intercept
~ 0 + x # Joined slope
)
library(mcp)
fit = mcp(model, df)
Let's plot it with a prediction interval (green line). The blue density is the posterior distribution for the change point location:
# Plot it
plot(fit, q_predict = T)
You can inspect individual parameters in more detail using plot_pars(fit) and summary(fit).
tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.
Detailed Problem
I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.
Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.
I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:
#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)
This works well within the range of simulated simulated_stats, see image:
Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.
So in the image above, I would need to evaluate the PDF at, say, x=50:
approxfun(density(simulated_stats))(50)
> [1] NA
So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such
approxfun(
density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18
Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).
To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
)(x)})
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
Result:
Non-monotonic log density far from density mass
My question
Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)
What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?
Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...
Thanks!
One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.
Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
qp = quantile(ss, probs=p)
betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
return(betamod)
}
betamod <- get_mod(Simulated_stats_)
Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
xvals = c(val,val+tol)
yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))
[1] TRUE
Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))
[1] TRUE
Additional thoughts 3/5/22
As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.
Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968
and it is very fast
microbenchmark::microbenchmark(
list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)
Unit: microseconds
expr min lq mean median uq max neval
betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8 100
lin_int 1.7 2.25 3.503 4.35 4.50 10.5 100
Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
It is the case that the probability density for a standardized and unstandardized random variable will differ. E.g., in R
dnorm(x = 0, mean = 1, sd = 2)
dnorm(x = (0 - 1)/2)
However,
pnorm(q = 0, mean = 1, sd = 2)
pnorm(q = (0 - 1)/2)
yields the same value.
Are there any situations in which the Normal cumulative density function will yield a different probability for the same random variable when it is standardized versus unstandardized? If yes, is there a particular example in which this difference arises? If not, is there a general proof of this property?
Thanks so much for any help and/or insight!
This isn't really a coding question, but I'll answer it anyway.
Short answer: yes, they may differ.
Long answer:
A normal distribution is usually thought of as y=f(x), that is, a curve over the domain of x. When you standardize, you are converting from units of x to units of z. For example, if x~N(15,5^2), then a value of 10 is 5 x-units less than the mean. Notice that this is also 1 standard deviation less than the mean. When you standardize, you convert x to z~N(0,1^2). Now, that example value of 10, when standarized into z-units, becomes a value of -1 (i.e., it's still one standard deviation less than the mean).
As a result, the area under the curve to the left of x=10 is the same as the area under the curve to the left of z=-1. In words, the cumulative probability up to those cut-offs is the same.
However, the height of curves is different. Let the normal distribution curves be f(x) and g(z). Then f(10) != g(-1). In code:
dnorm(10, 15, 5) != dnorm(-1, 0, 1)
The reason is that the act of standardizing either "spreads" or "squishes" the f(x) curve to make it "fit" over the new z domain as g(z).
Here are two links that let you visualize the spreading/squishing:
https://academo.org/demos/gaussian-distribution/
https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php
Hope this helps!
I am trying to find the coefficients of a polynomial in R, but I am not sure of which order the polynomial is.
I have data:
x=seq(6, 174, by=8)
y=rep(c(-1,1),11)
Now I want to find the (obviously) non-linear function that hits up all these points. Function values should still is in the interval [-1,1], and all these points should be understood as the vertex of a parabola.
EDIT
Actually this is not example data, I just need exactly this function for exactly these points.
I tried to describe it with polynomials up to degree 25 and then gave up, with polynomials it seems that it is only possible to approximate the curve but not to get it directly.
Comments suggested using a sine curve. Does someone know how to get the exact trigonometric function?
Your data have a strong characteristic that they are sampled from a sinusoid signal. With restriction that y is constrained onto [-1,1], we know for sure the amplitude is 1, so let's assume we want a sin function:
y = sin((2 * pi / T) * x + phi)
where T is period and phi is phase. The period of your data is evident: 2 * 8 = 16. To get phi, just use the fact that when x = 6, y = -1. That is
sin(12 * pi / T + phi) = -1
which gives one solution: phi = -pi/2 - 12 * pi / T.
Here we go:
T <- 16
phi <- -pi/2 - 12 * pi / T
f <- function(x) sin(x * pi / 8 + phi)
plot(x, y)
x0 <- seq(6, 174, by = 0.2)
y0 <- f(x0)
lines(x0, y0, col = 2)
Your original intention to have a polynomial is not impossible, but it can't be an ordinary polynomial. An ordinary polynomial is unbounded. It will tends to Inf or -Inf when x tends to Inf or -Inf.
Local polynomial is possible. Since you say: all these points should be understood as the vertex of a parabola, you seem to expect a smooth function. Then a cubic spline is ideal. Specifically, we don't want a natural cubic spline but a period cubic spline. The spline function from stats package can help us:
int <- spline(x[-1], y[-1], method = "periodic", xout = x0)
Note, I have dropped the first datum, as with "periodic" method, spline wants y to have the same value on both ends. Once we drop the first datum, y values are 1 on both sides.
plot(x, y)
lines(int, col = 2)
I did not compare the spline interpolation with the sinusoid function. They can't be exactly the same, but in statistical modelling we can use either one to model the underlying cyclic signal / effect.
I want to minimize a simple linear function Y = x1 + x2 + x3 + x4 + x5 using ordinary least squares with the constraint that the sum of all coefficients have to equal 5. How can I accomplish this in R? All of the packages I've seen seem to allow for constraints on individual coefficients, but I can't figure out how to set a single constraint affecting coefficients. I'm not tied to OLS; if this requires an iterative approach, that's fine as well.
The basic math is as follows: we start with
mu = a0 + a1*x1 + a2*x2 + a3*x3 + a4*x4
and we want to find a0-a4 to minimize the SSQ between mu and our response variable y.
if we replace the last parameter (say a4) with (say) C-a1-a2-a3 to honour the constraint, we end up with a new set of linear equations
mu = a0 + a1*x1 + a2*x2 + a3*x3 + (C-a1-a2-a3)*x4
= a0 + a1*(x1-x4) + a2*(x2-x4) + a3*(x3-x4) + C*x4
(note that a4 has disappeared ...)
Something like this (untested!) implements it in R.
Original data frame:
d <- data.frame(y=runif(20),
x1=runif(20),
x2=runif(20),
x3=runif(20),
x4=runif(20))
Create a transformed version where all but the last column have the last column "swept out", e.g. x1 -> x1-x4; x2 -> x2-x4; ...
dtrans <- data.frame(y=d$y,
sweep(d[,2:4],
1,
d[,5],
"-"),
x4=d$x4)
Rename to tx1, tx2, ... to minimize confusion:
names(dtrans)[2:4] <- paste("t",names(dtrans[2:4]),sep="")
Sum-of-coefficients constraint:
constr <- 5
Now fit the model with an offset:
lm(y~tx1+tx2+tx3,offset=constr*x4,data=dtrans)
It wouldn't be too hard to make this more general.
This requires a little more thought and manipulation than simply specifying a constraint to a canned optimization program. On the other hand, (1) it could easily be wrapped in a convenience function; (2) it's much more efficient than calling a general-purpose optimizer, since the problem is still linear (and in fact one dimension smaller than the one you started with). It could even be done with big data (e.g. biglm). (Actually, it occurs to me that if this is a linear model, you don't even need the offset, although using the offset means you don't have to compute a0=intercept-C*x4 after you finish.)
Since you said you are open to other approaches, this can also be solved in terms of a quadratic programming (QP):
Minimize a quadratic objective: the sum of the squared errors,
subject to a linear constraint: your weights must sum to 5.
Assuming X is your n-by-5 matrix and Y is a vector of length(n), this would solve for your optimal weights:
library(limSolve)
lsei(A = X,
B = Y,
E = matrix(1, nrow = 1, ncol = 5),
F = 5)