Automatically find the scaling factor of the x-axis using LsqFit (or other method)? - julia

I have the following data: a vector B and a vector R. The vector B is the "independent" variable. For this pair, I have two data sets: One is an experimental measurement of Bex, Rex and the other is a simulation produced by me Bsim, Rsim. The simulation does not have any "scale" for the x-axis (the B vector). Therefore when I am trying to fit my curve to the experiment, I have to find out a scaling parameter B0 "by eye", and with this number B0 I multiply the entire Bsim vector and simply plot(Bsim, Rsim, Bex, Rex).
I wanted to use the package LsqFit to make the procedure automatic and more accurate. However I am having trouble in understanding how I could use it to find the scaling on the independent variable.
My first thought was to just "invert" the roles of B and R. However, there are two issues that I think make matters worse: 1) the R curve/data is not monotonous, 2) the experimental data are much more "dense" (they have more data-points: my simulation has 120 points in total, the experiments have some thousands).
Below I give an example if what I am trying to accomplish (of course, the answer need not use LsqFit). I also attach two figures that demonstrate everything very clearly.
#= stuff happened before this point =#
Bsim, Rsim = load(simulation)
Bex, Rex = load(experiment)
#this is what I want to do:
some_model(x, p) = ???
fit = curve_fit(some_model, Bex, Rex, [3.5])
B0 = fit.param[1]
#this is what I currently do by trail and error:
B0 = 3.85 #this is what I currently do by trial and error
plot(B0*Bsim, Rsim, Bex, Rex)
P.S.: The R curves (dependent variables) are both normalized by their maximum value because their scaling is not important.

A simple approach iff you can always expect both your experiment and simulation to feature one high peak, and you're sure that there's only a scaling factor rather than also an offset, is to simply multiply your Bsim vector by mode_rex / mode_rsim (e.g. in your example, mode_rsim = 1, and mode_rex = 4, so multiply Bsim by 4. But I'm sure you've thought of this already.
For a more general approach, one way is as follows:
add and load Interpolations package
Create a grid to interpolate over, e.g. Grid = 0:0.01:Bex[end]
interpolate Rex over that grid, e.g.
RexInterp = interpolate( (Bex,), Rex, Gridded(Linear()));
RexGridVec = RexInterp[Grid];
interpolate Rsim over the same grid, but introduce your multiplier on the Bsim "knots", e.g.
Multiplier = 0.1;
RsimInterp = interpolate( (Multiplier * Bsim,), Rsim, Gridded(Linear()));
RsimGridVec = RsimInterp[Grid]
Now you can calculate a square error value between RsimGridVec and RexGridVec, e.g.
SqErr = sum((RsimGridVec - RexGridVec).^2)
If you follow this technique, then if you create a loop for a multiplier range (say 0:0.01:10), and get the square error associated with each multiplier, you can find out the multiplier for which the square error is the minimum.
In theory if you wanted to find the optimal for a particular offset too, you can make it the outer loop for a range of offsets. Mind you this is a brute force approach, but it be reasonably efficient judging by the vectors in your graph.


Preferentially Sampling Based upon Value Size

So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {​​​​​​​​x-min(x)}​​​​​​​​/{​​​​​​​​max(x)-min(x)}​​​​​​​​ method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.

Calculate Normals from Heightmap

I am trying to convert an heightmap into a matrix of normals using central differencing which will later correspond to the steepness of a giving point.
I found several links with correct results but without explaining the math behind.
From this link I realised I can just do:
Vec3 normal = Vec3(2*(R-L), 2*(B-T), -4).Normalize();
The thing is that I don't know where the 2* and -4 comes from.
In this explanation of central differencing I see that we should divide that value by 2, but I still don't know how to connect all of this.
What I really want to know is the linear algebra definition behind this.
I have an heightmap, I want to measure the central differences and I want to obtain the normal vector to use later to measure the steepness.
PS: the Z-axis is the height.
From vector calculus, the normal of a surface is given by the gradient operator:
A height map h(x, y) is a special form of the function f:
For a discretized height map, assuming that the grid size is 1, the first-order approximations to the two derivative terms above are given by:
Since the x step from L to R is 2, and same for y. The above is exactly the formula you had, divided through by 4. When this vector is normalized, the factor of 4 is canceled.
(No linear algebra was harmed in the writing of this answer)

Calculating error in PCA

I have a question about a result which I did not expect when doing PCA.
I have successfully calculated the principal components using reference data, and then as a check to ensure that what's going on is what I think is going on, I've projected the reference data onto the entire basis of its eigenfucntions (kept all components) and then transformed back, (this is in python, so it's followed by ref_data_transform =pca.transform.(ref_data) followed by pca.inverse_transform(ref_data_transform) I get the exact same data. This is not a surprise.
What is also not a surprise is that as I choose fewer and fewer principle components, the point to point difference between the original data and that which has been projected onto a smaller basis and then projected back increases. That is, if you plot the original data and "filtered" data, it looks different, with the difference increasing as you reduce the size of the subspace onto which you're projecting. I can capture the difference between each data point in a vector called, say, difference_vec.
What IS a surprise (to me at least) is that when I sum over any column of difference_vec it always equals zero. That is, while the actual differences between any original data point and the corresponding one filtered by some number of principal components grow larger as I project onto a smaller and smaller subspace, the TOTAL error is always zero.
I very much appreciate any insight that one my have into if I'm making some mistake here and if not, why this erstwhile "projection induced error" metric doesn't work.
This happens because ref_data and what I’ll call inv_data = pca.inverse_transform(pca.transform(ref_data)) both have the same mean (taken along the second dimension, i.e., averaging over samples).
To see this, take a look at the code for transform:
transform = lambda X: dot(X - mu, V.T)
whereas inverse_transform can be defined as:
inverse_transform = lambda X: dot(X, V) + mu
where mu is the mean of ref_data and V are the first N eigenvectors of covariance(ref_data).
So if you follow the chain of data and its mean:
ref_data with mean mu;
transform(ref_data) has mean 0 (see the equivalent definition above: X-mu has zero mean, then projecting the result linearly onto some coordinate reference only rotates/shears/flips those zero-mean points, doesn’t alter their mean;
Finally, inv_data = inverse_transform(transform(ref_data)) adds mu back so it has mu-mean;
you see that ref_data and inv_data both have mean mu.
Finally, sum(ref_data - inv_data) can be seen as sum(mean(ref_data - inv_data) * num_samples), which by linearity simplifies to sum(mu - mu), which is 0.
That’s a lot of words, sorry, but the idea, now that I see it, is really simple. As I mentioned in my comment, in cases like this you want to use a matrix norm, like the Frobenius norm, to measure a distance between two matrixes, not just sum(A - B) 😅!
Sample code:
import numpy as np
from sklearn.decomposition import PCA
ref_data = np.random.randn(20, 3)
pca = PCA(n_components=1)
trans_data = pca.transform(ref_data)
inv_data = pca.inverse_transform(trans_data)
np.mean(inv_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(ref_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(trans_data, 0) # array([ -2.49800181e-17]) meanwhile ...
np.sum(inv_data - ref_data) # -1.3877787807814457e-15 !

Replacing negative values in a model (system of ODEs) with zero

I'm currently working on solving a system of ordinary differential equations using deSolve, and was wondering if there's any way of preventing differential variable values from going below zero. I've seen a few other posts about setting negative values to zero in a vector, data frame, etc., but since this is a biological model (and it doesn't make sense for a T cell count to go negative), I need to stop it from happening to begin with so these values don't skew the results, not just replace the negatives in the final output.
My standard approach is to transform the state variables to an unconstrained scale. The most obvious/standard way to do this for positive variables is to write down equations for the dynamics of log(x) rather than of x.
For example, with the Susceptible-Infected-Recovered (SIR) model for infectious disease epidemics, where the equations are dS/dt = -beta*S*I; dI/dt = beta*S*I-gamma*I; dR/dt = gamma*I we would naively write the gradient function as
gfun <- function(time, y, params) {
g <- with(as.list(c(y,params)),
If we make log(I) rather than I be the state variable (in principle we could do this with S as well, but in practice S is much less likely to approach the boundary), then we have d(log(I))/dt = (dI/dt)/I = beta*S-gamma; the rest of the equations need to use exp(logI) to refer to I. So:
gfun_log <- function(time, y, params) {
g <- with(as.list(c(y,params)),
(it would be slightly more efficient to compute exp(logI) once and store/re-use it rather than computing it twice ...)
If a value doesn’t become negative in reality but becomes negative in your model, you should change your model or, equivalently, modify your differential equations such that this is not possible. With other words: Do not try to constrain your dynamical variables but their derivatives. Everything else will only lead to problems with your solver, while it should not care about a change in the differential equation.
For a simple example, suppose that:
you have a one-dimensional differential equation ẏ = f(y),
y shall not become negative,
your initial y is positive.
In this case, y can only become negative if f(0) < 0. Thus, all you have to do is to modify f such that f(0) ≥ 0 (and it is still smooth).
For a proof of principle, you can multiply f with an appropriately modified sigmoid function (which allows you to compose every logical operation with smooth functions). This way, nothing would change for most values of y, and you only change your differential equation if y is close to 0, i.e., when you were going to manipulate things anyway.
However, I would not really recommend using sigmoids without thinking about your model. If your model is totally wrong near y = 0, it will very likely already be useless for nearby values. If your simulations venture in this terrain and you want the results to be meaningful, you should fix this.

approximation methods

I attached image:
So in this image there is a diagram of the function, which is defined on the given points.
For example on points x=1..N.
Another diagram, which was drawn as a semitransparent curve,
That is what I want to get from the original diagram,
i.e. I want to approximate the original function so that it becomes smooth.
Are there any methods for doing that?
I heard about least squares method, which can be used to approximate a function by straight line or by parabolic function. But I do not need to approximate by parabolic function.
I probably need to approximate it by trigonometric function.
So are there any methods for doing that?
And one idea, is it possible to use the Least squares method for this problem, if we can deduce it for trigonometric functions?
One more question!
If I use the discrete Fourier transform and think about the function as a sum of waves, so may be noise has special features by which we can define it and then we can set to zero the corresponding frequency and then perform inverse Fourier transform.
So if you think that it is possible, then what can you suggest in order to identify the frequency of noise?
Unfortunately many solutions here presented don't solve the problem and/or they are plain wrong.
There are many approaches and they are specifically built to solve conditions and requirements you must be aware of !
a) Approximation theory: If you have a very sharp defined function without errors (given by either definition or data) and you want to trace it exactly as possible, you are using
polynominal or rational approximation by Chebyshev or Legendre polynoms, meaning that you
approach the function by a polynom or, if periodical, by Fourier series.
b) Interpolation: If you have a function where some points (but not the whole curve!) are given and you need a function to get through this points, you can use several methods:
Newton-Gregory, Newton with divided differences, Lagrange, Hermite, Spline
c) Curve fitting: You have a function with given points and you want to draw a curve with a given (!) function which approximates the curve as closely as possible. There are linear
and nonlinear algorithms for this case.
Your drawing implicates:
It is not remotely like a mathematical function.
It is not sharply defined by data or function
You need to fit the curve, not some points.
What do you want and need is
d) Smoothing: Given a curve or datapoints with noise or rapidly changing elements, you only want to see the slow changes over time.
You can do that with LOESS as Jacob suggested (but I find that overkill, especially because
choosing a reasonable span needs some experience). For your problem, I simply recommend
the running average as suggested by Jim C.
Sorry, cdonner and Orendorff, your proposals are well-minded, but completely wrong because you are using the right tools for the wrong solution.
These guys used a sixth polynominal to fit climate data and embarassed themselves completely.
Use loess in R (free).
E.g. here the loess function approximates a noisy sine curve.
As you can see you can tweak the smoothness of your curve with span
Here's some sample R code from here:
Step-by-Step Procedure
Let's take a sine curve, add some
"noise" to it, and then see how the
loess "span" parameter affects the
look of the smoothed curve.
Create a sine curve and add some noise:
period <- 120 x <- 1:120 y <-
sin(2*pi*x/period) +
Plot the points on this noisy sine curve:
plot(x,y, main="Sine Curve +
'Uniform' Noise") mtext("showing
loess smoothing (local regression
Apply loess smoothing using the default span value of 0.75:
y.loess <- loess(y ~ x, span=0.75,
data.frame(x=x, y=y))
Compute loess smoothed values for all points along the curve:
y.predict <- predict(y.loess,
Plot the loess smoothed curve along with the points that were already
You could use a digital filter like a FIR filter. The simplest FIR filter is just a running average. For more sophisticated treatment look a something like a FFT.
This is called curve fitting. The best way to do this is to find a numeric library that can do it for you. Here is a page showing how to do this using scipy. The picture on that page shows what the code does:
Now it's only 4 lines of code, but the author doesn't explain it at all. I'll try to explain briefly here.
First you have to decide what form you want the answer to be. In this example the author wants a curve of the form
f(x) = p0 cos (2π/p1 x + p2) + p3 x
You might instead want the sum of several curves. That's OK; the formula is an input to the solver.
The goal of the example, then, is to find the constants p0 through p3 to complete the formula. scipy can find this array of four constants. All you need is an error function that scipy can use to see how close its guesses are to the actual sampled data points.
fitfunc = lambda p, x: p[0]*cos(2*pi/p[1]*x+p[2]) + p[3]*x # Target function
errfunc = lambda p: fitfunc(p, Tx) - tX # Distance to the target function
errfunc takes just one parameter: an array of length 4. It plugs those constants into the formula and calculates an array of values on the candidate curve, then subtracts the array of sampled data points tX. The result is an array of error values; presumably scipy will take the sum of the squares of these values.
Then just put some initial guesses in and scipy.optimize.leastsq crunches the numbers, trying to find a set of parameters p where the error is minimized.
p0 = [-15., 0.8, 0., -1.] # Initial guess for the parameters
p1, success = optimize.leastsq(errfunc, p0[:])
The result p1 is an array containing the four constants. success is 1, 2, 3, or 4 if ths solver actually found a solution. (If the errfunc is sufficiently crazy, the solver can fail.)
This looks like a polynomial approximation. You can play with polynoms in Excel ("Add Trendline" to a chart, select Polynomial, then increase the order to the level of approximation that you need). It shouldn't be too hard to find an algorithm/code for that.
Excel can show the equation that it came up with for the approximation, too.
