Poorly fitting curve in natural log regression - r

I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.

There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Related

Why do we discard the first 10000 simulation data?

The following code comes from this book, Statistics and Data Analysis For Financial Engineering, which describes how to generate simulation data of ARCH(1) model.
library(TSA)
library(tseries)
n = 10200
set.seed("7484")
e = rnorm(n)
a = e
y = e
sig2 = e^2
omega = 1
alpha = 0.55
phi = 0.8
mu = 0.1
omega/(1-alpha) ; sqrt(omega/(1-alpha))
for (t in 2:n){
a[t] = sqrt(sig2[t])*e[t]
y[t] = mu + phi*(y[t-1]-mu) + a[t]
sig2[t+1] = omega + alpha * a[t]^2
}
plot(e[10001:n],type="l",xlab="t",ylab=expression(epsilon),main="(a) white noise")
My question is that why we need to discard the first 10000 simulation?
========================================================
Bottom Line Up Front
Truncation is needed to deal with sampling bias introduced by the simulation model's initialization when the simulation output is a time series.
Details
Not all simulations require truncation of initial data. If a simulation produces independent observations, then no truncation is needed. The problem arises when the simulation output is a time series. Time series differ from independent data because their observations are serially correlated (also known as autocorrelated). For positive correlations, the result is similar to having inertia—observations which are near neighbors tend to be similar to each other. This characteristic interacts with the reality that computer simulations are programs, and all state variables need to be initialized to something. The initialization is usually to a convenient state, such as "empty and idle" for a queueing service model where nobody is in line and the server is available to immediately help the first customer. As a result, that first customer experiences zero wait time with probability 1, which is certainly not the case for the wait time of some customer k where k > 1. Here's where serial correlation kicks us in the pants. If the first customer always has a zero wait time, that affects some unknown quantity of subsequent customer's experiences. On average they tend to be below the long term average wait time, but gravitate more towards that long term average as k, the customer number, increases. How long this "initialization bias" lingers depends on both how atypical the initialization is relative to the long term behavior, and the magnitude and duration of the serial correlation structure of the time series.
The average of a set of values yields an unbiased estimate of the population mean only if they belong to the same population, i.e., if E[Xi] = μ, a constant, for all i. In the previous paragraph, we argued that this is not the case for time series with serial correlation that are generated starting from a convenient but atypical state. The solution is to remove some (unknown) quantity of observations from the beginning of the data so that the remaining data all have the same expected value. This issue was first identified by Richard Conway in a RAND Corporation memo in 1961, and published in refereed journals in 1963 - [R.W. Conway, "Some tactical problems on digital simulation", Manag. Sci. 10(1963)47–61]. How to determine an optimal truncation amount has been and remains an active area of research in the field of simulation. My personal preference is for a technique called MSER, developed by Prof. Pres White (University of Virginia). It treats the end of the data set as the most reliable in terms of unbiasedness, and works its way towards the front using a fairly simple measure to detect when adding observations closer to the front produces a significant deviation. You can find more details in this 2011 Winter Simulation Conference paper if you're interested. Note that the 10,000 you used may be overkill, or it may be insufficient, depending on the magnitude and duration of serial correlation effects for your particular model.
It turns out that serial correlation causes other problems in addition to the issue of initialization bias. It also has a significant effect on the standard error of estimates, as pointed out at the bottom of page 489 of the WSC2011 paper, so people who calculate the i.i.d. estimator s2/n can be off by orders of magnitude on the estimated width of confidence intervals for their simulation output.

Using a Point Process model for Prediction

I am analysing ambulance incident data. The dataset covers three years and has roughly 250000 incidents.
Preliminary analysis indicates that the incident distribution is related to population distribution.
Fitting a point process model using spatstat agrees with this, with broad agreement in a partial residual plot.
However, it is believed that the trend diverges from this population related trend during the "social hours", that is Friday, Saturday night, public holidays.
I want to take subsets of the data and see how they differ from the gross picture. How do I account for the difference in intensity due to the smaller number of points inherent in a subset of the data?
Or is there a way to directly use my fitted model for the gross picture?
It is difficult to provide data as there are privacy issues, and with the size of the dataset, it's hard to simulate the situation. I am not by any means a statistician, hence I am flundering a bit here. I have a copy of
"Spatial Point Patterns Methodology and Applications with R" which is very useful.
I will try with pseudocode to explain my methodology so far..
250k_pts.ppp <- ppp(the_ambulance_data x and y, the_window)
1.3m_census_pts <- ppp(census_data x and y, the_window)
Best bandwidth for the density surface by visual inspection seemed to be bw.scott. This was used to fit a density surface for the points.
inc_density <- density(250k_pts.ppp, bw.scott)
pop_density <- density(1.3m_census_pts, bw.scott)
fit0 <- ppm(inc_density ~ 1)
fit_pop <- ppm(inc_density ~ pop_density)
partials <- parres(fit_pop, "pop_density")
Plotting the partial residuals shows that the agreement with the linear fit is broadly acceptable, with some areas of 'wobble'..
What I am thinking of doing next:
the_ambulance_data %>% group_by(day_of_week, hour_of_day) %>%
select(x_coord, y_coord) %>% nest() -> nested_day_hour_pts
Taking one of these list items and creating a ppp, say fri_2300hr_ppp;
fri23.den <- density(fri_2300hr_ppp, bw.scott)
fit_fri23 <- fit(fri_2300hr_ppp ~ pop_density)
How do I then compare this ppp or density with the broader model? I can do characteristic tests such as dispersion, clustering.. Can I compare the partial residuals of fit_pop and fit_fri23?
How do I control for the effect of the number of points on the density - i.e. I have 250k points versus maybe 8000 points in the subset. I'm thinking maybe quantiles of the density surface?
Attach marks to the ambulance data representing the subset/categories of interest (eg 'busy' vs 'non-busy'). For an informal or nonparametric analysis, use tools like relrisk, or use density.splitppp after separating the different types of points using split.ppp. For a formal analysis (taking into account the sample sizes etc etc) you should fit several candidate models to the same data, one model having a busy/nonbusy effect and another model having no such effect, then use anova.ppm to test formally whether there is a busy/nonbusy effect. See Chapter 14 of the book mentioned.

Is it feasible to denoise time irrelevant sensor reading with Kalman Filter and how to code it?

After I did some research, I can understand how to implement it with time relevant functions. However, I'm not very sure about whether can I apply it to time irrelevant scenarios.
Giving that we have a simple function y=a*x^2, where both y and x are measured at a constant interval (say 1 min/sample) and a is a constant. However, both y and x measurements have white noise.
More specifically, x and y are two independently measured variables. For example, x is air flow rate in a duct and the y is the pressure drop across the duct. Because the air flow is varying due to the variation of the fan speed, the pressure drop across the duct is also varying. The relation between the pressure drop y and flow rate x is y=a*x^2, however both measurement embedded white noise. Is that possible to use Kalman Filter to estimate a more accurate y? Both x and y are recorded in a constant time interval.
Here are my questions:
Is it feasible to implement Kalman Filter for the y reading noise reduction? Or in another word, have a better estimation of y?
If this is feasible, how to code it in R or C?
P.S.
I tried to apply Kalman Filter to single variable and it works well. The result is as below. I'll have a try Ben's suggestion then and have a look whether can I make it works.
I think you can apply some Kalman Filter like ideas here.
Make your state a, with variance P_a. Your update is just F=[1], and your measurement is just H=[1] with observation y/x^2. In other words, you measure x and y and estimate a by solving for a in your original equation. Update your scalar KF as usual. Approximating R will be important. If x and y both have zero mean Gaussian noise, then y/x^2 certainly doesn't, but you can come up with an approximation.
Now that you have a running estimate of a (which is a random constant, so Q=0 ideally, but maybe Q=[tiny] to avoid numerical issues) you can use it to get a better y.
You have y_meas and y_est=a*x_meas^2. Combine those using your variances as (R_y * a * x^2 + (P_a + R_x2) * y_meas) / (R_y + P_a + R_x2). Over time as P_a goes to zero (you become certain of your estimate of a) you can see you end up combining information from your x and y measurements proportional to your trust in them individually. Early on, when P_a is high you are mostly trusting the direct measurement of y_meas because you don't know the relationship.

Using Rs fft function

I'm currently trying to use the fft function in R to transform measured soil temperature at a certain depths so as to model soil temperatures and heat fluxes at different depths.
I wanted to clarify some points regarding the fft function in R as i'm currently experiencing problems implementing this procedure.
So I have a df containing the date and time and soil temperatures at 5cm (T5) depth for a period of several months. According to the literature, it is possible to simulate temperatures and heat fluxes at different depths based on a fast Fourier transform of the measured data.
So my first step was naturally DF$FFT = fft (DF$T5)
From which I receive a series of complex numbers (Cn) i.e. the respective real (an) and imaginary (bn) numbers.
According to the literature, I can then recreate the T5 data with a formula based on outputs from the aforementioned fft.
*T_(0,t )= meanT + ∑ (An sin⁡〖nωt+φ〗) ̅
NB the summed term is summed between n=1 and M, the highest harmonic
where T o,t is the temperature at given time point, mean Temperature over the period, t is the time and...
An = (2/sqrt(N))*|Cn|
|Cn| = modulus of the complex number of the nth harmonic Mod (DF$FFT)
phi = arctan (an/bn) i.e. arctan (Re(DF$FFT)/Im(DF$FFT)
omega = (2*pi/N)
Unfortunately based on the output of the fft in R i cannot recreate the temperature values using the above formula. I realise i can recreate the data using
fft (fft(DF$T5), inverse = T)/length (DF$T5)
However i need to be able to do it with the above equation so as to use the terms from this equation to model temperatures at other depths. Could anyone lend a hand in where i may be going wrong with the procedure i have described above. For example the above procedure was implemented in paper where the fft function from Mathcad was used! I am not looking here for a quick fix solution to my problem, so i understand that more data and info would be handy if that were the case. What i am looking for though is a bit of guidance with e.g. any peculiarities of the R fft that i should be aware of.
If anyone could help in any way possible it would be most appreciated. Also if anyone needs more info regarding my problem please do ask
thanks a lot
Brad

Units of a Fourier Transform (FFT) when doing Spectral Analysis of a Signal

My question has to do with the physical meaning of the results of doing a spectral analysis of a signal, or of throwing the signal into an FFT and interpreting what comes out using a suitable numerical package,
Specifically:
take a signal, say a time-varying voltage v(t)
throw it into an FFT (you get back a sequence of complex numbers)
now take the modulus (abs) and square the result, i.e. |fft(v)|^2.
So you now have real numbers on the y axis -- shall I call these spectral coefficients?
using the sampling resolution, you follow a cookbook recipe and associate the spectral coefficients to frequencies.
AT THIS POINT, you have a frequency spectrum g(w) with frequency on the x axis, but WHAT PHYSICAL UNITS on the y axis?
My understanding is that this frequency spectrum shows how much of the various frequencies are present in the voltage signal -- they are spectral coefficients in the sense that they are the coefficients of the sines and cosines of the various frequencies required to reconstitute the original signal.
So the first question is, what are the UNITS of these spectral coefficients?
The reason this matters is that spectral coefficients can be tiny and enormous, so I want to use a dB scale to represent them.
But to do that, I have to make a choice:
Either I use the 20log10 dB conversion, corresponding to a field measurement, like voltage.
Or I use the 10log10 dB conversion, corresponding to an energy measurement, like power.
Which scaling I use depends on what the units are.
Any light shed on this would be greatly appreciated!
take a signal, a time-varying voltage v(t)
units are V, values are real.
throw it into an FFT -- ok, you get back a sequence of complex numbers
units are still V, values are complex ( not V/Hz - the FFT a DC signal becomes a point at the DC level, not an dirac delta function zooming off to infinity )
now take the modulus (abs)
units are still V, values are real - magnitude of signal components
and square the result, i.e. |fft(v)|^2
units are now V2, values are real - square of magnitudes of signal components
shall I call these spectral coefficients?
It's closer to an power density rather than usual use of spectral coefficient. If your sink is a perfect resistor, it will be power, but if your sink is frequency dependent it's "the square of the magnitude of the FFT of the input voltage".
AT THIS POINT, you have a frequency spectrum g(w): frequency on the x axis, and... WHAT PHYSICAL UNITS on the y axis?
Units are V2
The other reason the units matter is that the spectral coefficients can be tiny and enormous, so I want to use a dB scale to represent them. But to do that, I have to make a choice: do I use the 20log10 dB conversion (corresponding to a field measurement, like voltage)? Or do I use the 10log10 dB conversion (corresponding to an energy measurement, like power)?
You've already squared the voltage values, giving equivalent power into a perfect 1 Ohm resistor, so use 10log10.
log(x2) is 2 log(x), so 20log10 |fft(v)| = 10log10 ( |fft(v)|2), so alternatively if you did not square the values you could use 20log10.
The y axis is complex (as opposed to real). The magnitude is the amplitude of the original signal in whatever units your original samples were in. The angle is the phase of that frequency component.
Here's what I've been able to come up with so far:
The y-axis seems likely to be in units of [Energy / Hz] !?
Here's how I'm deriving this (feedback welcomed!):
the signal v(t) is in volts
so after taking the Fourier integral: integral e^iwt v(t) dt , we should have units of [volts*seconds], or [volts/Hz] (e^iwt is unitless)
taking the magnitude squared should then give units of [volts^2 * s^2], or [v^2 * s/Hz]
we know Power is proportional to volts ^2, so this gets us to [power * s / Hz]
but Power is the time-rate of change in energy, i.e. power = energy/s, so we can also write Energy = power * s
this leaves us with the candidate conclusion [Energy/Hz]. (Joules/Hz ?!)
... which suggests the meaning "Energy content per Hz", and suggests as a use integrating frequency bands and seeing the energy content... which would be very nice if it were true...
Continuing... assuming the above is correct, then we are dealing with an Energy measurement, so this would suggest using 10log10 conversion to get into dB scale, instead of 20log10...
...
The power into a resistor is v^2/R watts. The power of a signal x(t) is an abstraction of the power into a 1 Ohm resistor. Therefore, the power of a signal x(t) is x^2 (also called instantaneous power), regardless of the physical units of x(t).
For example, if x(t) is temperature, and the units of x(t) are degrees C, then the units for the power x^2 of x(t) are C^2, certainly not watts.
If you take the Fourier transform of x(t) to get X(jw), then the units of X(jw) are C*sec or C/Hz (according to the Fourier transform integral). If you use (abs(X(jw)))^2, then the units are C^2*sec^2=C^2*sec/Hz. Since power units are C^2, and energy units are C^2*sec, then abs(X(jw)))^2 gives the energy spectral density, say E/Hz. This is consistent with Parseval's theorem, where the energy of x(t) is given by (1/2*pi) times the integral of abs(X(jw)))^2 with respect to w, i.e., (1/2*pi)*int(abs(X(jw)))^2*dw) > (1/2*pi)*(C^2*sec^2)*2*pi*Hz > (1/2*pi)*(C^2*sec/Hz)*2*pi*Hz > E.
Conversion to a dB (log scale) scale does not change the units.
If you take the FFT of samples of x(t), written as x(n), to get X(k), then the result X(k) is an estimate of the Fourier series coefficients of a periodic function, where one period over T0 seconds is the segment of x(t) that was sampled. If the units of x(t) are degrees C, then the units of X(k) are also degrees C. The units of abs(X(k))^2 are C^2, which are the units of power. Thus, a plot of abs(X(k))^2 versus frequency shows the power spectrum (not power spectral density) of x(n), which is an estimate the power of a set of frequency components of x(t) at the frequencies k/T0 Hz.
Well, late answer I know. But I just had cause to do something like this, in a different context. My raw data was latency values for transactions against a storage unit - I resampled it to a 1ms time interval. So original data y was "latency, in microseconds." I had 2^18 = 262144 original data points, on 1ms time steps.
After I did the FFT, I got a 0th component (DC) such that the following held:
FFT[0] = 262144*(average of all input data).
So it looks to me like FFT[0] is N*(average of input data). That sort of makes sense - every single data point possesses that DC average as part of what it is, so you add 'em all up.
If you look at the definition of the FFT that makes sense too. All of the other components would involve sine and cosine terms too, but really the FFT is just a summation. The average is just the only one that happens to be present in all points equally, because you have cos(0) = 1.

Resources