Using Rs fft function - r

I'm currently trying to use the fft function in R to transform measured soil temperature at a certain depths so as to model soil temperatures and heat fluxes at different depths.
I wanted to clarify some points regarding the fft function in R as i'm currently experiencing problems implementing this procedure.
So I have a df containing the date and time and soil temperatures at 5cm (T5) depth for a period of several months. According to the literature, it is possible to simulate temperatures and heat fluxes at different depths based on a fast Fourier transform of the measured data.
So my first step was naturally DF$FFT = fft (DF$T5)
From which I receive a series of complex numbers (Cn) i.e. the respective real (an) and imaginary (bn) numbers.
According to the literature, I can then recreate the T5 data with a formula based on outputs from the aforementioned fft.
*T_(0,t )= meanT + ∑ (An sin⁡〖nωt+φ〗) ̅
NB the summed term is summed between n=1 and M, the highest harmonic
where T o,t is the temperature at given time point, mean Temperature over the period, t is the time and...
An = (2/sqrt(N))*|Cn|
|Cn| = modulus of the complex number of the nth harmonic Mod (DF$FFT)
phi = arctan (an/bn) i.e. arctan (Re(DF$FFT)/Im(DF$FFT)
omega = (2*pi/N)
Unfortunately based on the output of the fft in R i cannot recreate the temperature values using the above formula. I realise i can recreate the data using
fft (fft(DF$T5), inverse = T)/length (DF$T5)
However i need to be able to do it with the above equation so as to use the terms from this equation to model temperatures at other depths. Could anyone lend a hand in where i may be going wrong with the procedure i have described above. For example the above procedure was implemented in paper where the fft function from Mathcad was used! I am not looking here for a quick fix solution to my problem, so i understand that more data and info would be handy if that were the case. What i am looking for though is a bit of guidance with e.g. any peculiarities of the R fft that i should be aware of.
If anyone could help in any way possible it would be most appreciated. Also if anyone needs more info regarding my problem please do ask
thanks a lot
Brad

Related

how to calculate the discrete number of steps per feature of dataset

I have been looking for a way to calculate the minimum number of samples required Ne(min) to train a classification model when the dataset is not normally distributed. A research paper suggests the following :
if the data are not normally distributed, an exponential relationship between d and N will be
assumed and the number of samples that are required may be as plentiful as:
Ne(min) = Dsteps^d
where Dsteps is the discrete number of steps per feature.
d: dimension of the dataset.
....
It
is useful to think of a histogram approach to understand this relationship. If we want to construct a histogram from data with at least one sample in each bin and with Dsteps discrete steps per feature, we will require at least Dsteps^d samples.
The number of samples required to model the data accurately is in this case an exponential function of d.
I will be very grateful if someone can help me to get/calculate this measure: the discrete number of steps per feature.
An explanation with R or Matlab code would be very helpful. Thank you :D
Edit:
Paper reference: Christiaan Maarten Van Der Walt: Data Measure that Characterises Classification Problems, 2008.

Poorly fitting curve in natural log regression

I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.
There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Trying to do a simulation in R

I'm pretty new to R, so I hope you can help me!
I'm trying to do a simulation for my Bachelor's thesis, where I want to simulate how a stock evolves.
I've done the simulation in Excel, but the problem is that I can't make that large of a simulation, as the program crashes! Therefore I'm trying in R.
The stock evolves as follows (everything except $\epsilon$ consists of constants which are known):
$$W_{t+\Delta t} = W_t exp^{r \Delta t}(1+\pi(exp((\sigma \lambda -0.5\sigma^2) \Delta t+\sigma \epsilon_{t+\Delta t} \sqrt{\Delta t}-1))$$
The only thing here which is stochastic is $\epsilon$, which is represented by a Brownian motion with N(0,1).
What I've done in Excel:
Made 100 samples with a size of 40. All these samples are standard normal distributed: N(0,1).
Then these outcomes are used to calculate how the stock is affected from these (the normal distribution represent the shocks from the economy).
My problem in R:
I've used the sample function:
x <- sample(norm(0,1), 1000, T)
So I have 1000 samples, which are normally distributed. Now I don't know how to put these results into the formula I have for the evolution of my stock. Can anyone help?
Using R for (discrete) simulation
There are two aspects to your question: conceptual and coding.
Let's deal with the conceptual first, starting with the meaning of your equation:
1. Conceptual issues
The first thing to note is that your evolution equation is continuous in time, so running your simulation as described above means accepting a discretisation of the problem. Whether or not that is appropriate depends on your model and how you have obtained the evolution equation.
If you do run a discrete simulation, then the key decision you have to make is what stepsize $\Delta t$ you will use. You can explore different step-sizes to observe the effect of step-size, or you can proceed analytically and attempt to derive an appropriate step-size.
Once you have your step-size, your simulation consists of pulling new shocks (samples of your standard normal distribution), and evolving the equation iteratively until the desired time has elapsed. The final state $W_t$ is then available for you to analyse however you wish. (If you retain all of the $W_t$, you have a distribution of the trajectory of the system as well, which you can analyse.)
So:
your $x$ are a sampled distribution of your shocks, i.e. they are $\epsilon_t=0$.
To simulate the evolution of the $W_t$, you will need some initial condition $W_0$. What this is depends on what you're modelling. If you're modelling the likely values of a single stock starting at an initial price $W_0$, then your initial state is a 1000 element vector with constant value.
Now evaluate your equation, plugging in all your constants, $W_0$, and your initial shocks $\epsilon_0 = x$ to get the distribution of prices $W_1$.
Repeat: sample $x$ again -- this is now $\epsilon_1$. Plugging this in, gives you $W_2$ etc.
2. Coding the simulation (simple example)
One of the useful features of R is that most operators work element-wise over vectors.
So you can pretty much type in your equation more or less as it is.
I've made a few assumptions about the parameters in your equation, and I've ignored the $\pi$ function -- you can add that in later.
So you end up with code that looks something like this:
dt <- 0.5 # step-size
r <- 1 # parameters
lambda <- 1
sigma <- 1 # std deviation
w0 <- rep(1,1000) # presumed initial condition -- prices start at 1
# Show an example iteration -- incorporate into one line for production code...
x <- rnorm(1000,mean=0,sd=1) # random shock
w1 <- w0*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*x*sqrt(dt) -1)) # evolution
When you're ready to let the simulation run, then merge the last two lines, i.e. include the sampling statement in the evolution statement. You then get one line of code which you can run manually or embed into a loop, along with any other analysis you want to run.
# General simulation step
w <- w*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*rnorm(1000,mean=0,sd=1)*sqrt(dt) -1))
You can also easily visualise the changes and obtain summary statistics (5-number summary):
hist(w)
summary(w)
Of course, you'll still need to work through the details of what you actually want to model and how you want to go about analysing it --- and you've got the $\pi$ function to deal with --- but this should get you started toward using R for discrete simulation.

Wavelet reconstruction of time series

I'm trying to reconstruct the original time series from a Morlet's wavelet transform. I'm working in R, package Rwave, function cwt. The result of this function is a matrix of n*m (n=period, m=time) containing complex values.
To reconstruct the signal I used the formula (11) in Torrence & Compo classic text, but the result has nothing to do with the original signal. I'm specially concerned with the division between the real part of the wavelet transform and the scale, this step distorts completely the result. On the other hand, if I just sum the real parts over all the scales, the result is quite similar to the original time series, but with slightly wider values (the original series ranges~ [-0.2, 0.5], the reconstructed series ranges ~ [-0.4,0.7]).
I'm wondering if someone could tell of some practical procedure, formula or algorithm to reconstruct the original time series. I've already read the papers of Torrence and Compo (1998), Farge (1992) and other books, all with different formulas, but no one really help me.
I have been working on this topic currently, using the same paper. I show you code using an example dataset, detailing how I implemented the procedure of wavelet decomposition and reconstruction.
# Lets first write a function for Wavelet decomposition as in formula (1):
mo<-function(t,trans=0,omega=6,j=0){
dial<-2*2^(j*.125)
sqrt((1/dial))*pi^(-1/4)*exp(1i*omega*((t-trans)/dial))*exp(-((t-trans)/dial)^2/2)
}
# An example time series data:
y<-as.numeric(LakeHuron)
From my experience, for correct reconstruction you should do two things: first subject the mean to get a zero-mean dataset. I then increase the maximal scale. I mostly use 110 (although the formula in the Torrence and Compo suggests 71)
# subtract mean from data:
y.m<-mean(y)
y.madj<-y-y.m
# increase the scale:
J<-110
wt<-matrix(rep(NA,(length(y.madj))*(J+1)),ncol=(J+1))
# Wavelet decomposition:
for(j in 0:J){
for(k in 1:length(y.madj)){
wt[k,j+1]<-mo(t=1:(length(y.madj)),j=j,trans=k)%*%y.madj
}
}
#Extract the real part for the reconstruction:
wt.r<-Re(wt)
# Reconstruct as in formula (11):
dial<-2*2^(0:J*.125)
rec<-rep(NA,(length(y.madj)))
for(l in 1:(length(y.madj))){
rec[l]<-0.2144548*sum(wt.r[l,]/sqrt(dial))
}
rec<-rec+y.m
plot(y,type="l")
lines(rec,col=2)
As you can see in the plot, it looks like a perfect reconstruction:

how to cluster curve with kmeans?

I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?

Resources