Finding cyclic function from points with more than 2 indep. variales - formula

I am doing my master thesis in Electrical engineering about the impact of the humidity and
temperature on power consumption
I have a problem that is related to statistics, numerical methods and mathematics topics
I have real data for one year (year 2000)
Every day has 24 hours records for temperature, humidity, power consumption
So, the total points for one parameter, for example, temperature is 24*366 = 8784 points
I classified the pattern of the power to three patterns:
Daily, seasonally and to cover the whole year
The aim is to find a mathematical model of the following form:
P = f ( T , H , t , date )
Where,
P = power consumption,
T = temperature,
t = time in hours from 1 to 24,
date = the date number in the year from 1 to 366 ( or date number in a month from 1 to 31)
I started drawing in Matlab program a sample day, 1st August showing the effect of time,
humidity and temperature on power consumption::
http://www7.0zz0.com/2010/12/11/23/264638558.jpg
Then, I make the analysis wider to see what changes happened when drawing this day with the next day:
http://www7.0zz0.com/2010/12/11/23/549837601.jpg
After that I make it wider and include the 1st week of august:
http://www7.0zz0.com/2010/12/11/23/447153078.jpg
Then, the whole month, august:
http://www7.0zz0.com/2010/12/12/00/120820248.jpg
Then, starting from January, I plot power and temperature for 1st six months without
humidity (only for scaling):
http://www7.0zz0.com/2010/12/12/00/908911392.jpg
with humidity :
http://www7.0zz0.com/2010/12/12/00/102651717.jpg
Then, the whole year plot without humidity:
( P,T,H have constant values but I separate H only for scaling since H values are too much higher than P and H and that cause shrinking of the plot making small plots for P and T)
http://www7.0zz0.com/2010/12/11/23/290259320.jpg
and finally with humidity:
http://www7.0zz0.com/2010/12/11/23/842530863.jpg
The reason I have plotted these figures is to follow the behaviors of all parameters. How P is changing with respect to Temperature, Humidity, and time in hours and time in day number.
It is clear that these figures represent cyclic behavior but this behavior is not
constant. It is starting to increase and then decrease during the whole year.
For example the behavior of 1st January is almost the same as any other day in the year
but the difference is in shifting up or down, left or right.
Also, Temperature and Humidity are almost sinusoidal. However, Power consumption behavior is not purely sinusoidal as seen in the following figure:
http://www7.0zz0.com/2010/12/12/00/153503144.jpg
I am not expert in statistics and numerical methods, and this matter now does not have relation with electrical engineering concept.
The results I am aiming to get are:
Specify the day number in the year from 1 to 366,
then specify the hour in that day,
temperature and humidity also will be specified.
All of these parameters are to be specified by the user
The result:
The mathematical model should be capable to find the power consumption in that specific hour of that day.
Then, the Power found from the model will be compared to the measured real power from the
data and if the values are very close to each other, then the model will be accurate and
accepted.
I am sorry for this long question. I actually read many papers, many helps but I could not
reach to the correct approach of how to find one unified model by following the curves
behavior from starting till the end of the year and also having more than one independent
variable has disturbed me a lot.
I hope this problem is not difficult for statistics and mathematics experts.
Any help will be highly appreciated,
Thanks in advance
Regards

About this:
"Also, Temperature and Humidity are almost sinusoidal. However, Power consumption behavior is not purely sinusoidal"
Seems in local scale (several days/weeks order) temperature and humidity can be expressed as periodic train of Gaussians:
After such assumption we can model power consumption as superposition of temperature and humidity trains of Gaussians. Consider this opencalc spreadsheet chart:
in which f1 and f2 are train of gaussians (here only 4 peaks, but you may calculate as many as you need for data fitting) and f3 is superposition of these two trains,-
just (f12 + f22)1/2
However i don't know to what degree power consumption follows the addition of train of gaussians. You may invest time to explore this possibility.
Good luck!

Related

Time series forecasting of outcome variable based on current performance of outcome variable in R

I have a very large dataset (~55,000 datapoints) for chicken crops. Chickens are grown over ~35 day period. The dataset covers 10 sheds of ~20,000 chickens each. In the sheds are weighing platforms and as chickens step on them they send the weight recorded to a server. They are sending continuously from day 0 to the final day.
The variables I have are: House (as a number, House 1 up to House 10), Weight (measured in grams, to 5 decimal points) and Day (measured as a number between two integers, e.g. 12 noon on day 0 might be 0.5 in the day, whereas day 23.3 suggests a third of the way through day 23 (8AM). But as this data is sent continuously the numbers can be very precise).
I want to construct either a Time Series Regression model or an ML model so that if I take a new crop, as data is sent by the sensors, the model can make a prediction for what the end weight will be. Then as that crop cycle finishes it can be added to the training data and repeat.
Currently I'm using this very simple Weight VS Time model, but eventually would include things like temperature, water and food consumption, humidity etc.
I've run regression analyses on the data sets to determine the relationship between time and weight (it's likely quadratic, see image attached) and tried using randomForrest in R to create a model. The test model seemed to work well in regards to the MAPE value being similar to the training value, but that was by taking out one house and using that as the test.
Potentially what I've tried so far is completely the wrong methodology but this is a new area so I'm really not sure of the best approach.

Poorly fitting curve in natural log regression

I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.
There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Understanding TSA::periodogram()

I have some data sampled at regular intervals that looks sinusoidal and I would like to determine the frequency of the wave, to that end I obtained R and loaded the TSA package that contains a function named 'periodogram'.
In an attempt to understand how it works I created some data as follows:
x<-.0001*1:260
This could be interpreted to be 260 samples with an interval of .0001 seconds
Frequency=80
The frequency could be interpreted to be 80Hz so there should be about 125 points per wave period
y<-sin(2*pi*Frequency*x)
I then do:
foo=TSA::periodogram(y)
In the resulting periodogram I would expect to see a sharp spike at the frequency that corresponds to my data - I do see a sharp spike but the maximum 'spec' value has a frequency of 0.007407407, how does this relate to my frequency of 80Hz?
I note that there is variable foo$bandwidth with a value of 0.001069167 which I also have difficulty interpreting.
If there are better ways of determining the frequency of my data I would be interested - my experience with R is limited to one day.
The periodogram is computed from the time series without knowledge of your actual sampling interval. This result in frequencies which are limited to the normalized [0,0.5] range. To obtain a frequency in Hertz that takes into account the sampling interval, you simply need to multiply by the sampling rate. In your case, the spike you get at a normalized frequency of 0.007407407 and a sampling rate of 10,000Hz, this correspond to a frequency of ~74Hz.
Now, that's not quite 80Hz (the original tone frequency), but you have to keep in mind that a periodogram is a frequency spectrum estimate, and its frequency resolution is limited by the number of input samples. In your case you are using 260 samples, so the frequency resolution is on the order of 10,000Hz/260 or ~38Hz. Since 74Hz is well within 80 +/- 38Hz, it is a reasonable result. To get a better frequency estimate you would have to increase the number of samples.
Note that the periodogram of a sinusoidal tone will typically spike near the tone frequency and decay on either side (a phenomenon caused by the limited number of samples used for the estimation, often called spectral leakage) until the value can be considered comparatively 'negligeable'. The foo$bandwidth variable then indicates that the input signal starts to contain less energy for frequencies above 0.001069167*10000Hz ~ 107Hz, which is consistent with the tone's decay.

cluster many curves representing gas consumption

I have 700 hourly time series from 2010 to 2014 of gas consumption. One time serie represents the consumption of one companies.
Some have constant consumption, other consume only 4 months of the year and some have a high volatility consumption. As a consequence I would like to cluster them according to the shape of the consuption curve.
I tried the R package "kml", but i do not have good results. I also tried the "kmlShape" package, but it seems that i have too much data, and each time R quit..
I wondered if using Fast fourier transform and then cluster it could be a good idea? My goal is to really distinguish the group that the consumption is constant to those whose consumption is variable.
Then I would like to cluster the variable consummer in function of the peak and when they consumme.
I also tried to calculate the mean et variance of each clients, then cluster it with k-mean but it not very good, i can see 2 cluster, one with 650 clients and on other with 50...
thanks
first exemple`
2nd exemple
Here are three exemple of what I have, I have 700 curves likes that, some are high variables, some pretty constant.
I would like to cluster them according to their shape in order to have one group where the consumption is pretty constant, an other where the consumption is highly variable and try to cluster it according to the time the peak appear

sine wave model of temperature given daily high and low

I am trying to create a model of daily temperatures that uses the sine wave, along with the daily high and low temperature in R (where the daily high and low change each day). It would look like the sine wave found here: http://www.ipm.ucdavis.edu/WEATHER/ddconcepts.html, under degree-days.
The idea will be to calculate the area under the curve found between a high and low threshold.
I have been working on this for longer than I care to admit but don't even have anything worth showing at this point. I am at square one and will take any advice you have.

Resources