I am working with a problem in R that has the demand for a product that is normally distributed with a mean of a and standard deviation b. It is estimated to increase c% each year. The market share for year 1 is anywhere between d% and e% with each number in between equally likely, and is expected to grow at f% each year.
I know I need to use the rnorm function for the demand
rnorm(1000,a,b) and runif function for the market share runif(1000,1.d,1.e) (adding the 1 in front for easier calculation) year 1 values.
The problem is asking for what the market share and demand will be over 3 years. I know year 1, but I am not sure how I would set the calculation up in R for years 2 and 3 given the growth rates c and f. I currently have something like MarketSizeGrowth<-cumprod(c(runif(1000,1.d,1.e),rep(1.f,2) for the market size but this is definitely wrong.
Related
I have a very large dataset (~55,000 datapoints) for chicken crops. Chickens are grown over ~35 day period. The dataset covers 10 sheds of ~20,000 chickens each. In the sheds are weighing platforms and as chickens step on them they send the weight recorded to a server. They are sending continuously from day 0 to the final day.
The variables I have are: House (as a number, House 1 up to House 10), Weight (measured in grams, to 5 decimal points) and Day (measured as a number between two integers, e.g. 12 noon on day 0 might be 0.5 in the day, whereas day 23.3 suggests a third of the way through day 23 (8AM). But as this data is sent continuously the numbers can be very precise).
I want to construct either a Time Series Regression model or an ML model so that if I take a new crop, as data is sent by the sensors, the model can make a prediction for what the end weight will be. Then as that crop cycle finishes it can be added to the training data and repeat.
Currently I'm using this very simple Weight VS Time model, but eventually would include things like temperature, water and food consumption, humidity etc.
I've run regression analyses on the data sets to determine the relationship between time and weight (it's likely quadratic, see image attached) and tried using randomForrest in R to create a model. The test model seemed to work well in regards to the MAPE value being similar to the training value, but that was by taking out one house and using that as the test.
Potentially what I've tried so far is completely the wrong methodology but this is a new area so I'm really not sure of the best approach.
I have two variables, x and y, measured at one minute intervals for over two years. The average daily values of x and y are almost 90% correlated. However, when I analyze x and y in one minute intervals they are only 50% correlated. How can I detect the time interval at which this correlation becomes 90%? Ideally I'd like to do this in R.
I'm new to statistics/econometrics, so my apologies if this question is very basic!
I'm not quite sure what you are asking here. What do you mean by x and y being 90 "percent" correlated? Do you mean you get a correlation coefficient of .9?
Beyond this clarification you can absolutely have a situation where the average of 2 variables is more correlated than any individual subset of the data. In other words order matters, so the correlation of the average is not the average of the correlation. For example, this R code shows if we took 3 measurements each hour for 2 hours (6 measurements total), the overall correlation is .5, while the correlation of the average hourly measure is a perfect 1. Essentially when you take the correlation of averages you are effectively removing the impact of the order your measurement values are distributed within the interval you are averaging over, which ends up actually being very important when taking the correlations. Let me know if I missed something about your question though.
X=c(1,2,3,4,5,6)
Y=c(3,2,1,6,5,4)
cor(X,Y)
HourAvgX=c(mean(X[1:3]),mean(X[4:6]))
HourAvgY=c(mean(Y[1:3]),mean(Y[4:6]))
cor(HourAvgX,HourAvgY)
Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.
Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.
I am doing my master thesis in Electrical engineering about the impact of the humidity and
temperature on power consumption
I have a problem that is related to statistics, numerical methods and mathematics topics
I have real data for one year (year 2000)
Every day has 24 hours records for temperature, humidity, power consumption
So, the total points for one parameter, for example, temperature is 24*366 = 8784 points
I classified the pattern of the power to three patterns:
Daily, seasonally and to cover the whole year
The aim is to find a mathematical model of the following form:
P = f ( T , H , t , date )
Where,
P = power consumption,
T = temperature,
t = time in hours from 1 to 24,
date = the date number in the year from 1 to 366 ( or date number in a month from 1 to 31)
I started drawing in Matlab program a sample day, 1st August showing the effect of time,
humidity and temperature on power consumption::
http://www7.0zz0.com/2010/12/11/23/264638558.jpg
Then, I make the analysis wider to see what changes happened when drawing this day with the next day:
http://www7.0zz0.com/2010/12/11/23/549837601.jpg
After that I make it wider and include the 1st week of august:
http://www7.0zz0.com/2010/12/11/23/447153078.jpg
Then, the whole month, august:
http://www7.0zz0.com/2010/12/12/00/120820248.jpg
Then, starting from January, I plot power and temperature for 1st six months without
humidity (only for scaling):
http://www7.0zz0.com/2010/12/12/00/908911392.jpg
with humidity :
http://www7.0zz0.com/2010/12/12/00/102651717.jpg
Then, the whole year plot without humidity:
( P,T,H have constant values but I separate H only for scaling since H values are too much higher than P and H and that cause shrinking of the plot making small plots for P and T)
http://www7.0zz0.com/2010/12/11/23/290259320.jpg
and finally with humidity:
http://www7.0zz0.com/2010/12/11/23/842530863.jpg
The reason I have plotted these figures is to follow the behaviors of all parameters. How P is changing with respect to Temperature, Humidity, and time in hours and time in day number.
It is clear that these figures represent cyclic behavior but this behavior is not
constant. It is starting to increase and then decrease during the whole year.
For example the behavior of 1st January is almost the same as any other day in the year
but the difference is in shifting up or down, left or right.
Also, Temperature and Humidity are almost sinusoidal. However, Power consumption behavior is not purely sinusoidal as seen in the following figure:
http://www7.0zz0.com/2010/12/12/00/153503144.jpg
I am not expert in statistics and numerical methods, and this matter now does not have relation with electrical engineering concept.
The results I am aiming to get are:
Specify the day number in the year from 1 to 366,
then specify the hour in that day,
temperature and humidity also will be specified.
All of these parameters are to be specified by the user
The result:
The mathematical model should be capable to find the power consumption in that specific hour of that day.
Then, the Power found from the model will be compared to the measured real power from the
data and if the values are very close to each other, then the model will be accurate and
accepted.
I am sorry for this long question. I actually read many papers, many helps but I could not
reach to the correct approach of how to find one unified model by following the curves
behavior from starting till the end of the year and also having more than one independent
variable has disturbed me a lot.
I hope this problem is not difficult for statistics and mathematics experts.
Any help will be highly appreciated,
Thanks in advance
Regards
About this:
"Also, Temperature and Humidity are almost sinusoidal. However, Power consumption behavior is not purely sinusoidal"
Seems in local scale (several days/weeks order) temperature and humidity can be expressed as periodic train of Gaussians:
After such assumption we can model power consumption as superposition of temperature and humidity trains of Gaussians. Consider this opencalc spreadsheet chart:
in which f1 and f2 are train of gaussians (here only 4 peaks, but you may calculate as many as you need for data fitting) and f3 is superposition of these two trains,-
just (f12 + f22)1/2
However i don't know to what degree power consumption follows the addition of train of gaussians. You may invest time to explore this possibility.
Good luck!
I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.