I am qute new to R and studied several posts and websites about time series and moving averaging but simply cannot find a useful hint averging a special period of time.
My data is a table via readcsv with a date and time in one column and several other columns with values. The time steps in the data are not constant, so sometimes 5 minutes, sometimes 2 hours. Eg.
2014-01-25 14:50:00, 4, 8
2014-01-25 14:55:00, 3, 7
2014-01-25 15:00:00, 1, 4
2014-01-25 15:20:24, 12, 34
2014-01-25 17:19:00, 150, 225
2014-01-25 19:00:00, 300, 400
2014-01-25 21:00:00, NA, NA
2014-01-25 23:19:00, 312, 405
So I look for an averaging plot that
calculates data average in arbitrary intervals like 30 minutes, 1 hour, 1 day etc. So lower steps should be aggregated and higher steps should be disaggregated.
(removed, since it is trivial to get value per hour from a time series D which is averaged by X hours with D/x.)
data flagged as NA should not be taken into account. So the function should not interpolate/smooth through Na gaps and performing a line plot should not connect the points between a NA gap with a line.
I already tried
aggregate(list(value1=data$value1,value2=data$value2), list(time=cut(data$time, "1 hour")), sum)
but this does not fulfill needs 1 and 3 and is not able to disaggregate 2-hourly data steps.
Answering point 3: plot automatically skips NA values and breaks the line.
Try this example:
plot(c(1:5,NA,NA,6:10),t='l')
Now, if you want to 'smooth' or average over time intervals purely for graphical purposes, It's probably easiest to start out by separating your data at each line with an NA and then doing a spline or other smoothing operation on each subsection separately.
Related
I can't figure out how to plot frequency against sequencial durations in R.
I have several processes(X,Y,Z,...) that consist of multiple steps (a,b,c,d,...). Each process has a different sequence of steps. So, X may consist of aabaacabcd, Y may consist of acdad, etc.
There are time stamps for beginning and end of every step, which I used to calculate each steps duration.
The data frame looks something like this:
ProcessID Step Start End Seconds
X a 30.09.2022 14:08 30.09.2022 14:11 165
X d 30.09.2022 14:11 30.09.2022 14:24 756
Y a 29.09.2022 11:55 29.09.2022 13:16 4876
Y c 29.09.2022 13:16 29.09.2022 14:26 4199
Y d 29.09.2022 14:26 30.09.2022 17:17 96654
There are around 1000 processes in the data frame. At the beginning of each process there's step 'a' which may appear more often throughout the sequence. 'd' only exists once in each sequence and closes the process.
I plotted frequency for 'seconds' for each step in a ridgeline plot, which looks like this:
ridgeline plot
This, however, treats 'seconds' as one data point rather than a duration and also doesn't take into consideration that step b, c, d etc usually (not always) follow step a.
I would like to plot frequency against a timeline which consideres the frequency of step a/b/c/d at a certain time point. For example, at time point 20 seconds on the timeline there are 32 processes in step a, 49 processes are at step c...etc.
I do struggle with defining the durations. 'Seconds' only gives me one time point and not the sequence of steps. 'Start' and 'End' does define sequence and duration but are not "normed" to all start at 0. So I don't know how to compare.
Can anyone help?
Thanks in advance
I have an year s data with quarterly spikes, like below:
Sample code in R to create the dataframe:
x <- data.frame("Month" = c(1:12), "Count" = c(110,220,2500,150,180,1800,300,550,5000,205,313,4218))
Here is how the data looks:
Month Count
1 110
2 220
3 2500
4 150
5 180
6 1800
7 300
8 550
9 5000
10 205
11 313
12 4218
We can see that last month of every quarter has spike. My target is to forecast for next one year based on this data. I tried linear regression with some feature engineering (like how far a month is away from quarter) and results were obviously not satisfactory as it doesn't appear there is linear dependency.
I tried other techniques like seasonal naive and STLF (using R) and am currently going through few interpolation techniques (like lagrange or newtonInterpolation), there appears to be a lot of materials to study. Can anyone suggest a good possible solution for this so that I can explore further?
I am working on a research paper on graph manipulation and I have the following data:
returns 1+returns cum_return price period_ret(step=25)
1 7.804919e-03 1.0078049 0.007804919 100.78355 NA
2 3.560800e-03 1.0035608 0.011393511 101.14306 NA
3 -1.490719e-03 0.9985093 0.009885807 100.99239 NA
. -2.943304e-03 0.9970567 0.006913406 100.69558 NA
. 1.153007e-03 1.0011530 0.008074385 100.81175 NA
. -2.823012e-03 0.9971770 0.005228578 100.52756 NA
25 -7.110762e-03 0.9928892 -0.001919363 99.81526 -0.02364
. -1.807268e-02 0.9819273 -0.019957356 98.02754 NA
. -3.300315e-03 0.9966997 -0.023191805 97.70455 NA
250 5.846750e-03 1.0058467 -0.017480652 98.27748 0.12125
These are 250 daily stock returns, the cummulative return, price and the 25-day period returns (returns between days 0-25; 25-50;...;200-250).
What I want to do is the following:
I want to rearrange the returns but the period returns should be identical although their order can change. So there are 10! possible combinations of the subsets.
What I did so far: I wrote a code using the sample, repeat and identical functions and here is a shortened version:
repeat{
temp <- tibble(
returns = sample(x$returns, 250, replace=TRUE) )
if(identical(sort(round(c(x$period_ret[(!is.na(x$period_ret))]),2)),sort(round(c(temp$period_ret[(!is.na(temp$period_ret))]),2)))) break
}
This took me quite some time and unfortunately it isn't of any real use. Only later I began thinking of the math and that there are 250! possible samples so I would spend days waiting for any result.
What do I need this for?
I would like to create graphs with different orders of the returns. Thus, all the graphs have the same summary statistics but look different. Its important that they have the same period_returns (no matter of their order) to fulfil a utility formula.
I have a dataset that looks somewhat like this (the actual dataset is ~150000 lines with additional columns of fluff information such as company name, etc.):
Date return1 return2 rank
01/31/2008 0.05434 0.23413 3
01/31/2008 0.03423 0.43423 4
01/31/2008 0.65277 0.23423 1
01/31/2008 0.02342 0.47234 4
02/31/2008 0.01463 0.01231 4
02/31/2008 0.13456 0.52552 2
02/31/2008 0.34534 0.36663 1
02/31/2008 0.00324 0.56463 3
...
12/31/2015 0.21234 0.02333 2
12/31/2015 0.07245 0.87234 1
12/31/2015 0.47282 0.12998 1
12/31/2015 0.99022 0.03445 2
Basically I need to caculate the date-specific correlation between return1 and rank (so the corr. on 01/31/2008, 02/31/2008, and so on). I know I can split the data using the split() function but I am unsure as to how to get the date-specific correlation. The real data has about 260 entries per date and around 68 dates, so manually subsetting the original table and performing calculations is time consuming but more importantly more susceptible to error.
My ultimate goal is to create a time series of the correlations on different dates.
Thank you in advance!
I had this same problem earlier, except I wasn't calculating correlation. What I would do is
a %>% group_by(Date) %>% summarise(Correlation = cor(return1, rank))
And this will provide, for each date, a correlation value between return1 and rank. Don't forget that you can specify what kind of correlation you would like (e.g. Spearman).
I am trying to generate appointment times for yearly scheduled visits. The available days=1:365 and the first appointment should be randomly chosen first=sample(days,1,replace=F)
Now given the first appointment I want to generate 3 more appointment in the space between 1:365 so that there will be exactly 4 appointments in the 1:365 space, and as equally spaced between them as possible.
I have tried
point<-sort(c(first-1:5*364/4,first+1:5*364/4 ));point<-point[point>0 & point<365]
but it does not always give me 4 appointments. I have eventually run this many times and picked only the samples with 4 appointments, but I wanted to ask if there is a more elegant way to get exactly 4 points as equally distanced a s possible.
I was thinking of equal spacing (around 91 days between appointments) in a year starting at the first appointment... Essentially one appointment per quarter of the year.
# Find how many days in a quarter of the year
quarter = floor(365/4)
first = sample(days, 1)
all = c(first, first + (1:3)*quarter)
all[all > 365] = all[all > 365] - 365
all
sort(all)
Is this what you're looking for?
set.seed(1) # for reproducible example ONLY - you need to take this out.
first <- sample(1:365,1)
points <- c(first+(0:3)*(365-first)/4)
points
# [1] 97 164 231 298
Another way uses
points <- c(first+(0:3)*(365-first)/3)
This creates 4 points euqally spaced on [first, 365], but the last point will always be 365.
The reason your code is giving unexpected results is because you use first-1:5*364/4. This creates points prior to first, some of which can be < 0. Then you exclude those with points[points>0...].