I am trying to calculate the sequential recordings in a time series, and aggregate the data for these sequences.
Example Data
Here is an example of the data taken at a maximum frequency of 1 second:
timestamp Value
06:07:23 0.439
06:07:24 0.556
06:07:25 0.430
06:07:26 0.418
06:07:27 0.407
06:07:47 0.439
06:07:48 0.420
06:07:49 0.405
09:55:21 0.507
09:55:22 0.439
10:03:24 0.439
10:03:25 0.439
10:03:36 1.708
10:03:37 0.608
10:03:38 0.439
10:03:46 0.484
10:03:47 0.380
10:03:48 0.607
10:03:49 0.439
10:03:50 0.439
10:03:51 0.439
10:03:52 0.430
10:03:53 0.439
10:03:54 4.924
10:03:55 1.012
10:03:56 0.887
10:03:57 0.439
10:03:58 0.439
10:04:18 0.447
10:04:19 0.447
As can be seen, there are periods whereby a value is taken every second. I am trying to find a way to aggregate if there was no gap between the observations to end up with something as follows:
timestamp max duration
06:07:23 0.556 5
06:07:47 0.439 3
09:55:21 0.507 2
10:03:24 0.439 2
10:03:36 1.708 3
10:03:46 1.012 13
10:04:18 0.447 2
I am struggling to find a way of grouping the data by the sequential data. The closest answer I have been able to find is this one, however, the answers were provided over three and a half years ago and I was struggling to get the data.table method working.
Any ideas much appreciated!
Here is an attempt in data.table:
dat[,
.(timestamp = timestamp[1], max = max(Value), duration=.N),
by = cumsum(c(FALSE, diff(as.POSIXct(dat$timestamp, format="%H:%M:%S", tz="UTC")) > 1))
]
# cumsum timestamp max duration
#1: 0 06:07:23 0.556 5
#2: 1 06:07:47 0.439 3
#3: 2 09:55:21 0.507 2
#4: 3 10:03:24 0.439 2
#5: 4 10:03:36 1.708 3
#6: 5 10:03:46 4.924 13
#7: 6 10:04:18 0.447 2
Related
Here I have a snippet of my dataset. The rows indicate different days of the year.
The Substations represent individuals, there are over 500 individuals.
The 10 minute time periods run all the way through 24 hours.
I need to find an average value for each 10 minute interval for each individual in this dataset. This should result in single row for each individual substation, with the respective average value for each time interval.
I have tried:
meanbygroup <- stationgroup %>%
group_by(Substation) %>%
summarise(means = colMeans(tenminintervals[sapply(tenminintervals, is.numeric)]))
But this averages the entire column and I am left with the same average values for each individual substation.
So for each individual substation, I need an average for each individual time interval.
Please help!
Try using summarize(across()), like this:
df %>%
group_by(Substation) %>%
summarize(across(everything(), ~mean(.x, na.rm=T)))
Output:
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A -0.233 0.110 -0.106
2 B 0.203 -0.0997 -0.128
3 C -0.0733 0.196 -0.0205
4 D 0.0905 -0.0449 -0.0529
5 E 0.401 0.152 -0.0957
6 F 0.0368 0.120 -0.0787
7 G 0.0323 -0.0792 -0.278
8 H 0.132 -0.0766 0.157
9 I -0.0693 0.0578 0.0732
10 J 0.0776 -0.176 -0.0192
# … with 16 more rows
Input:
set.seed(123)
df = bind_cols(
tibble(Substation = sample(LETTERS,size = 1000, replace=T)),
as_tibble(setNames(lapply(1:3, function(x) rnorm(1000)),c("00:00", "00:10", "00:20")))
) %>% arrange(Substation)
# A tibble: 1,000 × 4
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A 0.121 -1.94 0.137
2 A -0.322 1.05 0.416
3 A -0.158 -1.40 0.192
4 A -1.85 1.69 -0.0922
5 A -1.16 -0.455 0.754
6 A 1.95 1.06 0.732
7 A -0.132 0.655 -1.84
8 A 1.08 -0.329 -0.130
9 A -1.21 2.82 -0.0571
10 A -1.04 0.237 -0.328
# … with 990 more rows
I output a series of acf results and want to extract just lag 1 autocorrelation coefficient. Can anyone give a quick pointer? Thank you.
#A snippet of a series of acf() results
$`25`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 0.366 -0.347 -0.399 -0.074 0.230 0.050 -0.250 -0.213 -0.106 0.059 0.154 0.031
$`26`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.060 0.026 -0.163 -0.233 -0.191 -0.377 0.214 0.037 0.178 -0.016 0.049
$`27`
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 -0.025 -0.136 0.569 -0.227 -0.264 0.218 -0.262 -0.411 0.123 -0.039 -0.192 0.130
#For this example, the extracted values will be 0.366, 0.060, -0.025, the values can either
be in a list or matrix
EDIT
#`acf` in base R was used
p <- acf.each()
#sapply was tried but it resulted this
sapply(acf.each(), `[`, "1")
1 2 3
acf 0.7398 0.1746 0.4278
type "correlation" "correlation" "correlation"
n.used 24 17 14
lag 1 1 1
series "x" "x" "x"
snames NULL NULL NULL
The structure seems to be a list. We can use sapply to do the extraction
sapply(lst1, function(x) x$acf[2])
data
lst1 <- list(acf(ldeaths), acf(ldeaths))
I am performing correlations with the following data:
datacor
A tibble: 213 x 3
Prop_coord Prop_assoc PPT
<dbl> <dbl> <dbl>
1 0.474 0.211 92
2 0.343 0.343 85
3 0.385 0.308 83
4 0.714 0 92
5 0.432 0.273 73
6 0.481 0.148 92
7 0.455 0.273 96
8 0.605 0.184 88
9 0.412 0.235 98
10 0.5 0.318 94
# … with 203 more rows
The cor.test works well, but when I try to compare correlations it shows this error:
> cocor(~ Prop_coord+PPT | Prop_assoc+PPT, datacor)
Error in cocor(~Prop_coord + PPT | Prop_assoc + PPT, datacor) :
The variable 'PPT' must be numeric
What should I do?
Just to keep the record here that someone elsewhere helped me with this, the problem was that cocor seems not to work with tibbles. So when I read my data with data.frame, it worked perfectly.
I am trying to modify some code that I have, which works, to instead work with a different function for estimating a model. The original code is the following, and it works with the ARIMA function:
S=round(0.75*length(ts_HHFCE_log))
h=1
error1.h <- c()
for (i in S:(length(ts_HHFCE_log)-h))
{
mymodel.sub <- arima(ts_HHFCE_log[1:i], order = c(0,1,3),seasonal=c(0,0,0))
predict.h <- predict(mymodel.sub,n.ahead=h)$pred[h]
error1.h <- c(error1.h,ts_HHFCE_log[i+h]-predict.h)
}
The intuition is the following: Your time series has length T. You start somewhere at the beginning of your sample, but to give enough observations to regress and obtain parameter coefficients for your alpha and betas. Let's call this t for simplicity. Then based on this, you produce a one-step ahead forecast, so for time period (t+1). Then your forecast error is the difference between the actual value for (t+1) and your forecast value based on regressing on data available until t. Then you iterate, and consider from the start to (t+1), regress, and forecast (t+2). Then you obtain a forecast error for (t+2). Then basically you keep on doing this iterative process until you reach (T-1) and produce a forecast for T. This provides with what is known as a dynamic out of sample forecast error series. You do this for different models and then ascertain using a statistical test which is the more appropriate model to use. It is a way to produce out of sample forecasting using only the data you already have.
I have modified the code to be the following:
S=round(0.75*length(ts.GDP))
h=1
error1.h <- c()
for (i in S:(length(ts.GDP)-h))
{
mymodel.sub <- lm(ts.GDP[4:i] ~ ts.GDP[3:(i-1)] + ts.GDP[2:(i-2)] + ts.GDP[1:(i-3)])
predict.h <- predict(mymodel.sub,n.ahead=h)$pred[h]
error1.h <- c(error1.h,ts.GDP[i+h]-predict.h)
}
I'm trying to do an AR(3) model. The reason I am not using the ARIMA function is because I also then want to compare these forecast errors with an ARDL model, and to my knowledge there is no simple function for the ARDL model (I'd have to use lm(), hence why I want to do the AR(3) model using the lm() function).
The model I wish to compare the AR(3) model is the following:
model_ts.GDP_1 <- lm(ts.GDP[4:123] ~ ts.GDP[3:122] + ts.GDP[2:121] + ts.GDP[1:120] + ts.CCI_AGG[3:122] + ts.CCI_AGG[2:121] + ts.CCI_AGG[1:120])
I am unsure how further to modify the code to get what I am after. Hopefully the intuition bit I explained should be clear in what I am trying to do.
The data for GDP is basically the quarterly growth rate. It is stationary. The other variable in the second model is an index I've constructed using a dynamic PCA and taken first differences so it too is stationary. But in any case, in the second model, the forecast at t is based only on lagged data of each GDP and the index I constructed. Equally, given I am simulating out of sample forecast using data I have, there is no issue with actually properly forecasting. (In time series, this technique is seen as a more robust method to compare models than simply using things such as RMSE, etc.)
Thanks!
The data I am using:
Date GDP_qoq CCI_A_qoq
31/03/1988 2.956 0.540
30/06/1988 2.126 -0.743
30/09/1988 3.442 0.977
31/12/1988 3.375 -0.677
31/03/1989 2.101 0.535
30/06/1989 1.787 -0.667
30/09/1989 2.791 0.343
31/12/1989 2.233 -0.334
31/03/1990 1.961 0.520
30/06/1990 2.758 -0.763
30/09/1990 1.879 0.438
31/12/1990 0.287 -0.708
31/03/1991 1.796 -0.078
30/06/1991 1.193 -0.735
30/09/1991 0.908 0.896
31/12/1991 1.446 0.163
31/03/1992 0.870 0.361
30/06/1992 0.215 -0.587
30/09/1992 0.262 0.238
31/12/1992 1.646 -1.436
31/03/1993 2.375 0.646
30/06/1993 0.249 -0.218
30/09/1993 1.806 0.676
31/12/1993 1.218 -0.393
31/03/1994 1.501 0.346
30/06/1994 0.879 -0.501
30/09/1994 1.123 0.731
31/12/1994 2.089 0.062
31/03/1995 0.386 0.475
30/06/1995 1.238 -0.243
30/09/1995 1.836 0.263
31/12/1995 1.236 -0.125
31/03/1996 1.926 -0.228
30/06/1996 2.109 -0.013
30/09/1996 1.312 0.196
31/12/1996 0.972 -0.015
31/03/1997 1.028 -0.001
30/06/1997 1.086 -0.016
30/09/1997 2.822 0.156
31/12/1997 -0.818 -0.062
31/03/1998 1.418 0.408
30/06/1998 0.970 -0.548
30/09/1998 0.968 0.466
31/12/1998 2.826 -0.460
31/03/1999 0.599 0.228
30/06/1999 -0.651 -0.361
30/09/1999 1.289 0.579
31/12/1999 1.600 0.196
31/03/2000 2.324 0.535
30/06/2000 1.368 -0.499
30/09/2000 0.825 0.440
31/12/2000 0.378 -0.414
31/03/2001 0.868 0.478
30/06/2001 1.801 -0.521
30/09/2001 0.319 0.068
31/12/2001 0.877 0.045
31/03/2002 1.253 0.061
30/06/2002 1.247 -0.013
30/09/2002 1.513 0.625
31/12/2002 1.756 0.125
31/03/2003 1.443 -0.088
30/06/2003 0.874 -0.138
30/09/2003 1.524 0.122
31/12/2003 1.831 -0.075
31/03/2004 0.780 0.395
30/06/2004 1.665 -0.263
30/09/2004 0.390 0.543
31/12/2004 0.886 -0.348
31/03/2005 1.372 0.500
30/06/2005 2.574 -0.066
30/09/2005 0.961 0.058
31/12/2005 2.378 -0.061
31/03/2006 1.015 0.212
30/06/2006 1.008 -0.218
30/09/2006 1.105 0.593
31/12/2006 0.943 -0.144
31/03/2007 1.566 0.111
30/06/2007 1.003 -0.125
30/09/2007 1.810 0.268
31/12/2007 1.275 -0.592
31/03/2008 1.413 0.017
30/06/2008 -0.491 -0.891
30/09/2008 -0.617 -0.836
31/12/2008 -1.410 -1.092
31/03/2009 -1.593 0.182
30/06/2009 -0.106 -0.922
30/09/2009 0.788 0.351
31/12/2009 0.247 0.414
31/03/2010 1.221 -0.329
30/06/2010 1.561 -0.322
30/09/2010 0.163 0.376
31/12/2010 0.825 -0.104
31/03/2011 2.484 0.063
30/06/2011 -0.574 -0.107
30/09/2011 0.361 -0.006
31/12/2011 0.997 -0.304
31/03/2012 0.760 0.243
30/06/2012 0.143 -0.381
30/09/2012 2.547 0.315
31/12/2012 0.308 -0.046
31/03/2013 0.679 0.221
30/06/2013 0.766 -0.170
30/09/2013 1.843 0.352
31/12/2013 0.756 0.080
31/03/2014 1.380 -0.080
30/06/2014 1.501 0.162
30/09/2014 0.876 0.017
31/12/2014 0.055 -0.251
31/03/2015 0.497 0.442
30/06/2015 1.698 -0.278
30/09/2015 0.066 0.397
31/12/2015 0.470 0.076
31/03/2016 1.581 0.247
30/06/2016 0.859 -0.342
30/09/2016 0.865 -0.011
31/12/2016 1.467 0.049
31/03/2017 1.006 0.087
30/06/2017 0.437 -0.215
30/09/2017 0.527 0.098
31/12/2017 0.900 0.218
The only thing you need to understand is how to get predictions using lm, it's not necessary to add other details (without reproducible data you're only making it more difficult).
Create dummy data:
set.seed(123)
df<-data.frame(a=runif(10),b=runif(10),c=runif(10))
> print(df)
a b c
1 0.2875775 0.95683335 0.8895393
2 0.7883051 0.45333416 0.6928034
3 0.4089769 0.67757064 0.6405068
4 0.8830174 0.57263340 0.9942698
5 0.9404673 0.10292468 0.6557058
6 0.0455565 0.89982497 0.7085305
7 0.5281055 0.24608773 0.5440660
8 0.8924190 0.04205953 0.5941420
9 0.5514350 0.32792072 0.2891597
10 0.4566147 0.95450365 0.1471136
Fit your model:
model<-lm(c~a+b,data=df)
Create new data:
new_df<-data.frame(a=runif(1),b=runif(1))
> print(new_df)
a b
1 0.9630242 0.902299
Get predictions from your new data:
prediction<- predict(model,new_df)
> print(prediction)
1
0.8270997
In your case, the new data new_df will be your lagged data, but you have to make the appropriate changes, OR provide reproducible data as above if you want us to go through the details of your problem.
Hope this helps.
This question already has answers here:
Create dummy variable in R excluding certain cases as NA
(2 answers)
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I currently have a data frame in r where one column contains values from 1-20. I need to replace values 0-4 -> A, 5-9 -> B, 10-14 -> C.... and so on. What would be the best way to accomplish this?
This is what the first part of the data frame currently looks like:
sex length diameter height wholeWeight shuckedWeight visceraWeight shellWeight rings
1 2 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
2 2 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
3 1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
4 2 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
5 0 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
6 0 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
7 1 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
8 1 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.2600 16
9 2 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.1650 9
10 1 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.3200 19
11 1 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.2100 14
12 2 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.1350 10
I'm trying to replace the values in rings.
Maybe you can create a vector to be a transform table.
tran_table <- rep(c("A","B","C","D"), rep(5,4))
names(tran_table) <- 1:20
test_df <- data.frame(
ring = sample(1:20, 20)
)
test_df$ring <- tran_table[test_df$ring]
And the result is:
> test_df
ring
1 A
2 D
3 B
4 A
5 A
6 C
7 B
8 D
9 B
10 A
11 B
12 D
13 C
14 A
15 C
16 C
17 B
18 D
19 C
20 D