Add rows for missing timestamp data in df - r

I have a df like this:
Line
Sensor
Day
Time
Measurement
1
A
1
10:00:00
56
2
A
1
11:00:00
42
3
A
1
12:00:00
87
4
A
1
12:20:00
12
5
A
1
12:50:00
44
I would like to create some rows. Considering that measurements should be taken every 10 minutes I would like to add a non-constant number of rows (i.e there should be 6 between line 1 and 2; two rows between line 3 and 4; 3 rows between line 4 and five)
in order to get something similar to this:
Line
Sensor
Day
Time
Measurement
1
A
1
10:00:00
56
2
A
1
10:10:00
54
3
A
1
10:20:00
35
4
A
1
10:30:00
11
5
A
1
10:40:00
45
6
A
1
10:50:00
56
7
A
1
11:00:00
90
...
...
...
...
...
...
...
13
A
1
12:00:00
87
14
A
1
12:10:00
97
15
A
1
12:20:00
42
16
A
1
12:30:00
67
4
A
1
12:40:00
76
5
A
1
12:50:00
11
Any suggestion?

Related

Splitting a dateTime vector if time is greater than x between vector components

I have the following data:
df <- data.frame(index = 1:85,
times = c(seq(as.POSIXct("2020-10-03 21:31:00 UTC"),
as.POSIXct("2020-10-03 22:25:00 UTC")
"min"),
seq(as.POSIXct("2020-11-03 10:10:00 UTC"),
as.POSIXct("2020-11-03 10:39:00 UTC"),
"min")
))
if we look at row 55 and 56 there is a clear divide in times:
> df[55:56, ]
index times
55 55 2020-10-03 22:25:00
56 56 2020-11-03 10:10:00
I would like to add a third categorical column split based on the splits,
e.g. row df$split[55, ] = A and row df$split[56, ] = B
logic like
If time gap between rows is greater than 5 mins start new category for subsequent rows until the next instance where time gap > 5 mins.
thanks
You could use
library(dplyr)
df %>%
mutate(cat = 1 + cumsum(c(0, diff(times)) > 5))
which returns
index times cat
1 1 2020-10-03 21:31:00 1
2 2 2020-10-03 21:32:00 1
3 3 2020-10-03 21:33:00 1
4 4 2020-10-03 21:34:00 1
5 5 2020-10-03 21:35:00 1
6 6 2020-10-03 21:36:00 1
7 7 2020-10-03 21:37:00 1
8 8 2020-10-03 21:38:00 1
...
53 53 2020-10-03 22:23:00 1
54 54 2020-10-03 22:24:00 1
55 55 2020-10-03 22:25:00 1
56 56 2020-11-03 10:10:00 2
57 57 2020-11-03 10:11:00 2
58 58 2020-11-03 10:12:00 2
59 59 2020-11-03 10:13:00 2
If you need letters or something else, you could for example use
df %>%
mutate(cat = LETTERS[1 + cumsum(c(0, diff(times)) > 5)])
to convert the categories 1 and 2 into A and B.

Why the output of difftime does not start from zero for each group

I have a dataset including user_ids, datetime, and index (which shows the activity number for each user_id). I have to find the time difference for each activity. Therefore, this new column (walk_time) should start with NA for a unique activity and have the time difference values in all the rows for that index (activity). However, I use my code, but it does not consider the group_by(index). Here is my code and the output of my code.
p.s.: I used dput() in R based on the replies from my last post in Stack Overflow and copied and pasted my data here. Please let me know If I should bring my data in another way.
I want to compute the difference between consecutive time data, but I have to group them.
sample_DF$walk_mins <- as.numeric("")
sample_DF <- sample_DF %>%
group_by(index.y) %>%
mutate(walk_mins = as.numeric(difftime(DATETIME2 , lag(DATETIME2) , units = "mins")))
user_id DATETIME2 index.y walk_mins
1 41 2019-06-02 20:44:00 1 NA
2 41 2019-06-03 16:46:00 2 1202
3 41 2019-06-03 16:50:00 2 4
4 41 2019-06-03 20:43:00 3 233
5 41 2019-06-03 20:44:00 3 1
6 41 2019-06-03 21:00:00 4 16
7 41 2019-06-04 13:28:00 5 988
8 41 2019-06-04 13:29:00 5 1
9 41 2019-06-04 13:30:00 5 1
10 41 2019-06-04 13:31:00 5 1
11 41 2019-06-04 13:32:00 5 1
12 41 2019-06-04 13:34:00 5 2
13 41 2019-06-04 13:35:00 5 1
14 41 2019-06-04 13:36:00 5 1
15 41 2019-06-04 17:31:00 6 235
16 41 2019-06-04 18:46:00 7 75
17 41 2019-06-04 19:13:00 8 27
18 41 2019-06-04 19:37:00 9 24
19 41 2019-06-04 19:55:00 10 18
20 41 2019-06-04 20:13:00 11 18
If we need the difftime to start from 0, change the default in 'lag' too first of 'DATETIME2'. By default, it is NA. Also, based on the output showed, it seems that the plyr::mutate masked the dplyr::mutate
library(dplyr)
sample_DF <- sample_DF %>%
group_by(index.y) %>%
dplyr::mutate(walk_mins = as.numeric(difftime(DATETIME2 ,
lag(DATETIME2, default = first(DATETIME2)) , units = "mins")))

Aggregate Data based on Two Different Assessment Methods in R

I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)

Create a vector in R by summing rows based on multiple criteria

I have financial data which is currently in 15 minute intervals, but I want to convert the intervals from 15 minutes to 30 minutes before I conduct the rest of my analysis. As such, I would like to sum the traded volumes for two adjacent 15 minute intervals and take the closing price of the second 15 minute sub-interval (ie the end of the 30 minute period).
I have shown below an example of the data (df) and the desired output (df.30min) using an sapply function. This works fine for the example below, but given that I am analysing 10 years of daily data with 50 companies and 27 intervals per day the processing time is excessive, even for one year of data. I have similar issues if I try a for loop.
I am new to R so I am hoping that there is a fairly easy solution using one of the built in functions.
In my actual dataset there are 27 x 15 minute intervals (10:00-16:45). I would like my final "30 minute" dataset to have one 15 minute interval from 13:30-13:45. Also, there may be other anomalies where the stock exchange opened late / closed early or where a stock was put on a trading halt partway through a day. (I have managed to map the times in my data to the correct Interval using a lookup table with a match function.) Given the imperfect structure of my data I am after a solution that is not reliant on a complete set and perfectly even number of 15 minute intervals. In Excel I would use a sumifs function.
set.seed(1)
df <- data.frame(
Company = rep(c("Co A", "Co B", "Co C"), each = 8),
Date = as.Date(rep(c("2005-01-01", "2005-01-02"), times = 3, each = 4)),
Time = as.factor(c("10:00:00", "10:15:00", "10:30:00", "10:45:00")),
Interval = as.factor(c(1,1,2,2)),
Interval.End = as.factor(c(0,1)),
Close = abs(round(rnorm(24),1))*10+100,
Volume = abs(round(rnorm(24),1))*10)
> df
Company Date Time Interval Interval.End Close Volume
1 Co A 2005-01-01 10:00:00 1 0 106 6
2 Co A 2005-01-01 10:15:00 1 1 102 1
3 Co A 2005-01-01 10:30:00 2 0 108 2
4 Co A 2005-01-01 10:45:00 2 1 116 15
5 Co A 2005-01-02 10:00:00 1 0 103 5
6 Co A 2005-01-02 10:15:00 1 1 108 4
7 Co A 2005-01-02 10:30:00 2 0 105 14
8 Co A 2005-01-02 10:45:00 2 1 107 1
9 Co B 2005-01-01 10:00:00 1 0 106 4
10 Co B 2005-01-01 10:15:00 1 1 103 1
11 Co B 2005-01-01 10:30:00 2 0 115 14
12 Co B 2005-01-01 10:45:00 2 1 104 4
13 Co B 2005-01-02 10:00:00 1 0 106 4
14 Co B 2005-01-02 10:15:00 1 1 122 1
15 Co B 2005-01-02 10:30:00 2 0 111 11
16 Co B 2005-01-02 10:45:00 2 1 100 8
17 Co C 2005-01-01 10:00:00 1 0 100 2
18 Co C 2005-01-01 10:15:00 1 1 109 3
19 Co C 2005-01-01 10:30:00 2 0 108 7
20 Co C 2005-01-01 10:45:00 2 1 106 6
21 Co C 2005-01-02 10:00:00 1 0 109 7
22 Co C 2005-01-02 10:15:00 1 1 108 7
23 Co C 2005-01-02 10:30:00 2 0 101 4
24 Co C 2005-01-02 10:45:00 2 1 120 8
df.30min <- df[-which(df$Interval.End == 0),]
df.30min$Volume <-sapply(seq_len(nrow(df.30min)),
function(i) sum(df$Volume[df$Company == df.30min$Company[i] &
df$Date == df.30min$Date[i] &
df$Interval == df.30min$Interval[i]]))
> df.30min
Company Date Time Interval Interval.End Close Volume
2 Co A 2005-01-01 10:15:00 1 1 102 7
4 Co A 2005-01-01 10:45:00 2 1 116 17
6 Co A 2005-01-02 10:15:00 1 1 108 9
8 Co A 2005-01-02 10:45:00 2 1 107 15
10 Co B 2005-01-01 10:15:00 1 1 103 5
12 Co B 2005-01-01 10:45:00 2 1 104 18
14 Co B 2005-01-02 10:15:00 1 1 122 5
16 Co B 2005-01-02 10:45:00 2 1 100 19
18 Co C 2005-01-01 10:15:00 1 1 109 5
20 Co C 2005-01-01 10:45:00 2 1 106 13
22 Co C 2005-01-02 10:15:00 1 1 108 14
24 Co C 2005-01-02 10:45:00 2 1 120 12
Using library dplyr, you can try something like this:
library(dplyr)
df %>% arrange(Company, Date, Time, Interval, Interval.End) %>% group_by(Company, Date, Interval) %>% summarise(Time = Time[2], Interval.End = Interval.End[2], Close = Close[2], Volume = sum(Volume))
Source: local data frame [12 x 7]
Groups: Company, Date [?]
Company Date Interval Time Interval.End Close Volume
(fctr) (date) (fctr) (fctr) (fctr) (dbl) (dbl)
1 Co A 2005-01-01 1 10:15:00 1 102 7
2 Co A 2005-01-01 2 10:45:00 1 116 17
3 Co A 2005-01-02 1 10:15:00 1 108 9
4 Co A 2005-01-02 2 10:45:00 1 107 15
5 Co B 2005-01-01 1 10:15:00 1 103 5
6 Co B 2005-01-01 2 10:45:00 1 104 18
7 Co B 2005-01-02 1 10:15:00 1 122 5
8 Co B 2005-01-02 2 10:45:00 1 100 19
9 Co C 2005-01-01 1 10:15:00 1 109 5
10 Co C 2005-01-01 2 10:45:00 1 106 13
11 Co C 2005-01-02 1 10:15:00 1 108 14
12 Co C 2005-01-02 2 10:45:00 1 120 12
If your data frame is already arranged properly, you can get rid of that arrange part above.
Note: I am assuming there are always two intervals (0, 1) and therefore using hardcoded value of 2. If this is not the case, you can use the proper subsetting.
We can do this using data.table
library(data.table)
setDT(df)[order(Company, Date, Time, Interval),
list(Time=Time[2L], Interval.End = Interval.End[2L],
Close = Close[2L], Volume = sum(Volume)),
by = .(Company, Date, Interval)]
# Company Date Interval Time Interval.End Close Volume
# 1: Co A 2005-01-01 1 10:15:00 1 102 7
# 2: Co A 2005-01-01 2 10:45:00 1 116 17
# 3: Co A 2005-01-02 1 10:15:00 1 108 9
# 4: Co A 2005-01-02 2 10:45:00 1 107 15
# 5: Co B 2005-01-01 1 10:15:00 1 103 5
# 6: Co B 2005-01-01 2 10:45:00 1 104 18
# 7: Co B 2005-01-02 1 10:15:00 1 122 5
# 8: Co B 2005-01-02 2 10:45:00 1 100 19
# 9: Co C 2005-01-01 1 10:15:00 1 109 5
#10: Co C 2005-01-01 2 10:45:00 1 106 13
#11: Co C 2005-01-02 1 10:15:00 1 108 14
#12: Co C 2005-01-02 2 10:45:00 1 120 12

insert new rows to the time series data, with date added automatically

I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33

Resources