I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)
Say I have a datatable with a list of id numbers and date-times. Each id may appear more than once with a different date-time (example data in table 1)
I want to add a column to table 1 with a date from a second table matched on id number, which finds the closest date in table 2 occurring after the date in table 1.
Again there may be multiple dates for the same id number in table 2 so I just want to add the nearest next data.
I figure I need to write a for loop but can't work out the way to run a match for each id number then select just 1 result to display for the date column. The other condition I need to add is that if there is no date in table 2 before the next date for the same id in table 1 , it should just return NA.
What would be the best way to proceed? Thanks in advance all
Table1
id_code inspection_date
1 600 2019-10-10 18:24:32
2 600 2019-10-10 23:55:13
3 600 2019-08-07 13:42:45
4 601 2019-08-16 15:45:54
5 601 2019-08-17 17:25:34
6 602 2019-08-19 12:34:31
7 603 2019-11-03 16:30:31
8 603 2019-11-03 19:01:01
Table 2
id_code2 confirm_date
1 598 2019-09-09 13:24:45
2 600 2019-10-10 19:35:37
3 600 2019-10-11 01:23:58
4 600 2019-08-07 16:30:01
5 601 2019-08-17 02:30:35
6 601 2019-08-17 22:45:46
7 601 2019-08-19 19:12:18
8 602 2019-12-01 12:12:12
9 602 2019-12-14 23:25:35
10 602 2019-12-29 03:30:31
11 603 2019-12-30 06:35:35
12 603 2019-12-31 01:02:34
13 605 2019-12-31 17:24:46
Blockquote
Here's a solution that would work, but I'm not sure it's the fastest possible:
table1$confirm_date<-as.POSIXct(mapply(function(x,y) sort(table2$confirm_date[table2$id_code2==x &
table2$confirm_date>y])[1], table1$id_code, table1$inspection_date),
origin="1970-01-01 00:00.00 UTC")
This goes over your first table one by one and finds the appropriate date from table 2 that shares the same ID and has the closest (i.e., first) date that's bigger than it.
Output:
id_code inspection_date confirm_date
1 600 2019-10-10 18:24:32 2019-10-10 19:35:37
2 600 2019-10-10 23:55:13 2019-10-11 01:23:58
3 600 2019-08-07 13:42:45 2019-08-07 16:30:01
4 601 2019-08-16 15:45:54 2019-08-17 02:30:35
5 601 2019-08-17 17:25:34 2019-08-17 22:45:46
6 602 2019-08-19 12:34:31 2019-12-01 12:12:12
7 603 2019-11-03 16:30:31 2019-12-30 06:35:35
8 603 2019-11-03 19:01:01 2019-12-30 06:35:35
(Your sample data didn't have a case where there was not confirm_date that's later than the inspection_date, but I made sure and it indeed returns an NA as required).
Suppose i have two dataset
ds1
NO ID DOB ID2 count
1 4083 2007-10-01 3625 5
2 4408 2008-07-01 3603 2
3 4514 2007-07-01 3077 3
4 4396 2008-05-01 3413 5
5 4222 2003-12-01 3341 1
ds2
loc share
12 445
23 4
10 56
1 1
23 34
I want "share" column of ds2 to be added to ds1 so that it would look like
dsmerged
NO ID DOB ID2 count share
1 4083 2007-10-01 3625 5 445
2 4408 2008-07-01 3603 2 4
3 4514 2007-07-01 3077 3 56
4 4396 2008-05-01 3413 5 1
5 4222 2003-12-01 3341 1 34
i tried merge as
dsmerged <- merge(ds1[,c(1:5)],ds2[,c(2)])
But what it does is it duplicates the dataset (5*5=25 rows) while it does add "share" column. i dont want that duplicate values obviously. Thank you
If you know that the rows represent the same id then you can just cbind
ds3 <- cbind(ds1, share = ds2$share)
but it would be better if you had an id to join on.
Using dplyr
library(dplyr)
bind_cols(ds1, ds2['share'])
Or with data.table
setDT(ds1)[, share := ds2[["share"]]]
I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33
I am looking for a way to make regular discrete time intervals in R with data that is irregular and includes location information. (For example, 10 second intervals, and only the first location information per time interval).
The input data looks like this:
ID Time Location Duration
1 Mark 2015-04-15 23:55:41 1 145448
2 Mark 2015-04-15 23:58:07 9 1559
3 Mark 2015-04-15 23:58:08 9 2279
4 Mark 2015-04-15 23:58:11 9 557
5 Mark 2015-04-15 23:58:11 3 10540
6 Mark 2015-04-15 23:58:22 9 1783
7 Mark 2015-04-15 23:58:24 9 8706
8 Mark 2015-04-15 23:58:32 9 555
9 Mark 2015-04-15 23:58:33 2 124137
10 Mark 2015-04-16 00:00:37 2 7411
11 Mark 2015-04-16 00:00:37 20 7411
and the desired output would be:
ID Time Location
1 Mark 2015-04-15 23:55:40 1
2 Mark 2015-04-15 23:55:50 1
3 Mark 2015-04-15 23:56:00 1
...
16 Mark 2015-04-15 23:58:00 9
17 Mark 2015-04-15 23:58:10 9
Any ideas?