The given dataset contains a timestamp in the third column which consists of a date in mm/dd/yyyy format with time in 24 hr format for the month of January. I wish to find the difference in minutes using R by comparing every row with its previous row only if patient has a common value i.e. difference of every timestamp row with previous row if patient value of both is same say "1","2" and "3". This also means that first row of the dataset should give value 0 minutes as there is nothing to compare. Thanks and please help.
patient handling time
1 Registration 1/2/2017 11:41
1 Triage and Assessment 1/2/2017 12:40
1 Registration 1/2/2017 12:40
1 Triage and Assessment 1/2/2017 22:32
1 Blood test 1/5/2017 8:59
1 Blood test 1/5/2017 14:34
1 MRI SCAN 1/5/2017 21:37
2 X-Ray 1/7/2017 4:31
2 X-Ray 1/7/2017 7:57
2 Discuss Results 1/7/2017 14:45
2 Discuss Results 1/7/2017 17:55
2 Check-out 1/9/2017 17:09
2 Check-out 1/9/2017 19:14
3 Registration 1/4/2017 1:34
3 Registration 1/4/2017 6:36
3 Triage and Assessment 1/4/2017 17:49
3 Triage and Assessment 1/5/2017 8:59
3 Blood test 1/5/2017 21:37
3 Blood test 1/6/2017 3:53
If time is already of class POSIXct, and the data frame is already sorted by patient and time, the time difference in minutes can be appended using a streamlined version of SBista's answer
library(dplyr)
DF %>%
group_by(patient) %>%
mutate(delta = difftime(time, lag(time, default = first(time)), units = "mins"))
# A tibble: 19 x 4
# Groups: patient [3]
patient handling time delta
<chr> <chr> <dttm> <time>
1 1 Registration 2017-01-02 11:41:00 0 mins
2 1 Triage and Assessment 2017-01-02 12:40:00 59 mins
3 1 Registration 2017-01-02 12:40:00 0 mins
4 1 Triage and Assessment 2017-01-02 22:32:00 592 mins
5 1 Blood test 2017-01-05 08:59:00 3507 mins
6 1 Blood test 2017-01-05 14:34:00 335 mins
7 1 MRI SCAN 2017-01-05 21:37:00 423 mins
8 2 X-Ray 2017-01-07 04:31:00 0 mins
9 2 X-Ray 2017-01-07 07:57:00 206 mins
10 2 Discuss Results 2017-01-07 14:45:00 408 mins
11 2 Discuss Results 2017-01-07 17:55:00 190 mins
12 2 Check-out 2017-01-09 17:09:00 2834 mins
13 2 Check-out 2017-01-09 19:14:00 125 mins
14 3 Registration 2017-01-04 01:34:00 0 mins
15 3 Registration 2017-01-04 06:36:00 302 mins
16 3 Triage and Assessment 2017-01-04 17:49:00 673 mins
17 3 Triage and Assessment 2017-01-05 08:59:00 910 mins
18 3 Blood test 2017-01-05 21:37:00 758 mins
19 3 Blood test 2017-01-06 03:53:00 376 mins
Another approach would be to compute the delta for all rows ignoring the grouping by patient and then to replace the first value for each patient by zero as requested by the OP. Ignoring the groups in first place might bring a performance gain (not verified).
Unfortunately, I am not proficient enough to implement this using dplyr syntax, so I use data.table with its update by reference:
library(data.table)
setDT(DF)[, delta := difftime(time, shift(time), units = "mins")][]
DF[DF[, first(.I), by = patient]$V1, delta := 0][]
patient handling time delta
1: 1 Registration 2017-01-02 11:41:00 0 mins
2: 1 Triage and Assessment 2017-01-02 12:40:00 59 mins
3: 1 Registration 2017-01-02 12:40:00 0 mins
4: 1 Triage and Assessment 2017-01-02 22:32:00 592 mins
5: 1 Blood test 2017-01-05 08:59:00 3507 mins
6: 1 Blood test 2017-01-05 14:34:00 335 mins
7: 1 MRI SCAN 2017-01-05 21:37:00 423 mins
8: 2 X-Ray 2017-01-07 04:31:00 0 mins
9: 2 X-Ray 2017-01-07 07:57:00 206 mins
10: 2 Discuss Results 2017-01-07 14:45:00 408 mins
11: 2 Discuss Results 2017-01-07 17:55:00 190 mins
12: 2 Check-out 2017-01-09 17:09:00 2834 mins
13: 2 Check-out 2017-01-09 19:14:00 125 mins
14: 3 Registration 2017-01-04 01:34:00 0 mins
15: 3 Registration 2017-01-04 06:36:00 302 mins
16: 3 Triage and Assessment 2017-01-04 17:49:00 673 mins
17: 3 Triage and Assessment 2017-01-05 08:59:00 910 mins
18: 3 Blood test 2017-01-05 21:37:00 758 mins
19: 3 Blood test 2017-01-06 03:53:00 376 mins
You can do the following:
data %>%
group_by(patient) %>%
mutate(diff_in_sec = as.POSIXct(time, format = "%m/%d/%Y %H:%M") - lag(as.POSIXct(time, format = "%m/%d/%Y %H:%M"), default=first(as.POSIXct(time, format = "%m/%d/%Y %H:%M"))))%>%
mutate(diff_in_min = as.numeric(diff_in_sec/60))
You get the output as:
# A tibble: 19 x 5
# Groups: patient [3]
patient handling time diff_in_sec diff_in_min
<int> <chr> <chr> <time> <dbl>
1 1 Registration 1/2/2017 11:41 0 secs 0
2 1 Triage and Assessment 1/2/2017 12:40 3540 secs 59
3 1 Registration 1/2/2017 12:40 0 secs 0
4 1 Triage and Assessment 1/2/2017 22:32 35520 secs 592
5 1 Blood test 1/5/2017 8:59 210420 secs 3507
6 1 Blood test 1/5/2017 14:34 20100 secs 335
7 1 MRI SCAN 1/5/2017 21:37 25380 secs 423
8 2 X-Ray 1/7/2017 4:31 0 secs 0
9 2 X-Ray 1/7/2017 7:57 12360 secs 206
10 2 Discuss Results 1/7/2017 14:45 24480 secs 408
11 2 Discuss Results 1/7/2017 17:55 11400 secs 190
12 2 Check-out 1/9/2017 17:09 170040 secs 2834
13 2 Check-out 1/9/2017 19:14 7500 secs 125
14 3 Registration 1/4/2017 1:34 0 secs 0
15 3 Registration 1/4/2017 6:36 18120 secs 302
16 3 Triage and Assessment 1/4/2017 17:49 40380 secs 673
17 3 Triage and Assessment 1/5/2017 8:59 54600 secs 910
18 3 Blood test 1/5/2017 21:37 45480 secs 758
19 3 Blood test 1/6/2017 3:53 22560 secs 376
I'm borrowing the reproducible example given here:
Aggregate daily level data to weekly level in R
since it's pretty much close to what I want to do.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
In his question, he asks to aggregate on weekly intervals, what I'd like to do is aggregate on a "day of the week basis".
So I'd like to have a table similar to that one, adding the values of all the same day of the week:
Day of the week value
1 "Sunday" 60000
2 "Monday" 50000
3 "Tuesday" 60000
4 "Wednesday" 50000
5 "Thursday" 60000
6 "Friday" 50000
7 "Saturday" 60000
You can try:
aggregate(d$value, list(weekdays(as.Date(d$Interval))), sum)
We can group them by weekly intervals using weekdays :
library(dplyr)
df %>%
group_by(Day_Of_The_Week = weekdays(as.Date(Interval))) %>%
summarise(value = sum(value))
# Day_Of_The_Week value
# <chr> <int>
#1 Friday 16903
#2 Monday 26368
#3 Saturday 4738
#4 Sunday 2975
#5 Thursday 17858
#6 Tuesday 23772
#7 Wednesday 13560
We can do this with data.table
library(data.table)
setDT(df1)[, .(value = sum(value)), .(Dayofweek = weekdays(as.Date(Interval)))]
# Dayofweek value
#1: Sunday 2975
#2: Monday 26368
#3: Tuesday 23772
#4: Wednesday 13560
#5: Thursday 17858
#6: Friday 16903
#7: Saturday 4738
using lubridate https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
df1$Weekday=wday(arrive,label=TRUE)
library(data.table)
df1=data.table(df1)
df1[,sum(value),Weekday]
Working with the Rblpapi package, I receive a list of multiple data frames when requesting securities. (Equaling the number of securities requested)
My problem is the following one: Let's say:
I request daily data for A and B from 01.10.2016 - 31.10.2016
Some data for A is missing during that time, while B has,
also some data for B is missing, when A has.
So basically:
list$A
date PX_LAST
1 2016-10-03 216.704
2 2016-10-04 217.245
3 2016-10-05 216.887
4 2016-10-06 217.164
5 2016-10-10 217.504
6 2016-10-11 217.022
7 2016-10-12 217.326
8 2016-10-13 216.219
9 2016-10-14 217.275
10 2016-10-17 216.751
11 2016-10-18 218.812
12 2016-10-19 219.682
13 2016-10-20 220.189
14 2016-10-21 220.930
15 2016-10-25 221.179
16 2016-10-26 219.840
17 2016-10-27 219.158
18 2016-10-31 217.820
list$B
date PX_LAST
1 2016-10-03 1722.82
2 2016-10-04 1717.82
3 2016-10-05 1721.14
4 2016-10-06 1718.40
5 2016-10-07 1712.40
6 2016-10-11 1700.33
7 2016-10-12 1695.54
8 2016-10-13 1689.62
9 2016-10-14 1693.71
10 2016-10-17 1687.84
11 2016-10-18 1701.10
12 2016-10-19 1706.74
13 2016-10-21 1701.16
14 2016-10-24 1706.24
15 2016-10-25 1701.20
16 2016-10-26 1699.92
17 2016-10-27 1694.66
18 2016-10-28 1690.96
19 2016-10-31 1690.92
As you see they have a different number of obervations and dates are also not equal. For example: 5. observation for A is on 2016-10-10 and for B is on 2016-10-07.
So what I need is a means to combine both data frames. My idea was a full range date range (every day) where I add the PX_values for corresponding dates of A and B. After that I could delete empty rows.
Sorry for bad formatting, this is my first post here.
Thanks in advance.
I am looking for a way to make regular discrete time intervals in R with data that is irregular and includes location information. (For example, 10 second intervals, and only the first location information per time interval).
The input data looks like this:
ID Time Location Duration
1 Mark 2015-04-15 23:55:41 1 145448
2 Mark 2015-04-15 23:58:07 9 1559
3 Mark 2015-04-15 23:58:08 9 2279
4 Mark 2015-04-15 23:58:11 9 557
5 Mark 2015-04-15 23:58:11 3 10540
6 Mark 2015-04-15 23:58:22 9 1783
7 Mark 2015-04-15 23:58:24 9 8706
8 Mark 2015-04-15 23:58:32 9 555
9 Mark 2015-04-15 23:58:33 2 124137
10 Mark 2015-04-16 00:00:37 2 7411
11 Mark 2015-04-16 00:00:37 20 7411
and the desired output would be:
ID Time Location
1 Mark 2015-04-15 23:55:40 1
2 Mark 2015-04-15 23:55:50 1
3 Mark 2015-04-15 23:56:00 1
...
16 Mark 2015-04-15 23:58:00 9
17 Mark 2015-04-15 23:58:10 9
Any ideas?
I have a huge dataset similar to the following reproducible sample data.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
I want to aggregate this data to weekly level to get the output similar to the following:
Interval value
1 Week 2, June 2012 *aggregate value for day 10 to day 14 of June 2012*
2 Week 3, June 2012 *aggregate value for day 15 to day 21 of June 2012*
3 Week 4, June 2012 *aggregate value for day 22 to day 28 of June 2012*
4 Week 5, June 2012 *aggregate value for day 29 to day 30 of June 2012*
5 Week 1, July 2012 *aggregate value for day 1 to day 7 of July 2012*
6 Week 2, July 2012 *aggregate value for day 8 to day 10 of July 2012*
How do I achieve this easily without writing a long code?
If you mean the sum of of ‘value’ by week I think the easiest way to do it is to convert the data into a xts object as GSee suggested:
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
[,1]
2012-06-10 552
2012-06-17 23629
2012-06-24 23872
2012-07-01 23667
2012-07-08 23552
2012-07-10 10902
I leave the formatting of the output as an exercise for you :-)
If you were to use week from lubridate, you would only get five weeks to pass to by. Assume dat is your data,
> library(lubridate)
> do.call(rbind, by(dat$value, week(dat$Interval), summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 24 552 4146 4188 3759 4529 4850
# 25 490 2498 4256 3396 4438 5156
# 26 564 2578 4206 3355 4346 4866
# 27 698 993 4868 3366 5122 5770
# 28 671 1086 3200 3200 5314 5726
This shows a summary for the 24th through 28th week of the year. Similarly, we can get the means with aggregate with
> aggregate(value~week(Interval), data = dat, mean)
# week(Interval) value
# 1 24 3758.667
# 2 25 3396.286
# 3 26 3355.000
# 4 27 3366.429
# 5 28 3199.500
I just came across this old question because it was used as a dupe target.
Unfortunately, all the upvoted answers (except the one by konvas and a now deleted one) present solutions for aggregating the data by week of the year while the OP has requested to aggregate by week of the month.
The definition of week of the year and week of the month is ambiguous as discussed here, here, and here.
However, the OP has indicated that he wants to count the days 1 to 7 of each month as week 1 of the month, days 8 to 14 as week 2 of the month, etc. Note that week 5 is a stub for most of the months consisting of only 2 or 3 days (except for the month of February if no leap year).
Having prepared the ground, here is a data.table solution for this kind of aggregation:
library(data.table)
DT[, .(value = sum(value)),
by = .(Interval = sprintf("Week %i, %s",
(mday(Interval) - 1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Interval value
1: Week 2, Jun 2012 18366
2: Week 3, Jun 2012 24104
3: Week 4, Jun 2012 23348
4: Week 5, Jun 2012 5204
5: Week 1, Jul 2012 23579
6: Week 2, Jul 2012 11573
We can verify that we have picked the correct intervals by
DT[, .(value = sum(value),
date_range = toString(range(Interval))),
by = .(Week = sprintf("Week %i, %s",
(mday(Interval) -1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Week value date_range
1: Week 2, Jun 2012 18366 2012-06-10, 2012-06-14
2: Week 3, Jun 2012 24104 2012-06-15, 2012-06-21
3: Week 4, Jun 2012 23348 2012-06-22, 2012-06-28
4: Week 5, Jun 2012 5204 2012-06-29, 2012-06-30
5: Week 1, Jul 2012 23579 2012-07-01, 2012-07-07
6: Week 2, Jul 2012 11573 2012-07-08, 2012-07-10
which is in line with OP's specification.
Data
library(data.table)
DT <- fread(
"rn Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176", drop = 1L)
DT[, Interval := as.Date(Interval)]
If you are using a data frame, you can easily do this with the tidyquant package. Use the tq_transmute function, which applies a mutation and returns a new data frame. Select the "value" column and apply the xts function apply.weekly. The additional argument FUN = sum will get the aggregate by week.
library(tidyquant)
df
#> # A tibble: 31 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-11 4850
#> 3 2012-06-12 4642
#> 4 2012-06-13 4132
#> 5 2012-06-14 4190
#> 6 2012-06-15 4186
#> 7 2012-06-16 1139
#> 8 2012-06-17 490
#> 9 2012-06-18 5156
#> 10 2012-06-19 4430
#> # ... with 21 more rows
df %>%
tq_transmute(select = value,
mutate_fun = apply.weekly,
FUN = sum)
#> # A tibble: 6 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-17 23629
#> 3 2012-06-24 23872
#> 4 2012-07-01 23667
#> 5 2012-07-08 23552
#> 6 2012-07-10 10902
When you say "aggregate" the values, you mean take their sum? Let's say your data frame is d and assuming d$Interval is of class Date, you can try
# if d$Interval is not of class Date d$Interval <- as.Date(d$Interval)
formatdate <- function(date)
paste0("Week ", (as.numeric(format(date, "%d")) - 1) + 1,
", ", format(date, "%b %Y"))
# change "sum" to your required function
aggregate(d$value, by = list(formatdate(d$Interval)), sum)
# Group.1 x
# 1 Week 1, Jul 2012 3725.667
# 2 Week 2, Jul 2012 3199.500
# 3 Week 2, Jun 2012 3544.000
# 4 Week 3, Jun 2012 3434.000
# 5 Week 4, Jun 2012 3333.143
# 6 Week 5, Jun 2012 3158.667