index grouped columns in data frame - r

I have a data frame as follow
time site val
2014-09-01 00:00:00 2001 1
2014-09-01 00:15:00 2001 0
2014-09-01 00:30:00 2001 2
2014-09-01 00:45:00 2001 0
2014-09-01 00:00:00 2002 1
2014-09-01 00:15:00 2002 0
2014-09-01 00:30:00 2002 2
2014-09-02 00:45:00 2001 0
2014-09-02 00:00:00 2001 1
2014-09-02 00:15:00 2001 0
2014-09-02 00:30:00 2001 2
2014-09-02 00:45:00 2001 0
2014-09-02 00:00:00 2002 1
2014-09-02 00:15:00 2002 0
2014-09-02 00:30:00 2002 2
2014-09-02 00:45:00 2001 0
I'd like to be able group it by time and site then add a new variable that will consist of the occurence index of the group
time site val h
2014-09-01 00:00:00 2001 1 1
2014-09-01 00:15:00 2001 0 2
2014-09-01 00:30:00 2001 2 3
2014-09-01 00:45:00 2001 0 4
2014-09-01 00:00:00 2002 1 1
2014-09-01 00:15:00 2002 0 2
2014-09-01 00:30:00 2002 2 3
2014-09-02 00:45:00 2002 0 4
2014-09-02 00:00:00 2001 1 1
2014-09-02 00:15:00 2001 0 2
2014-09-02 00:30:00 2001 2 3
2014-09-02 00:45:00 2001 0 4
2014-09-02 00:00:00 2002 1 1
2014-09-02 00:15:00 2002 0 2
2014-09-02 00:30:00 2002 2 3
2014-09-02 00:45:00 2001 0 4
df <- structure(list(time = structure(c(1409522400, 1409523300, 1409524200,
1409525100, 1409522400, 1409523300, 1409524200, 1409611500, 1409608800,
1409609700, 1409610600, 1409611500, 1409608800, 1409609700, 1409610600,
1409611500), class = c("POSIXct", "POSIXt"), tzone = ""), site = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("2001",
"2002"), class = "factor"), val = c(1L, 0L, 2L, 0L, 1L, 0L, 2L,
0L, 1L, 0L, 2L, 0L, 1L, 0L, 2L, 0L)), .Names = c("time", "site",
"val"), row.names = c(NA, -16L), class = "data.frame")
what are my possibilities in r to achieve this
thanks

Using dplyr. First we create a column id extracting the day from the date (column time). Then we group by site and id, and add a new variable counter counting the number of occurrences by those two groups.
df$id <- as.factor(format(df$time,'%d'))
library(dplyr)
df %>% group_by(site, id) %>% mutate(counter = row_number())
Output:
time site val id counter
(time) (fctr) (int) (fctr) (int)
1 2014-09-01 00:00:00 2001 1 01 1
2 2014-09-01 00:15:00 2001 0 01 2
3 2014-09-01 00:30:00 2001 2 01 3
4 2014-09-01 00:45:00 2001 0 01 4
5 2014-09-01 00:00:00 2002 1 01 1
6 2014-09-01 00:15:00 2002 0 01 2
7 2014-09-01 00:30:00 2002 2 01 3
8 2014-09-02 00:45:00 2001 0 02 1
9 2014-09-02 00:00:00 2001 1 02 2
10 2014-09-02 00:15:00 2001 0 02 3
11 2014-09-02 00:30:00 2001 2 02 4
12 2014-09-02 00:45:00 2001 0 02 5
13 2014-09-02 00:00:00 2002 1 02 1
14 2014-09-02 00:15:00 2002 0 02 2
15 2014-09-02 00:30:00 2002 2 02 3
16 2014-09-02 00:45:00 2001 0 02 6

We can use ave
df$h <- with(df, ave(val, cumsum(c(TRUE,diff(time)< 0)), FUN= seq_along))
df
# time site val h
#1 2014-09-01 03:30:00 2001 1 1
#2 2014-09-01 03:45:00 2001 0 2
#3 2014-09-01 04:00:00 2001 2 3
#4 2014-09-01 04:15:00 2001 0 4
#5 2014-09-01 03:30:00 2002 1 1
#6 2014-09-01 03:45:00 2002 0 2
#7 2014-09-01 04:00:00 2002 2 3
#8 2014-09-02 04:15:00 2001 0 4
#9 2014-09-02 03:30:00 2001 1 1
#10 2014-09-02 03:45:00 2001 0 2
#11 2014-09-02 04:00:00 2001 2 3
#12 2014-09-02 04:15:00 2001 0 4
#13 2014-09-02 03:30:00 2002 1 1
#14 2014-09-02 03:45:00 2002 0 2
#15 2014-09-02 04:00:00 2002 2 3
#16 2014-09-02 04:15:00 2001 0 4
NOTE: This is based on the expected output showed in the OP's post. I understand that 'site' is also described as the grouping variable, but then the expected output should be something else.

Related

Remove duplicates if an observation appears consecutively, order matters

I have a dataframe grouped by bikeid and sorted by time. If type repeats consecutively, I want to keep the earliest time. In the case below, I want to remove line 17, 19,33,39 and 41
subtract value from previous row by group
This will get what I need once I removed the duplicates.
bikeid type time
1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
17 1004 repair_time 2019-04-20 05:29:00
19 1004 repair_time 2019-04-28 07:33:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
33 1010 red_time 2019-04-23 11:02:34
37 1010 repair_time 2019-04-24 17:20:00
39 1010 repair_time 2019-04-24 18:30:00
41 1010 repair_time 2019-04-24 18:42:00
The final result should look this this:
bikeid type time
1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
37 1010 repair_time 2019-04-24 17:20:00
An option is to use rleid (from data.table) to create a grouping variable along with the second column and slice the first observation. Here, the time column is already arranged, so we don't have to do any ordering
library(dplyr)
library(data.table)
df1 %>%
group_by(V2, grp = rleid(V3)) %>%
slice(1) %>%
ungroup %>%
select(-grp)
# A tibble: 8 x 4
# V1 V2 V3 V4
# <int> <int> <chr> <chr>
#1 1 1004 repair_time 2019-04-04 14:07:00
#2 3 1004 red_time 2019-04-19 00:54:56
#3 8 1004 repair_time 2019-04-19 12:47:00
#4 10 1004 red_time 2019-04-19 16:45:18
#5 15 1004 repair_time 2019-04-20 04:42:00
#6 27 1010 repair_time 2019-04-20 10:05:00
#7 29 1010 red_time 2019-04-22 20:51:21
#8 37 1010 repair_time 2019-04-24 17:20:00
Or use the data.table method where we convert the 'data.frame' to
'data.table' (setDT(df1)), grouped by 'V2', and rleid of 'V3', get the row index (.I) of the first observation, extract ($V1) it and subset the rows of dataset
library(data.table)
setDT(df1)[df1[, .I[1], .(V2, rleid(V3))]$V1]
data
df1 <- structure(list(V1 = c(1L, 3L, 8L, 10L, 15L, 17L, 19L, 27L, 29L,
33L, 37L, 39L, 41L), V2 = c(1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1010L, 1010L, 1010L, 1010L, 1010L, 1010L), V3 = c("repair_time",
"red_time", "repair_time", "red_time", "repair_time", "repair_time",
"repair_time", "repair_time", "red_time", "red_time", "repair_time",
"repair_time", "repair_time"), V4 = c("2019-04-04 14:07:00",
"2019-04-19 00:54:56", "2019-04-19 12:47:00", "2019-04-19 16:45:18",
"2019-04-20 04:42:00", "2019-04-20 05:29:00", "2019-04-28 07:33:00",
"2019-04-20 10:05:00", "2019-04-22 20:51:21", "2019-04-23 11:02:34",
"2019-04-24 17:20:00", "2019-04-24 18:30:00", "2019-04-24 18:42:00"
)), class = "data.frame", row.names = c(NA, -13L))
Another option using lag to check if the status is the same as the previous row. As akrun notes, this works because the data is already sorted by time:
library(dplyr)
df %>%
group_by(bikeid) %>%
mutate(repeated = status == lag(status)) %>%
# Need the is.na() check as first element of each group is NA
# due to the lag
filter(! repeated | is.na(repeated))
Data setup code:
txt = "1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
17 1004 repair_time 2019-04-20 05:29:00
19 1004 repair_time 2019-04-28 07:33:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
33 1010 red_time 2019-04-23 11:02:34
37 1010 repair_time 2019-04-24 17:20:00
39 1010 repair_time 2019-04-24 18:30:00
41 1010 repair_time 2019-04-24 18:42:00"
df = read.table(text = txt, header = FALSE)
colnames(df) = c("row", "bikeid", "status", "date", "time")
df$date = as.POSIXct(paste(df$date, df$time))

How can I automatically populate a data frame with hourly values and fill the blank values with zeroes?

I have some data that I extract from an Elasticsearch system that shows employees' availability over the course of a date range, broken down into hourly slots.
Associates are never available for 24 hours per day, but I want to display the data across 24 hourly slots, with 0s populating the cells where no data are present.
My thoughts are that I need to to create up a blank data frame, impute the results into it and then populate the rest with 0s, but I would really like to know if there is a better way.
Note that the size of the initial data frame is not always the same size because different days return different hourly values (seven hour-long slots, three hour-long slots, 12 hour-long slots and so on).
Also note that where there aren't any hits/results from the query results, the hourly slots in between do not appear as there are no associated data (see between 18:00 and 21:00).
At present, the whole data frame looks like this:
hour associate_count minutes_covered
<dttm> <int> <dbl>
1 2018-08-06 10:00:00 2 37
2 2018-08-06 11:00:00 2 60
3 2018-08-06 12:00:00 2 42
4 2018-08-06 13:00:00 1 56
5 2018-08-06 14:00:00 2 60
6 2018-08-06 15:00:00 2 60
7 2018-08-06 16:00:00 2 60
8 2018-08-06 17:00:00 1 52
9 2018-08-06 18:00:00 1 0 # NOTE THAT THERE IS A 3-HOUR GAP HERE UNTIL THE NEXT HIT
10 2018-08-06 21:00:00 1 10
The data behind the data frame:
df <- structure(list(hour = structure(c(1533546000, 1533549600, 1533553200,
1533556800, 1533560400, 1533564000, 1533567600, 1533571200, 1533574800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
associate_count = c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L),
minutes_covered = c(37, 60, 42, 56, 60, 60, 60, 52, 0)), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
How can I get the data to look like this?
hour associate_count minutes_covered
1 2018-08-06 00:00 0 0
2 2018-08-06 01:00 0 0
3 2018-08-06 02:00 0 0
4 2018-08-06 03:00 0 0
5 2018-08-06 04:00 0 0
6 2018-08-06 05:00 0 0
7 2018-08-06 06:00 0 0
8 2018-08-06 07:00 0 0
9 2018-08-06 08:00 0 0
10 2018-08-06 09:00 0 0
11 2018-08-06 10:00 2 37
12 2018-08-06 11:00 2 60
13 2018-08-06 12:00 2 42
14 2018-08-06 13:00 1 56
15 2018-08-06 14:00 2 60
16 2018-08-06 15:00 2 60
17 2018-08-06 16:00 2 60
18 2018-08-06 17:00 1 52
19 2018-08-06 18:00 1 0
20 2018-08-06 19:00 0 0
21 2018-08-06 20:00 0 0
22 2018-08-06 21:00 1 10
23 2018-08-06 22:00 0 0
24 2018-08-06 23:00 0 0
You can use tidyr::complete for this. It lets you additionally fill out other variables to expand the data frame with, if needed.
library(tidyverse)
library(lubridate)
df <- structure(list(hour = structure(c(1533546000, 1533549600, 1533553200, 1533556800, 1533560400, 1533564000, 1533567600, 1533571200, 1533574800), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"), associate_count = c(2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L), minutes_covered = c(37, 60, 42, 56, 60, 60, 60, 52, 0)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
my_complete <- function(df, start_date, end_date){
start_hour <- str_c(start_date, " 00:00:00") %>% ymd_hms
end_hour <- str_c(end_date, " 00:00:00") %>% ymd_hms
df %>%
complete(
hour = seq(from = start_hour, to = end_hour, by = "hour"),
fill = list(associate_count = 0L, minutes_covered = 0)
)
}
my_complete(df, "2018-08-06", "2018-08-07")
#> # A tibble: 25 x 3
#> hour associate_count minutes_covered
#> <dttm> <int> <dbl>
#> 1 2018-08-06 00:00:00 0 0
#> 2 2018-08-06 01:00:00 0 0
#> 3 2018-08-06 02:00:00 0 0
#> 4 2018-08-06 03:00:00 0 0
#> 5 2018-08-06 04:00:00 0 0
#> 6 2018-08-06 05:00:00 0 0
#> 7 2018-08-06 06:00:00 0 0
#> 8 2018-08-06 07:00:00 0 0
#> 9 2018-08-06 08:00:00 0 0
#> 10 2018-08-06 09:00:00 2 37
#> # ... with 15 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).
# create a sequence of hours for your day
allhours <- data.frame(hour=seq(from= as.POSIXct("2018-06-08 00:00"),
to = as.POSIXct("2018-06-08 23:00"),
by = "hours"))
# merge that sequence with your data (all=TRUE is important here)
res <- merge(df, allhours, by="hour", all=TRUE)
# convert NAs to Zeros
res[is.na(res$associate_count), "associate_count"] <- 0
res[is.na(res$minutes_covered), "minutes_covered"] <- 0
I'm not 100% sure what the expected output should be. But we may go somewhere starting around this:
new_df <- data.frame(hour=seq(ymd_hms('2018-08-06 00:00:00'),
ymd_hms('2018-08-06 23:00:00'), by = '1 hour'))
Now we can join with the old data frame
new_df %>% left_join(df)
Joining, by = "hour"
hour associate_count minutes_covered
1 2018-08-06 00:00:00 NA NA
2 2018-08-06 01:00:00 NA NA
3 2018-08-06 02:00:00 NA NA
4 2018-08-06 03:00:00 NA NA
5 2018-08-06 04:00:00 NA NA
6 2018-08-06 05:00:00 NA NA
7 2018-08-06 06:00:00 NA NA
8 2018-08-06 07:00:00 NA NA
9 2018-08-06 08:00:00 NA NA
10 2018-08-06 09:00:00 2 37
11 2018-08-06 10:00:00 2 60
12 2018-08-06 11:00:00 2 42
13 2018-08-06 12:00:00 1 56
14 2018-08-06 13:00:00 2 60
15 2018-08-06 14:00:00 2 60
16 2018-08-06 15:00:00 2 60
17 2018-08-06 16:00:00 1 52
18 2018-08-06 17:00:00 1 0
19 2018-08-06 18:00:00 NA NA
20 2018-08-06 19:00:00 NA NA
21 2018-08-06 20:00:00 NA NA
22 2018-08-06 21:00:00 NA NA
23 2018-08-06 22:00:00 NA NA
24 2018-08-06 23:00:00 NA NA
If absolutely must get rid of the NAs and you need them to be zero you can add another pipe term like this %>% mutate_at(c(2:3), funs(replace(., is.na(.), 0)))

Flag first instance of an event occurring contingent on other variable's value

New to R and to solving such a problem as the one below, so not sure about how certain functionality is achieved in particular instances.
I have a dataframe as such:
df <- data.frame(DATETIME = seq(from = as.POSIXct('2014-01-01 00:00', tz = "GMT"), to = as.POSIXct('2014-01-01 06:00', tz = "GMT"), by='15 mins'),
Price = c(23,22,23,24,27,31,33,34,31,26,24,23,19,18,19,19,23,25,26,26,27,30,26,25,24),
TroughPriceFlag = c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0))
df <- data.table(df)
df
DATETIME Price TroughPriceFlag
1: 2014-01-01 00:00:00 23 0
2: 2014-01-01 00:15:00 22 1
3: 2014-01-01 00:30:00 23 0
4: 2014-01-01 00:45:00 24 0
5: 2014-01-01 01:00:00 27 0
6: 2014-01-01 01:15:00 31 0
7: 2014-01-01 01:30:00 33 0
8: 2014-01-01 01:45:00 34 0
9: 2014-01-01 02:00:00 31 0
10: 2014-01-01 02:15:00 26 0
11: 2014-01-01 02:30:00 24 0
12: 2014-01-01 02:45:00 23 0
13: 2014-01-01 03:00:00 19 0
14: 2014-01-01 03:15:00 18 1
15: 2014-01-01 03:30:00 19 0
16: 2014-01-01 03:45:00 19 0
17: 2014-01-01 04:00:00 23 0
18: 2014-01-01 04:15:00 25 0
19: 2014-01-01 04:30:00 26 0
20: 2014-01-01 04:45:00 26 0
21: 2014-01-01 05:00:00 27 0
22: 2014-01-01 05:15:00 30 0
23: 2014-01-01 05:30:00 26 0
24: 2014-01-01 05:45:00 25 0
25: 2014-01-01 06:00:00 24 0
What I wish to do is two things:
(1) From where we observe a TroughPrice, flag the first instance where the price has risen by 10 or more dollars. That is, find the first instance where deltaPrice >= 10 since the trough price.
As an example: from the trough price of 22 (row 2), in the next interval price is increased to 23 which is a change of 1 dollar, so no flag. From the trough price of 22 (again row 2, since always with reference to the trough price in question), two intervals later the price is 24 dollars, so the price has increased by 2 dollars since the trough, so again no flag. However, from the trough price of 22, 5 intervals later the price has increased to 33 dollars, which is an increase of 11 dollars and is the first time the price has increased above 10 dollars. Thus the flag is 1.
(2) Determine the number of 15 minute periods which have passed between the trough price and the first instance the price has risen by 10 or more dollars.
The resulting dataframe should look like this:
DATETIME Price TroughPriceFlag FirstOver10CentsFlag CountPeriods
1 2014-01-01 00:00:00 23 0 0 NA
2 2014-01-01 00:15:00 22 1 0 5
3 2014-01-01 00:30:00 23 0 0 NA
4 2014-01-01 00:45:00 24 0 0 NA
5 2014-01-01 01:00:00 27 0 0 NA
6 2014-01-01 01:15:00 31 0 0 NA
7 2014-01-01 01:30:00 33 0 1 NA
8 2014-01-01 01:45:00 34 0 0 NA
9 2014-01-01 02:00:00 31 0 0 NA
10 2014-01-01 02:15:00 26 0 0 NA
11 2014-01-01 02:30:00 24 0 0 NA
12 2014-01-01 02:45:00 23 0 0 NA
13 2014-01-01 03:00:00 19 0 0 NA
14 2014-01-01 03:15:00 18 1 0 8
15 2014-01-01 03:30:00 19 0 0 NA
16 2014-01-01 03:45:00 19 0 0 NA
17 2014-01-01 04:00:00 23 0 0 NA
18 2014-01-01 04:15:00 25 0 0 NA
19 2014-01-01 04:30:00 26 0 0 NA
20 2014-01-01 04:45:00 26 0 0 NA
21 2014-01-01 05:00:00 27 0 0 NA
22 2014-01-01 05:15:00 30 0 1 NA
23 2014-01-01 05:30:00 26 0 0 NA
24 2014-01-01 05:45:00 25 0 0 NA
25 2014-01-01 06:00:00 24 0 0 NA
I'm not really sure where to start, since the time gaps can be quite large and I've only used indexing in the context of a few steps forward/backward. Please help!
Thanks in advance
You can chain operation with data.table package, the idea would be to group by cumsum of the ThroughPriceFlag:
library(data.table)
df[, col1:=pmatch(Price-Price[1]>10,T, nomatch=0), cumsum(TroughPriceFlag)][
, count:=which(col1==1)-1,cumsum(TroughPriceFlag)][
TroughPriceFlag==0, count:=NA]
#> df
# DATETIME Price TroughPriceFlag col1 count
# 1: 2014-01-01 00:00:00 23 0 0 NA
# 2: 2014-01-01 00:15:00 22 1 0 5
# 3: 2014-01-01 00:30:00 23 0 0 NA
# 4: 2014-01-01 00:45:00 24 0 0 NA
# 5: 2014-01-01 01:00:00 27 0 0 NA
# 6: 2014-01-01 01:15:00 31 0 0 NA
# 7: 2014-01-01 01:30:00 33 0 1 NA
# 8: 2014-01-01 01:45:00 34 0 0 NA
# 9: 2014-01-01 02:00:00 31 0 0 NA
#10: 2014-01-01 02:15:00 26 0 0 NA
#11: 2014-01-01 02:30:00 24 0 0 NA
#12: 2014-01-01 02:45:00 23 0 0 NA
#13: 2014-01-01 03:00:00 19 0 0 NA
#14: 2014-01-01 03:15:00 18 1 0 8
#15: 2014-01-01 03:30:00 19 0 0 NA
#16: 2014-01-01 03:45:00 19 0 0 NA
#17: 2014-01-01 04:00:00 23 0 0 NA
#18: 2014-01-01 04:15:00 25 0 0 NA
#19: 2014-01-01 04:30:00 26 0 0 NA
#20: 2014-01-01 04:45:00 26 0 0 NA
#21: 2014-01-01 05:00:00 27 0 0 NA
#22: 2014-01-01 05:15:00 30 0 1 NA
#23: 2014-01-01 05:30:00 26 0 0 NA
#24: 2014-01-01 05:45:00 25 0 0 NA
#25: 2014-01-01 06:00:00 24 0 0 NA

Conditional (inequality) join in data.table

I'm just trying to figure out how to do a conditional join on two data.tables.
I've written a sqldf conditional join to give me the circuits whose start or finish times are within the other's start/finish times.
sqldf("select dt2.start, dt2.finish, dt2.counts, dt1.id, dt1.circuit
from dt2
left join dt1 on (
(dt2.start >= dt1.start and dt2.start < dt1.finish) or
(dt2.finish >= dt1.start and dt2.finish < dt1.finish)
)")
This gives me the correct result, but it's too slow for my large-ish data set.
What's the data.table way to do this without a vector scan?
Here's my data:
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
Using non-equi joins:
ans = dt1[dt2, on=.(start <= end, end > start),
.(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end),
allow.cartesian=TRUE
][!cndn %in% TRUE]
The condition start <= end, end >= start (note the >= on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start part (> instead of >=). But still it also picks up the intervals of type:
dt1: start=================end
dt2: start--------------------------------end ## start < start, end > end
and
dt1: start=================end
dt2: start----------end ## end == end
The cndn column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.
PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR operation. It is possible to do two conditional joins, and then bind them together though.
Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.
No idea if this performs faster, but here's a shot at a data table method. I reshape dt1 and use findInterval to identify where the times in dt2 line up with times in dt1.
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
# > dt1
# circuit start end id
# 1: b 2014-02-28 16:00:00 2014-02-28 17:30:00 1001
# 2: a 2014-02-28 17:52:00 2014-02-28 18:51:59 1002
# 3: b 2014-02-28 18:00:00 2014-02-28 21:00:00 1003
# 4: a 2014-02-28 18:52:00 2014-02-28 19:00:00 1004
# 5: b 2014-03-01 00:05:00 2014-03-01 02:55:00 1005
# 6: c 2014-02-28 23:00:00 2014-03-01 06:30:00 1006
# 7: a 2014-03-01 01:40:00 2014-03-01 04:59:59 1007
# 8: a 2014-03-01 05:00:00 2014-03-01 06:00:00 1008
# 9: b 2014-03-01 04:30:00 2014-03-01 07:30:00 1009
# > dt2
# start end seconds counts
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1
## reshapes dt1 from wide to long
## puts start and end times into one column and sorts by time
## this is so that you can use findInterval later
dt3 <- dt1[,list(time = c(start,end)), by = "circuit,id"][order(time)]
dt3[,ntvl := seq_len(nrow(dt3))]
# circuit id time ntvl
# 1: b 1001 2014-02-28 16:00:00 1
# 2: b 1001 2014-02-28 17:30:00 2
# 3: a 1002 2014-02-28 17:52:00 3
# 4: b 1003 2014-02-28 18:00:00 4
# 5: a 1002 2014-02-28 18:51:59 5
# 6: a 1004 2014-02-28 18:52:00 6
# 7: a 1004 2014-02-28 19:00:00 7
# 8: b 1003 2014-02-28 21:00:00 8
# 9: c 1006 2014-02-28 23:00:00 9
# 10: b 1005 2014-03-01 00:05:00 10
# 11: a 1007 2014-03-01 01:40:00 11
# 12: b 1005 2014-03-01 02:55:00 12
# 13: b 1009 2014-03-01 04:30:00 13
# 14: a 1007 2014-03-01 04:59:59 14
# 15: a 1008 2014-03-01 05:00:00 15
# 16: a 1008 2014-03-01 06:00:00 16
# 17: c 1006 2014-03-01 06:30:00 17
# 18: b 1009 2014-03-01 07:30:00 18
## map interval to id
dt4 <- dt3[,list(ntvl = seq(from = min(ntvl), to = max(ntvl)-1), by = 1),by = "circuit,id"]
setkey(dt4, ntvl)
# circuit id ntvl
# 1: b 1001 1
# 2: a 1002 3
# 3: a 1002 4
# 4: b 1003 4
# 5: b 1003 5
# 6: b 1003 6
# 7: a 1004 6
# 8: b 1003 7
# 9: c 1006 9
# 10: c 1006 10
# 11: b 1005 10
# 12: c 1006 11
# 13: b 1005 11
# 14: a 1007 11
# 15: c 1006 12
# 16: a 1007 12
# 17: c 1006 13
# 18: a 1007 13
# 19: b 1009 13
# 20: c 1006 14
# 21: b 1009 14
# 22: c 1006 15
# 23: b 1009 15
# 24: a 1008 15
# 25: c 1006 16
# 26: b 1009 16
# 27: b 1009 17
# circuit id ntvl
## finds intervals in dt2
dt2[,`:=`(ntvl_start = findInterval(start, dt3[["time"]], rightmost.closed = FALSE),
ntvl_end = findInterval(end, dt3[["time"]], rightmost.closed = FALSE))]
# start end seconds counts ntvl_start ntvl_end
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1 1 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1 1 2
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0 2 2
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1 2 3
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2 3 7
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1 7 8
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0 8 8
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1 8 9
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2 9 10
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3 10 12
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2 12 12
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3 12 16
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2 16 17
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1 17 18
## joins, by start time, then by end time
## the commented out lines may be a better alternative
## if there are many NA values
setkey(dt2, ntvl_start)
dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_start_na <- dt2[!dt4]
setkey(dt2, ntvl_end)
dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_end_na <- dt2[!dt4]
## bring them all together and remove duplicates
dt_ans <- unique(rbind(dt_ans_start, dt_ans_end), by = c("start", "id"))
dt_ans <- dt_ans[!(is.na(id) & counts > 0)]
dt_ans[,ntvl := NULL]
setkey(dt_ans,start)
# start end counts id circuit
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 1 1001 b
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1 1001 b
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 0 NA NA
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 1 1002 a
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1002 a
# 6: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1003 b
# 7: 2014-02-28 19:00:00 2014-02-28 21:00:00 1 1003 b
# 8: 2014-02-28 21:00:00 2014-02-28 22:59:59 0 NA NA
# 9: 2014-02-28 22:59:59 2014-03-01 00:04:59 1 1006 c
# 10: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1006 c
# 11: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1005 b
# 12: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1006 c
# 13: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1005 b
# 14: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1007 a
# 15: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1006 c
# 16: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1007 a
# 17: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1006 c
# 18: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1007 a
# 19: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1009 b
# 20: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1006 c
# 21: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1009 b
# 22: 2014-03-01 06:30:00 2014-03-01 07:30:00 1 1009 b
# start end counts id circuit

calculating differences in times, data grouped by rows

I have a data set in the following format
ID DATETIME VALUE
1 4/2/2012 10:00 300
1 5/2/2012 23:00 150
1 6/3/2012 10:00 650
2 1/2/2012 10:00 450
2 2/2/2012 13:00 240
3 6/5/2012 09:00 340
3 7/5/2012 23:00 240
I would like to first calculate the time difference from first instance per ID to each subsequent time.
ID DATETIME VALUE DIFTIME(days)
1 4/2/2012 10:00 300 0
1 5/2/2012 23:00 150 1.3
1 6/3/2012 10:00 650 33
2 1/2/2012 10:00 450 0
2 2/2/2012 13:00 240 1
3 6/5/2012 09:00 340 0
3 7/5/2012 23:00 240 1
And then I'd like to make this a wide format
ID 0 1 1.3 33
1 300 na 150 na 650
2 450 240 na na
3 340 240 na na
Here a solution using data.table and reshape2 packages:
library(data.table)
DT <- as.data.table(dat)
DT[, `:=`(DIFTIME, c(0, diff(as.Date(DATETIME)))), by = "ID"]
## ID VALUE DATETIME DIFTIME
## 1: 1 300 2012-02-04 10:00:00 0
## 2: 1 150 2012-02-05 23:00:00 1
## 3: 1 650 2012-03-06 10:00:00 30
## 4: 2 450 2012-02-01 10:00:00 0
## 5: 2 240 2012-02-02 13:00:00 1
## 6: 3 340 2012-05-06 09:00:00 0
## 7: 3 240 2012-05-07 23:00:00 1
library(reshape2)
dcast(formula = ID ~ DIFTIME, data = DT[, list(ID, DIFTIME, VALUE)])
## ID 0 1 30
## 1 1 300 150 650
## 2 2 450 240 NA
## 3 3 340 240 NA
data in handy format
Here my dat:
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), DATETIME = structure(c(1328346000,
1328479200, 1331024400, 1328086800, 1328184000, 1336287600, 1336424400
), class = c("POSIXct", "POSIXt"), tzone = ""), VALUE = c(300L,
150L, 650L, 450L, 240L, 340L, 240L)), .Names = c("ID", "DATETIME",
"VALUE"), class = "data.frame", row.names = c(NA, 7L))

Resources