Remove duplicates if an observation appears consecutively, order matters

Remove duplicates if an observation appears consecutively, order matters - r

I have a dataframe grouped by bikeid and sorted by time. If type repeats consecutively, I want to keep the earliest time. In the case below, I want to remove line 17, 19,33,39 and 41
subtract value from previous row by group
This will get what I need once I removed the duplicates.
bikeid type time
1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
17 1004 repair_time 2019-04-20 05:29:00
19 1004 repair_time 2019-04-28 07:33:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
33 1010 red_time 2019-04-23 11:02:34
37 1010 repair_time 2019-04-24 17:20:00
39 1010 repair_time 2019-04-24 18:30:00
41 1010 repair_time 2019-04-24 18:42:00
The final result should look this this:
bikeid type time
1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
37 1010 repair_time 2019-04-24 17:20:00

An option is to use rleid (from data.table) to create a grouping variable along with the second column and slice the first observation. Here, the time column is already arranged, so we don't have to do any ordering
library(dplyr)
library(data.table)
df1 %>%
group_by(V2, grp = rleid(V3)) %>%
slice(1) %>%
ungroup %>%
select(-grp)
# A tibble: 8 x 4
# V1 V2 V3 V4
# <int> <int> <chr> <chr>
#1 1 1004 repair_time 2019-04-04 14:07:00
#2 3 1004 red_time 2019-04-19 00:54:56
#3 8 1004 repair_time 2019-04-19 12:47:00
#4 10 1004 red_time 2019-04-19 16:45:18
#5 15 1004 repair_time 2019-04-20 04:42:00
#6 27 1010 repair_time 2019-04-20 10:05:00
#7 29 1010 red_time 2019-04-22 20:51:21
#8 37 1010 repair_time 2019-04-24 17:20:00
Or use the data.table method where we convert the 'data.frame' to
'data.table' (setDT(df1)), grouped by 'V2', and rleid of 'V3', get the row index (.I) of the first observation, extract ($V1) it and subset the rows of dataset
library(data.table)
setDT(df1)[df1[, .I[1], .(V2, rleid(V3))]$V1]
data
df1 <- structure(list(V1 = c(1L, 3L, 8L, 10L, 15L, 17L, 19L, 27L, 29L,
33L, 37L, 39L, 41L), V2 = c(1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1010L, 1010L, 1010L, 1010L, 1010L, 1010L), V3 = c("repair_time",
"red_time", "repair_time", "red_time", "repair_time", "repair_time",
"repair_time", "repair_time", "red_time", "red_time", "repair_time",
"repair_time", "repair_time"), V4 = c("2019-04-04 14:07:00",
"2019-04-19 00:54:56", "2019-04-19 12:47:00", "2019-04-19 16:45:18",
"2019-04-20 04:42:00", "2019-04-20 05:29:00", "2019-04-28 07:33:00",
"2019-04-20 10:05:00", "2019-04-22 20:51:21", "2019-04-23 11:02:34",
"2019-04-24 17:20:00", "2019-04-24 18:30:00", "2019-04-24 18:42:00"
)), class = "data.frame", row.names = c(NA, -13L))

Another option using lag to check if the status is the same as the previous row. As akrun notes, this works because the data is already sorted by time:
library(dplyr)
df %>%
group_by(bikeid) %>%
mutate(repeated = status == lag(status)) %>%
# Need the is.na() check as first element of each group is NA
# due to the lag
filter(! repeated | is.na(repeated))
Data setup code:
txt = "1 1004 repair_time 2019-04-04 14:07:00
3 1004 red_time 2019-04-19 00:54:56
8 1004 repair_time 2019-04-19 12:47:00
10 1004 red_time 2019-04-19 16:45:18
15 1004 repair_time 2019-04-20 04:42:00
17 1004 repair_time 2019-04-20 05:29:00
19 1004 repair_time 2019-04-28 07:33:00
27 1010 repair_time 2019-04-20 10:05:00
29 1010 red_time 2019-04-22 20:51:21
33 1010 red_time 2019-04-23 11:02:34
37 1010 repair_time 2019-04-24 17:20:00
39 1010 repair_time 2019-04-24 18:30:00
41 1010 repair_time 2019-04-24 18:42:00"
df = read.table(text = txt, header = FALSE)
colnames(df) = c("row", "bikeid", "status", "date", "time")
df$date = as.POSIXct(paste(df$date, df$time))

Related

move values that have 2 values for the same date into new column [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
Asked a similar previous question but cant seem to alter the code to get the desired outcome:
my data frame is df:
date tss
2020-05-29 71
2020-05-29 60
2020-05-30 42
2020-05-31 NA
2020-06-01 95
2020-06-01 82
2020-06-02 69
2020-06-03 103
2020-06-04 49
2020-06-05 74
2020-06-05 49
2020-06-06 NA
2020-06-07 NA
2020-06-08 NA
2020-06-09 50
2020-06-10 191
2020-06-11 125
2020-06-11 126
2020-06-12 104
2020-06-12 77
Would like to move the tss scores that occur more than once (twice in the same day) into a new column where there is only one row for each date(date is classified as a date).
for example:
date tss tss2
2020-05-29 71 60
2020-05-30 42 0
2020-05-31 NA
2020-06-01 95 82
There will only ever be 2 tss entries for the same date. tried utilising group_by and pivot_wider but to no success.
thank you.

Try this:
library(tidyverse)
#Data
df <- structure(list(date = c("2020-05-29", "2020-05-29", "2020-05-30",
"2020-05-31", "2020-06-01", "2020-06-01", "2020-06-02", "2020-06-03",
"2020-06-04", "2020-06-05", "2020-06-05", "2020-06-06", "2020-06-07",
"2020-06-08", "2020-06-09", "2020-06-10", "2020-06-11", "2020-06-11",
"2020-06-12", "2020-06-12"), tss = c(71L, 60L, 42L, NA, 95L,
82L, 69L, 103L, 49L, 74L, 49L, NA, NA, NA, 50L, 191L, 125L, 126L,
104L, 77L)), class = "data.frame", row.names = c(NA, -20L))
#Code
df %>% group_by(date) %>% mutate(i=row_number(date)) %>%
pivot_wider(names_from = i,values_from = tss)
# A tibble: 15 x 3
# Groups: date [15]
date `1` `2`
<chr> <int> <int>
1 2020-05-29 71 60
2 2020-05-30 42 NA
3 2020-05-31 NA NA
4 2020-06-01 95 82
5 2020-06-02 69 NA
6 2020-06-03 103 NA
7 2020-06-04 49 NA
8 2020-06-05 74 49
9 2020-06-06 NA NA
10 2020-06-07 NA NA
11 2020-06-08 NA NA
12 2020-06-09 50 NA
13 2020-06-10 191 NA
14 2020-06-11 125 126
15 2020-06-12 104 77

Date difference between end date to start date

I have a data alooks like below.
id from data to date
1 2015-03-09 2015-03-14
2 2015-02-22 2015-02-24
2 2015-05-06 2015-05-17
3 2015-02-12 2015-02-16
4 2015-03-10 2015-03-16
4 2015-03-22 2015-04-07
4 2015-06-07 2015-07-07
4 2015-07-06 2015-07-07
4 2015-08-02 2015-08-07
I want to create a seperate variable which is the difference between the to date and the next from date grouped by id.
So the first time of the id will be NA.I tried the below method based on the other answer in stackoverflow and I could not
achieve that.
library(data.table)
chf1 = data.table(id = chf$id,from date = chf$f.date,to_date = chf$t.date)
setkey(chf1,id)
chf1[,diff:=c(NA,difftime(from_date, to_date, units = "days")),by=id]
The output look like
id from_date to_date difference
1 2015-03-09 2015-03-14 NA
2 2015-02-22 2015-02-24 NA
2 2015-05-06 2015-05-17 71
3 2015-02-12 2015-02-16 NA
4 2015-03-10 2015-03-16 NA
4 2015-03-22 2015-04-07 6
4 2015-06-07 2015-06-10 64
4 2015-07-06 2015-07-07 26
4 2015-08-02 2015-08-07 26

There are three issues in the code
1) chf1$from_date, chf1$to_date gets the whole column, so there is no effect of grouping by 'id'
2) difftime gives output with the same length as the initial column length.
3) As difftime takes the difference between each element of 'from_date' with corresponding element of 'to_date', there is no need for by = id
Therefore, the code can be
chf1[, diff1:=difftime(from_date, to_date, units = "days")]
chf1
# id from_date to_date diff1
#1: 1 2015-03-09 2015-03-14 -5 days
##2: 2 2015-02-22 2015-02-24 -2 days
#3: 2 2015-05-06 2015-05-17 -11 days
#4: 3 2015-02-12 2015-02-16 -4 days
#5: 4 2015-03-10 2015-03-16 -6 days
#6: 4 2015-03-22 2015-04-07 -16 days
#7: 4 2015-06-07 2015-07-07 -30 days
#8: 4 2015-07-06 2015-07-07 -1 days
#9: 4 2015-08-02 2015-08-07 -5 days
Based on the description in OP's code, if we need to get the difference between the next value of 'from_date', after grouping by 'id', use the difftime on the shifted 'from_date' with that of 'to_date' and assign (:= it to 'diff1'.
chf1[, diff1 := difftime(shift(from_date, type = "lead"), to_date,
units = "days") , by = id]
chf1
# id from_date to_date diff1
#1: 1 2015-03-09 2015-03-14 NA days
#2: 2 2015-02-22 2015-02-24 71 days
#3: 2 2015-05-06 2015-05-17 NA days
#4: 3 2015-02-12 2015-02-16 NA days
#5: 4 2015-03-10 2015-03-16 6 days
#6: 4 2015-03-22 2015-04-07 61 days
#7: 4 2015-06-07 2015-07-07 -1 days
#8: 4 2015-07-06 2015-07-07 26 days
#9: 4 2015-08-02 2015-08-07 NA days
Or it could be
chf1[, diff1 := difftime(from_date, shift(to_date), units = "days"), by = id]
data
chf <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 4L),
f.date = structure(c(16503,
16488, 16561, 16478, 16504, 16516, 16593, 16622, 16649), class = "Date"),
t.date = structure(c(16508, 16490, 16572, 16482, 16510, 16532,
16623, 16623, 16654), class = "Date")), .Names = c("id",
"f.date", "t.date"), row.names = c(NA, -9L), class = "data.frame")
chf1 = data.table(id = chf$id,from_date = chf$f.date,to_date = chf$t.date)

index grouped columns in data frame

I have a data frame as follow
time site val
2014-09-01 00:00:00 2001 1
2014-09-01 00:15:00 2001 0
2014-09-01 00:30:00 2001 2
2014-09-01 00:45:00 2001 0
2014-09-01 00:00:00 2002 1
2014-09-01 00:15:00 2002 0
2014-09-01 00:30:00 2002 2
2014-09-02 00:45:00 2001 0
2014-09-02 00:00:00 2001 1
2014-09-02 00:15:00 2001 0
2014-09-02 00:30:00 2001 2
2014-09-02 00:45:00 2001 0
2014-09-02 00:00:00 2002 1
2014-09-02 00:15:00 2002 0
2014-09-02 00:30:00 2002 2
2014-09-02 00:45:00 2001 0
I'd like to be able group it by time and site then add a new variable that will consist of the occurence index of the group
time site val h
2014-09-01 00:00:00 2001 1 1
2014-09-01 00:15:00 2001 0 2
2014-09-01 00:30:00 2001 2 3
2014-09-01 00:45:00 2001 0 4
2014-09-01 00:00:00 2002 1 1
2014-09-01 00:15:00 2002 0 2
2014-09-01 00:30:00 2002 2 3
2014-09-02 00:45:00 2002 0 4
2014-09-02 00:00:00 2001 1 1
2014-09-02 00:15:00 2001 0 2
2014-09-02 00:30:00 2001 2 3
2014-09-02 00:45:00 2001 0 4
2014-09-02 00:00:00 2002 1 1
2014-09-02 00:15:00 2002 0 2
2014-09-02 00:30:00 2002 2 3
2014-09-02 00:45:00 2001 0 4
df <- structure(list(time = structure(c(1409522400, 1409523300, 1409524200,
1409525100, 1409522400, 1409523300, 1409524200, 1409611500, 1409608800,
1409609700, 1409610600, 1409611500, 1409608800, 1409609700, 1409610600,
1409611500), class = c("POSIXct", "POSIXt"), tzone = ""), site = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("2001",
"2002"), class = "factor"), val = c(1L, 0L, 2L, 0L, 1L, 0L, 2L,
0L, 1L, 0L, 2L, 0L, 1L, 0L, 2L, 0L)), .Names = c("time", "site",
"val"), row.names = c(NA, -16L), class = "data.frame")
what are my possibilities in r to achieve this
thanks

Using dplyr. First we create a column id extracting the day from the date (column time). Then we group by site and id, and add a new variable counter counting the number of occurrences by those two groups.
df$id <- as.factor(format(df$time,'%d'))
library(dplyr)
df %>% group_by(site, id) %>% mutate(counter = row_number())
Output:
time site val id counter
(time) (fctr) (int) (fctr) (int)
1 2014-09-01 00:00:00 2001 1 01 1
2 2014-09-01 00:15:00 2001 0 01 2
3 2014-09-01 00:30:00 2001 2 01 3
4 2014-09-01 00:45:00 2001 0 01 4
5 2014-09-01 00:00:00 2002 1 01 1
6 2014-09-01 00:15:00 2002 0 01 2
7 2014-09-01 00:30:00 2002 2 01 3
8 2014-09-02 00:45:00 2001 0 02 1
9 2014-09-02 00:00:00 2001 1 02 2
10 2014-09-02 00:15:00 2001 0 02 3
11 2014-09-02 00:30:00 2001 2 02 4
12 2014-09-02 00:45:00 2001 0 02 5
13 2014-09-02 00:00:00 2002 1 02 1
14 2014-09-02 00:15:00 2002 0 02 2
15 2014-09-02 00:30:00 2002 2 02 3
16 2014-09-02 00:45:00 2001 0 02 6

We can use ave
df$h <- with(df, ave(val, cumsum(c(TRUE,diff(time)< 0)), FUN= seq_along))
df
# time site val h
#1 2014-09-01 03:30:00 2001 1 1
#2 2014-09-01 03:45:00 2001 0 2
#3 2014-09-01 04:00:00 2001 2 3
#4 2014-09-01 04:15:00 2001 0 4
#5 2014-09-01 03:30:00 2002 1 1
#6 2014-09-01 03:45:00 2002 0 2
#7 2014-09-01 04:00:00 2002 2 3
#8 2014-09-02 04:15:00 2001 0 4
#9 2014-09-02 03:30:00 2001 1 1
#10 2014-09-02 03:45:00 2001 0 2
#11 2014-09-02 04:00:00 2001 2 3
#12 2014-09-02 04:15:00 2001 0 4
#13 2014-09-02 03:30:00 2002 1 1
#14 2014-09-02 03:45:00 2002 0 2
#15 2014-09-02 04:00:00 2002 2 3
#16 2014-09-02 04:15:00 2001 0 4
NOTE: This is based on the expected output showed in the OP's post. I understand that 'site' is also described as the grouping variable, but then the expected output should be something else.

Conditional (inequality) join in data.table

I'm just trying to figure out how to do a conditional join on two data.tables.
I've written a sqldf conditional join to give me the circuits whose start or finish times are within the other's start/finish times.
sqldf("select dt2.start, dt2.finish, dt2.counts, dt1.id, dt1.circuit
from dt2
left join dt1 on (
(dt2.start >= dt1.start and dt2.start < dt1.finish) or
(dt2.finish >= dt1.start and dt2.finish < dt1.finish)
)")
This gives me the correct result, but it's too slow for my large-ish data set.
What's the data.table way to do this without a vector scan?
Here's my data:
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))

Using non-equi joins:
ans = dt1[dt2, on=.(start <= end, end > start),
.(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end),
allow.cartesian=TRUE
][!cndn %in% TRUE]
The condition start <= end, end >= start (note the >= on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start part (> instead of >=). But still it also picks up the intervals of type:
dt1: start=================end
dt2: start--------------------------------end ## start < start, end > end
and
dt1: start=================end
dt2: start----------end ## end == end
The cndn column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.
PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR operation. It is possible to do two conditional joins, and then bind them together though.
Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.

No idea if this performs faster, but here's a shot at a data table method. I reshape dt1 and use findInterval to identify where the times in dt2 line up with times in dt1.
dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600,
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299,
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L,
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L,
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L,
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end",
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))
# > dt1
# circuit start end id
# 1: b 2014-02-28 16:00:00 2014-02-28 17:30:00 1001
# 2: a 2014-02-28 17:52:00 2014-02-28 18:51:59 1002
# 3: b 2014-02-28 18:00:00 2014-02-28 21:00:00 1003
# 4: a 2014-02-28 18:52:00 2014-02-28 19:00:00 1004
# 5: b 2014-03-01 00:05:00 2014-03-01 02:55:00 1005
# 6: c 2014-02-28 23:00:00 2014-03-01 06:30:00 1006
# 7: a 2014-03-01 01:40:00 2014-03-01 04:59:59 1007
# 8: a 2014-03-01 05:00:00 2014-03-01 06:00:00 1008
# 9: b 2014-03-01 04:30:00 2014-03-01 07:30:00 1009
# > dt2
# start end seconds counts
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1
## reshapes dt1 from wide to long
## puts start and end times into one column and sorts by time
## this is so that you can use findInterval later
dt3 <- dt1[,list(time = c(start,end)), by = "circuit,id"][order(time)]
dt3[,ntvl := seq_len(nrow(dt3))]
# circuit id time ntvl
# 1: b 1001 2014-02-28 16:00:00 1
# 2: b 1001 2014-02-28 17:30:00 2
# 3: a 1002 2014-02-28 17:52:00 3
# 4: b 1003 2014-02-28 18:00:00 4
# 5: a 1002 2014-02-28 18:51:59 5
# 6: a 1004 2014-02-28 18:52:00 6
# 7: a 1004 2014-02-28 19:00:00 7
# 8: b 1003 2014-02-28 21:00:00 8
# 9: c 1006 2014-02-28 23:00:00 9
# 10: b 1005 2014-03-01 00:05:00 10
# 11: a 1007 2014-03-01 01:40:00 11
# 12: b 1005 2014-03-01 02:55:00 12
# 13: b 1009 2014-03-01 04:30:00 13
# 14: a 1007 2014-03-01 04:59:59 14
# 15: a 1008 2014-03-01 05:00:00 15
# 16: a 1008 2014-03-01 06:00:00 16
# 17: c 1006 2014-03-01 06:30:00 17
# 18: b 1009 2014-03-01 07:30:00 18
## map interval to id
dt4 <- dt3[,list(ntvl = seq(from = min(ntvl), to = max(ntvl)-1), by = 1),by = "circuit,id"]
setkey(dt4, ntvl)
# circuit id ntvl
# 1: b 1001 1
# 2: a 1002 3
# 3: a 1002 4
# 4: b 1003 4
# 5: b 1003 5
# 6: b 1003 6
# 7: a 1004 6
# 8: b 1003 7
# 9: c 1006 9
# 10: c 1006 10
# 11: b 1005 10
# 12: c 1006 11
# 13: b 1005 11
# 14: a 1007 11
# 15: c 1006 12
# 16: a 1007 12
# 17: c 1006 13
# 18: a 1007 13
# 19: b 1009 13
# 20: c 1006 14
# 21: b 1009 14
# 22: c 1006 15
# 23: b 1009 15
# 24: a 1008 15
# 25: c 1006 16
# 26: b 1009 16
# 27: b 1009 17
# circuit id ntvl
## finds intervals in dt2
dt2[,`:=`(ntvl_start = findInterval(start, dt3[["time"]], rightmost.closed = FALSE),
ntvl_end = findInterval(end, dt3[["time"]], rightmost.closed = FALSE))]
# start end seconds counts ntvl_start ntvl_end
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 3599 1 1 1
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1800 1 1 2
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 1319 0 2 2
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 480 1 2 3
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 3601 2 3 7
# 6: 2014-02-28 19:00:00 2014-02-28 21:00:00 7200 1 7 8
# 7: 2014-02-28 21:00:00 2014-02-28 22:59:59 7199 0 8 8
# 8: 2014-02-28 22:59:59 2014-03-01 00:04:59 3900 1 8 9
# 9: 2014-03-01 00:04:59 2014-03-01 01:39:59 5700 2 9 10
# 10: 2014-03-01 01:39:59 2014-03-01 02:55:00 4501 3 10 12
# 11: 2014-03-01 02:55:00 2014-03-01 04:29:59 5699 2 12 12
# 12: 2014-03-01 04:29:59 2014-03-01 06:00:00 5401 3 12 16
# 13: 2014-03-01 06:00:00 2014-03-01 06:30:00 1800 2 16 17
# 14: 2014-03-01 06:30:00 2014-03-01 07:30:00 3600 1 17 18
## joins, by start time, then by end time
## the commented out lines may be a better alternative
## if there are many NA values
setkey(dt2, ntvl_start)
dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_start <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_start_na <- dt2[!dt4]
setkey(dt2, ntvl_end)
dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = NA]
# dt_ans_end <- dt4[dt2, list(start,end,counts,id,circuit),nomatch = 0]
# dt_ans_end_na <- dt2[!dt4]
## bring them all together and remove duplicates
dt_ans <- unique(rbind(dt_ans_start, dt_ans_end), by = c("start", "id"))
dt_ans <- dt_ans[!(is.na(id) & counts > 0)]
dt_ans[,ntvl := NULL]
setkey(dt_ans,start)
# start end counts id circuit
# 1: 2014-02-28 16:00:00 2014-02-28 16:59:59 1 1001 b
# 2: 2014-02-28 17:00:00 2014-02-28 17:30:00 1 1001 b
# 3: 2014-02-28 17:30:00 2014-02-28 17:51:59 0 NA NA
# 4: 2014-02-28 17:51:59 2014-02-28 17:59:59 1 1002 a
# 5: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1002 a
# 6: 2014-02-28 17:59:59 2014-02-28 19:00:00 2 1003 b
# 7: 2014-02-28 19:00:00 2014-02-28 21:00:00 1 1003 b
# 8: 2014-02-28 21:00:00 2014-02-28 22:59:59 0 NA NA
# 9: 2014-02-28 22:59:59 2014-03-01 00:04:59 1 1006 c
# 10: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1006 c
# 11: 2014-03-01 00:04:59 2014-03-01 01:39:59 2 1005 b
# 12: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1006 c
# 13: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1005 b
# 14: 2014-03-01 01:39:59 2014-03-01 02:55:00 3 1007 a
# 15: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1006 c
# 16: 2014-03-01 02:55:00 2014-03-01 04:29:59 2 1007 a
# 17: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1006 c
# 18: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1007 a
# 19: 2014-03-01 04:29:59 2014-03-01 06:00:00 3 1009 b
# 20: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1006 c
# 21: 2014-03-01 06:00:00 2014-03-01 06:30:00 2 1009 b
# 22: 2014-03-01 06:30:00 2014-03-01 07:30:00 1 1009 b
# start end counts id circuit

calculating differences in times, data grouped by rows

I have a data set in the following format
ID DATETIME VALUE
1 4/2/2012 10:00 300
1 5/2/2012 23:00 150
1 6/3/2012 10:00 650
2 1/2/2012 10:00 450
2 2/2/2012 13:00 240
3 6/5/2012 09:00 340
3 7/5/2012 23:00 240
I would like to first calculate the time difference from first instance per ID to each subsequent time.
ID DATETIME VALUE DIFTIME(days)
1 4/2/2012 10:00 300 0
1 5/2/2012 23:00 150 1.3
1 6/3/2012 10:00 650 33
2 1/2/2012 10:00 450 0
2 2/2/2012 13:00 240 1
3 6/5/2012 09:00 340 0
3 7/5/2012 23:00 240 1
And then I'd like to make this a wide format
ID 0 1 1.3 33
1 300 na 150 na 650
2 450 240 na na
3 340 240 na na

Here a solution using data.table and reshape2 packages:
library(data.table)
DT <- as.data.table(dat)
DT[, `:=`(DIFTIME, c(0, diff(as.Date(DATETIME)))), by = "ID"]
## ID VALUE DATETIME DIFTIME
## 1: 1 300 2012-02-04 10:00:00 0
## 2: 1 150 2012-02-05 23:00:00 1
## 3: 1 650 2012-03-06 10:00:00 30
## 4: 2 450 2012-02-01 10:00:00 0
## 5: 2 240 2012-02-02 13:00:00 1
## 6: 3 340 2012-05-06 09:00:00 0
## 7: 3 240 2012-05-07 23:00:00 1
library(reshape2)
dcast(formula = ID ~ DIFTIME, data = DT[, list(ID, DIFTIME, VALUE)])
## ID 0 1 30
## 1 1 300 150 650
## 2 2 450 240 NA
## 3 3 340 240 NA
data in handy format
Here my dat:
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), DATETIME = structure(c(1328346000,
1328479200, 1331024400, 1328086800, 1328184000, 1336287600, 1336424400
), class = c("POSIXct", "POSIXt"), tzone = ""), VALUE = c(300L,
150L, 650L, 450L, 240L, 340L, 240L)), .Names = c("ID", "DATETIME",
"VALUE"), class = "data.frame", row.names = c(NA, 7L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicates if an observation appears consecutively, order matters - r

Related

move values that have 2 values for the same date into new column [duplicate]

Date difference between end date to start date

index grouped columns in data frame

Conditional (inequality) join in data.table

calculating differences in times, data grouped by rows

Categories

Resources