I have the following data:
df <- data.frame(index = 1:85,
times = c(seq(as.POSIXct("2020-10-03 21:31:00 UTC"),
as.POSIXct("2020-10-03 22:25:00 UTC")
"min"),
seq(as.POSIXct("2020-11-03 10:10:00 UTC"),
as.POSIXct("2020-11-03 10:39:00 UTC"),
"min")
))
if we look at row 55 and 56 there is a clear divide in times:
> df[55:56, ]
index times
55 55 2020-10-03 22:25:00
56 56 2020-11-03 10:10:00
I would like to add a third categorical column split based on the splits,
e.g. row df$split[55, ] = A and row df$split[56, ] = B
logic like
If time gap between rows is greater than 5 mins start new category for subsequent rows until the next instance where time gap > 5 mins.
thanks
You could use
library(dplyr)
df %>%
mutate(cat = 1 + cumsum(c(0, diff(times)) > 5))
which returns
index times cat
1 1 2020-10-03 21:31:00 1
2 2 2020-10-03 21:32:00 1
3 3 2020-10-03 21:33:00 1
4 4 2020-10-03 21:34:00 1
5 5 2020-10-03 21:35:00 1
6 6 2020-10-03 21:36:00 1
7 7 2020-10-03 21:37:00 1
8 8 2020-10-03 21:38:00 1
...
53 53 2020-10-03 22:23:00 1
54 54 2020-10-03 22:24:00 1
55 55 2020-10-03 22:25:00 1
56 56 2020-11-03 10:10:00 2
57 57 2020-11-03 10:11:00 2
58 58 2020-11-03 10:12:00 2
59 59 2020-11-03 10:13:00 2
If you need letters or something else, you could for example use
df %>%
mutate(cat = LETTERS[1 + cumsum(c(0, diff(times)) > 5)])
to convert the categories 1 and 2 into A and B.
df <- data.frame("Date"=seq(as.Date("2020/1/1"),by="day", length.out = 20),events=sample(0:100,20))
trying to sum the closest
df <- df %>% mutate(seven_sum=sum(events[Date <= Date & Date > Date-7]) )
Then i want to sum everyting that has happend during the last 7 days and I can understand why this is not working but not really how to solve it.
So basically i would like to for each row sum that date and all other within 7 days. it is wasy to fix if I use a fixed date range but i would like to to change for each row...
Any advice on how to continue would be very helpfull.
Using purrr::map_int :
library(dplyr)
library(purrr)
df %>% mutate(seven_sum=map_int(Date, ~sum(events[Date <= .x & Date > (.x-7)])))
# Date events seven_sum
#1 2020-01-01 66 66
#2 2020-01-02 94 160
#3 2020-01-03 49 209
#4 2020-01-04 39 248
#5 2020-01-05 84 332
#6 2020-01-06 29 361
#7 2020-01-07 36 397
#8 2020-01-08 20 351
#9 2020-01-09 40 297
#10 2020-01-10 25 273
#11 2020-01-11 3 237
#12 2020-01-12 97 250
#13 2020-01-13 22 243
#14 2020-01-14 63 270
#15 2020-01-15 58 308
#16 2020-01-16 91 359
#17 2020-01-17 26 360
#18 2020-01-18 47 404
#19 2020-01-19 35 342
#20 2020-01-20 38 358
and same logic in base R :
sapply(df$Date, function(x) sum(df$events[df$Date <= x & df$Date > (x-7)]))
We can use data.table methods to do a non-equi join which would be more efficient
library(data.table)
v1 <- setDT(df)[df[, Date1 := Date - 7], sum(events),
on = .(Date <= Date, Date > Date1), allow.cartesian =TRUE, by = .EACHI]$V1
df[, seven_sum := v1][]
I’m a newbie in R.
I have two dataset A and B.
A <- data.table::fread(
"
V1 DATE ID
1 7/16/11 a
2 2/18/09 b
3 3/25/08 c
")
B <- data.table::fread(
"
V1 DATE ID Value
1 2013-06-13 a 109
2 2017-08-22 a 86
3 2017-09-15 a 88
4 2008-11-05 a 78
5 2009-02-17 a 74
6 2009-03-09 a 84
7 2009-03-17 a 81
8 2009-04-14 a 57
9 2009-04-21 a 65
10 2009-05-12 a 54
11 2009-06-08 a 54
12 2009-08-27 a 68
13 2011-08-26 b 199
14 2011-12-07 b 174
15 2012-01-31 b 66
16 2012-02-15 b 58
17 2012-04-17 b 59
18 2012-12-21 b 78
19 2013-01-14 b 91
20 2014-03-12 b 74
21 2014-08-28 b 98
22 2014-10-18 b 112
23 2010-12-15 b 36
24 2011-08-26 b 199
25 2011-12-07 b 174
26 2012-01-31 b 66
27 2012-02-15 b 58
28 2012-04-17 b 59
29 2015-05-08 c 105
30 2006-03-27 c 69
31 2007-03-12 c 104
32 2007-11-09 c 63
33 2008-03-25 c 239
34 2008-04-04 c 446
35 2008-04-09 c 354
36 2008-04-10 c 365
37 2008-04-11 c 366
38 2008-04-18 c 273
39 2008-04-28 c 271
40 2008-05-06 c 262
41 2008-05-19 c 72
42 2008-05-24 c 86
43 2008-06-20 c 47
44 2008-07-10 c 46
45 2008-08-06 c 55
46 2008-09-01 c 58
47 2008-09-29 c 56
48 2008-10-30 c 53
49 2008-12-09 c 71
50 2008-12-18 c 63
51 2009-01-14 c 60
52 2009-02-21 c 58
53 2009-03-28 c 54
54 2009-04-29 c 56
55 2009-04-30 c 59
56 2009-06-23 c 64
57 2009-07-24 c 69
58 2009-08-17 c 73
59 2009-10-04 c 127
60 2009-11-26 c 289
61 2009-12-02 c 277
62 2009-12-08 c 230
")
I tried weeks to use R to:
find value from B which ID==A$ID, and B$DATE is closest date before or the same date as A$DATE;
The expected result is : ID=c, DATE=2008-03-25, Value=239
find value from B which ID==A$ID, and B$DATE is 14 days after A$DATE. If there is no exact date after 14 days, find the closest date's value (like 15, 16 or 17 days after A$DATE)
The expected result is : ID=c, DATE=2008-04-09, Value=354
Both questions can answered using a rolling join from data.table.
However, there are two important steps in preparing the data.
The date strings need to be converted to class IDate (or Date) to allow for date arithmetic. (IDate uses an integer representation to save memory).
The dataframes need to be coerced to data.table to enable the enhanced syntax. setDT() coerces a dataframe or tibble to data.table by reference, i.e., without copying.
BTW: The sample datasets provided by the OP were already data.tables as the OP had used the data.table::fread() function.
Data preparation:
library(data.table)
setDT(A)[, DATE := as.IDate(DATE, "%m/%d/%y")]
setDT(B)[, DATE := as.IDate(DATE)]
Now, we can apply the rolling join:
B[A, on = .(ID, DATE), roll = +Inf, .(ID, DATE, Value)]
ID DATE Value
1: a 2011-07-16 68
2: b 2009-02-18 NA
3: c 2008-03-25 239
The result can be verified by printing B in proper order B[order(ID, DATE)]. The earliest date for ID == "b" in B is 2011-08-26. So, there is no date in B on or before 2009-02-18.
Please, note that the value in the DATE column is the reference date A$DATE, not the matching B$DATE.
Edit after clarification of the expected result by the OP:
Also the second question can be solved by a rolling join but the code requires three modifications:
The reference dates A$DATE need to be shifted by 14 days later.
We need a backward rolling join because the OP wants to find the closest date in B on or after the shifted reference date.
According to OP's expected result the result should contain the matching B$DATE.
With the additional requrements we get
B[A[, .(ID, DATE = DATE + 14)], on = .(ID, DATE), roll = -Inf, .(ID, DATE = x.DATE, Value)]
ID DATE Value
1: a 2013-06-13 109
2: b 2010-12-15 36
3: c 2008-04-09 354
A solution using dplyr:
q1 and q2 corresponds to your two questions.
library(dplyr)
A$DATE <- as.Date(A$DATE,format = "%m/%d/%y")
B$DATE <- as.Date(B$DATE)
BA <- left_join(B,A, by= c("ID"="ID"))
q1 <- BA %>%
filter(ID %in% A$ID) %>%
filter(DATE.x < DATE.y) %>%
group_by(ID) %>%
arrange(desc(DATE.x)) %>%
slice(1)
q2 <- BA %>%
filter(ID %in% A$ID) %>%
group_by(ID) %>%
filter(as.numeric(DATE.x) - as.numeric(DATE.y) >= 14)
q1
#> # A tibble: 2 x 6
#> # Groups: ID [2]
#> V1.x DATE.x ID Value V1.y DATE.y
#> <int> <date> <chr> <int> <int> <date>
#> 1 12 2009-08-27 a 68 1 2011-07-16
#> 2 32 2007-11-09 c 63 3 2008-03-25
q2
#> # A tibble: 48 x 6
#> # Groups: ID [3]
#> V1.x DATE.x ID Value V1.y DATE.y
#> <int> <date> <chr> <int> <int> <date>
#> 1 1 2013-06-13 a 109 1 2011-07-16
#> 2 2 2017-08-22 a 86 1 2011-07-16
#> 3 3 2017-09-15 a 88 1 2011-07-16
#> 4 13 2011-08-26 b 199 2 2009-02-18
#> 5 14 2011-12-07 b 174 2 2009-02-18
#> 6 15 2012-01-31 b 66 2 2009-02-18
#> 7 16 2012-02-15 b 58 2 2009-02-18
#> 8 17 2012-04-17 b 59 2 2009-02-18
#> 9 18 2012-12-21 b 78 2 2009-02-18
#> 10 19 2013-01-14 b 91 2 2009-02-18
#> # ... with 38 more rows
I have a sample xts object with the some data:
dates <- seq.Date(from = as.Date("2010-01-01", format = "%Y-%m-%d"),
to = as.Date("2013-12-01", format = "%Y-%m-%d"), by = "month")
sample_data <- cbind(1:length(dates),length(dates):1)
xts_object <- xts(x = sample_data, order.by = dates)
I then use apply.yearly on it with the function cumsum:
apply.yearly(x = xts_object, FUN = cumsum)
The output is a tranposed matrix, which is not what I originally intended it to return.
I would expect the snippet above to return the same output as:
rbind(apply(X = xts_object[1:12],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[13:24],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[25:36],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[37:48],MARGIN = 2,FUN = cumsum))
The problem with using apply is that it returns a matrix and not an xts object. While I could solve this by using as.xts, I would like to know if there is something I am missing, or if I am using apply.yearly incorrectly. Using pure apply seems to be more prone to difficult to catch errors and bugs.
This might not be the most elegant solution, but it works:
# Split xts_object by year
xts_list = split(xts_object, "years")
# cumsum for each year
cumsum_list = lapply(xts_list, FUN = cumsum)
# rbind them together
do.call(rbind, cumsum_list)
# [,1] [,2]
# 2010-01-01 1 48
# 2010-02-01 3 95
# 2010-03-01 6 141
# 2010-04-01 10 186
# 2010-05-01 15 230
# 2010-06-01 21 273
# 2010-07-01 28 315
# 2010-08-01 36 356
# 2010-09-01 45 396
# 2010-10-01 55 435
# 2010-11-01 66 473
# 2010-12-01 78 510
# 2011-01-01 13 36
# 2011-02-01 27 71
# 2011-03-01 42 105
# 2011-04-01 58 138
# 2011-05-01 75 170
# 2011-06-01 93 201
# 2011-07-01 112 231
# 2011-08-01 132 260
# 2011-09-01 153 288
# 2011-10-01 175 315
# 2011-11-01 198 341
# 2011-12-01 222 366
# 2012-01-01 25 24
# 2012-02-01 51 47
# 2012-03-01 78 69
# 2012-04-01 106 90
# 2012-05-01 135 110
# 2012-06-01 165 129
# 2012-07-01 196 147
# 2012-08-01 228 164
# 2012-09-01 261 180
# 2012-10-01 295 195
# 2012-11-01 330 209
# 2012-12-01 366 222
# 2013-01-01 37 12
# 2013-02-01 75 23
# 2013-03-01 114 33
# 2013-04-01 154 42
# 2013-05-01 195 50
# 2013-06-01 237 57
# 2013-07-01 280 63
# 2013-08-01 324 68
# 2013-09-01 369 72
# 2013-10-01 415 75
# 2013-11-01 462 77
# 2013-12-01 510 78
class(do.call(rbind, cumsum_list))
# [1] "xts" "zoo"
The resulting object would still be "xts"
I have a dataframe that looks like this:
id date1 value1 date2 value2 date3 value3
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65
I want to create a new column that contains whichever value of the three columns "value1", "value2", and "value3" that is the most recent. I don't need to know which date it was associated with.
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Code to create working example:
set.seed(1234)
id <- sample(1000:2000, 6, replace=TRUE)
date1 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value1 <- sample(1:100, 6, replace=TRUE)
date2 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value2 <- sample(1:100, 6, replace=TRUE)
date3 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value3 <- sample(1:100, 6, replace=TRUE)
df <- data.frame(id, date1, value1, date2, value2, date3, value3)
Edit: Per #Pierre Lafortune's answer, you can actually collapse this into one statement.
Edit 2: Added in data with NAs, also changed code to handle NAs.
This should do the trick rather nicely. It does require a loop and I would be interested to see if someone could come up with a concise vecotrized solution.
date_cols <- colnames(df)[grep("date",colnames(df))]
df$value_recent<-df[cbind(1:nrow(df),grep("date",colnames(df))[apply(sapply(df[,date_cols],as.numeric),1,which.max)]+1)]
df
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 <NA> 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 <NA> 30 2013-04-07 53 2013-03-20 100 53
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Data:
df<-structure(list(id = c(1113L, 1622L, 1609L, 1624L, 1861L, 1640L
), date1 = structure(c(NA, 15679, NA, 16090, 16351, 16134), class = "Date"),
value1 = c(29L, 93L, 30L, 84L, 29L, 27L), date2 = structure(c(15612,
15679, 15802, 15781, 15571, 15398), class = "Date"), value2 = c(22L,
82L, 53L, 92L, 84L, 5L), date3 = structure(c(16006, 15727,
15784, 16080, 15604, 16446), class = "Date"), value3 = c(21L,
26L, 100L, 81L, 56L, 65L)), .Names = c("id", "date1", "value1",
"date2", "value2", "date3", "value3"), row.names = c(NA, -6L), class = "data.frame")
I'm using apply to go over the rows looking for the most recent date. Then use that index to find the value that corresponds. We use a matrix subsetting method to keep it concise:
indx <- apply(df[grep("date", names(df))], 1, function(x) which(x == max(x))[1])
df$value_recent <- df[grep("val", names(df))][cbind(1:nrow(df), indx)]
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
# 2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
# 3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
# 4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
# 5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
# 6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
(Note: arranging your data this way will create more trouble than good.)
There are probably less verbose ways to do this, but here's one option. First move it to a "long" format, then split it by id, sort, and extract the most recent record and merge that back in with the original data frame.
ld <- reshape(df,
idvar = "id",
varying = list(paste0("date", 1:3),
paste0("value", 1:3)),
v.names = c("date", "value"),
direction = "long")
recent <- split(ld, ld$id)
recent <- lapply(recent, function(x) {
d <- x[order(x$date), ]
d <- d[nrow(d), c(1, 4)]
names(d)[2] <- "value_recent"
d
})
recent <- do.call(rbind, recent)
merge(df, recent, by = "id")
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1204 2014-10-25 73 2012-12-22 39 2015-07-18 62 62
# 2 1667 2012-01-16 97 2014-02-28 30 2014-12-31 83 83
# 3 1673 2015-01-16 96 2014-12-16 50 2014-08-05 31 96
# 4 1722 2015-02-07 10 2013-12-25 4 2012-08-18 93 10
# 5 1882 2012-10-20 91 2014-12-28 71 2015-09-03 18 18
# 6 1883 2012-03-30 73 2015-04-26 4 2014-12-23 74 4
Here's a similar solution that also starts with reshape but then does the rest in a series of pipes:
library(dplyr)
library(reshape)
df2 <- reshape(df,
varying = list(names(df)[grep("date", names(df))],
names(df)[grep("value", names(df))]),
v.names = c("date", "value"),
direction = "long") %>%
# order data for step to come
arrange(id, date) %>%
# next two steps cut down to last (ordered) obs for each id
group_by(id) %>%
slice(n()) %>%
# keep only the columns we need and rename the value column for merging
select(id, most.recent = value) %>%
# merge the values back into the original data frame, matching on id
left_join(df, .)