I would like to match the values of two table based on the following spatial and time conditions:
Two hour time interval
min(dist), max dist 20km
I can solve the problem using for loops but since the tables' dimension are 3million * 10, 100 000 * 13 it takes too long to complete.
Do you have any suggestion? I post below a practical example and the desired outpu. Thank you.
Example
DT1 <- data.table(
Date = as.POSIXct(c("2005-01-05 10:40:00", "2005-01-06 10:40:00", "2005-01-07 10:40:00", "2005-01-08 10:40:00", "2005-01-09 10:40:00", "2005-01-10 10:40:00"), format = "%Y-%m-%d %T", tz = "GMT"),
Lat = c(rep(50, 3), 35.44, 25.44, 15.44),
Lon = c(rep(-50, 3), -10.44, -20.44, -30.44),
Other.col = sample(LETTERS, 6))
DT2 <- data.table(
Date = as.POSIXct(c("2011-01-01 10:40:00", "2005-01-05 11:40:00", "2005-01-09 08:59:00", "2005-01-09 09:18:00", "2005-01-10 08:59:00"), format = "%Y-%m-%d %T", tz = "GMT"),
Lat = c(35.44, 1, 25.54, 25.43, 15.46),
Lon = c(-10.44, 1, -20.66, -20.42, -30.13),
Quality = c("h", "f", "n", "z", "l"))
DT1
Date Lat Lon Other.col
1: 2005-01-05 10:40:00 50.00 -50.00 E
2: 2005-01-06 10:40:00 50.00 -50.00 C
3: 2005-01-07 10:40:00 50.00 -50.00 O
4: 2005-01-08 10:40:00 35.44 -10.44 Z
5: 2005-01-09 10:40:00 25.44 -20.44 T
6: 2005-01-10 10:40:00 15.44 -30.44 S
DT2
Date Lat Lon Quality
1: 2011-01-01 10:40:00 35.44 -10.44 h
2: 2005-01-05 11:40:00 1.00 1.00 f
3: 2005-01-09 08:59:00 25.54 -20.66 n
4: 2005-01-09 09:18:00 25.43 -20.42 z
5: 2005-01-10 08:59:00 15.46 -30.13 l
Output
Date Lat Lon Other.col V2
1: 2005-01-05 10:40:00 50.00 -50.00 E NA
2: 2005-01-06 10:40:00 50.00 -50.00 C NA
3: 2005-01-07 10:40:00 50.00 -50.00 O NA
4: 2005-01-08 10:40:00 35.44 -10.44 Z NA
5: 2005-01-09 10:40:00 25.44 -20.44 T z
6: 2005-01-10 10:40:00 15.44 -30.44 S l
Related
I have a dataframe as:
T1 T2 T3 timestamp
45.37 44.48 13 2015-11-05 10:23:00
44.94 44.55 13.37 2015-11-05 10:24:00
45.32 44.44 13.09 2015-11-05 10:27:00
45.46 44.51 13.29 2015-11-05 10:28:00
45.46 44.65 13.18 2015-11-05 10:29:16
45.96 44.85 13.23 2015-11-05 10:32:00
45.52 44.56 13.53 2015-11-05 10:36:00
45.36 44.62 13.25 2015-11-05 10:37:00
I want to create a new dataframe that contains vaules of T1, T2 and T3 aggregated over 5 min intervals based on the timestamp column. I did come across aggregate and it seems to use one of the columns to group/aggregate the corresponding values in other columns.
If no rows had values over 5-min interval, then the rows to represent NAs. I also like another column that indicates number items used to make the average over 5-min intervals.
Looking for a most efficient way of doing it in R. Thanks
First make sure the timestamp columns is a date.time column. You can skip this line if it already is in this format.
df1$timestamp <- as.POSIXct(df1$timestamp)
xts has some nice functions for working with timeseries. Especially for rolling functions or time aggregating functions. In this case period.apply can help out.
library(xts)
# create xts object. Be sure to exclude the timestamp column otherwise you end up with a character matrix.
df1_xts <- as.xts(df1[, -4], order.by = df1$timestamp)
# sum per 5 minute intervals
df1_xts_summed <- period.apply(df1_xts, endpoints(df1_xts, on = "minutes", k = 5), colSums)
# count rows per 5 minute interval and add to data
df1_xts_summed$nrows <- period.apply(df1_xts$T1, endpoints(df1_xts, on = "minutes", k = 5), nrow)
df1_xts_summed
T1 T2 T3 nrows
2015-11-05 10:24:00 90.31 89.03 26.37 2
2015-11-05 10:29:16 136.24 133.60 39.56 3
2015-11-05 10:32:00 45.96 44.85 13.23 1
2015-11-05 10:37:00 90.88 89.18 26.78 2
If you want it all back into a data.frame:
df_final <- data.frame(timestamp = index(df1_xts_summed), coredata(df1_xts_summed))
df_final
timestamp T1 T2 T3 nrows
1 2015-11-05 10:24:00 90.31 89.03 26.37 2
2 2015-11-05 10:29:16 136.24 133.60 39.56 3
3 2015-11-05 10:32:00 45.96 44.85 13.23 1
4 2015-11-05 10:37:00 90.88 89.18 26.78 2
Edit if you want everything rounded at 5 minutes with these as the timestamps you need to do the following:
First step is to replace the timestamps with the 5 minute intervals, taking into account the starting minutes of the timestamps. For this I use the ceiling_date from the lubridate package and add to it the difference between the first values of the timestamp and the ceiling of the first value of the timestamp. This will return the last values of each interval. (If you want to use the start of the interval you need to use floor_date)
df1$timestamp <- lubridate::ceiling_date(df1$timestamp, "5 mins") + difftime(lubridate::ceiling_date(first(df1$timestamp), "5 mins"), first(df1$timestamp), unit = "secs")
Next the same xts code as before which returns the same data, but the timestamp is now the last value of the 5 minute intervals.
df1_xts <- as.xts(df1[, -4], order.by = df1$timestamp)
df1_xts_summed <- period.apply(df1_xts, ep, colSums)
df1_xts_summed$nrows <- period.apply(df1_xts$T1, endpoints(df1_xts, on = "minutes", k = 5), nrow)
df_final <- data.frame(timestamp = index(df1_xts_summed), coredata(df1_xts_summed))
df_final
timestamp T1 T2 T3 nrows
1 2015-11-05 10:27:00 90.31 89.03 26.37 2
2 2015-11-05 10:32:00 136.24 133.60 39.56 3
3 2015-11-05 10:37:00 45.96 44.85 13.23 1
4 2015-11-05 10:42:00 90.88 89.18 26.78 2
data:
df1 <- structure(list(T1 = c(45.37, 44.94, 45.32, 45.46, 45.46, 45.96,
45.52, 45.36), T2 = c(44.48, 44.55, 44.44, 44.51, 44.65, 44.85,
44.56, 44.62), T3 = c(13, 13.37, 13.09, 13.29, 13.18, 13.23,
13.53, 13.25), timestamp = c("2015-11-05 10:23:00", "2015-11-05 10:24:00",
"2015-11-05 10:27:00", "2015-11-05 10:28:00", "2015-11-05 10:29:16",
"2015-11-05 10:32:00", "2015-11-05 10:36:00", "2015-11-05 10:37:00"
)), class = "data.frame", row.names = c(NA, -8L))
I have a sample dataset of the trajectory of one bike. My objective is to figure out, on average, the amount of time that lapses in between visits to station B.
So far, I have been able to simply order the dataset with:
test[order(test$starttime, decreasing = FALSE),]
and find the row index of where start_station and end_station equal B.
which(test$start_station == 'B')
which(test$end_station == 'B')
The next part is where I run into trouble. In order to calculate the time that lapses in between when the bike is at Station B, we must take the difftime() between where start_station = "B" (bike leaves) and the next occurring record where end_station= "B", even if the record happens to be in the same row (see row 6).
Using the dataset below, we know that the bike spent 510 minutes between 7:30:00 and 16:00:00 outside of Station B, 30 minutes between 18:00:00 and 18:30:00 outside of Station B, and 210 minutes between 19:00:00 and 22:30:00 outside of Station B, which averages to 250 minutes.
How would one reproduce this output in R using difftime()?
> test
bikeid start_station starttime end_station endtime
1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00
2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00
3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00
4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00
5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00
6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00
7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00
8 1 А 2017-09-25 20:00:00 C 2017-09-25 20:30:00
9 1 C 2017-09-25 22:00:00 B 2017-09-25 22:30:00
10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00
Here is the sample data:
> dput(test)
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A",
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600,
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400,
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt"
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B",
"A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800,
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800,
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid",
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA,
-10L), class = "data.frame")
This will calculate the difference as asked in the order it occurs, but does not append it to the data.frame
lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
[[1]]
Time difference of 510 mins
[[2]]
Time difference of 30 mins
[[3]]
Time difference of 210 mins
[[4]]
Time difference of NA mins
To calculate the average time:
v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"])
mean(v1, na.rm = TRUE)
[1] 250
Another possibility:
library(data.table)
d <- setDT(test)[ , {
start = starttime[start_station == "B"]
end = endtime[end_station == "B"]
.(start = start, end = end, duration = difftime(end, start, units = "min"))
}
, by = .(trip = cumsum(start_station == "B"))]
d
# trip start end duration
# 1: 0 <NA> 2017-09-25 01:30:00 NA mins
# 2: 1 2017-09-25 07:30:00 2017-09-25 16:00:00 510 mins
# 3: 2 2017-09-25 18:00:00 2017-09-25 18:30:00 30 mins
# 4: 3 2017-09-25 19:00:00 2017-09-25 22:30:00 210 mins
# 5: 4 2017-09-25 23:00:00 <NA> NA mins
d[ , mean(duration, na.rm = TRUE)]
# Time difference of 250 mins
# or
d[ , mean(as.integer(duration), na.rm = TRUE)]
# [1] 250
The data is grouped by a counter which increases by 1 each time a bike starts from "B" (by = cumsum(start_station == "B")).
I have a dataframe with below data ( Average of the values of timestamp 7.50 and 7.40 should be my value of A for time Stamp 7.45)
Date_Time | A
7/28/2017 8:00| 443.75
7/28/2017 7:50| 440.75
7/28/2017 7:45| NA
7/28/2017 7:40| 447.5
7/28/2017 7:30| 448.75
7/28/2017 7:20| 444.5
7/28/2017 7:15| NA
7/28/2017 7:10| 440.25
7/28/2017 7:00| 447.5
I want it to transform into 15 min interval something like below using mean:
Date / Time | Object Value
7/28/2017 8:00| 465
7/28/2017 7:45| 464.875
7/28/2017 7:30| 464.75
7/28/2017 7:15| 464.875
7/28/2017 7:00| 465
Updat
The OP changes his or her desired output. Since I have no time to update my answer, I will leave my answer as it is. See my comment in the original post to see how to use na.interpolation to fill in the missing values.
Original Post
This solution assumes you calculated the average based on the average values in 8:00, 7:30, and 7:00.
library(dplyr)
library(tidyr)
library(lubridate)
library(imputeTS)
dt2 <- dt %>%
mutate(Date.Time = mdy_hm(Date.Time)) %>%
filter(Date.Time %in% seq(min(Date.Time), max(Date.Time), by = "15 min")) %>%
complete(Date.Time = seq(min(Date.Time), max(Date.Time), by = "15 min")) %>%
mutate(Object.Value = na.interpolation(Object.Value)) %>%
fill(Object.Name) %>%
arrange(desc(Date.Time))
dt2
# A tibble: 5 x 3
Date.Time Object.Name Object.Value
<dttm> <chr> <dbl>
1 2017-07-28 08:00:00 a 465.000
2 2017-07-28 07:45:00 a 464.875
3 2017-07-28 07:30:00 a 464.750
4 2017-07-28 07:15:00 a 464.875
5 2017-07-28 07:00:00 a 465.000
Data
dt <- read.table(text = "'Date Time' 'Object Name' 'Object Value'
'7/28/2017 8:00' a 465
'7/28/2017 7:50' a 465
'7/28/2017 7:40' a 464.75
'7/28/2017 7:30' a 464.75
'7/28/2017 7:20' a 464.75
'7/28/2017 7:10' a 465
'7/28/2017 7:00' a 465",
header = TRUE, stringsAsFactors = FALSE)
If the values measured on the 10-minute intervals are time-integrated averages over that period, it's reasonable to average them to a different period. If these are instantaneous measurements, then it's more reasonable to smooth them as others have suggested.
To take time-integrated averages measured on the 10-minute schedule and average those to the 15-minute schedule, you can use the intervalaverage package:
library(data.table)
library(intervalaverage)
x <- structure(list(time = c("7/28/2017 8:00", "7/28/2017 7:50", "7/28/2017 7:45",
"7/28/2017 7:40", "7/28/2017 7:30", "7/28/2017 7:20", "7/28/2017 7:15",
"7/28/2017 7:10", "7/28/2017 7:00"), A = c(443.75, 440.75, NA,
447.5, 448.75, 444.5, NA, 440.25, 447.5)), row.names = c(NA,
-9L), class = "data.frame")
y <- structure(list(time = c("7/28/2017 8:00", "7/28/2017 7:45", "7/28/2017 7:30",
"7/28/2017 7:15", "7/28/2017 7:00")), row.names = c(NA, -5L), class = "data.frame")
setDT(x)
setDT(y)
x
#> time A
#> 1: 7/28/2017 8:00 443.75
#> 2: 7/28/2017 7:50 440.75
#> 3: 7/28/2017 7:45 NA
#> 4: 7/28/2017 7:40 447.50
#> 5: 7/28/2017 7:30 448.75
#> 6: 7/28/2017 7:20 444.50
#> 7: 7/28/2017 7:15 NA
#> 8: 7/28/2017 7:10 440.25
#> 9: 7/28/2017 7:00 447.50
y
#> time
#> 1: 7/28/2017 8:00
#> 2: 7/28/2017 7:45
#> 3: 7/28/2017 7:30
#> 4: 7/28/2017 7:15
#> 5: 7/28/2017 7:00
x[, time:=as.POSIXct(time,format='%m/%d/%Y %H:%M',tz = "UTC")]
setnames(x, "time","start_time")
x[, start_time_integer:=as.integer(start_time)]
y[, time:=as.POSIXct(time,format='%m/%d/%Y %H:%M',tz = "UTC")]
setnames(y, "time","start_time")
y[, start_time_integer:=as.integer(start_time)]
setkey(y, start_time)
setkey(x, start_time)
##drop time times at 15 and 45
x <- x[!start_time %in% as.POSIXct(c("2017-07-28 07:45:00","2017-07-28 07:15:00"),tz="UTC")]
x[, end_time_integer:=as.integer(start_time)+60L*10L-1L]
x[, end_time:=as.POSIXct(end_time_integer,origin="1969-12-31 24:00:00",tz = "UTC")]
y[, end_time_integer:=as.integer(start_time)+60L*15L-1L]
y[, end_time:=as.POSIXct(end_time_integer,origin="1969-12-31 24:00:00",tz = "UTC")]
x
#> start_time A start_time_integer end_time_integer
#> 1: 2017-07-28 07:00:00 447.50 1501225200 1501225799
#> 2: 2017-07-28 07:10:00 440.25 1501225800 1501226399
#> 3: 2017-07-28 07:20:00 444.50 1501226400 1501226999
#> 4: 2017-07-28 07:30:00 448.75 1501227000 1501227599
#> 5: 2017-07-28 07:40:00 447.50 1501227600 1501228199
#> 6: 2017-07-28 07:50:00 440.75 1501228200 1501228799
#> 7: 2017-07-28 08:00:00 443.75 1501228800 1501229399
#> end_time
#> 1: 2017-07-28 07:09:59
#> 2: 2017-07-28 07:19:59
#> 3: 2017-07-28 07:29:59
#> 4: 2017-07-28 07:39:59
#> 5: 2017-07-28 07:49:59
#> 6: 2017-07-28 07:59:59
#> 7: 2017-07-28 08:09:59
y
#> start_time start_time_integer end_time_integer end_time
#> 1: 2017-07-28 07:00:00 1501225200 1501226099 2017-07-28 07:14:59
#> 2: 2017-07-28 07:15:00 1501226100 1501226999 2017-07-28 07:29:59
#> 3: 2017-07-28 07:30:00 1501227000 1501227899 2017-07-28 07:44:59
#> 4: 2017-07-28 07:45:00 1501227900 1501228799 2017-07-28 07:59:59
#> 5: 2017-07-28 08:00:00 1501228800 1501229699 2017-07-28 08:14:59
out <- intervalaverage(x,y,interval_vars=c("start_time_integer","end_time_integer"),value_vars="A")
out[, start_time:=as.POSIXct(start_time_integer,origin="1969-12-31 24:00:00",tz="UTC")]
out[, end_time:=as.POSIXct(end_time_integer,origin="1969-12-31 24:00:00",tz="UTC")]
out[, list(start_time,end_time, A)]
#> start_time end_time A
#> 1: 2017-07-28 07:00:00 2017-07-28 07:14:59 445.0833
#> 2: 2017-07-28 07:15:00 2017-07-28 07:29:59 443.0833
#> 3: 2017-07-28 07:30:00 2017-07-28 07:44:59 448.3333
#> 4: 2017-07-28 07:45:00 2017-07-28 07:59:59 443.0000
#> 5: 2017-07-28 08:00:00 2017-07-28 08:14:59 NA
#Note that this just equivalent to taking weighted.mean:
weighted.mean(c(447.5,440.25),w=c(10,5))
#> [1] 445.0833
weighted.mean(c(440.25,444.5),w=c(5,10))
#> [1] 443.0833
#etc
Note that the intervalaverage package requires integer columns defining closed intervals, hence the conversion to integer. integers are converted back to datetime (POSIXct) for readability.
I've downloaded a list of every Bitcoin transaction on a large exchange since 2013. What I have now looks like this:
Time Price Volume
1 2013-03-31 22:07:49 93.3 80.628518
2 2013-03-31 22:08:13 100.0 20.000000
3 2013-03-31 22:08:14 100.0 1.000000
4 2013-03-31 22:08:16 100.0 5.900000
5 2013-03-31 22:08:19 100.0 29.833879
6 2013-03-31 22:08:21 100.0 20.000000
7 2013-03-31 22:08:25 100.0 10.000000
8 2013-03-31 22:08:29 100.0 1.000000
9 2013-03-31 22:08:31 100.0 5.566121
10 2013-03-31 22:09:27 93.3 33.676862
I'm trying to work with the data in R, but my computer isn't powerful enough to handle processing it when I run getSymbols(BTC_XTS). I'm trying to convert it to a format like the following (price action over a day):
Date Open High Low Close Volume Adj.Close
1 2014-04-11 32.64 33.48 32.15 32.87 28040700 32.87
2 2014-04-10 34.88 34.98 33.09 33.40 33970700 33.40
3 2014-04-09 34.19 35.00 33.95 34.87 21597500 34.87
4 2014-04-08 33.10 34.43 33.02 33.83 35440300 33.83
5 2014-04-07 34.11 34.37 32.53 33.07 47770200 33.07
6 2014-04-04 36.01 36.05 33.83 34.26 41049900 34.26
7 2014-04-03 36.66 36.79 35.51 35.76 16792000 35.76
8 2014-04-02 36.68 36.86 36.56 36.64 14522800 36.64
9 2014-04-01 36.16 36.86 36.15 36.49 15734000 36.49
10 2014-03-31 36.46 36.58 35.73 35.90 15153200 35.90
I'm new to R, and any response would be greatly appreciated!
I don't know what you could mean when you say your "computer isn't powerful enough to handle processing it when [you] run getSymbols(BTC_XTS)". getSymbols retrieves data... why do you need to retrieve data you already have?
Also, you have no adjusted close data, so it's not possible to have an Adj.Close column in the output.
You can get what you want by coercing your input data to xts and calling to.daily on it. For example:
require(xts)
Data <- structure(list(Time = c("2013-03-31 22:07:49", "2013-03-31 22:08:13",
"2013-03-31 22:08:14", "2013-03-31 22:08:16", "2013-03-31 22:08:19",
"2013-03-31 22:08:21", "2013-03-31 22:08:25", "2013-03-31 22:08:29",
"2013-03-31 22:08:31", "2013-03-31 22:09:27"), Price = c(93.3,
100, 100, 100, 100, 100, 100, 100, 100, 93.3), Volume = c(80.628518,
20, 1, 5.9, 29.833879, 20, 10, 1, 5.566121, 33.676862)), .Names = c("Time",
"Price", "Volume"), class = "data.frame", row.names = c(NA, -10L))
x <- xts(Data[,-1], as.POSIXct(Data[,1]))
d <- to.daily(x, name="BTC")
BACKGROUD
dplyr has window functions. When you want to control the order of window functions,
you can use order_by.
DATA
mydf <- data.frame(id = c("ana", "bob", "caroline",
"bob", "ana", "caroline"),
order = as.POSIXct(c("2015-01-01 18:00:00", "2015-01-01 18:05:00",
"2015-01-01 19:20:00", "2015-01-01 09:07:00",
"2015-01-01 08:30:00", "2015-01-01 11:11:00"),
format = "%Y-%m-%d %H:%M:%S"),
value = runif(6, 10, 20),
stringsAsFactors = FALSE)
# id order value
#1 ana 2015-01-01 18:00:00 19.00659
#2 bob 2015-01-01 18:05:00 13.64010
#3 caroline 2015-01-01 19:20:00 12.08506
#4 bob 2015-01-01 09:07:00 14.40996
#5 ana 2015-01-01 08:30:00 17.45165
#6 caroline 2015-01-01 11:11:00 14.50865
Suppose you want to use lag(), you can do the following.
arrange(mydf, id, order) %>%
group_by(id) %>%
mutate(check = lag(value))
# id order value check
#1 ana 2015-01-01 08:30:00 17.45165 NA
#2 ana 2015-01-01 18:00:00 19.00659 17.45165
#3 bob 2015-01-01 09:07:00 14.40996 NA
#4 bob 2015-01-01 18:05:00 13.64010 14.40996
#5 caroline 2015-01-01 11:11:00 14.50865 NA
#6 caroline 2015-01-01 19:20:00 12.08506 14.50865
However, you can avoid using arrange() with order_by().
group_by(mydf, id) %>%
mutate(check = lag(value, order_by = order))
# id order value check
#1 ana 2015-01-01 18:00:00 19.00659 17.45165
#2 bob 2015-01-01 18:05:00 13.64010 14.40996
#3 caroline 2015-01-01 19:20:00 12.08506 14.50865
#4 bob 2015-01-01 09:07:00 14.40996 NA
#5 ana 2015-01-01 08:30:00 17.45165 NA
#6 caroline 2015-01-01 11:11:00 14.50865 NA
EXPERIMENT
I wanted to apply the same procedure to the case in which I wanted
to assign row number to a new column. Using the sample data, you can do the folowing.
group_by(mydf, id) %>%
arrange(order) %>%
mutate(num = row_number())
# id order value num
#1 ana 2015-01-01 08:30:00 17.45165 1
#2 ana 2015-01-01 18:00:00 19.00659 2
#3 bob 2015-01-01 09:07:00 14.40996 1
#4 bob 2015-01-01 18:05:00 13.64010 2
#5 caroline 2015-01-01 11:11:00 14.50865 1
#6 caroline 2015-01-01 19:20:00 12.08506 2
Can we omit the arrange line? Seeing the CRAN manual, I did the following.
Both attempts were not successful.
### Not working
group_by(mydf, id) %>%
mutate(num = row_number(order_by = order))
### Not working
group_by(mydf, id) %>%
mutate(num = order_by(order, row_number()))
How can we achieve this?
I did not mean to answer this question by myself. But, I decided to share
what I found given I have not seen many posts using order_by and particularly
with_order. My answer was to use with_order() instead of order_by().
group_by(mydf, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))
# id order value num
#1 ana 2015-01-01 18:00:00 19.00659 2
#2 bob 2015-01-01 18:05:00 13.64010 2
#3 caroline 2015-01-01 19:20:00 12.08506 2
#4 bob 2015-01-01 09:07:00 14.40996 1
#5 ana 2015-01-01 08:30:00 17.45165 1
#6 caroline 2015-01-01 11:11:00 14.50865 1
I wanted to see if there would be any difference between the two
approaches in terms of speed. It seems that they are pretty similar in this case.
library(microbenchmark)
mydf2 <- data.frame(id = rep(c("ana", "bob", "caroline",
"bob", "ana", "caroline"), times = 200000),
order = seq(as.POSIXct("2015-03-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct("2015-01-01 18:00:00", format = "%Y-%m-%d %H:%M:%S"),
length.out = 1200000),
value = runif(1200000, 10, 20),
stringsAsFactors = FALSE)
jazz1 <- function() {group_by(mydf2, id) %>%
arrange(order) %>%
mutate(num = row_number())}
jazz2 <- function() {group_by(mydf2, id) %>%
mutate(num = with_order(order_by = order, fun = row_number, x = order))}
res <- microbenchmark(jazz1, jazz2, times = 1000000L)
res
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# jazz1 32 36 47.17647 38 47 12308 1e+06 a
# jazz2 32 36 47.02902 38 47 12402 1e+06 a