cut by interval and aggregate over one month in R - r

I have the given data - all bike trips that started from a particular station over the month of October 2013. I'd like to count the amount of trips that occurred within ten-minute time intervals. There should be a total of 144 rows with a sum of all of the trips that occurred within that interval for the entire month. How would one cut the data.frame and then aggregate by interval (so that trips occurring between 00:00:01 and 00:10:00 are counted in the second row, between 00:10:01 and 00:20:00 are counted in the third row, and so on...)?
head(one.station)
tripduration starttime stoptime start.station.id start.station.name
59 803 2013-10-01 00:11:49 2013-10-01 00:25:12 521 8 Ave & W 31 St
208 445 2013-10-01 00:40:05 2013-10-01 00:47:30 521 8 Ave & W 31 St
359 643 2013-10-01 01:25:57 2013-10-01 01:36:40 521 8 Ave & W 31 St
635 388 2013-10-01 05:30:30 2013-10-01 05:36:58 521 8 Ave & W 31 St
661 314 2013-10-01 05:38:00 2013-10-01 05:43:14 521 8 Ave & W 31 St
768 477 2013-10-01 05:54:49 2013-10-01 06:02:46 521 8 Ave & W 31 St
start.station.latitude start.station.longitude end.station.id end.station.name
59 40.75045 -73.99481 2003 1 Ave & E 18 St
208 40.75045 -73.99481 505 6 Ave & W 33 St
359 40.75045 -73.99481 508 W 46 St & 11 Ave
635 40.75045 -73.99481 459 W 20 St & 11 Ave
661 40.75045 -73.99481 462 W 22 St & 10 Ave
768 40.75045 -73.99481 457 Broadway & W 58 St
end.station.latitude end.station.longitude bikeid usertype birth.year gender
59 40.73416 -73.98024 15139 Subscriber 1985 1
208 40.74901 -73.98848 20538 Subscriber 1990 2
359 40.76341 -73.99667 19935 Customer \\N 0
635 40.74674 -74.00776 14781 Subscriber 1955 1
661 40.74692 -74.00452 17976 Subscriber 1982 1
768 40.76695 -73.98169 19022 Subscriber 1973 1
So that the output looks like this
output
interval total_trips
1 00:00:00 0
2 00:10:00 1
3 00:20:00 2
4 00:30:00 3
5 00:40:00 4

Here it is using only start time:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(6) * 1000)
start_times <- as.POSIXlt(
c("2013-10-01 00:11:49"
,"2013-10-01 00:40:05"
,"2013-10-01 01:25:57"
,"2013-10-01 05:30:30"
,"2013-10-01 05:38:00"
,"2013-10-01 05:54:49")
)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2013-10-01 00:00:00"), as.POSIXct("2013-10-01 06:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
out
out_buckets trip_count
1 2013-10-01 00:00:00 0
2 2013-10-01 00:10:00 1
3 2013-10-01 00:20:00 0
4 2013-10-01 00:30:00 0
5 2013-10-01 00:40:00 1
6 2013-10-01 00:50:00 0
7 2013-10-01 01:00:00 0
8 2013-10-01 01:10:00 0
9 2013-10-01 01:20:00 1
10 2013-10-01 01:30:00 0
11 2013-10-01 01:40:00 0
12 2013-10-01 01:50:00 0
13 2013-10-01 02:00:00 0
14 2013-10-01 02:10:00 0
15 2013-10-01 02:20:00 0
16 2013-10-01 02:30:00 0
17 2013-10-01 02:40:00 0
18 2013-10-01 02:50:00 0
19 2013-10-01 03:00:00 0
20 2013-10-01 03:10:00 0
21 2013-10-01 03:20:00 0
22 2013-10-01 03:30:00 0
23 2013-10-01 03:40:00 0
24 2013-10-01 03:50:00 0
25 2013-10-01 04:00:00 0
26 2013-10-01 04:10:00 0
27 2013-10-01 04:20:00 0
28 2013-10-01 04:30:00 0
29 2013-10-01 04:40:00 0
30 2013-10-01 04:50:00 0
31 2013-10-01 05:00:00 0
32 2013-10-01 05:10:00 0
33 2013-10-01 05:20:00 0
34 2013-10-01 05:30:00 2
35 2013-10-01 05:40:00 0
36 2013-10-01 05:50:00 1
37 2013-10-01 06:00:00 0

The lubridate library can provide one solution. It has a nice function for interval overlap logic. The below uses lapply to loop through the intervals provided in the data then buckets them accordingly.
library(lubridate)
start_times <- as.POSIXlt(
c("2013-10-01 00:11:49"
,"2013-10-01 00:40:05"
,"2013-10-01 01:25:57"
,"2013-10-01 05:30:30"
,"2013-10-01 05:38:00"
,"2013-10-01 05:54:49")
)
stop_times <- as.POSIXlt(
c("2013-10-01 00:25:12"
,"2013-10-01 00:47:30"
,"2013-10-01 01:36:40"
,"2013-10-01 05:36:58"
,"2013-10-01 05:43:14"
,"2013-10-01 06:02:46")
)
start_bucket <- seq(as.POSIXct("2013-10-01 00:00:00"), as.POSIXct("2013-10-01 06:0:00"), by = 600)
end_bucket <- start_bucket + 600
bucket_interval <- interval(start_bucket, end_bucket)
data_interval <- interval(start_times, stop_times)
int_list <- lapply(data_interval, function(x) ifelse(int_overlaps(x, bucket_interval),1,0))
rides_per_bucket <- rowSums(do.call(cbind, int_list))
out_df <- data.frame(bucket_interval, rides_per_bucket)
out_df
bucket_interval rides_per_bucket
1 2013-10-01 00:00:00 PDT--2013-10-01 00:10:00 PDT 0
2 2013-10-01 00:10:00 PDT--2013-10-01 00:20:00 PDT 1
3 2013-10-01 00:20:00 PDT--2013-10-01 00:30:00 PDT 1
4 2013-10-01 00:30:00 PDT--2013-10-01 00:40:00 PDT 0
5 2013-10-01 00:40:00 PDT--2013-10-01 00:50:00 PDT 1
6 2013-10-01 00:50:00 PDT--2013-10-01 01:00:00 PDT 0
7 2013-10-01 01:00:00 PDT--2013-10-01 01:10:00 PDT 0
8 2013-10-01 01:10:00 PDT--2013-10-01 01:20:00 PDT 0
9 2013-10-01 01:20:00 PDT--2013-10-01 01:30:00 PDT 1
10 2013-10-01 01:30:00 PDT--2013-10-01 01:40:00 PDT 1
11 2013-10-01 01:40:00 PDT--2013-10-01 01:50:00 PDT 0
12 2013-10-01 01:50:00 PDT--2013-10-01 02:00:00 PDT 0
13 2013-10-01 02:00:00 PDT--2013-10-01 02:10:00 PDT 0
14 2013-10-01 02:10:00 PDT--2013-10-01 02:20:00 PDT 0
15 2013-10-01 02:20:00 PDT--2013-10-01 02:30:00 PDT 0
16 2013-10-01 02:30:00 PDT--2013-10-01 02:40:00 PDT 0
17 2013-10-01 02:40:00 PDT--2013-10-01 02:50:00 PDT 0
18 2013-10-01 02:50:00 PDT--2013-10-01 03:00:00 PDT 0
19 2013-10-01 03:00:00 PDT--2013-10-01 03:10:00 PDT 0
20 2013-10-01 03:10:00 PDT--2013-10-01 03:20:00 PDT 0
21 2013-10-01 03:20:00 PDT--2013-10-01 03:30:00 PDT 0
22 2013-10-01 03:30:00 PDT--2013-10-01 03:40:00 PDT 0
23 2013-10-01 03:40:00 PDT--2013-10-01 03:50:00 PDT 0
24 2013-10-01 03:50:00 PDT--2013-10-01 04:00:00 PDT 0
25 2013-10-01 04:00:00 PDT--2013-10-01 04:10:00 PDT 0
26 2013-10-01 04:10:00 PDT--2013-10-01 04:20:00 PDT 0
27 2013-10-01 04:20:00 PDT--2013-10-01 04:30:00 PDT 0
28 2013-10-01 04:30:00 PDT--2013-10-01 04:40:00 PDT 0
29 2013-10-01 04:40:00 PDT--2013-10-01 04:50:00 PDT 0
30 2013-10-01 04:50:00 PDT--2013-10-01 05:00:00 PDT 0
31 2013-10-01 05:00:00 PDT--2013-10-01 05:10:00 PDT 0
32 2013-10-01 05:10:00 PDT--2013-10-01 05:20:00 PDT 0
33 2013-10-01 05:20:00 PDT--2013-10-01 05:30:00 PDT 0
34 2013-10-01 05:30:00 PDT--2013-10-01 05:40:00 PDT 2
35 2013-10-01 05:40:00 PDT--2013-10-01 05:50:00 PDT 1
36 2013-10-01 05:50:00 PDT--2013-10-01 06:00:00 PDT 1
37 2013-10-01 06:00:00 PDT--2013-10-01 06:10:00 PDT 1

Related

Create a function to filter two columns in R

I have replicate this code with 4 different places and 4 different years.
df1 <- df %>% filter(Place == "Al" & year==2016)
rollingMean(df1, pollutant = "O", hours=8, new.name = "mean", data.thresh=75)
Sample of data:
Place O date_time year
Al 23 2016-01-01 01:00:00 2016
Al 15 2016-01-01 02:00:00 2016
Al 18 2016-01-01 03:00:00 2016
Al 18 2016-01-01 04:00:00 2016
Al 20 2016-01-01 05:00:00 2016
Al 21 2016-01-01 06:00:00 2016
Ar 23 2016-01-01 01:00:00 2016
Ar 15 2016-01-01 02:00:00 2016
Ar 18 2016-01-01 03:00:00 2016
Ar 18 2016-01-01 04:00:00 2016
Ar 20 2016-01-01 05:00:00 2016
Ar 21 2016-01-01 06:00:00 2016
Ma 23 2016-01-01 01:00:00 2016
Ma 15 2016-01-01 02:00:00 2016
Ma 18 2016-01-01 03:00:00 2016
Ma 18 2016-01-01 04:00:00 2016
Ma 20 2016-01-01 05:00:00 2016
Ma 21 2016-01-01 06:00:00 2016
Ss 23 2016-01-01 01:00:00 2016
Ss 15 2016-01-01 02:00:00 2016
Ss 18 2016-01-01 03:00:00 2016
Ss 18 2016-01-01 04:00:00 2016
Ss 20 2016-01-01 05:00:00 2016
Ss 21 2016-01-01 06:00:00 2016
How can I optimize my code? I think that I need to loop or map but it is my first time doing this.
You can split the dataset for every unique value of Place and Year and use map to run rollingMean function for each group and combine them into one dataframe.
library(dplyr)
library(purrr)
result <- df %>%
group_split(Place, Year) %>%
map_df(~rollingMean(.x, pollutant = "O", hours=8,
new.name = "mean", data.thresh=75))

How to transform a datetime column from a `Non UTC` format to `UTC` format without loosing data the days in which there is a time change in R

I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.

Flag first instance of an event occurring contingent on other variable's value

New to R and to solving such a problem as the one below, so not sure about how certain functionality is achieved in particular instances.
I have a dataframe as such:
df <- data.frame(DATETIME = seq(from = as.POSIXct('2014-01-01 00:00', tz = "GMT"), to = as.POSIXct('2014-01-01 06:00', tz = "GMT"), by='15 mins'),
Price = c(23,22,23,24,27,31,33,34,31,26,24,23,19,18,19,19,23,25,26,26,27,30,26,25,24),
TroughPriceFlag = c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0))
df <- data.table(df)
df
DATETIME Price TroughPriceFlag
1: 2014-01-01 00:00:00 23 0
2: 2014-01-01 00:15:00 22 1
3: 2014-01-01 00:30:00 23 0
4: 2014-01-01 00:45:00 24 0
5: 2014-01-01 01:00:00 27 0
6: 2014-01-01 01:15:00 31 0
7: 2014-01-01 01:30:00 33 0
8: 2014-01-01 01:45:00 34 0
9: 2014-01-01 02:00:00 31 0
10: 2014-01-01 02:15:00 26 0
11: 2014-01-01 02:30:00 24 0
12: 2014-01-01 02:45:00 23 0
13: 2014-01-01 03:00:00 19 0
14: 2014-01-01 03:15:00 18 1
15: 2014-01-01 03:30:00 19 0
16: 2014-01-01 03:45:00 19 0
17: 2014-01-01 04:00:00 23 0
18: 2014-01-01 04:15:00 25 0
19: 2014-01-01 04:30:00 26 0
20: 2014-01-01 04:45:00 26 0
21: 2014-01-01 05:00:00 27 0
22: 2014-01-01 05:15:00 30 0
23: 2014-01-01 05:30:00 26 0
24: 2014-01-01 05:45:00 25 0
25: 2014-01-01 06:00:00 24 0
What I wish to do is two things:
(1) From where we observe a TroughPrice, flag the first instance where the price has risen by 10 or more dollars. That is, find the first instance where deltaPrice >= 10 since the trough price.
As an example: from the trough price of 22 (row 2), in the next interval price is increased to 23 which is a change of 1 dollar, so no flag. From the trough price of 22 (again row 2, since always with reference to the trough price in question), two intervals later the price is 24 dollars, so the price has increased by 2 dollars since the trough, so again no flag. However, from the trough price of 22, 5 intervals later the price has increased to 33 dollars, which is an increase of 11 dollars and is the first time the price has increased above 10 dollars. Thus the flag is 1.
(2) Determine the number of 15 minute periods which have passed between the trough price and the first instance the price has risen by 10 or more dollars.
The resulting dataframe should look like this:
DATETIME Price TroughPriceFlag FirstOver10CentsFlag CountPeriods
1 2014-01-01 00:00:00 23 0 0 NA
2 2014-01-01 00:15:00 22 1 0 5
3 2014-01-01 00:30:00 23 0 0 NA
4 2014-01-01 00:45:00 24 0 0 NA
5 2014-01-01 01:00:00 27 0 0 NA
6 2014-01-01 01:15:00 31 0 0 NA
7 2014-01-01 01:30:00 33 0 1 NA
8 2014-01-01 01:45:00 34 0 0 NA
9 2014-01-01 02:00:00 31 0 0 NA
10 2014-01-01 02:15:00 26 0 0 NA
11 2014-01-01 02:30:00 24 0 0 NA
12 2014-01-01 02:45:00 23 0 0 NA
13 2014-01-01 03:00:00 19 0 0 NA
14 2014-01-01 03:15:00 18 1 0 8
15 2014-01-01 03:30:00 19 0 0 NA
16 2014-01-01 03:45:00 19 0 0 NA
17 2014-01-01 04:00:00 23 0 0 NA
18 2014-01-01 04:15:00 25 0 0 NA
19 2014-01-01 04:30:00 26 0 0 NA
20 2014-01-01 04:45:00 26 0 0 NA
21 2014-01-01 05:00:00 27 0 0 NA
22 2014-01-01 05:15:00 30 0 1 NA
23 2014-01-01 05:30:00 26 0 0 NA
24 2014-01-01 05:45:00 25 0 0 NA
25 2014-01-01 06:00:00 24 0 0 NA
I'm not really sure where to start, since the time gaps can be quite large and I've only used indexing in the context of a few steps forward/backward. Please help!
Thanks in advance
You can chain operation with data.table package, the idea would be to group by cumsum of the ThroughPriceFlag:
library(data.table)
df[, col1:=pmatch(Price-Price[1]>10,T, nomatch=0), cumsum(TroughPriceFlag)][
, count:=which(col1==1)-1,cumsum(TroughPriceFlag)][
TroughPriceFlag==0, count:=NA]
#> df
# DATETIME Price TroughPriceFlag col1 count
# 1: 2014-01-01 00:00:00 23 0 0 NA
# 2: 2014-01-01 00:15:00 22 1 0 5
# 3: 2014-01-01 00:30:00 23 0 0 NA
# 4: 2014-01-01 00:45:00 24 0 0 NA
# 5: 2014-01-01 01:00:00 27 0 0 NA
# 6: 2014-01-01 01:15:00 31 0 0 NA
# 7: 2014-01-01 01:30:00 33 0 1 NA
# 8: 2014-01-01 01:45:00 34 0 0 NA
# 9: 2014-01-01 02:00:00 31 0 0 NA
#10: 2014-01-01 02:15:00 26 0 0 NA
#11: 2014-01-01 02:30:00 24 0 0 NA
#12: 2014-01-01 02:45:00 23 0 0 NA
#13: 2014-01-01 03:00:00 19 0 0 NA
#14: 2014-01-01 03:15:00 18 1 0 8
#15: 2014-01-01 03:30:00 19 0 0 NA
#16: 2014-01-01 03:45:00 19 0 0 NA
#17: 2014-01-01 04:00:00 23 0 0 NA
#18: 2014-01-01 04:15:00 25 0 0 NA
#19: 2014-01-01 04:30:00 26 0 0 NA
#20: 2014-01-01 04:45:00 26 0 0 NA
#21: 2014-01-01 05:00:00 27 0 0 NA
#22: 2014-01-01 05:15:00 30 0 1 NA
#23: 2014-01-01 05:30:00 26 0 0 NA
#24: 2014-01-01 05:45:00 25 0 0 NA
#25: 2014-01-01 06:00:00 24 0 0 NA

How to sort a dataframe by column and get the index?

I want to sort a dataframe in R by a column and add the ranking to a new column.
Specifically, I want to rank the price column in the data.frame below (ascending) for every day. Then, I want to add a column indicating the rank of every hour of the day.
library(dplyr)
prices <- data.frame(time = c("2014-07-01 00:00:00 CEST","2014-07-01 01:00:00 CEST","2014-07-01 02:00:00 CEST","2014-07-01 03:00:00 CEST",
"2014-07-01 04:00:00 CEST","2014-07-01 05:00:00 CEST","2014-07-01 06:00:00 CEST","2014-07-01 07:00:00 CEST",
"2014-07-01 08:00:00 CEST","2014-07-01 09:00:00 CEST","2014-07-01 10:00:00 CEST","2014-07-01 11:00:00 CEST",
"2014-07-01 12:00:00 CEST","2014-07-01 13:00:00 CEST","2014-07-01 14:00:00 CEST","2014-07-01 15:00:00 CEST",
"2014-07-01 16:00:00 CEST","2014-07-01 17:00:00 CEST","2014-07-01 18:00:00 CEST","2014-07-01 19:00:00 CEST",
"2014-07-01 20:00:00 CEST","2014-07-01 21:00:00 CEST","2014-07-01 22:00:00 CEST","2014-07-01 23:00:00 CEST",
"2014-07-02 00:00:00 CEST","2014-07-02 01:00:00 CEST","2014-07-02 02:00:00 CEST","2014-07-02 03:00:00 CEST",
"2014-07-02 04:00:00 CEST","2014-07-02 05:00:00 CEST","2014-07-02 06:00:00 CEST","2014-07-02 07:00:00 CEST",
"2014-07-02 08:00:00 CEST","2014-07-02 09:00:00 CEST","2014-07-02 10:00:00 CEST","2014-07-02 11:00:00 CEST",
"2014-07-02 12:00:00 CEST","2014-07-02 13:00:00 CEST","2014-07-02 14:00:00 CEST","2014-07-02 15:00:00 CEST",
"2014-07-02 16:00:00 CEST","2014-07-02 17:00:00 CEST","2014-07-02 18:00:00 CEST","2014-07-02 19:00:00 CEST",
"2014-07-02 20:00:00 CEST","2014-07-02 21:00:00 CEST","2014-07-02 22:00:00 CEST","2014-07-02 23:00:00 CEST"),
price = c(31.75,30.54,30.10,29.32,25.97,26.90,33.59,41.06,40.99,42.44,40.00,39.94,35.69,36.00,36.00,35.17,34.94,35.18,39.00,
41.92,40.09,38.87,39.38,36.00,30.26,29.29,29.37,25.15,25.81,27.97,31.63,39.91,39.99,39.61,39.13,40.43,38.41,36.96,
36.00,34.95,33.82,36.08,38.59,39.91,39.02,36.90,38.88,32.59))
I am using arange from dplyr for the sorting as follows.
prices_sorted <- arrange(df, format(df$time, format="%Y-%m-%d"), real)
Is there a 'clean' way to arrive at the following?
prices_ranked
time price ranking
1 2014-07-01 00:00:00 CEST 31.75 5
2 2014-07-01 01:00:00 CEST 30.54 6
3 2014-07-01 02:00:00 CEST 30.10 4
4 2014-07-01 03:00:00 CEST 29.32 3
5 2014-07-01 04:00:00 CEST 25.97 2
6 2014-07-01 05:00:00 CEST 26.90 1
7 2014-07-01 06:00:00 CEST 33.59 7
8 2014-07-01 07:00:00 CEST 41.06 17
9 2014-07-01 08:00:00 CEST 40.99 16
10 2014-07-01 09:00:00 CEST 42.44 18
11 2014-07-01 10:00:00 CEST 40.00 13
12 2014-07-01 11:00:00 CEST 39.94 14
13 2014-07-01 12:00:00 CEST 35.69 15
14 2014-07-01 13:00:00 CEST 36.00 24
15 2014-07-01 14:00:00 CEST 36.00 22
16 2014-07-01 15:00:00 CEST 35.17 19
17 2014-07-01 16:00:00 CEST 34.94 23
18 2014-07-01 17:00:00 CEST 35.18 12
19 2014-07-01 18:00:00 CEST 39.00 11
20 2014-07-01 19:00:00 CEST 41.92 21
21 2014-07-01 20:00:00 CEST 40.09 9
22 2014-07-01 21:00:00 CEST 38.87 8
23 2014-07-01 22:00:00 CEST 39.38 20
24 2014-07-01 23:00:00 CEST 36.00 10
25 2014-07-02 00:00:00 CEST 30.26 4
26 2014-07-02 01:00:00 CEST 29.29 5
27 2014-07-02 02:00:00 CEST 29.37 6
28 2014-07-02 03:00:00 CEST 25.15 2
29 2014-07-02 04:00:00 CEST 25.81 3
30 2014-07-02 05:00:00 CEST 27.97 1
31 2014-07-02 06:00:00 CEST 31.63 7
32 2014-07-02 07:00:00 CEST 39.91 24
33 2014-07-02 08:00:00 CEST 39.99 17
34 2014-07-02 09:00:00 CEST 39.61 16
35 2014-07-02 10:00:00 CEST 39.13 15
36 2014-07-02 11:00:00 CEST 40.43 18
37 2014-07-02 12:00:00 CEST 38.41 22
38 2014-07-02 13:00:00 CEST 36.96 14
39 2014-07-02 14:00:00 CEST 36.00 13
40 2014-07-02 15:00:00 CEST 34.95 19
41 2014-07-02 16:00:00 CEST 33.82 23
42 2014-07-02 17:00:00 CEST 36.08 21
43 2014-07-02 18:00:00 CEST 38.59 11
44 2014-07-02 19:00:00 CEST 39.91 10
45 2014-07-02 20:00:00 CEST 39.02 8
46 2014-07-02 21:00:00 CEST 36.90 20
47 2014-07-02 22:00:00 CEST 38.88 9
48 2014-07-02 23:00:00 CEST 32.59 12
I was a little unclear on what order you wanted things in, but is this what you were looking for? Updated to rank by date (I added in some additional data so you could see that)
library(data.table)
prices <- data.table(time = c("2014-07-01 00:00:00 CEST", "2014-07-01 01:00:00 CEST", "2014-07-01 02:00:00 CEST","2014-07-01 03:00:00 CEST", "2014-07-01 04:00:00 CEST",
"2015-07-01 00:00:00 CEST", "2015-07-01 01:00:00 CEST", "2015-07-01 02:00:00 CEST","2015-07-01 03:00:00 CEST", "2015-07-01 04:00:00 CEST"),
price = c(31.75, 30.54, 30.10, 29.32, 25.97,31.75, 30.12, 31.10, 39.32, 25.97))
prices <- prices[,"date" := as.Date(time)]
prices.sorted <- prices[order(time),ranking := rank(price,ties.method='first'), by=date]
Here is my solution, it uses the base R solution sort:
prices %>% mutate(ranking = row_number(sort(price, decreasing = T)))
time price ranking
1 2014-07-01 00:00:00 CEST 31.75 5
2 2014-07-01 01:00:00 CEST 30.54 4
3 2014-07-01 02:00:00 CEST 30.10 3
4 2014-07-01 03:00:00 CEST 29.32 2
5 2014-07-01 04:00:00 CEST 25.97 1
Maybe this:
prices %>% arrange(price) %>% mutate(ranking=min_rank(price)) %>% arrange(time)
# time price ranking
#1 2014-07-01 00:00:00 CEST 31.75 5
#2 2014-07-01 01:00:00 CEST 30.54 4
#3 2014-07-01 02:00:00 CEST 30.10 3
#4 2014-07-01 03:00:00 CEST 29.32 2
#5 2014-07-01 04:00:00 CEST 25.97 1

Count occurence from table

this is my table ... I need to count the instance for the last column per date.
So basically need
date Count
2015-02-02 8
2015-02-03 10
2015-02-02 01:30:00 PM 1
2015-02-02 02:30:00 PM 1
2015-02-02 03:30:00 PM 1
2015-02-02 05:30:00 PM 1
2015-02-02 06:30:00 PM 1
2015-02-02 08:30:00 AM 1
2015-02-02 09:30:00 AM 1
2015-02-02 11:30:00 AM 1
2015-02-03 01:30:00 PM 2
2015-02-03 02:30:00 PM 2
2015-02-03 03:30:00 PM 2
2015-02-03 04:30:00 PM 2
2015-02-03 05:30:00 PM 2
2015-02-03 06:30:00 PM 2
2015-02-03 08:30:00 AM 2
2015-02-03 09:30:00 AM 2
2015-02-03 10:30:00 AM 2
2015-02-03 11:30:00 AM om 2
2015-02-04 01:30:00 PM 3
2015-02-04 02:30:00 PM 3
2015-02-04 03:30:00 PM 3
2015-02-04 05:30:00 PM 3
2015-02-04 06:30:00 PM 3
2015-02-04 08:30:00 AM 3
2015-02-04 09:30:00 AM 3
2015-02-04 10:30:00 AM 3
2015-02-04 11:30:00 AM 3
2015-02-05 01:30:00 PM 4
2015-02-05 02:30:00 PM 4
2015-02-05 03:30:00 PM 4
2015-02-05 04:30:00 PM 4
2015-02-05 05:30:00 PM 4
2015-02-05 06:30:00 PM 4
2015-02-05 08:30:00 AM 4
2015-02-05 09:30:00 AM 4
2015-02-05 10:30:00 AM 4
2015-02-05 11:30:00 AM 4
2015-02-06 01:30:00 PM 5
2015-02-06 02:30:00 PM 5
2015-02-06 08:30:00 AM 5
2015-02-06 09:30:00 AM 5
2015-02-06 10:30:00 AM 5
2015-02-06 11:30:00 AM 5
select DATE(datecolumn) as thedate, count(lastcol) from sometable group by thedate
similar question: https://stackoverflow.com/a/366610/636077

Resources