How to use repeated labels when grouping data using cut? - r

I need to create a new column to categorize my experiment. The timeseries data is divided in 10 minutes group using cut. I want to add a column that loops between 3 labels say A, B, C and the goes back to A.
test<- tibble(date_time = seq.POSIXt(ymd_hms('2020-02-02 00:00:00'),
ymd_hms('2020-02-02 01:00:00'),
by= '30 sec' ),
cat = cut(date_time, breaks = '10 min'))
I want to get something like this
date_time cat
<dttm> <fct>
1 2020-02-02 00:00:00 A
2 2020-02-02 00:05:30 A
3 2020-02-02 00:10:00 B
4 2020-02-02 00:20:30 C
5 2020-02-02 00:30:00 A
6 2020-02-02 00:31:30 A
I have used the labels option in cut before with a known number of factors but not like this. Any help is welcome

You could use cut(labels = F) to create numeric labels for your intervals, then use these as an index into the built-in LETTERS vector (or a vector of custom labels). The modulo operator and a little arithmetic will make it cycle through A, B, and C:
library(tidyverse)
library(lubridate)
test<- tibble(date_time = seq.POSIXt(ymd_hms('2020-02-02 00:00:00'),
ymd_hms('2020-02-02 01:00:00'),
by= '30 sec' ),
cat_num = cut(date_time, breaks = '10 min', labels = F),
cat = LETTERS[((cat_num - 1) %% 3) + 1]
)
date_time cat_num cat
<dttm> <int> <chr>
1 2020-02-02 00:00:00 1 A
2 2020-02-02 00:00:30 1 A
3 2020-02-02 00:01:00 1 A
4 2020-02-02 00:01:30 1 A
5 2020-02-02 00:02:00 1 A
6 2020-02-02 00:02:30 1 A
7 2020-02-02 00:03:00 1 A
8 2020-02-02 00:03:30 1 A
9 2020-02-02 00:04:00 1 A
10 2020-02-02 00:04:30 1 A
...48 more rows...
59 2020-02-02 00:29:00 3 C
60 2020-02-02 00:29:30 3 C
61 2020-02-02 00:30:00 4 A
62 2020-02-02 00:30:30 4 A
...59 more rows...

Using my santoku package:
library(santoku)
library(lubridate)
test <- tibble(
date_time = seq.POSIXt(ymd_hms('2020-02-02 00:00:00'),
ymd_hms('2020-02-02 01:00:00'),
by= '30 sec'
),
cut_time = chop_width(date_time,
minutes(10),
labels = lbl_seq("A")
)
)
table(test$cut_time)
A B C D E F G
20 20 20 20 20 20 1

Related

Create variable for day of the experiment

I have a large data set that spanned a month in time with the data stamped in a column called txn_date like the below. (this is a toy reproduction of it)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
Ideally I would like to get a column in my data frame for each date in the data which I think could be done by first getting a single column that is 1 for the first date that appears and then so on.
So something like this
dat1 <- read.table(text = "var1 txn_date day
5 2020-10-25 1
1 2020-10-25 1
3 2020-10-26 2
4 2020-10-27 3
1 2020-10-27 3
3 2020-10-31 7
3 2020-11-01 8
8 2020-11-12 9 ", header = TRUE
I'm not quite sure how to get this. The txn_date column is as.Date in my actual data frame. I think if I could get the single day column like is listed above (then convert it to a factor) then I could always one hot encode the actual levels of that column if I need to. Ultimately I need to use the day of the experiment as a regressor in a regression I'm going to run.
Something along the lines of y ~ x + day_1 + day_2 +...+ error
Would this be suitable?
library(tidyverse)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
dat1$txn_date <- as.Date(dat1$txn_date)
dat1 %>%
mutate(days = txn_date - txn_date[1] + 1)
# var1 txn_date days
#1 5 2020-10-25 1 days
#2 1 2020-10-25 1 days
#3 3 2020-10-26 2 days
#4 4 2020-10-27 3 days
#5 1 2020-10-27 3 days
#6 3 2020-10-31 7 days
#7 3 2020-11-01 8 days
#8 8 2020-11-02 9 days
We create a sequence of dates based on the min and max of 'txn_date' and match
dates <- seq(min(as.Date(dat1$txn_date)),
max(as.Date(dat1$txn_date)), by = '1 day')
dat1$day <- with(dat1, match(as.Date(txn_date), dates))
dat1$day
#[1] 1 1 2 3 3 7 8 9
Or may use factor route
with(dat1, as.integer(factor(txn_date, levels = as.character(dates))))
#[1] 1 1 2 3 3 7 8 9

Group by with summarise in date difference in R

I am trying to use group_by and then summarise using date difference calculation. I am not sure if its a runtime error or something wrong in what I am doing. Sometimes when I run the code I get the output as days and other times as seconds. I am not sure what is causing this change. I am not changing dataset or codes. The dataset I am using is huge (2,304,433 rows and 40 columns). Both the times, the output value (digits) are the same but only the name changes (days to secs). I would like to see the output in days.
This is the code that I am using:
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
Revenue = max(TOTAL_AMT + 0.000001/QUANTITY),
No_Days = (max(ORDER_DT) - min(ORDER_DT) + 1)/n())
This is the output.
Can anyone please help me on this?
Use difftime() You might need to specify the units.
set.seed(314)
data <- data.frame(PRODUCT = sample(1:10, size = 10000, replace = TRUE),
PERSON_ID = sample(1:10, size = 10000, replace = TRUE),
ORDER_DT = as.POSIXct(as.Date('2019/01/01') + sample(-300:+300, size = 10000, replace = TRUE)))
require(dplyr)
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
start = min(ORDER_DT),
end = max(ORDER_DT)) %>%
mutate(No_Days = (as.double(difftime(end, start, units = "days"), units = "days")+1)/Freq)
gives:
PRODUCT PERSON_ID Freq start end No_Days
<int> <int> <int> <dttm> <dttm> <dbl>
1 1 1 109 2018-03-21 01:00:00 2019-10-27 02:00:00 5.38
2 1 2 117 2018-03-23 01:00:00 2019-10-26 02:00:00 4.98
3 1 3 106 2018-03-19 01:00:00 2019-10-28 01:00:00 5.56
4 1 4 109 2018-03-07 01:00:00 2019-10-26 02:00:00 5.50
5 1 5 95 2018-03-07 01:00:00 2019-10-16 02:00:00 6.2
6 1 6 79 2018-03-09 01:00:00 2019-10-04 02:00:00 7.28
7 1 7 83 2018-03-09 01:00:00 2019-10-28 01:00:00 7.22
8 1 8 114 2018-03-09 01:00:00 2019-10-16 02:00:00 5.15
9 1 9 100 2018-03-09 01:00:00 2019-10-13 02:00:00 5.84
10 1 10 91 2018-03-11 01:00:00 2019-10-26 02:00:00 6.54
# ... with 90 more rows
Why is the value devided by n()?
Simple as.integer(max(ORDER_DT) - min(ORDER_DT)) should work, but if it doesn't then please be more specific and update me with more information.
Also while working with datetime values it's good to know lubridate library

Corresponding last value in every minute

I want to extract the corresponding last value in every minute say, in a table "Table":
Value Time
1 5/1/2018 15:50:57
5 5/1/2018 15:50:58
21 5/1/2018 15:51:48
22 5/1/2018 15:51:49
5 5/1/2018 15:52:58
8 5/1/2018 15:52:59
71 5/1/2018 15:53:45
33 5/1/2018 15:53:50
I need the corresponding last "Value" at the end of each minute in "Time". That is:
I want the output values to be: 5, 22, 8, 33
I tried using "as.POSIXct" to find Table$Time value but I am not able to proceed.
1) aggregate Using DF shown reproducibly in the Note at the end, truncate each time to the minute and then aggregate based on that:
aggregate(Value ~ Minute, transform(DF, Minute = trunc(Time, "min")), tail, 1)
giving:
Minute Value
1 2018-05-01 15:59:00 5
2 2018-05-01 16:59:00 22
3 2018-05-01 17:59:00 8
4 2018-05-01 18:59:00 33
2) subset An alternative, depending on what output you want, is to truncate the times to minutes and then remove those rows for which there are duplicate truncated times proceeding backwards from the end.
subset(DF, !duplicated(trunc(Time, "min"), fromLast = TRUE))
giving:
Value Time
2 5 2018-05-01 15:59:58
4 22 2018-05-01 16:59:49
6 8 2018-05-01 17:59:59
8 33 2018-05-01 18:59:50
Note
We assume the following input shown reproducibly. Note that we have converted the Time column to POSIXct class.
Lines <- "
Value Time
1 5/1/2018 15:59:57
5 5/1/2018 15:59:58
21 5/1/2018 16:59:48
22 5/1/2018 16:59:49
5 5/1/2018 17:59:58
8 5/1/2018 17:59:59
71 5/1/2018 18:59:45
33 5/1/2018 18:59:50"
Lines2 <- sub(" ", ",", trimws(readLines(textConnection(Lines))))
DF <- read.csv(text = Lines2)
DF$Time <- as.POSIXct(DF$Time, format = "%m/%d/%Y %H:%M:%S")
Very similar to #G.Grothendieck, but with using format instead, i.e.
aggregate(Value ~ format(Time, '%Y-%m-%d %H:%M:00'), df, tail, 1)
# format(Time, "%Y-%m-%d %H:%M:00") Value
#1 2018-05-01 15:50:00 5
#2 2018-05-01 15:51:00 22
#3 2018-05-01 15:52:00 8
#4 2018-05-01 15:53:00 33
Building on # Grothendieck's great answer I provide a tidyverse solution.
library(dplyr)
Lines <- "
Value Time
1 5/1/2018 15:50:57
5 5/1/2018 15:50:58
21 5/1/2018 16:51:48
22 5/1/2018 16:51:49
5 5/1/2018 17:52:58
8 5/1/2018 17:52:59
71 5/1/2018 18:53:45
33 5/1/2018 18:53:50"
Lines2 <- sub(" ", ",", readLines(textConnection(Lines)))
DF <- read.csv(text = Lines2) %>% tibble::as_tibble()
# after creating reproducible data set. Set Time to date-time format
# then floor the time to nearest minute
DF %>%
dplyr::mutate(Time = lubridate::dmy_hms(Time),
minute = lubridate::floor_date(Time, "minute")) %>%
# Group by minute
dplyr::group_by(minute) %>%
# arrange by time
dplyr::arrange(Time) %>%
# extract the last row in each group
dplyr::filter(dplyr::row_number() == n())
Output
# A tibble: 4 x 3
# Groups: min [4]
Value Time min
<int> <dttm> <dttm>
1 5 2018-01-05 15:50:58 2018-01-05 15:50:00
2 22 2018-01-05 16:51:49 2018-01-05 16:51:00
3 8 2018-01-05 17:52:59 2018-01-05 17:52:00
4 33 2018-01-05 18:53:50 2018-01-05 18:53:00

R Sum rows by hourly rate

I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']

Using a rolling time interval to count rows in R and dplyr

Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.
In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.
This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3
Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6

Resources