Related
I am working on a transaction data set that reports the time of transaction in hhmmss format. e.g., 204629, 215450 etc.
I would like to derive from the given column a factor variable with levels that indicate certain hours of the day e.g. 12-3 pm, 3-6 pm etc.
I can think of using str_sub function to select hour values from the given variable and convert them to factor. But is there a more efficient method to achieve this?
You can use dplyr::mutate and stringr::str_sub to create the hour column, and then use cut to divide the hour column into your periods.
library(dplyr)
library(stringr)
library(lubridate)
tibble(string = c("215450", "220102", "020129")) %>%
mutate(hour = str_sub(string, 1, 2) %>% as.numeric,
minute = str_sub(string, 3, 4) %>% as.numeric,
second = str_sub(string, 5, 6) %>% as.numeric,
time = str_c(hour, minute, second, sep = ":") %>% hms()) %>%
mutate(period = cut(hour, breaks = 2, labels = c("period one", "period two")))
# A tibble: 3 x 6
string hour minute second time period
<chr> <dbl> <dbl> <dbl> <Period> <fct>
1 215450 21 54 50 21H 54M 50S period two
2 220102 22 1 2 22H 1M 2S period two
3 020129 2 1 29 2H 1M 29S period one
I have a data frame that consists of customers scheduled subscription payments as follows:
CusID <- c(1,2,3)
FromDate <- c(ymd("2019-01-01"), ymd("2019-01-04"), ymd("2019-02-02"))
ToDate <-c(ymd("2019-01-16"), ymd("2019-01-15"), ymd("2019-04-03"))
Amount <- c(5,10,12)
Frequency <- c("Weekly", "Fortnightly", "Monthly")
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
For each row (customer), I wish to loop from the FromDate to the ToDate and output one row of each data for each scheduled payment that falls between those dates, resulting in the following data frame:
CusID <- c(1,1,1,2,3,3,3)
PaymentDate <- c(ymd("2019-01-01"), ymd("2019-01-08"), ymd("2019-01-15"),
ymd("2019-01-04"),ymd("2019-02-02"),ymd("2019-03-02"),ymd("2019-04-02"))
Amount <- c(5,5,5,10,12,12,12)
Output <- data.frame(CusID, PaymentDate, Amount)
What is an efficient way to achieve this using R (and preferably using dplyr / tidyverse functions)?
In SAS my approach would be to use a DO / WHILE LOOP and OUTPUT statement to write a new line for each scheduled payment. e.g.
data Output;
set Input;
PaymentDate = FromDate;
do while (PaymentDate < ToDate);
Payment = Amount;
PaymentDate = PaymentDate + (7 / 14 / 30 ~ logic based on Frequency);
output;
loop;
run;
(The key here in SAS is the output statement - it explicitly writes a new record each time it is invoked, thus can be used in a loop to write multiple output lines per input line).
Is there an equivalent method available in R, or is a different approach recommended?
Another option using tidyverse
Input %>%
mutate(Frequency = case_when(Frequency == "Weekly" ~ 7L,
Frequency == "Fortnightly" ~ 14L,
Frequency == "Monthly" ~ 30L,
TRUE ~ 0L)) %>%
group_by(CusID) %>%
group_modify(~ {PaymentDate <- seq.Date(from = .x$FromDate, to = .x$ToDate, by = .x$Frequency)
crossing(.x[,1], PaymentDate)})
# A tibble: 7 x 3
# Groups: CusID [3]
CusID PaymentDate Amount
<dbl> <date> <dbl>
1 1 2019-01-01 5
2 1 2019-01-08 5
3 1 2019-01-15 5
4 2 2019-01-04 10
5 3 2019-02-02 12
6 3 2019-03-04 12
7 3 2019-04-03 12
Payment dates are a little different from your expected output because seq.Date adds 30 days taking into account the different number of days in those months.
UPDATE:
Here is a more verbatim solution
Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID, Amount) %>%
expand(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency))
# A tibble: 7 x 3
# Groups: CusID, Amount [3]
CusID Amount PaymentDate
<dbl> <dbl> <date>
1 1 5 2019-01-01
2 1 5 2019-01-08
3 1 5 2019-01-15
4 2 10 2019-01-04
5 3 12 2019-02-02
6 3 12 2019-03-02
7 3 12 2019-04-02
I tweaked your Input data.frame so that the Frequency values are strings, not factors.
You could create a helper table freq_mapping to convert from your Frequency to the frequency format R likes. This would avoid the 30 day issue that one of the other answers pointed out.
freq_mapping <- data.frame(Frequency=c('Weekly', 'Fortnightly', 'Monthly'),
RFrequency = c('1 week', '2 weeks', '1 month'),
stringsAsFactors = FALSE)
Then merge Input with this:
Input <- Input %>%
inner_join(freq_mapping, by = 'Frequency')
Now you can create the payment dates:
Input$PaymentDate <- Input$FromDate
Input %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)
not so easy problem for me. The solution is not beautiful but it should somehow do the work. You'll see there is a problem for the monthly payment which is not always 30, but otherwise it should work. But nicer solution surely exist.
library(data.table)
Input <- data.frame(CusID, Amount, Frequency, FromDate, ToDate)
Input=data.table(Input)
Input[Frequency=="Weekly",freq:=7][Frequency=="Fortnightly",freq:=14][Frequency=="Monthly",freq:=30]
Input[,Ratio:=(ToDate-FromDate)/freq]
#What is the maximum rows ? for a customer ?
NREP=as.integer(max(ceiling(Input$Ratio)))
Input[,Rep:=1][,PaymentDate:=FromDate]
for(i in 1:NREP){
Inputtemp=copy(Input)
Inputtemp[,FromDate:=FromDate+freq]
Input=rbind(Input,Inputtemp)
}
#Remove invalid rows
Input=unique(Input)
Input=Input[!(FromDate>ToDate),]
setorder(Input,CusID)
Input=Input[,c("CusID","FromDate","Amount")]
setnames(Input,"FromDate","PaymentDate")
Input==data.table(Output)
A mashup of Humpelstielzchen and user2474226's answers, to bring all logic into a single dplyr step.
Output <- Input %>%
mutate(PaymentDate = FromDate,
RFrequency = case_when(Frequency == "Weekly" ~ '1 week',
Frequency == "Fortnightly" ~ '2 weeks',
Frequency == "Monthly" ~ '1 month')) %>%
group_by(CusID) %>%
complete(PaymentDate = seq.Date(FromDate,ToDate, by = RFrequency)) %>%
fill(PaymentDate,Amount) %>%
select(CusID, PaymentDate, Amount)
A subject was measured at several time points over several days. I have a row "resptime_s" (time that the subject was answered the beep on his smartphone). Now I want to know the mean time between those (so between the rows of this column) with the night time taken out (nighttime is always from 22:30 p.m till 7:30 a.m). Take as example:
The R script:
setwd("C:/Users/Hanne/Desktop/")
dat <- read.csv(file="datnew2.csv", sep=";",header=TRUE)
rows <- c(1:388) #time points
columns <- c(2,60) # datum and time
nVariables = 2
newdata<-dat[rows,columns]
head(newdata)
fun2 <- function(x){
bt <- as.integer(sub("(^\\d{1,2}):.*", "\\1", x))
f <- cumsum(c(FALSE, diff(bt) < 0))
d <- rep(as.Date("2018-01-01"), length.out = length(bt))
bt <- as.POSIXct(paste(d, x))
res <- sapply(split(bt, f), function(b) c(0, difftime(b[-1], b[1])))
unname(unlist(res))
}
fun2(newdata$resptime_s)
But the result isn't correct.
And with:
dput(head(newdata, 30))
I obtained this output:
Using the different functions for working with time intervals in lubridate gives the most elegant and easy to understand solution.
library(tidyverse)
library(lubridate)
data <- tribble(
~time_point, ~beeptime,
1, "08:30",
2, "11:13",
3, "12:08",
4, "17:20",
5, "22:47",
6, "7:36",
7, "9:40"
) %>%
mutate(beeptime = as_datetime(hm(beeptime)))
1. Define the daytime interval
day <- interval(
as_datetime(hm("07:30")),
as_datetime(hm("22:30"))
)
2. Keep daytime beeps and estimate the time (interval) between them
# %--% is basically the same as interval() above.
data_interval <-
data %>%
filter(beeptime %within% day) %>%
mutate(beep_interval = lag(beeptime) %--% beeptime)
3. Take the average
# You can use as.numeric() to extract (e.g.) minutes, which you can
# just pass to mean().
data_interval$beep_interval %>%
as.numeric("minutes") %>%
abs() %>%
mean(na.rm = TRUE)
#> [1] 247.6
Try the following. It pastes a date that increments every time the next hour is less than the previous one. Then difftime works as expected.
fun <- function(x){
bt <- as.integer(sub("(^\\d{1,2}):.*", "\\1", x))
inx <- as.logical(cumsum(c(FALSE, diff(bt) < 0)))
d <- rep(as.Date("2018-01-01"), length.out = length(bt))
d[inx] <- d[inx] + 1
beeptime <- as.POSIXct(paste(d, x))
difftime(beeptime[-1], beeptime[1])
}
fun(newdata$beeptime)
#Time differences in hours
#[1] 2.716667 3.633333 8.833333 14.283333 23.100000 25.166667
Data.
newdata <-
structure(list(time_point = 1:7, beeptime = structure(1:7, .Label = c("08:30",
"11:13", "12:08", "17:20", "22:47", "7:36", "9:40"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
Edit.
I believe that I have missunderstood the question. The OP does not want differences between the first hour and all others. What is needed is the differences restarting from zero every night.
If this is the case, the following function will do it.
fun2 <- function(x){
bt <- as.integer(sub("(^\\d{1,2}):.*", "\\1", x))
f <- cumsum(c(FALSE, diff(bt) < 0))
d <- rep(as.Date("2018-01-01"), length.out = length(bt))
bt <- as.POSIXct(paste(d, x))
res <- sapply(split(bt, f), function(b) c(0, difftime(b[-1], b[1])))
unname(unlist(res))
}
fun2(newdata$beeptime)
#[1] 0.000000 2.716667 3.633333 8.833333 14.283333 0.000000 2.066667
Another approach could be to convert beeptime in offset (in seconds) from midnight using lubridate package.
We can then write a function(s) to calculate difference in time excluding night time (22:30 - 7:30).
Before we start solution, lets have a look for offset in seconds from midnight for 7:30 and 22:30.
library(lubridate)
as.numeric(seconds(hm("7:30")))
# [1] 27000
as.numeric(seconds(hm("22:30")))
# [1] 81000
I have written two sets of function to calculate difference between two times:
# Function checks individual time and shifts them to night boundary. So that
# time over night can be excluded.
checkNightBoundry <- function(val){
if(val < 27000){
val = 27000
} else if(val > 81000) {
val = 81000
}
val
}
# Arguments are offset from midnight in seconds
# Calculate difference between two time, excluding midtime
calcDifftime <- function(currVal, prevVal){
diffTime <- 0
currVal = checkNightBoundry(currVal)
prevVal = checkNightBoundry(prevVal)
if(currVal > prevVal){
diffTime = currVal - prevVal
}else if(currVal < prevVal){
diffTime = (81000 - prevVal) + (currVal - 27000)
}
diffTime
}
Now, use above functions:
library(dplyr)
library(lubridate)
df %>% mutate(beeptimeOffset = as.numeric(seconds(hm(beeptime)))) %>%
mutate(diffTime = mapply(calcDifftime,
beeptimeOffset, lag(beeptimeOffset, default = first(beeptimeOffset)))/3600)
# timepoint beeptime beeptimeOffset(sec) diffTime(hrs)
# 1 1 08:30 30600 0.0000000
# 2 2 11:13 40380 2.7166667
# 3 3 12:08 43680 0.9166667
# 4 4 17:20 62400 5.2000000
# 5 5 22:47 82020 5.1666667
# 6 6 7:36 27360 0.1000000
# 7 7 9:40 34800 2.0666667
Data:
df <- read.table(text =
"timepoint beeptime
1 08:30
2 11:13
3 12:08
4 17:20
5 22:47
6 7:36
7 9:40",
header = TRUE, stringsAsFactors = FALSE)
I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE
I have this a dataframe with a long list of dates in one column and values in another column, that looks like this:
set.seed(1234)
df <- data.frame(date= as.Date(c('2010-09-05', '2011-09-06', '2010-09-13',
'2011-09-14', '2010-09-23', '2011-09-24',
'2010-10-05', '2011-10-06', '2010-10-13',
'2011-10-14', '2010-10-23', '2011-10-24')),
value= rnorm(12))
I need to calculate the mean value in each 10 day period of each month, but irrespective of year, like this:
dfNeeded <- data.frame(datePeriod=c('period.Sept0.10', 'period.Sept11.20', 'period.Sept21.30',
'period.Oct0.10', 'period.Oct11.20', 'period.Oct21.31'),
meanValue=c(mean(df$value[c(1,2)]),
mean(df$value[c(3,4)]),
mean(df$value[c(5,6)]),
mean(df$value[c(7,8)]),
mean(df$value[c(9,10)]),
mean(df$value[c(11,12)])))
Is there a fast way of doing this?
Here is a way to do it, which uses the lubridate package for month and day extraction, but you can do it with base R date functions :
library(lubridate)
df$period <- paste(month(df$date),cut(day(df$date),breaks=c(0,10,20,31)),sep="-")
aggregate(df$value, list(period=df$period), mean)
Which gives :
period x
1 10-(0,10] -0.5606859
2 10-(10,20] -0.7272449
3 10-(20,31] -0.7377896
4 9-(0,10] -0.4648183
5 9-(10,20] -0.6306283
6 9-(20,31] 0.4675903
This approach with format.Date and modulo arithmetic should be reasonably fast:
tapply(df$value, list( format(df$date, "%b"), as.POSIXlt(df$date)$mday %/% 10), mean)
0 1 2
Oct -0.560686 -0.727245 -0.73779
Sep -0.464818 -0.630628 0.46759
I'm not sure how it would compare to the aggregate approach:
aggregate(df$value, list( format(df$date, "%b"), as.POSIXlt(df$date)$mday %/% 10), mean)
Group.1 Group.2 x
1 Oct 0 -0.560686
2 Sep 0 -0.464818
3 Oct 1 -0.727245
4 Sep 1 -0.630628
5 Oct 2 -0.737790
6 Sep 2 0.467590