Aggregating time on hourly basis and counting it - r

I have following dataframe in R.
Date Car_NO
2016-12-24 19:35:00 ABC
2016-12-24 19:55:00 DEF
2016-12-24 20:15:00 RTY
2016-12-24 20:35:00 WER
2016-12-24 21:34:00 DER
2016-12-24 00:23:00 ABC
2016-12-24 00:22:00 ERT
2016-12-24 11:45:00 RTY
2016-12-24 13:09:00 RTY
Date format is "POSIXct" "POSIXt"
I want to count hourly movement of car traffic. like 12-1,1-2,2-3,3-4 and so on
Currently my approach is following
df$time <- ymd_hms(df$Date)
df$hours <- hour(df$time)
df$minutes <- minute(df$time)
df$time <- as.numeric(paste(df$hours,df$minutes,sep="."))
And after this I will apply ifelse loop to divide it in hourly time slots,but I think it will be long and tedious way to do it. Is there any easy approach in R.
My desired dataframe would be
Time_Slots Car_Traffic_count
00-01 2
01-02 0
02-03 0
.
.
.
19-20 2
20-21 2
21-22 1
.
.
.

Simplest would be to just use the starting hour to indicate a time interval:
# sample data
df = data.frame(time = Sys.time()+seq(1,10)*10000, runif(10) )
# summarize
library(dplyr)
df$hour = factor(as.numeric(format(df$time,"%H")), levels = seq(0,24))
df = df %>%
group_by(hour) %>%
summarize(count=n()) %>%
complete(hour, fill = list(count = 0))
Output:
# A tibble: 24 x 2
hour count
<fctr> <dbl>
1 0 0
2 1 1
3 2 0
4 3 0
5 4 1
6 5 0
7 6 1
8 7 0
9 8 0
10 9 1
# ... with 14 more rows
You can optionally add:
df$formatted = paste0(as.character(df$hour),"-",as.numeric(as.character(df$hour))+1)
at then end to get your desired format. Hope this helps!

Related

Elegant way to get no of days to prev and next year using R?

I have an R data frame like as shown below
test_df <- data.frame("subbject_id" = c(1,2,3,4,5),
"date_1" = c("01/01/2003","12/31/2007","12/30/2008","01/02/2007","01/01/2007"))
I would like to get the no of days to prev year and next year.
I was trying something like the below
library(lubridate)
test_df$current_yr = year(mdy(test_df$date_1))
prev_yr = test_df$current_yr - 1 #(subtract 1 to get the prev year)
next_yr = test_df$current_yr + 1 #(add 1 to get the prev year)
days_to_prev_yr = days_in_year(current_yr) # this doesn't work
In python, I know we have something called day of the year and offsets.YearEnd(0) etc which I knew based on this post. But can help me with how to do this using R?
I expect my output to be like as shown below
You can use ceiling_date and floor_date from lubridate to get first and last days of the year and then subtract it with date_1 to get days_to_previous_year and days_to_next_year.
library(dplyr)
library(lubridate)
test_df %>%
mutate(date_1 = mdy(date_1),
previous_year = floor_date(date_1, 'year'),
next_year = ceiling_date(date_1, 'year') - 1,
days_to_previous_year = as.integer(date_1 - previous_year),
days_to_next_year = as.integer(next_year - date_1)) %>%
select(-previous_year, -next_year)
# subbject_id date_1 days_to_previous_year days_to_next_year
#1 1 2003-01-01 0 364
#2 2 2007-12-31 364 0
#3 3 2008-12-30 364 1
#4 4 2007-01-02 1 363
#5 5 2007-01-01 0 364
One dplyr and lubridate option could be:
test_df %>%
mutate(date_1 = mdy(date_1),
days_to_prev_year = date_1 - mdy(paste0("01-01-", year(date_1))),
days_to_next_year = mdy(paste0("12-31-", year(date_1))) - date_1)
subbject_id date_1 days_to_prev_year days_to_next_year
1 1 2003-01-01 0 days 364 days
2 2 2007-12-31 364 days 0 days
3 3 2008-12-30 364 days 1 days
4 4 2007-01-02 1 days 363 days
5 5 2007-01-01 0 days 364 days

Break up rows representing long time intervals into multiple rows

I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!
Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

Count the number of active episodes per month from data with start and end dates

I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1

R Sum rows by hourly rate

I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']

Resources