counting the number of people in the system in R - r

I have the arrival time and departure time and date of different customers to a system. I want to count the number of people in the system in every 30 min. How can I do this R?
Here are my data

If I understand your question, here's an example with fake data:
library(tidyverse)
library(lubridate)
# Fake data
set.seed(2)
dat = data.frame(id=1:1000, type=rep(c("A","B"), 500),
arrival=as.POSIXct("2013-08-21 05:00:00") + sample(-10000:10000, 1000, replace=TRUE))
dat$departure = dat$arrival + sample(100:5000, 1000, replace=TRUE)
# Times when we want to check how many people are still present
times = seq(round_date(min(dat$arrival), "hour"), ceiling_date(max(dat$departure), "hour"), "30 min")
# Count number of people present at each time
map_df(times, function(x) {
dat %>%
group_by(type) %>%
summarise(Time = x,
Count=sum(arrival < x & departure > x)) %>%
spread(type, Count) %>%
mutate(Total = A + B)
})
Time A B Total
<dttm> <int> <int> <int>
1 2013-08-21 02:00:00 0 0 0
2 2013-08-21 02:30:00 26 31 57
3 2013-08-21 03:00:00 54 53 107
4 2013-08-21 03:30:00 75 81 156
5 2013-08-21 04:00:00 58 63 121
6 2013-08-21 04:30:00 66 58 124
7 2013-08-21 05:00:00 55 60 115
8 2013-08-21 05:30:00 52 63 115
9 2013-08-21 06:00:00 57 62 119
10 2013-08-21 06:30:00 62 51 113
11 2013-08-21 07:00:00 60 67 127
12 2013-08-21 07:30:00 72 54 126
13 2013-08-21 08:00:00 66 46 112
14 2013-08-21 08:30:00 19 12 31
15 2013-08-21 09:00:00 1 2 3
16 2013-08-21 09:30:00 0 0 0
17 2013-08-21 10:00:00 0 0 0

I'm not sure what you mean by counting the number of people "in the system", but I'm assuming you mean "the number of people who have arrived but not yet departed". To do this, you can apply a simple logical condition on the relevant columns of your dataframe, e.g.
logicVec <- df$arrival_time <= dateTimeObj & dateTimeObj < df$departure_time
LogicVec will evidently be a logical vector of TRUEs and FALSEs. Because TRUE == 1 and FALSE == 0, you can then simply use the sum(logicVec) function to get the the total number of people/customers/rows who fulfill the condition written above.
You can then simply repeat this line of code for every dateTimeObj (of class e.g. POSIXct) you want. In your case, it would be every dateTimeObj where each are 30 minutes apart.
I hope this helps.

Related

Analyzing data in order of column and then row in R

I have a dataset of logged data at 5 minutes intervals that also includes data at 1 minute intervals denoted by _1 - _5 in the header.
Each row represents a 5 minute interval.
datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
20190710 09:00:00 21 13 14 26 29 32
20190710 09:05:00 21 28 28 29 38 12
20190710 09:10:00 20 8 15 29 30 19
20190711 11:12:00 18 6 9 18 51 49
20190711 11:17:00 17 49 48 48 30 10
The actual dataset has an additional 25 columns of data logged at 5 minute intervals and consists of approximately 25000 rows.
I'm looking for an efficient way of analyzing the speed for each day.
For example, if I wanted to plot the speed for each day it would take speed_1 to speed_5 from the earliest entry on a particular day, say 09:00:00, then speed_1 to speed_5 from the next time, 09:05:00, and so on for the whole day.
Currently I have created an additional dataframe for the speed that fills in the times to give:
datetime speed
20190710 09:00:00 13
20190710 09:01:00 14
20190710 09:02:00 26
20190710 09:03:00 29
20190710 09:04:00 32
This results in having a second df of 125000 entries. I was wondering if there was a more memory efficient way of analyzing the original dataset as the datasets may grow considerably in the future.
Edit: Reproducible code added
structure(list(time = structure(1:3, .Label = c("20190710 09-00-00", "20190710 09-05-00", "20190710 09-10-00"), class = "factor"), temp = c(21, 21, 20), speed_1 = c(13, 28, 8), speed_2 = c(14, 28, 15), speed_3 = c(26, 29, 29), speed_4 = c(29, 38, 30), speed_5 = c(32, 12, 19)), .Names = c("time", "temp", "speed_1", "speed_2", "speed_3", "speed_4", "speed_5"), row.names = c(NA, -3L), class = "data.frame")
Here is a dplyr version:
library(tidyverse)
library(lubridate)
df <- read.table(text='datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
"20190710 09:00:00" 21 13 14 26 29 32
"20190710 09:05:00" 21 28 28 29 38 12
"20190710 09:10:00" 20 8 15 29 30 19
"20190711 11:12:00" 18 6 9 18 51 49
"20190711 11:17:00" 17 49 48 48 30 10',header=T)
# we take our dataframe
df %>%
# ...then we put all the speed columns in one column
pivot_longer(starts_with("speed_")
, names_to = "minute"
, values_to = "speed") %>%
# ...then we...
mutate(datetime = ymd_hms(datetime) #...turn the "datetime" column actually into a datetime format
, minute = gsub("speed_", "", minute) %>% as.numeric() # ...remove "speed_" from the former column names (which are now in column "speed")
, datetime = datetime + minutes(minute - 1)) # ...and add the minute to our datetime...
...to get this:
# A tibble: 25 x 4
datetime temp minute speed
<dttm> <int> <dbl> <int>
1 2019-07-10 09:00:00 21 1 13
2 2019-07-10 09:01:00 21 2 14
3 2019-07-10 09:02:00 21 3 26
4 2019-07-10 09:03:00 21 4 29
5 2019-07-10 09:04:00 21 5 32
6 2019-07-10 09:05:00 21 1 28
7 2019-07-10 09:06:00 21 2 28
8 2019-07-10 09:07:00 21 3 29
9 2019-07-10 09:08:00 21 4 38
10 2019-07-10 09:09:00 21 5 12
# ... with 15 more rows
Some example data and expected output would help a lot. I gave it a shot anyways. You can do this if you simply want a list of all the speeds for every date.
dataset <- read.table(text='datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
"20190710 09:00:00" 21 13 14 26 29 32
"20190710 09:05:00" 21 28 28 29 38 12
"20190710 09:10:00" 20 8 15 29 30 19
"20190711 11:12:00" 18 6 9 18 51 49
"20190711 11:17:00" 17 49 48 48 30 10',header=T)
dataset$datetime <- as.POSIXlt(dataset$datetime,format="%Y%m%d %H:%M:%OS")
lapply(split(dataset,as.Date(dataset$datetime)), function(x) c(t(x[,3:ncol(x)])) )
output:
$`2019-07-10`
[1] 13 14 26 29 32 28 28 29 38 12 8 15 29 30 19
$`2019-07-11`
[1] 6 9 18 51 49 49 48 48 30 10
Edit: Updated answer so that the speeds are in the correct order.
Here is something raw using data.table:
library(data.table)
setDT(df)
df[, time := as.POSIXct(time, format="%Y%m%d %H-%M-%OS")]
out <-
df[, !"temp"
][, melt(.SD, id.vars = "time")
][, time := time + (rleid(variable)-1)*60, time
][order(time), !"variable"]
out
# time value
# 1: 2019-07-10 09:00:00 13
# 2: 2019-07-10 09:01:00 14
# 3: 2019-07-10 09:02:00 26
# 4: 2019-07-10 09:03:00 29
# 5: 2019-07-10 09:04:00 32
# 6: 2019-07-10 09:05:00 28
# 7: 2019-07-10 09:06:00 28
# 8: 2019-07-10 09:07:00 29
# 9: 2019-07-10 09:08:00 38
# 10: 2019-07-10 09:09:00 12
# 11: 2019-07-10 09:10:00 8
# 12: 2019-07-10 09:11:00 15
# 13: 2019-07-10 09:12:00 29
# 14: 2019-07-10 09:13:00 30
# 15: 2019-07-10 09:14:00 19
Data:
df <- data.frame(
time = factor(c("20190710 09-00-00", "20190710 09-05-00", "20190710 09-10-00")),
temp = c(21, 21, 20),
speed_1 = c(13, 28, 8),
speed_2 = c(14, 28, 15),
speed_3 = c(26, 29, 29),
speed_4 = c(29, 38, 30),
speed_5 = c(32, 12, 19)
)

Binning differences in dates as time-unit-aware numeric vector across years

I need to calculate "how many x units apart" each element in a vector of POSIX dates is from a given reference date, where
x is a "typical" time unit like month, week, quarter etc.
the date vector can span multiple years
the result needs to be a numeric vector
I have something, but it doesn't feel like a consistent approach that could be generalized (two different approaches for month and week).
Possibly worth nothing: I'm generally looking for solutions that comply with ISO 8601
EDIT
"Consistent" in the sense that I would ideally, say, a solution that is always leverages as.numeric(dates) with some clever "time unit binning" afterwards. But for months I wouldn't see how this could be achieved as each month contains a different number of days (works for weeks as we can always safely say "a week contains 7 days").
In other words: for months I'd like to use something like (as.numeric(.x) / (<something>)) just as I use (as.numeric(.x) / (60 * 60 * 24 * 7)) for weeks. It's that <something> that I'm looking for to have a generic way of binning differences in dates.
Solution draft
Function defs:
library(magrittr)
library(purrr)
normalize_time_distance_month <- function(dates) {
dates %>%
as.POSIXct() %>%
purrr::map_dbl(function(.x)
as.numeric(format(.x, "%y")) * 12 + as.numeric(format(.x, "%m")))
}
normalize_time_distance_week <- function(dates) {
dates %>%
as.POSIXct() %>%
purrr::map_dbl(function(.x)
(as.numeric(.x) / (60 * 60 * 24 * 7)) %>%
round())
}
Months:
# Months ------------------------------------------------------------------
dates <- seq(as.POSIXct("2018-03-01"), length.out = 24, by = "month")
origin <- as.POSIXct("2018-05-01")
dates_norm <- normalize_time_distance_month(dates)
origin_norm <- normalize_time_distance_month(origin)
(time_diffs <- dates_norm - origin_norm)
#> [1] -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> [24] 21
Weeks:
# Weeks -------------------------------------------------------------------
dates <- seq(as.POSIXct("2018-05-07"), length.out = 104, by = "week")
origin <- as.POSIXct("2018-05-21")
dates_norm <- normalize_time_distance_week(dates)
origin_norm <- normalize_time_distance_week(origin)
(time_diffs <- dates_norm - origin_norm)
#> [1] -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
#> [18] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
#> [35] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#> [52] 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
#> [69] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
#> [86] 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
#> [103] 100 101
Created on 2018-05-25 by the reprex package (v0.2.0).
One option would be to pass expression as an argument and then parse it
library(tidyverse)
library(rlang)
normalize_time_distance <- function(dates, expr) {
dates %>%
as_tibble %>%
mutate(value = as.POSIXct(value)) %>%
mutate(value = !! parse_expr(expr)) %>%
pull(value)
}
expr1 <- 'as.numeric(format(value, "%y")) * 12 + as.numeric(format(value, "%m"))'
normalize_time_distance(dates, expr1)
#[1] 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
#[20] 238 239 240 241 242
expr2 <- 'round((as.numeric(value) / (60 * 60 * 24 * 7)))'
normalize_time_distance(dates, expr2)
#[1] 2513 2517 2522 2526 2530 2535 2539 2544 2548 2552 2557 2561 2565 2570 2574
#[16] 2578 2583 2587 2591 2596 2600 2604 2609 2613
If you are interested in intervals that are multiples of a day there is no point in using POSIXt class. It will only give rise of the possibility of time zone errors which you can entirely prevent by using Date class so from here on we will assume Date class. as.Date can be used to convert a POSIXct object to a Date object.
There are two distinct cases in your question. Intervals that are multiples of a day (day, week) and intervals that are multiples of a month (month, quarter, year). These will have to be dealt with separately because there is not a fixed number of days in a month.
Case 1 - interval is multiple of days
If the interval length is d days then if x and y are Date class objects the
number of intervals is
# x and y are Date class
(as.numeric(y) - as.numeric(x)) / d
where d is 1 for days and 7 for weeks.
Case 2 -- interval is mulitple of months
If the interval length is m months then if x and y are Date class objects:
library(zoo)
date2ym <- function(x) {
ym <- as.yearmon(x)
b <- as.numeric(as.Date(ym))
e <- as.numeric(as.Date(ym, frac = 1))
12 * as.numeric(ym) + (as.numeric(x) - b) / (e - b + 1)
}
# x and y are Date class
(date2ym(y) - date2ym(x)) / m
where m is 1 for months, 3 for quarters and 12 for years.
EDIT
Fix (2).

How to calculate average time interval based on unique value?

I'm having trouble when trying to calculate the average time interval (how many days) between appearances of the same value in another column.
My data looks like this:
dt subject_id
2016-09-13 77
2016-11-07 1791
2016-09-18 1332
2016-08-31 84
2016-08-23 89
2016-08-23 41
2016-09-15 41
2016-10-12 93
2016-10-05 93
2016-11-09 94
2016-10-25 94
2016-11-03 94
2016-10-09 375
2016-10-14 11
2016-09-27 11
2016-09-13 11
2016-08-23 11
2016-08-27 11
And I want to get something like this:
subject_id mean_day
41 23
93 7
94 7.5
11 13
I tried to use:
aggregate(dt~subject_id, data, mean)
But it can't calculate mean from Date values. Any ideas?
My first approach would be something like this:
df$dt <- as.Date(df$dt)
library(dplyr)
df %>%
group_by(subject_id) %>%
summarise((max(dt) - min(dt))/(n()-1))
# <int> <time>
#1 11 13.0 days
#2 41 23.0 days
#3 77 NaN days
#4 84 NaN days
#5 89 NaN days
#6 93 7.0 days
#7 94 7.5 days
#8 375 NaN days
#9 1332 NaN days
#10 1791 NaN days
I think it's a starting point for you ... you can modify as you want.

Compare two time series

I have two time series with hourly resolution now I want to compare the load time series with the capacity time series and count the number of hours when the load is bigger than the capacity. So to know for each hour if there is enough capacity to meet the load. And to calculate the exact difference in cases when there is not enough capacity.
library(xts)
load<-c(81,81,82,98,81,67,90,92,75,78,83,83,83,43,97,92,72,85,62)
capacity<-c(78,97,78,65,45,98,67,109,78,109,52,42,97,87,83,90,99,89,125)
time1<-seq(from=as.POSIXct("2013-01-01 00:00"),to=as.POSIXct("2013-01-01 18:00"),by="hour")
dat0<-data.frame(load,capacity)
df1<-xts(dat0,order.by=time1)
df1
load capacity
2013-01-01 00:00:00 81 78
2013-01-01 01:00:00 81 97
2013-01-01 02:00:00 82 78
2013-01-01 03:00:00 98 65
2013-01-01 04:00:00 81 45
2013-01-01 05:00:00 67 98
2013-01-01 06:00:00 90 67
2013-01-01 07:00:00 92 109
2013-01-01 08:00:00 75 78
2013-01-01 09:00:00 78 109
2013-01-01 10:00:00 83 52
2013-01-01 11:00:00 83 42
2013-01-01 12:00:00 83 97
2013-01-01 13:00:00 43 87
2013-01-01 14:00:00 97 83
2013-01-01 15:00:00 92 90
2013-01-01 16:00:00 72 99
2013-01-01 17:00:00 85 89
2013-01-01 18:00:00 62 125
I just want to know what is the fastest way to calculate it. I need to compare 10 years of data.
I would suggest using dplyr which runs considerably fast on large datasets. Check out the following piece of code and also make sure to have a look at the official Introduction to dplyr.
library(dplyr)
## difference between capacity and load
dat0 %>%
mutate(diff = capacity - load) -> dat1
## count hours with sufficient capacity
dat1 %>%
count(sufficient = diff >= 0) %>%
data.frame()
And here's the console output of the second operation.
sufficient n
1 FALSE 9
2 TRUE 10

Function: calculating seconds between data points

I have the following column in my data frame:
DateTime
1 2011-10-03 08:00:04
2 2011-10-03 08:00:05
3 2011-10-03 08:00:06
4 2011-10-03 08:00:09
5 2011-10-03 08:00:15
6 2011-10-03 08:00:24
7 2011-10-03 08:00:30
8 2011-10-03 08:00:42
9 2011-10-03 08:01:01
10 2011-10-03 08:01:24
11 2011-10-03 08:01:58
12 2011-10-03 08:02:34
13 2011-10-03 08:03:25
14 2011-10-03 08:04:26
15 2011-10-03 08:06:00
With dput:
> dput(smallDF)
structure(list(DateTime = structure(c(1317621604, 1317621605,
1317621606, 1317621609, 1317621615, 1317621624, 1317621630, 1317621642,
1317621661, 1317621684, 1317621718, 1317621754, 1317621805, 1317621866,
1317621960, 1317622103, 1317622197, 1317622356, 1317622387, 1317622463,
1317622681, 1317622851, 1317623061, 1317623285, 1317623404, 1317623498,
1317623612, 1317623849, 1317623916, 1317623994, 1317624174, 1317624414,
1317624484, 1317624607, 1317624848, 1317625023, 1317625103, 1317625179,
1317625200, 1317625209, 1317625229, 1317625238, 1317625249, 1317625264,
1317625282, 1317625300, 1317625315, 1317625339, 1317625353, 1317625365,
1317625371, 1317625381, 1317625395, 1317625415, 1317625423, 1317625438,
1317625458, 1317625469, 1317625487, 1317625500, 1317625513, 1317625533,
1317625548, 1317625565, 1317625581, 1317625598, 1317625613, 1317625640,
1317625661, 1317625674, 1317625702, 1317625715, 1317625737, 1317625758,
1317625784, 1317625811, 1317625826, 1317625841, 1317625862, 1317625895,
1317625909, 1317625935, 1317625956, 1317625973, 1317626001, 1317626043,
1317626062, 1317626100, 1317626113, 1317626132, 1317626153, 1317626179,
1317626212, 1317626239, 1317626271, 1317626296, 1317626323, 1317626361,
1317626384, 1317626407), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = "DateTime", row.names = c(NA,
-100L), class = "data.frame")
My goal: I want to calculate the time difference, in seconds, between each measurement.
Edit:
I'm looking to get the following result, where the time difference (in seconds) between each data point is calculated, except for the first value of the day (line 3), when the time is calculate relative to 8 am:
DateTime Seconds
1 2011-09-30 21:59:02 6
2 2011-09-30 21:59:04 2
3 2011-10-03 08:00:04 4
4 2011-10-03 08:00:05 1
5 2011-10-03 08:00:06 1
6 2011-10-03 08:00:09 3
7 2011-10-03 08:00:15 5
8 2011-10-03 08:00:24 9
9 2011-10-03 08:00:30 6
10 2011-10-03 08:00:42 12
11 2011-10-03 08:01:01 19
12 2011-10-03 08:01:24 23
13 2011-10-03 08:01:58 34
14 2011-10-03 08:02:34 36
15 2011-10-03 08:03:25 51
16 2011-10-03 08:04:26 61
17 2011-10-03 08:06:00 94
However, the measurements start at 8:00 am, so if the value is the first of the day, the number of seconds relative to 8:00 am need to be calculated. In the example above, the first measurement ends at 8:00:04 so using the $sec attribute of POSIX could work here, but on other days the first value may happen a few minutes after 8:00 o'clock.
I've tried to achieve that goal with the following function:
SecondsInBar <- function(x, startTime){
# First data point or first of day
if (x == 1 || x > 1 && x$wkday != x[-1]$wkday){
seconds <- as.numeric(difftime(x,
as.POSIXlt(startTime, format = "%H:%M:%S"),
units = "secs"))
# else calculate time difference
} else {
seconds <- as.numeric(difftime(x, x[-1], units = "secs"))
}
return (seconds)
}
Which then could be called with SecondsInBar(smallDF$DateTime, "08:00:00").
There are at least two problems with this function, but I don't know how to solve these:
The code segment x$wkday != x[-1]$wkday returns a $ operator is
invalid for atomic vectors error,
And the as.POSIXlt(startTime, format = "%H:%M:%S") uses the
current date, which makes the difftime calculation erroneous.
My question:
Where am I going wrong with this function?
And: is this approach a viable way or should I approach it from a different angle?
How about something along these lines:
smallDF$DateTime - as.POSIXct(paste(strftime(smallDF$DateTime,"%Y-%m-%d"),"07:00:00"))
Time differences in secs
[1] 4 5 6 9 15 24 30 42 61 84 118 154 205 266 360
[16] 503 597 756 787 863 1081 1251 1461 1685 1804 1898 2012 2249 2316 2394
[31] 2574 2814 2884 3007 3248 3423 3503 3579 3600 3609 3629 3638 3649 3664 3682
[46] 3700 3715 3739 3753 3765 3771 3781 3795 3815 3823 3838 3858 3869 3887 3900
[61] 3913 3933 3948 3965 3981 3998 4013 4040 4061 4074 4102 4115 4137 4158 4184
[76] 4211 4226 4241 4262 4295 4309 4335 4356 4373 4401 4443 4462 4500 4513 4532
[91] 4553 4579 4612 4639 4671 4696 4723 4761 4784 4807
attr(,"tzone")
[1] ""
Note that I used 7am as when I copied your data my it decided to interpret it as BST.
As for your errors, you can't use $ to get elements of a date with POSIXct (which is how smallDF$DateTime is defined), only with POSIXlt. And for the second error, if you don't supply a date, it has to assume the current date, as there is no other information to draw upon.
Edit
Now its been clarified, I would propose a different approach: split your data frame by day, and then combine the times with the reference time and do diff on that, using lapply to loop over days:
#modify dataframe to add extra day to second half
smallDF[51:100,1] <- smallDF[51:100,1]+86400
smallDF2 <- split(smallDF,strftime(smallDF$DateTime,"%Y-%m-%d"))
lapply(smallDF2,function(x) diff(c(as.POSIXct(paste(strftime(x$DateTime[1],"%Y-%m-%d"),"07:00:00")),x$DateTime)))
$`2011-10-03`
Time differences in secs
[1] 4 1 1 3 6 9 6 12 19 23 34 36 51 61 94 143 94 159 31
[20] 76 218 170 210 224 119 94 114 237 67 78 180 240 70 123 241 175 80 76
[39] 21 9 20 9 11 15 18 18 15 24 14 12
$`2011-10-04`
Time differences in secs
[1] 3771 10 14 20 8 15 20 11 18 13 13 20 15 17 16
[16] 17 15 27 21 13 28 13 22 21 26 27 15 15 21 33
[31] 14 26 21 17 28 42 19 38 13 19 21 26 33 27 32
[46] 25 27 38 23 23

Resources