Calculating absence duration for employee's working time - r
I ran into a problem in R trying to calculate employee's actual working time where I subtract the absence duration from the working time.
For one scheduled day there can be several durations of absence.
Example data frame of one scheduled day:
row_num StartDate EndDate Absence_StartDate Absence_EndDate
1 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:45:00 2019-11-13 16:15:00
2 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:15:00 2019-11-13 14:00:00
3 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:30:00 2019-11-13 16:30:00
4 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:00:00 2019-11-13 15:00:00
You can reproduce the data frame from this line:
df <- data.frame(StartDate = rep("2019-11-13 14:30:00", 4),
EndDate = rep("2019-11-13 18:30:00", 4),
Absence_StartDate = c("2019-11-13 15:45:00", "2019-11-13 08:15:00", "2019-11-13 15:30:00", "2019-11-13 08:00:00"),
Absence_EndDate = c("2019-11-13 16:15:00", "2019-11-13 14:00:00", "2019-11-13 16:30:00", "2019-11-13 15:00:00"))
The main problem I face, is that some of those absence's are overlaping each other, for example row_num 1 and 3. 1st row has no use to me as 3rd row overlaps the first one (-120 minutes from working time). 2nd row doesn't affect the working time, as it's out of Employee's schedule, 4th row overlaps 2nd row and it affects the working time (-15 minutes).
for this example, the working time is 240 minutes and 150 minutes of absence, so the actual working time is 90 minutes.
How could one write a code to calculate the actual working time, given that the StartDate and EndDate can vary (but the same day) and there can be several absences (some may won't even affect the working time at all).
Tried using intervals from lubridate library and using for loop but couldn't achieve the results.
Thank you!
====== UPDATE =======
The code below by "#AnilGoyal" works fine for the most of the time. But, there's some kind of problem which I tried to solve but couldn't do it.
Here's example which doesn't work:
dput(df2)
structure(list(empid = c(1, 1, 1, 1, 1, 1, 1, 1), Date = structure(c(18213,
18213, 18213, 18213, 18213, 18213, 18213, 18213), class = "Date"),
presence_start = structure(c(1573648200, 1573648200, 1573648200,
1573648200, 1573624800, 1573624800, 1573624800, 1573624800
), tzone = "", class = c("POSIXct", "POSIXt")), presence_end = structure(c(1573655400,
1573655400, 1573655400, 1573655400, 1573646400, 1573646400,
1573646400, 1573646400), tzone = "", class = c("POSIXct",
"POSIXt")), emprsn = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), absence_start = structure(list(
sec = c(0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 15L, 30L,
0L, 15L, 15L, 30L, 0L), hour = c(15L, 8L, 14L, 8L, 15L,
8L, 14L, 8L), mday = c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L), mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L), year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L), wday = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("EET", "EET", "EET",
"EET", "EET", "EET", "EET", "EET"), gmtoff = c(7200L,
7200L, 7200L, 7200L, 7200L, 7200L, 7200L, 7200L)), tzone = c("",
"EET", "EEST"), class = c("POSIXlt", "POSIXt")), absence_end = structure(list(
sec = c(0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 0L, 30L,
0L, 15L, 0L, 30L, 0L), hour = c(16L, 14L, 16L, 14L, 16L,
14L, 16L, 14L), mday = c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L), mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L), year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L), wday = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("EET", "EET", "EET",
"EET", "EET", "EET", "EET", "EET"), gmtoff = c(7200L,
7200L, 7200L, 7200L, 7200L, 7200L, 7200L, 7200L)), tzone = c("",
"EET", "EEST"), class = c("POSIXlt", "POSIXt"))), row.names = c(NA,
-8L), class = "data.frame")
> df2
empid Date presence_start presence_end emprsn absence_start absence_end
1 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00
2 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00
3 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00
4 1 2019-11-13 2019-11-13 18:00:00 2019-11-13 20:00:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00
5 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00
6 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00
7 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00
8 1 2019-11-13 2019-11-13 11:30:00 2019-11-13 17:30:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00
The code below, with some modification to account for several employee's and as later noticed, several Working Time's for one Employee in one day (for example:
from 2019-11-13 8:00 to 2019-11-13 14:00 and
from 2019-11-13 14:30 to 2019-11-13 16:30)
Because of this, each working time will have duplicated absences, but this shouldn't be the problem as far as I know.
with this example I tried to add presence_start to group_by, but total_absence comes as a 0 for both times, I've noticed, that both working times have only one instances of d2=1, maybe it's the problem? The total_absence for this example should be 6 hours and 2 hours.
Since the scenario has been revised, I am proposing alternate tidyverse strategy. Hope this works.
sample data modified a bit so that previous scenario also included.
> dput(df)
structure(list(empid = c("1", "1", "1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2"), presence_start = structure(list(sec = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L), hour = c(18L, 18L,
18L, 18L, 11L, 11L, 11L, 11L, 14L, 14L, 14L, 14L, 14L), mday = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L),
mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L), year = c(119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L,
316L, 316L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST", "IST"),
gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt",
"POSIXt")), presence_end = structure(list(sec = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L), hour = c(20L, 20L, 20L, 20L,
17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 18L), mday = c(13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), mon = c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L),
year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L, 316L, 316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), class = c("POSIXlt", "POSIXt"
)), absrsn = c("1", "2", "3", "4", "1", "2", "3", "4", "5", "6",
"7", "8", "9"), absence_start = structure(list(sec = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 15L, 30L, 0L,
15L, 15L, 30L, 0L, 45L, 15L, 30L, 0L, 15L), hour = c(15L, 8L,
14L, 8L, 15L, 8L, 14L, 8L, 15L, 8L, 15L, 8L, 18L), mday = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L),
mon = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L), year = c(119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L,
316L, 316L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST", "IST"),
gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt",
"POSIXt")), absence_end = structure(list(sec = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(15L, 0L, 30L, 0L, 15L, 0L,
30L, 0L, 15L, 0L, 30L, 0L, 0L), hour = c(16L, 14L, 16L, 14L,
16L, 14L, 16L, 14L, 16L, 14L, 16L, 15L, 19L), mday = c(13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), mon = c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L),
year = c(119L, 119L, 119L, 119L, 119L, 119L, 119L, 119L,
119L, 119L, 119L, 119L, 119L), wday = c(3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), yday = c(316L, 316L, 316L,
316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L, 316L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), zone = c("IST", "IST", "IST", "IST", "IST", "IST",
"IST", "IST", "IST", "IST", "IST", "IST", "IST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), class = c("POSIXlt", "POSIXt"
)), date = structure(c(18213, 18213, 18213, 18213, 18213, 18213,
18213, 18213, 18213, 18213, 18213, 18213, 18213), class = "Date")), row.names = c(NA,
-13L), spec = structure(list(cols = list(empid = structure(list(), class = c("collector_double",
"collector")), presence_start = structure(list(), class = c("collector_character",
"collector")), presence_end = structure(list(), class = c("collector_character",
"collector")), absrsn = structure(list(), class = c("collector_double",
"collector")), absence_start = structure(list(), class = c("collector_character",
"collector")), absence_end = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
# A tibble: 13 x 7
empid presence_start presence_end absrsn absence_start absence_end date
<chr> <dttm> <dttm> <chr> <dttm> <dttm> <date>
1 1 2019-11-13 18:00:00 2019-11-13 20:00:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00 2019-11-13
2 1 2019-11-13 18:00:00 2019-11-13 20:00:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00 2019-11-13
3 1 2019-11-13 18:00:00 2019-11-13 20:00:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00 2019-11-13
4 1 2019-11-13 18:00:00 2019-11-13 20:00:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00 2019-11-13
5 1 2019-11-13 11:30:00 2019-11-13 17:30:00 1 2019-11-13 15:15:00 2019-11-13 16:15:00 2019-11-13
6 1 2019-11-13 11:30:00 2019-11-13 17:30:00 2 2019-11-13 08:15:00 2019-11-13 14:00:00 2019-11-13
7 1 2019-11-13 11:30:00 2019-11-13 17:30:00 3 2019-11-13 14:30:00 2019-11-13 16:30:00 2019-11-13
8 1 2019-11-13 11:30:00 2019-11-13 17:30:00 4 2019-11-13 08:00:00 2019-11-13 14:00:00 2019-11-13
9 2 2019-11-13 14:30:00 2019-11-13 18:30:00 5 2019-11-13 15:45:00 2019-11-13 16:15:00 2019-11-13
10 2 2019-11-13 14:30:00 2019-11-13 18:30:00 6 2019-11-13 08:15:00 2019-11-13 14:00:00 2019-11-13
11 2 2019-11-13 14:30:00 2019-11-13 18:30:00 7 2019-11-13 15:30:00 2019-11-13 16:30:00 2019-11-13
12 2 2019-11-13 14:30:00 2019-11-13 18:30:00 8 2019-11-13 08:00:00 2019-11-13 15:00:00 2019-11-13
13 2 2019-11-13 14:30:00 2019-11-13 18:30:00 9 2019-11-13 18:15:00 2019-11-13 19:00:00 2019-11-13
Now follow this approach (after loading both libraries needed
library(tidyverse)
library(lubridate)
#calculating both presence and absence times for each employee for each day
df %>% group_by(empid, date, presence_start) %>%
mutate(absence_start = if_else(absence_start < presence_start, presence_start, absence_start),
absence_end = if_else(absence_end > presence_end, presence_end, absence_end),
absence_end = if_else(absence_end < absence_start, absence_start, absence_end)) %>%
arrange(empid, date, presence_start, absence_start) %>%
mutate(rowid = row_number()) %>%
mutate(absence_start = if_else(rowid >1 & absence_start < lag(absence_end), lag(absence_end), absence_start),
absence_end = if_else(absence_end < absence_start, absence_start, absence_end)) %>%
mutate(presence_end = if_else(rowid >1 & presence_end == lag(presence_end), presence_start, presence_end)) %>%
ungroup() %>%
mutate(presence = difftime(presence_end, presence_start, units = "mins"),
absence = difftime(absence_end, absence_start, units = "mins")) %>%
group_by(empid, date) %>% summarise(sum(presence), sum(absence))
Check the result
# A tibble: 2 x 4
# Groups: empid [2]
empid date `sum(presence)` `sum(absence)`
<chr> <date> <drtn> <drtn>
1 1 2019-11-13 480 mins 270 mins
2 2 2019-11-13 240 mins 105 mins
Manual check- For employee 1 two presence times (row 1-4 & 5-8) (120+360=480 minutes) For emp 2 (rows 9-13) (240 minutes)
absence times for employee 1 (rows 1-4 - 0 mins because all are duplicate; rows 6 & 8 are overlapped - net absence (11:30 to 14:00 - 150 minutes); rows 5 & 7 again overlap (net absence 1430 to 1630 - 120 minutes) - total for emp-1 is 270 minutes. For emp -2 -- rows 8 & 10 overlap (net absence 1430 to 1500 - 30 minutes); rows 9 & 11 overlap again (net 1530 to 1630 - 60 minutes), row 13 exceeding presence (net 1815 to 1830 - 15 minutes) total -105 minutes.
All tally with code result
Good Luck
To this problem, I propose a solution for calculation of total absence time, as follows-
Obviously the data frame will consist of employees therefore I have duplicated the df given, for two employees on a newly created empno field.
sample data used- (Please note that column names have also been tweaked a bit)
> dput(df)
structure(list(empno = c(1, 1, 1, 1, 2, 2, 2, 2), rownum = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), presence_start = c("2019-11-13 14:30:00",
"2019-11-13 14:30:00", "2019-11-13 14:30:00", "2019-11-13 14:30:00",
"2019-11-13 14:30:00", "2019-11-13 14:30:00", "2019-11-13 14:30:00",
"2019-11-13 14:30:00"), presence_end = c("2019-11-13 18:30:00",
"2019-11-13 18:30:00", "2019-11-13 18:30:00", "2019-11-13 18:30:00",
"2019-11-13 18:30:00", "2019-11-13 18:30:00", "2019-11-13 18:30:00",
"2019-11-13 18:30:00"), absence_start = c("2019-11-13 15:45:00",
"2019-11-13 08:15:00", "2019-11-13 15:30:00", "2019-11-13 08:00:00",
"2019-11-13 15:45:00", "2019-11-13 08:15:00", "2019-11-13 15:30:00",
"2019-11-13 08:00:00"), absence_end = c("2019-11-13 16:15:00",
"2019-11-13 14:00:00", "2019-11-13 16:30:00", "2019-11-13 15:00:00",
"2019-11-13 16:15:00", "2019-11-13 14:00:00", "2019-11-13 16:30:00",
"2019-11-13 15:00:00")), row.names = c(NA, -8L), class = "data.frame")
> df
empno rownum presence_start presence_end absence_start absence_end
1 1 1 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:45:00 2019-11-13 16:15:00
2 1 2 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:15:00 2019-11-13 14:00:00
3 1 3 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:30:00 2019-11-13 16:30:00
4 1 4 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:00:00 2019-11-13 15:00:00
5 2 1 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:45:00 2019-11-13 16:15:00
6 2 2 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:15:00 2019-11-13 14:00:00
7 2 3 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 15:30:00 2019-11-13 16:30:00
8 2 4 2019-11-13 14:30:00 2019-11-13 18:30:00 2019-11-13 08:00:00 2019-11-13 15:00:00
Calculation for absence time needs creation of a few dummy fields as explained below-
library(tidyverse)
library(lubridate)
df2 <- df %>% pivot_longer(cols = -c(rownum, empno), names_to = c(".value", "event"), names_sep = "_") %>%
mutate(presence = ymd_hms(presence),
absence = ymd_hms(absence),
event = factor(event, levels = c("start", "end"), ordered =T),
absence_m = case_when(absence < presence & event == 'start' ~ presence,
absence > presence & event == 'end' ~ presence,
TRUE ~ absence),
dummy = ifelse(event == 'start', 1, -1)) %>%
group_by(empno, rownum) %>%
mutate(absence_m = case_when(event == 'end' & lag(absence_m) > absence_m ~ lag(absence_m),
TRUE ~ absence_m)) %>%
group_by(empno) %>% arrange(empno, absence_m) %>%
mutate(dummy = cumsum(dummy),
d2 = case_when(event == 'end' & lag(absence_m) == absence_m ~ 0,
event == 'start' & lead(absence_m) == absence_m ~0,
event == "start" & dummy == 1 ~ 1,
event == "end" & dummy == 0 ~ 1,
TRUE ~ 0)) %>%
filter(d2 == 1) %>%
mutate(absence_time = case_when(event == 'end' ~ (absence_m - lag(absence_m)),
TRUE ~ 0)) %>%
mutate(absence_time = replace_na(absence_time, 0)) %>%
summarise(total_absence = sum(absence_time))
> df2
# A tibble: 2 x 2
empno total_absence
<dbl> <drtn>
1 1 90 mins
2 2 90 mins
I am leaving calculation of total presence to you, which should not be difficult. Still if you have any doubts feel free to ask.
Related
Aggregate function not working properly in R
I am trying to use the aggregate function to get 100 Hz data into 1 minute averages. However, when I use this function the 1-min averages are incorrect. A sample of the data is below. I am using the following code to calculate the 1-min values. The code does not break but the calculations are incorrect. aggregate(list(X = df$`Gyroscope X`, Y = df$`Gyroscope Y`, Z = df$`Gyroscope Z`), list(minofday = cut(df$Timestamp, "1 min")),mean) Timestamp Gyroscope X Gyroscope Y Gyroscope Z 2018-07-10T10:25:00.0000000 41.381838 -21.667482 -118.896492 2018-07-10T10:25:00.0100000 48.046268 -12.399903 -110.917976 2018-07-10T10:25:00.0200000 49.102786 -7.36084 -106.485602 2018-07-10T10:25:00.0300000 44.338382 -9.215699 -102.296759 2018-07-10T10:25:00.0400000 34.724123 -11.308594 -96.108404 2018-07-10T10:25:00.0500000 19.622804 -15.225221 -88.122564 2018-07-10T10:25:00.0600000 13.240968 -26.539308 -85.274663 2018-07-10T10:25:00.0700000 13.397218 -31.933596 -80.127568 2018-07-10T10:25:00.0800000 16.333009 -29.663088 -73.027348 2018-07-10T10:25:00.0900000 17.384645 -29.745485 -67.694096 2018-07-10T10:25:00.1000000 16.546632 -30.08423 -67.565922
Assuming OP's data varies by the min (note the modified data), here is how to do it with base R and dplyr: df$Timestamp <- as.POSIXct(df$Timestamp, format = "%Y-%m-%dT%H:%M:%S") aggregate(list(X = df$Gyroscope_X, Y = df$Gyroscope_Y, Z = df$Gyroscope_Z), list(minofday = cut(df$Timestamp, "1 min")), mean) or a more concise way: aggregate(. ~ minofday, mean, data = cbind(setNames(df[,-1], c("X", "Y", "Z")), minofday = cut(df$Timestamp, "1 min"))) Result: minofday X Y Z 1 2018-07-10 10:24:00 48.57453 -9.880371 -108.70179 2 2018-07-10 10:25:00 27.78422 -19.314983 -95.13774 3 2018-07-10 10:26:00 16.85883 -29.704286 -70.36072 4 2018-07-10 10:27:00 16.54663 -30.084230 -67.56592 With lubridate and summarize_all from dplyr: library(dplyr) library(lubridate) df %>% mutate(Timestamp = ymd_hms(Timestamp)) %>% group_by(minofday = cut(Timestamp, "1 min")) %>% summarize_all(mean) %>% select(-Timestamp) Result: # A tibble: 4 x 4 minofday Gyroscope_X Gyroscope_Y Gyroscope_Z <fct> <dbl> <dbl> <dbl> 1 2018-07-10 10:24:00 48.6 -9.88 -109. 2 2018-07-10 10:25:00 27.8 -19.3 -95.1 3 2018-07-10 10:26:00 16.9 -29.7 -70.4 4 2018-07-10 10:27:00 16.5 -30.1 -67.6 Data: df <- read.table(text = " Timestamp Gyroscope_X Gyroscope_Y Gyroscope_Z 2018-07-10T10:25:00.0000000 41.381838 -21.667482 -118.896492 2018-07-10T10:24:00.0100000 48.046268 -12.399903 -110.917976 2018-07-10T10:24:00.0200000 49.102786 -7.36084 -106.485602 2018-07-10T10:25:00.0300000 44.338382 -9.215699 -102.296759 2018-07-10T10:25:00.0400000 34.724123 -11.308594 -96.108404 2018-07-10T10:25:00.0500000 19.622804 -15.225221 -88.122564 2018-07-10T10:25:00.0600000 13.240968 -26.539308 -85.274663 2018-07-10T10:25:00.0700000 13.397218 -31.933596 -80.127568 2018-07-10T10:26:00.0800000 16.333009 -29.663088 -73.027348 2018-07-10T10:26:00.0900000 17.384645 -29.745485 -67.694096 2018-07-10T10:27:00.1000000 16.546632 -30.08423 -67.565922", header = TRUE)
Since you are dealing with timestamps the xts package has a lot of functions that can help you. For rolling up timestamps period.apply can help you out. The endpoints part can roll up the data from microseconds all the way up to years. # don't load the timestamp column that one goes to the order.by part df1_xts <- xts(df1[, -1], order.by = df1$Timestamp) # roll up to seconds. period.apply(df1_xts, endpoints(df1_xts, on = "mins"), colMeans) Gyroscope_X Gyroscope_Y Gyroscope_Z 2018-07-10 10:25:00 28.55624 -20.46759 -90.59249 If you timestamp column is not yet a date time object you can use this: df1$Timestamp <- strptime(df1$Timestamp, format = "%Y-%m-%dT%H:%M:%OS") data: df1 <- structure(list(Timestamp = structure(list(sec = c(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1), min = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L), hour = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), mday = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), mon = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), year = c(118L, 118L, 118L, 118L, 118L, 118L, 118L, 118L, 118L, 118L, 118L), wday = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), yday = c(190L, 190L, 190L, 190L, 190L, 190L, 190L, 190L, 190L, 190L, 190L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), class = c("POSIXlt", "POSIXt")), Gyroscope_X = c(41.381838, 48.046268, 49.102786, 44.338382, 34.724123, 19.622804, 13.240968, 13.397218, 16.333009, 17.384645, 16.546632), Gyroscope_Y = c(-21.667482, -12.399903, -7.36084, -9.215699, -11.308594, -15.225221, -26.539308, -31.933596, -29.663088, -29.745485, -30.08423), Gyroscope_Z = c(-118.896492, -110.917976, -106.485602, -102.296759, -96.108404, -88.122564, -85.274663, -80.127568, -73.027348, -67.694096, -67.565922 )), row.names = c(NA, -11L), class = "data.frame")
Find all values for a given row given a date/timestamp
I would like to find all values for a given row given a DateTime value (in dataframe DATA). DATA <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0,0, 0, 0, 0, 0), min = c(30L, 15L, 0L, 45L, 30L, 15L, 0L, 45L,30L), hour = c(15L, 15L, 15L, 14L, 14L, 14L, 14L, 13L, 13L),mday = c(27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L), mon = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(116L, 116L, 116L,116L, 116L, 116L, 116L, 116L, 116L), wday = c(3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L), yday = c(26L, 26L, 26L, 26L, 26L,26L, 26L, 26L, 26L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), zone = c("EST", "EST", "EST", "EST", "EST", "EST","EST", "EST", "EST"), gmtoff = c(NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour","mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Close = c(1127.2, 1127.5,1126.9, 1128.3, 1125.4, 1122.7, 1122.8, 1117.3, 1116), FOMCBinary = c(0,0, 0, 0, 0, 0, 1, 0, 0)), .Names = c("DateTime", "Close", "FOMCBinary"), row.names = 2131:2139, class = "data.frame") #Output for DATA: DateTime Close FOMCBinary 2131 2016-01-27 15:30:00 1127.2 0 2132 2016-01-27 15:15:00 1127.5 0 2133 2016-01-27 15:00:00 1126.9 0 2134 2016-01-27 14:45:00 1128.3 0 2135 2016-01-27 14:30:00 1125.4 0 2136 2016-01-27 14:15:00 1122.7 0 2137 2016-01-27 14:00:00 1122.8 1 2138 2016-01-27 13:45:00 1117.3 0 2139 2016-01-27 13:30:00 1116.0 0 For example, entering "2016-01-27 14:00:00" should output: Index 2137, DateTime 2016-01-27 14:00:00, Close 1122.8, FOMCBinary 1. How could I do that? I tried the following without success: DATA$DateTime["2016-01-27 14:00:00"] DATA$DateTime[,DateTime = "2016-01-27 14:00:00"] DATA$DateTime[DateTime = "2016-01-27 14:00:00"]
Thanks #Sotos for the answer: DATA[DATA$DateTime == "2016-01-27 14:00:00",]
How to convert variable to time format when values have different formatting
I am trying to convert a variable that contains that length of time of a phone call from a factor into a chronological format using the chron function. However, the values for calls that were less then 1 hour are reported as m:s, while calls that were longer then an hour are reported as h:m:s. As such when I use the chron function variables that are not in the h:m:s format are converted to missing. How can I convert the variable to the correct formatting? Using the Chron function dat$Duration <- chron(times = as.character(dat$Duration), format = c(times = "h:m:s")) > dat Duration Type date 346 <NA> Incoming FaceTime 2014-11-22 16:55:45 349 <NA> Outgoing FaceTime 2014-11-22 23:02:24 350 <NA> Incoming FaceTime 2014-11-23 05:25:11 351 <NA> Canceled 2014-11-23 06:06:19 352 <NA> Canceled 2014-11-24 10:11:38 353 <NA> Canceled 2014-11-24 10:12:10 2 <NA> Canceled 2014-12-26 14:41:02 3 <NA> Outgoing FaceTime 2015-01-03 14:14:21 4 01:33:35 Outgoing FaceTime 2015-01-04 15:16:34 5 <NA> Outgoing FaceTime 2015-01-05 23:57:34 6 <NA> Outgoing FaceTime 2015-01-06 13:57:42 the data. structure(list(Duration = structure(c(290L, 301L, 263L, 1L, 1L, 1L, 1L, 134L, 13L, 156L, 64L), .Label = c("0:00", "0:59", "1:00:38", "1:00:55", "1:01:06", "1:05", "1:10:07", "1:10:12", "1:12:59", "1:15:06", "1:15:55", "1:22:59", "1:33:35", "1:33:52", "1:35:35", "1:39:46", "1:40", "1:50:32", "1:57:40", "1:58:29", "10:03", "10:17", "10:19", "10:21", "10:34", "10:38", "10:48", "11:09", "11:22", "11:24", "11:28", "11:32", "11:48", "11:51", "12:09", "12:19", "12:22", "12:45", "12:49", "12:51", "13:05", "13:15", "13:25", "13:35", "13:49", "14:16", "14:23", "14:34", "14:43", "14:50", "14:55", "15:20", "15:32", "15:33", "15:43", "15:44", "16:08", "16:09", "16:20", "16:40", "16:41", "16:50", "16:59", "17:02", "17:13", "17:17", "17:25", "17:46", "18:29", "18:30", "18:37", "18:48", "19:35", "19:38", "19:58", "2:02", "2:07", "2:16:59", "20:40", "21:42", "22:07", "22:42", "22:48", "23:10", "24:17", "25:26", "25:45", "26:36", "27:00", "27:16", "27:48", "28:01", "28:32", "28:42", "29:47", "29:51", "3:05", "3:10:17", "3:53", "3:58", "30:22", "30:27", "30:28", "30:38", "30:41", "31:22", "32:18", "33:35", "34:29", "37:44", "38:10", "4:15", "4:22", "4:37", "4:52", "40:14", "41:06", "41:51", "43:21", "44:13", "44:27", "44:58", "46:14", "46:56", "49:27", "5:03", "5:13", "5:43", "5:51", "50:15", "50:27", "51:06", "55:11", "58:10", "58:27", "6:27", "6:29", "6:35", "6:37", "6:49", "6:57", "7:08", "7:09", "7:29", "7:41", "7:47", "8:20", "8:25", "8:30", "8:41", "8:47", "8:55", "8:56", "8:57", "8:58", "9:35", "9:40", "9:55", "0:02", "0:03", "0:04", "0:06", "0:08", "0:10", "0:11", "0:12", "0:13", "0:14", "0:19", "0:22", "0:23", "0:25", "0:28", "0:29", "0:32", "0:36", "0:37", "0:38", "0:40", "0:41", "0:42", "0:43", "0:44", "0:47", "0:51", "0:53", "0:54", "0:56", "0:58", "1:00", "1:02", "1:02:22", "1:03", "1:05:59", "1:08", "1:13", "1:13:34", "1:24:20", "1:25", "1:31", "1:41", "1:42", "1:43", "1:48", "1:50", "1:55", "10:07", "10:12", "10:31", "10:35", "11:06", "11:07", "11:12", "11:27", "11:36", "12:01", "12:59", "13:31", "13:34", "13:36", "14:22", "15:04", "15:27", "15:48", "15:54", "15:57", "16:10", "16:12", "16:23", "17:12", "17:29", "17:44", "17:54", "18:12", "18:46", "2:01", "2:04", "2:08", "2:38", "2:57", "20:36", "20:39", "20:41", "21:03", "21:58", "22:04", "23:35", "24:02", "26:06", "26:25", "26:33", "26:43", "26:59", "27:09", "27:13", "27:19", "28:09", "28:30", "28:34", "3:00", "3:30", "3:37", "3:38", "3:41", "30:23", "30:31", "31:03", "31:09", "31:39", "31:47", "31:59", "32:12", "32:15", "33:17", "33:31", "34:51", "35:24", "36:01", "36:26", "36:47", "37:40", "37:50", "39:16", "4:02", "4:08", "4:20", "4:24", "4:31", "40:39", "43:49", "45:48", "47:28", "48:39", "5:05", "5:44", "5:55", "5:57", "54:04", "6:06", "6:43", "6:46", "6:50", "7:07", "7:22", "7:58", "8:00", "8:21", "8:26", "9:09", "9:22"), class = "factor"), Type = structure(c(3L, 5L, 3L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L), .Label = c("Canceled", "Incoming", "Incoming FaceTime", "Missed", "Outgoing FaceTime", "Outgoing" ), class = "factor"), date = structure(list(sec = c(45, 24, 11, 19, 38, 10, 2, 21, 34, 34, 42), min = c(55L, 2L, 25L, 6L, 11L, 12L, 41L, 14L, 16L, 57L, 57L), hour = c(16L, 23L, 5L, 6L, 10L, 10L, 14L, 14L, 15L, 23L, 13L), mday = c(22L, 22L, 23L, 23L, 24L, 24L, 26L, 3L, 4L, 5L, 6L), mon = c(10L, 10L, 10L, 10L, 10L, 10L, 11L, 0L, 0L, 0L, 0L), year = c(114L, 114L, 114L, 114L, 114L, 114L, 114L, 115L, 115L, 115L, 115L), wday = c(6L, 6L, 0L, 0L, 1L, 1L, 5L, 6L, 0L, 1L, 2L), yday = c(325L, 325L, 326L, 326L, 327L, 327L, 359L, 2L, 3L, 4L, 5L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("AEDT", "AEDT", "AEDT", "AEDT", "AEDT", "AEDT", "AEDT", "AEDT", "AEDT", "AEDT", "AEDT" ), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff" ), class = c("POSIXlt", "POSIXt"))), .Names = c("Duration", "Type", "date"), row.names = c(346L, 349L, 350L, 351L, 352L, 353L, 2L, 3L, 4L, 5L, 6L), class = "data.frame")
The obvious approach would be to manipulate your data in a first pass, then call the chron function. You could pass every date through something like this: date = re.sub(r'^(\d+:\d+)$', r'00:\1', date) or if date.count(':') == 1: date = '00:'+date EDIT: How did I get here? I don't even know R, sorry. The above is sample code in Python. Thanks for #thelatemail for translating this to R dat$Duration <- chron(times=gsub("^(\\d+:\\d+)$","00:\\1",dat$Duration), format=c(times="h:m:s"))
Group dates by bimester
Here is a sample of the data I'm currently working on: x <- structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(59L, 32L, 23L, 26L, 20L, 14L, 1L, 5L, 32L, 2L), hour = c(10L, 15L, 12L, 12L, 16L, 18L, 18L, 9L, 14L, 12L), mday = c(9L, 15L, 2L, 15L, 20L, 26L, 11L, 22L, 9L, 16L), mon = c(4L, 11L, 10L, 7L, 9L, 8L, 10L, 8L, 8L, 4L), year = c(111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L, 111L), wday = c(1L, 4L, 3L, 1L, 4L, 1L, 5L, 4L, 5L, 1L), yday = c(128L, 348L, 305L, 226L, 292L, 268L, 314L, 264L, 251L, 135L), isdst = c(0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt", "POSIXt")) So that > x [1] "2011-05-09 10:59:00" "2011-12-15 15:32:00" "2011-11-02 12:23:00" [4] "2011-08-15 12:26:00" "2011-10-20 16:20:00" "2011-09-26 18:14:00" [7] "2011-11-11 18:01:00" "2011-09-22 09:05:00" "2011-09-09 14:32:00" [10] "2011-05-16 12:02:00" Say I want to tabulate the distribution of x by month. This is how I accomplish it: > table(strftime(x, '%m')) 05 08 09 10 11 12 2 1 3 1 2 1 Now I want to do a similar tabulation, but this time I want to group the data by bimester (and possibly by trimester or semester, later on). I've taken a look at the help page for strptime, but couldn't find an appropriate separator. This is the best I have come up with so far: > table(cut(x = as.numeric(strftime(x, '%m')), breaks = c(1, 3, 5, 7, 9, 11, 13), labels = c('1-2', '3-4', '5-6', '7-8', '9-10', '11-12'), right = FALSE)) 1-2 3-4 5-6 7-8 9-10 11-12 0 0 2 1 4 3 It is a convoluted way of reaching this, but it's OK for a simple example and a single case. However, this approach will give me headaches down the road, since I'll want those data to remain POSIX (not to mention it makes my code scarier than it should). Is there an elegant solution for this?
If you're sticking with table and vectors (as opposed to have a rectangular data/output, in which case I'd use data.table), you could do: table(2*(x$mon %/% 2) + 1) # # 5 7 9 11 # 2 1 4 3
You could do away with using any type of format-ting of the date values themselves and just create a lookup vector for your groupings. This would also allow total flexibility in specifying what months fit into what categories. E.g.: src <- factor(rep(c('01-02','03-04','05-06','07-08','09-10','11-12'),each=2)) src[x$mon+1] #[1] 05-06 11-12 11-12 07-08 09-10 09-10 11-12 09-10 09-10 05-06 #Levels: 01-02 03-04 05-06 07-08 09-10 11-12 table(src[x$mon+1]) #01-02 03-04 05-06 07-08 09-10 11-12 # 0 0 2 1 4 3
plotting x-axes with custom label in R
I've to plot these data: day temperature 02/01/2012 13:30:00 10 10/01/2012 20:30:00 8 15/01/2012 13:30:00 12 25/01/2012 20:30:00 6 02/02/2012 13:30:00 5 10/02/2012 20:30:00 3 15/02/2012 13:30:00 6 25/02/2012 20:30:00 -1 02/03/2012 13:30:00 4 10/03/2012 20:30:00 -2 15/03/2012 13:30:00 7 25/03/2012 20:30:00 1 in the x-axis I want to label only the month and the day (e.g. Jan 02 ). How can I do this using the command plot() and axis()?
First, you will need to put your date text into a dtae class (e.g. as.POSIXct): df <- structure(list(day = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L), hour = c(13L, 20L, 13L, 20L, 13L, 20L, 13L, 20L, 13L, 20L, 13L, 20L), mday = c(2L, 10L, 15L, 25L, 2L, 10L, 15L, 25L, 2L, 10L, 15L, 25L), mon = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), year = c(112L, 112L, 112L, 112L, 112L, 112L, 112L, 112L, 112L, 112L, 112L, 112L), wday = c(1L, 2L, 0L, 3L, 4L, 5L, 3L, 6L, 5L, 6L, 4L, 0L), yday = c(1L, 9L, 14L, 24L, 32L, 40L, 45L, 55L, 61L, 69L, 74L, 84L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst" ), class = c("POSIXlt", "POSIXt")), temperature = c(10L, 8L, 12L, 6L, 5L, 3L, 6L, -1L, 4L, -2L, 7L, 1L)), .Names = c("day", "temperature"), row.names = c(NA, -12L), class = "data.frame") df df$day <- as.POSIXct(df$day, format="%d/%m/%Y %H:%M:%S") Your dates should now plot correctly. Don't apply the x-axis, by using the argument xaxt="n". Afterwards, you can create a sequence of dates where you would like your axis labeled, and apply this with axis.POSIXct: plot(df$day, df$temperature, t="l", ylab="Temperature", xlab="Date", xaxt="n") SEQ <- seq(min(df$day), max(df$day), by="months") axis.POSIXct(SEQ, at=SEQ, side=1, format="%b %Y") Similarly, to get a daily axis, simply modify the SEQ and axis.POSIXct code accordingly. For example, you may try: plot(df$day, df$temperature, t="l", ylab="Temperature", xlab="Date", xaxt="n") SEQ <- seq(min(df$day), max(df$day), by="days") axis.POSIXct(SEQ, at=SEQ, side=1, format="%b %d")