I am working on a transaction data set that reports the time of transaction in hhmmss format. e.g., 204629, 215450 etc.
I would like to derive from the given column a factor variable with levels that indicate certain hours of the day e.g. 12-3 pm, 3-6 pm etc.
I can think of using str_sub function to select hour values from the given variable and convert them to factor. But is there a more efficient method to achieve this?
You can use dplyr::mutate and stringr::str_sub to create the hour column, and then use cut to divide the hour column into your periods.
library(dplyr)
library(stringr)
library(lubridate)
tibble(string = c("215450", "220102", "020129")) %>%
mutate(hour = str_sub(string, 1, 2) %>% as.numeric,
minute = str_sub(string, 3, 4) %>% as.numeric,
second = str_sub(string, 5, 6) %>% as.numeric,
time = str_c(hour, minute, second, sep = ":") %>% hms()) %>%
mutate(period = cut(hour, breaks = 2, labels = c("period one", "period two")))
# A tibble: 3 x 6
string hour minute second time period
<chr> <dbl> <dbl> <dbl> <Period> <fct>
1 215450 21 54 50 21H 54M 50S period two
2 220102 22 1 2 22H 1M 2S period two
3 020129 2 1 29 2H 1M 29S period one
Related
I have a tibble in which each row represents an image of an eye and contains the following relevant variables: patientId, laterality (left or right), date, imageId.
I would like to manipulate this to create another tibble showing the number of followUpYears for each eye (patientId, laterality). followUpYears is defined in a somewhat unusual way:
In order to meet the requirements for follow-up in a particular year, there must be two different imaging dates during that year i.e. between days 0-365 for year 1, days 366-730 for year 2 etc. The first image date is always the baseline and followUpYears is always an integer.
Only one image per date is considered.
Follow-up ceases as soon as the requirement for 2 imaging dates in a year is not met i.e. if there is only 1 imaging date in the first year, followUpYears is 0 regardless of how many images are taken subsequently.
There is no requirement for there to be at least n years between the first and last image date for an eye to have n followUpYears.
The following dummy data demonstrates these points:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
expected_output <- tibble(patientId = c('A','B','B'),
laterality = c('L','R','L'),
followUpYears = c(0, 2, 1))
Patient A's left eye has 0 followUpYears because of points 2 and 3. Patient B's right eye has 2 followUpYears because of point 4 (despite the fact that there is only slightly more than 1 year between the first and last image date). Patient B's left eye only has 1 year of follow up since it doesn't meet the requirement for 2 image dates in year 2.
I am familiar with the basic dplyr verbs but I can't think of how to frame this type of variable. Note that patients might have one or both eyes included and some might have 10+ years of follow up. Finally, a solution that considers 1 year to be 365 days regardless of leap years is fine.
Thank you!
Here's a way with ifelse. diff_year is a helper function that computes the difference between two dates in year rounded to the value above.
library(dplyr)
diff_year <- function(date1, date2) ceiling(as.numeric(difftime(date1, date2)) / 365)
data %>%
group_by(patientId) %>%
summarise(followUpYears = ifelse(diff_year(date[date != first(date)][1], first(date)) <= 1,
diff_year(max(date), min(date)), 0))
#A tibble: 2 × 2
# patientId followUpYears
# <chr> <dbl>
#1 A 0
#2 B 2
Update with OP's comment. This should work with all conditions:
diff_year <- function(date1, date2) as.numeric((date1 - date2) / 365)
data %>%
distinct(patientId, laterality, date, .keep_all = TRUE) %>%
group_by(patientId, laterality) %>%
mutate(diffYear = floor(diff_year(date, min(date)))) %>%
add_count(count = diffYear) %>%
filter(!cumany(lag(n == 1, default = 0)) | row_number() == 1) %>%
summarise(followUpYears = ifelse(any(n > 1), ceiling(diff_year(max(date[n != 1]), min(date))), 0))
# patientId laterality followUpYears
#1 A L 0
#2 B L 1
#3 B R 2
Below is my approach which should cover all four conditions, I'm not sure however, how you get:
#> # A tibble: 1 x 3
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 B L 1
since according to your logic it should fall into the two year band from 2000-09-08 to 2001-09-10 are 367 days which equals two years.
The idea is that we first calculate a followup_flag which checks if the date is within 365 days of the former date, and then takes the cummin() so that the series breaks as soon there is no direct follow up year.
Then we can filter all rows which meet the followup_flag == 1.
And for this data set we check how many years are between the first and the last date, and since we want to count 367 as 2 years we have to take the ceiling().
library(dplyr)
library(lubridate)
data %>%
group_by(patientId, laterality) %>%
mutate(followup_flag = cummin(date - dplyr::lag(date, default = first(date)) <= 365)) %>%
filter(followup_flag == 1) %>%
summarise(followUpYears = as.numeric(
difftime(last(date), first(date), units = "days") / 365) %>%
ceiling()
)
#> `summarise()` has grouped output by 'patientId'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: patientId [2]
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 A L 0
#> 2 B L 2
#> 3 B R 2
Data used:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
Created on 2023-02-08 by the reprex package (v2.0.1)
I am trying to create a function to take the last 30 readings before the break in time. I then want to average the readings to get a value for the set of 30 seconds.
This is an example of the data I have
time concentration
10:29:19 1814.04
10:29:20 1815.80
10:29:21 1816.09
10:29:22 1817.52
10:29:23 1819.10
10:29:24 1818.25
10:29:25 1818.35
10:29:26 1819.10
10:29:27 1820.31
10:29:28 1819.63
10:29:29 1818.94
10:29:30 1818.91
10:29:31 1819.58
10:29:32 1818.73
10:29:33 1820.21
10:29:34 1821.64
10:29:35 1819.39
10:29:36 1818.52
10:29:37 1819.58
10:29:38 1820.27
10:29:39 1818.99
10:29:40 1819.77
10:29:41 1820.08
10:29:42 1820.13
10:29:43 1819.26
10:29:44 1820.50
10:29:45 1820.12
10:29:46 1818.45
10:29:47 1819.54 Here is the break in time. It jumps from 10:29:47 to 10:31:00
10:31:00 1129.30
10:31:01 2673.05
10:31:02 2492.65
10:31:03 2232.31
10:31:04 2190.14
This similar format is repeated for an hour's worth of readings.
photo of data photo of data
I'll suggest a tidyverse flow, since I infer that grouping is important here (and it is generally easy and intuitive within a dplyr pipe). Also, the strings you have for time-of-day are not easily comparable, so I'm going to convert them to a true timestamp (POSIXt in R-speak)[1].
library(dplyr)
dat %>%
mutate(
time_psx = as.POSIXct(paste(Sys.Date(), time)),
grp = cumsum(c(TRUE, diff(time_psx) > 1))
) %>%
group_by(grp) %>%
filter(time_psx >= max(time_psx) - 30) %>%
summarize(
time = last(time),
time_psx = last(time_psx),
n = n(),
mean = sum(concentration),
.groups = "drop"
)
# # A tibble: 2 x 5
# grp time time_psx n mean
# <int> <chr> <dttm> <int> <dbl>
# 1 1 10:29:47 2021-01-18 10:29:47 29 52751.
# 2 2 10:31:04 2021-01-18 10:31:04 5 10717.
This includes the last group, even if it is not technically followed by a gap. If you need to remove that, then we can add a single filter
dat %>%
mutate(
time_psx = as.POSIXct(paste(Sys.Date(), time)),
grp = cumsum(c(TRUE, diff(time_psx) > 1))
) %>%
filter(grp < max(grp)) %>%
group_by(grp) %>%
filter(time_psx >= max(time_psx) - 30) %>%
summarize(
time = last(time),
time_psx = last(time_psx),
n = n(),
mean = sum(concentration),
.groups = "drop"
)
# # A tibble: 2 x 5
# grp time time_psx n mean
# <int> <chr> <dttm> <int> <dbl>
# 1 1 10:29:47 2021-01-18 10:29:47 29 52751.
Note:
The conversion to POSIXt works, but will cause a logical problem if you have data that wraps over midnight. If you have a real "date" in your data, I urge you to include it in the sample data and in your conversion with as.POSIXt (in place of Sys.Date()).
Working with the time class in R sometimes can be a little challenging.
My dataset is a single column of day-times (H/M/S). Nevertheless, when I import the dataset, R classifies it as chr.
What I want to do is to create an additional column that has value equal to 1 in case the time is after 10.30.00, else 0.
Via lubridate I managed to convert the data into the Period data type:
db %>% dplyr::select(Time) %>% mutate(Time = lubridate::hms(Time))
Nevertheless, I do not know exactly how to apply an if statement to the resulting column Time of class Period as this:
%>% ifelse(Time > 10H 30M 0S, 1, 0)
does not work.
Any hint on how to get this simple task done? Many thanks!
You can do as follows.
library(dplyr)
library(lubridate)
data <- tibble(time = hms(c('10:29:30', '14:12:55')))
data %>%
mutate(after = if_else(time > hms('10:30:00'), 1, 0))
# time after
# <Period> <dbl>
# 1 10H 29M 30S 0
# 2 14H 12M 55S 1
An option is to coerce to integer with as.integer
library(lubridate)
library(dplyr)
data %>%
mutate(after = as.integer(time > hms('10:30:00')))
# A tibble: 2 x 2
# time after
# <Period> <int>
#1 10H 29M 30S 0
#2 14H 12M 55S 1
data
data <- tibble(time = hms(c('10:29:30', '14:12:55')))
I have a lubridate period column in my table as the following shows.
workerID worked_hours
02 08H30M00S
02 08H00M00S
03 08H00M00S
03 05H40M00S
What I want to achieve is like sum the number of hours worked by workerID. And I also want it to be in the HH:MM:SS format, even if the hours exceed 24, I dont want it to have the day and instead have the hours accumulate to more than 24.
I have tried working with
df %>%
group_by(workerID) %>%
summarise(sum(worked_hours))
but this returns a 0.
You can use the package lubridate which makes dealing with times a bit easier. In your case, we need to convert to hms (hours minutes seconds) class first, group by worker ID and take the sum. However, in order to get it in the format HH:MM:SS, we need to convert to period, i.e.
library(tidyverse)
library(lubridate)
df %>%
mutate(new = as.duration(hms(worked_hours))) %>%
group_by(workerID) %>%
summarise(sum_times = sum(new)) %>%
mutate(sum_times = seconds_to_period(sum_times))
which gives,
# A tibble: 2 x 2
workerID sum_times
<int> <S4: Period>
1 2 16H 30M 0S
2 3 13H 40M 0S
There's also a base R solution. I've added a row to exceed minutes and hours.
workerID worked_hours
1 2 08H30M00S
2 2 08H00M00S
3 3 08H00M00S
4 3 05H40M00S
5 2 09H45M00S
We could split worked_hours at the characters, then aggregate it by worker's ID. After that, we need to subtract full hours from the minutes. Finally we collapse the time with :.
p <- cbind(p[1], do.call(rbind, lapply(strsplit(p$worked_hours, "\\D"), as.numeric)))
p <- aggregate(. ~ workerID, p, sum)
p$`1` <- p$`1` + floor(p$`2` / 60)
p$`2` <- p$`2` %% 60
p[-1] <- lapply(p[-1], function(x) sprintf("%02d", x)) # to always have two digits
cbind(p[1], worked_hours=apply(p[-1], 1, function(x) paste(x, collapse=":")))
# workerID worked_hours
# 1 2 26:15:00
# 2 3 13:40:00
Data
p <- structure(list(workerID = c("2", "2", "3", "3", "2"), worked_hours = c("08H30M00S",
"08H00M00S", "08H00M00S", "05H40M00S", "09H45M00S")), row.names = c(NA,
-5L), class = "data.frame")
I have data in the form of start and stop times (in the format minutes:seconds). A simplistic example might be the timestamp of a light turning on, and the subsequent timestamp of the light turning off.
For example:
Start Stop
00:03.1 00:40.9
00:55.0 01:38.2
01:40.0 02:01.1
I would like to rearrange the data so that I can eventually look at it in terms of whole-minute interval bins in R.
Option 1: Turn the data into a binary listing for each tenth of a second, then aggregate the data later by timestamp.
Time.in.sec Yes.or.No
0.0 N
0.1 N
... ...
3.0 N
3.1 Y
3.2 Y
... ...
40.8 Y
40.9 N
... ...
Option 2: Split the time intervals at the minute marks and aggregate total time per minute (starting at time = 0:00.0) with some sort of logical rule.
Start Stop
00:03.10 00:40.90
00:55.00 00:59.99
01:00.00 01:38.20
01:40.00 01:59.99
02:00.00 02:01.10
I have tried looking into lubridate functions (i.e., making each range into an interval class) and cut(), but I can’t seem to figure out how to make either of these ideas work. I also am unclear whether packages such as zoo would be appropriate for this; honestly, I have very little experience with date/time formats and time series.
Other questions on Stackoverflow seem to be addressing making bins from raw timestamps (e.g., What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame? and Aggregate data by equally spaced time intervals in R), but I essentially want to do the opposite.
EDIT 1: Here is a CSV-format of the example data, up through minute 6.
Start, Stop
00:03.1, 00:40.9
00:55.0, 01:38.2
01:40.0, 02:01.1
03:03.1, 04:30.3
04:50.0, 05:01.5
05:08.7, 05:22.0
05:40.1, 05:47.9
EDIT 2: My ultimate goal for this is to have the data in a format that I can use to chunk the observations into standardized time bins (Minute 1, Minute 2, etc.) to get a by-minute percentage of when the data is "Yes". Basically I want to get a summary of the distribution of states by minute, and since the data is binary, I can do this by looking at the "yes" state.
For the first 3 minutes (from 00:00.0 up until 03:00.0), the output would be something like this:
Minute time.yes.sec perc.time.yes
1 42.8 71.33
2 58.2 96.98
3 1.1 1.83
# *NOTE: Here, Minute 1 = [0, 60), Minute 2 = [60, 120), etc.; I'm not opposed
# to the reverse definitions though (Minute 1 = (0, 60], etc.).
I could alternatively look at the data as a cumulative distribution plot, with each successive time point updating the value of "total time yes". However, If I could get the data in the format of option 1, I would have the flexibility to look at the data either way.
An option, lightly edited from my version in the comments:
library(tidyverse)
library(lubridate)
df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>% # convert each time to seconds
rowwise() %>% # evaluate the following row-by-row
# make a sequence from Start to Stop by 0.1, wrapped in a list
mutate(instant = list(seq(Start, Stop, by = 0.1))) %>%
unnest() %>% # expand list column
# make a factor, cutting instants into 60 second bins
mutate(minute = cut(instant, breaks = (0:6) * 60, labels = 1:6)) %>%
group_by(minute) %>% # evaluate the following grouped by new factor column
# for each group, count the rows, subtracting 1 for starting instants, and
# dividing by 10 to convert from tenths of seconds to secontds
summarise(elapsed = (n() - n_distinct(Start)) / 10,
pct_elapsed = elapsed / 60 * 100) # convert to percent
## # A tibble: 6 × 3
## minute elapsed pct_elapsed
## <fctr> <dbl> <dbl>
## 1 1 42.8 71.333333
## 2 2 58.1 96.833333
## 3 3 1.0 1.666667
## 4 4 56.9 94.833333
## 5 5 40.2 67.000000
## 6 6 22.5 37.500000
Note the correction for counting starting instants is imperfect, as it will subtract for every starting instant, even if it is a continuation of a sequence from the previous minute. It could be calculated more thoroughly if precision matters.
A more precise but somewhat difficult route is to add stops and starts at the turn of each minute:
df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>% # convert to seconds
gather(var, val) %>% # gather to long form
# construct and rbind data.frame of breaks at minute changes
bind_rows(expand.grid(var = c('Start', 'Stop'),
val = seq(60, by = 60, length.out = floor(max(.$val)/60)))) %>%
arrange(val, desc(var)) %>% # sort
mutate(index = rep(seq(n()/2), each = 2)) %>% # make indices for spreading
spread(var, val) %>% # spread back to wide form
mutate(elapsed = Stop - Start) %>% # calculate elapsed time for each row
# add and group by factor of which minute each falls in
group_by(minute = cut(Stop, seq(0, by = 60, length.out = ceiling(max(Stop) / 60 + 1)),
labels = 1:6)) %>%
summarise(elapsed = sum(elapsed), # calculate summaries
pct_elapsed = elapsed / 60 * 100)
## # A tibble: 6 × 3
## minute elapsed pct_elapsed
## <fctr> <dbl> <dbl>
## 1 1 42.8 71.333333
## 2 2 58.2 97.000000
## 3 3 1.1 1.833333
## 4 4 56.9 94.833333
## 5 5 40.3 67.166667
## 6 6 22.6 37.666667
I did the following using your original data prior to the edit:
Start Stop
00:03.1 00:40.9
00:55.0 01:38.2
01:40.0 02:01.1
agg <- read.table(con<-file("clipboard"), header=T)
The ms function below takes the raw character input I read in from the clipboard and turns changes it into minutes and seconds with an appropriate class, so that it can be used for comparisons. The same is true for the seconds function, the only difference there being that I'm dealing with data that's just measured in seconds, not minutes and seconds.
agg$Start <- lubridate::ms(agg$Start)
agg$Stop <- lubridate::ms(agg$Stop)
option1 <- data.frame(time = lubridate::seconds(seq(.1, 122, .1)),
flag = as.character("N"), stringsAsFactors = F)
for(i in 1:nrow(agg)){
option1$flag[option1$time > agg$Start[i] & option1$time < agg$Stop[i]] <- "Y"
}
To verify that it worked, let's look at table():
table(option1$flag)
N Y
201 1019
option1$minute <- ifelse(option1$time < lubridate::seconds(60), 0, 1)
option1$minute[option1$time > lubridate::seconds(120)] <- 2
table(option1$flag, option1$minute)
0 1 2
N 172 19 10
Y 427 582 10
prop.table(table(option1$flag, option1$minute),2)
0 1 2
N 0.28714524 0.03161398 0.50000000
Y 0.71285476 0.96838602 0.50000000