I have a dataset of start times and end times over an entire year for a certain activity. I want to break up the day into 24 intervals, each 1 hour in length, and then calculate and plot the fraction of time the person spent doing the activity per hour. I already converted the times using lubridate's mdy_hm().
Suppose these sample data in dataframe df:
start_time end_time duration
8/14/15 23:36 8/15/15 5:38 359
8/15/15 14:50 8/15/15 15:25 35
8/15/15 22:43 8/16/15 2:41 236
8/16/15 3:12 8/16/15 6:16 181
8/16/15 16:52 8/16/15 17:58 66
8/16/15 23:21 8/16/15 23:47 26
8/17/15 0:04 8/17/15 2:02 118
8/17/15 8:31 8/17/15 9:45 74
8/17/15 11:06 8/17/15 13:46 159
How can I find the fraction of the activity per hour over the whole year? I will then plot the result. I have tried extracting the hour with hour(), using group_by() on the time variables, and using the mean function within summarize() on duration, but I'm unsure of the logic.
Thank you for any help.
The group_by(...) %>% summarise(...) works best when your data is in the 'tidy' format, where each row is 1 observation of the data you want to aggregate over. In your case, an observation is a minute worked within some given hour and date. We can do this be generating those minute-by-minute observations as a list column, use tidyr::unnest() to expand the generated data into a long data frame, then do your counting over that data frame:
library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
df <-
tibble(
start_time = c("8/14/15 23:36","8/15/15 14:50","8/15/15 22:43",
"8/16/15 3:12","8/16/15 16:52","8/16/15 23:21",
"8/17/15 0:04","8/17/15 8:31","8/17/15 11:06"),
end_time = c("8/15/15 5:38","8/15/15 15:25","8/16/15 2:41",
"8/16/15 6:16","8/16/15 17:58","8/16/15 23:47",
"8/17/15 2:02","8/17/15 9:45","8/17/15 13:46")
) %>%
mutate_at(vars(start_time, end_time), funs(mdy_hm))
worked_hours <- df %>%
# First, make a long df with a minute per row
group_by(start_time, end_time) %>%
mutate(mins = list(tibble(
min = seq(from = start_time, to = end_time - minutes(1), by = as.difftime(minutes(1)))
))) %>%
unnest() %>%
ungroup() %>%
# Aggregate over the long df (count number of rows, i.e. minutes per date, hour)
select(min) %>%
mutate(date = date(min), hour = factor(hour(min), levels = 0:23)) %>%
group_by(date, hour) %>%
tally() %>%
# Calculate proportion of hour
mutate(prop = n / 60 * 100)
worked_hours %>%
# Use tidyr::complete to fill in unobserved values
complete(date, hour, fill = list(n = 0, prop = 0)) %>%
ggplot(aes(x = hour, y = prop)) +
geom_bar(stat = "identity") +
facet_wrap(~ date, ncol = 1)
Related
I have a growth rate, calculated from individual measurements 4 times a year, that I am trying to assign to a different time frame called Year2 (August 1st of year 1 to July 31st of year 2, see attached photo).
My Dataframe:
ID
Date
Year
Year2
Lag
Lapse
Growth
Daily_growth
1
2009-07-30
2009
2009
NA
NA
35.004
NA
1
2009-10-29
2009
2010
2009-07-30
91 days
31.585
0.347
1
2010-01-27
2010
2010
2009-10-29
90 days
63.769
0.709
1
2010-04-27
2010
2010
2010-01-27
90 days
28.329
0.315
1
2010-07-29
2010
2010
2010-04-27
93 days
32.068
0.345
1
2010-11-02
2010
2011
2010-07-29
96 days
128.1617320
1.335
I took the growth rate as follows:
Growth_df <- Growth_df%>%
group_by(ID) %>% # Individuals we measured
mutate(Lag = lag(Date), #Last date measured
Lapse = round(difftime(Date, Lag, units = "days")), #days between Dates monitored
Daily_growth = as.numeric(Growth) / as.numeric(Lapse))
What I am trying to do is assign the daily growth rate between each measurement, matching to the Year2 timeframe:
Growth_df <- Growth_df %>%
mutate(Year = as.numeric(Year),
Year2_growth = ifelse(Year == Year2, Daily_growth*Lapse, 0)) %>%
group_by(Year2) %>%
mutate(Year2_growth = sum(Year2_growth, na.rm = TRUE))
My problem is that I do not know how to get the dates in between the years (something in place of the 0 in the ifelse statement). I need some sort of way that would calculate how many days would be left from the new start date (August 1st) to the most recent measurement, then multiply it by the growth rate, as well as cut the end early (July 31st)
I tried making a second dataframe with nothing by years and days then assigning the growth rate when comparing the two dataframes but I have been stuck on the same issue: partitioning the time frame.
I am sure there's a much much muuuuch more efficient way to deal with this, but this is the way I sorted out:
Make my timeframes
Create a function for the ranges I wanted
Make a dataframe with for both the start and the end ranges
Join them together
Marvel in my lack of r skills.
Start_dates <- seq(ymd('2008-08-01'),ymd('2021-08-1'), by = '12 months')
End_dates <- seq(ymd('2009-07-31'),ymd('2022-07-31'), by = '12 months')
Year2_dates <- data.frame(Start_dates, End_dates)
Year2_dates <- Year2_dates %>%
mutate(Year2 = format(as.Date(Start_dates, format="%d/%m/%Y"),"%Y"),
Year2 = as.numeric(Year2) + 1)
Vegetation <- Vegetation %>%
left_join(Year2_dates)
Range_finder <- function(x,y){
as.numeric(difftime(x, y, unit = "days"))
}
Range_start <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==1) %>%
filter(Year != Year2) #had to get rid of first year samples as they were the top row but didn't have a change in year
Range_start <- Range_start %>%
mutate(Number_days_start = Range_finder(Date, Start_dates),
Border_range = Number_days_start * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Range_end <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==n(),
Year2 != 2022)
Range_end <- Range_end %>%
mutate(Number_days_end = Range_finder(End_dates, Date),
Border_range = Number_days_end * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Ranges <- full_join(Range_start, Range_end)
Vegetation <- Vegetation %>%
left_join(Ranges)
I am really new at R and this is probably a really basic question: Let's say I have a dataset with a column that includes date values of the format ("y-m-d H:M:S") as a Factor value.
How do I split the one column into 5?
Given example:
x <- as.factor(c("2018-01-03 12:34:32.92382", "2018-01-03 12:50:40.00040"))
x <- as_datetime(x) #to convert to type Date
x <- x %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
I get the error: for objects with the class(c('POSIXct', 'POSIXt') can not be used.
Change it into dataframe then run mutate part will works
x %>%
as.data.frame() %>%
rename(x = '.') %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
x year month day hour minute second
1 2018-01-03 12:34:32 2018 1 3 12 34 32.92382
2 2018-01-03 12:50:40 2018 1 3 12 50 40.00040
You could also make your mutate a little bit cleaner utilizing the power of across:
library(lubridate)
x %>%
data.frame(date = .) %>%
mutate(across(date,
funs(year, month, day, hour, minute, second),
.names = "{.fn}"))
I'm trying to find duplicate records in a dataframe of behavioural data in R. I need to find rows that have the same values based on more than one column and that have been recorded within the same hour.
For example: Rows 1-2 and 3-4 below have the same values in the columns Date, Observer and FocalID and have been recorded within the same hour.
N Date Time Observer FocalID
1 20180520 07:05:00 VR JK
2 20180520 07:50:00 VR JK
3 20180521 07:50:00 JD CJD
4 20180521 08:25:00 JD CJD
I have tried the following code, but it won't work. A reason is that find_duplicates (hablar package) does not accept an interval, but only dataframe columns.
Time <- as.POSIXct (df$Time, format="%H:%M:%S")
span60 <- (Time - minutes(60)) %--% (Time + minutes(60))
df %>% find_duplicates (Date, Observer, FocalID, Time %within% span60)
Any kind of help would be very welcome! Thank you!
Not sure what is your expected output. Here is one way which will give a unique ID to every "duplicates".
library(dplyr)
df %>%
tidyr::unite(DateTime, Date, Time, sep = " ") %>%
mutate(DateTime = lubridate::ymd_hms(DateTime)) %>%
group_by(Observer, FocalID) %>%
mutate(grp = floor(difftime(DateTime, first(DateTime), units = 'hour'))) %>%
group_by(grp, .add = TRUE) %>%
mutate(ID = cur_group_id()) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 5
# N DateTime Observer FocalID ID
# <int> <dttm> <chr> <chr> <int>
#1 1 2018-05-20 07:05:00 VR JK 2
#2 2 2018-05-20 07:50:00 VR JK 2
#3 3 2018-05-21 07:50:00 JD CJD 1
#4 4 2018-05-21 08:25:00 JD CJD 1
All the rows with similar ID's can be considered as a part of one group.
data
df <- structure(list(N = 1:4, Date = c(20180520L, 20180520L, 20180521L,
20180521L), Time = c("07:05:00", "07:50:00", "07:50:00", "08:25:00"
), Observer = c("VR", "VR", "JD", "JD"), FocalID = c("JK", "JK",
"CJD", "CJD")), class = "data.frame", row.names = c(NA, -4L))
I understand your problem as trying to find any rows that have duplicate values in Observer and FocalID, while being within 60 minutes of each other.
The following solution works across date boundaries:
library(dplyr)
library(purrr)
library(lubridate)
df2 <- df %>%
group_by(Observer, FocalID) %>%
mutate(
dt = as_datetime(paste(Date, Time), tz = "UTC"),
dt_frame = interval(dt - minutes(60), dt + minutes(60), "UTC"),
which_duplicates = map(dt_frame, ~N[which(dt %within% .x)], N = N, dt = dt),
has_duplicates = map_lgl(which_duplicates, ~length(.x) > 1)
)
If you don't want this to group observations across midnight and keep days separate, just add Date to the group_by statement.
need to assaign time interval based on "cnt_rows" columns group by "Name". i.e if count is around
96 means then it will be 15 mins time interval . so if count is 94 then time interval should stop at
11:15 PM (Based on number of rows) and if they are exactly 96 then it should end at 11:45 PM every day.
Same for 5 mins interval. Interval should not exceed the day
cnt_rows = c("94","94",".",".","94","286","286",".",".",".","286","96","96",".",".","96")
Name = c("Alan","Alan",".",".","Alan","Steve","Steve",".",".",".","Steve","Mike","Mike",".",".","Mike")
Values = c("10","10",".",".","45","91","35",".",".",".","46","34","5",".",".","34")
Input Table
df = data.frame(cnt_rows,Name,Values)
Output Table
dt = c("2019-12-01 00:00:00","2019-12-01 00:15:00",".",".","2019-12-01 23:15:00","2019-12-01 00:00:00","2019-12-01 00:05:00",".",".",".","2019-12-01 23:45:00","2019-12-01 00:00:00","2019-12-01 00:15:00",".",".","2019-12-01 23:45:00")
df_out = data.frame(cnt_rows,Name,Values,dt)
Thanks in advance.
Maybe you can try :
library(dplyr)
date <- as.POSIXct('2019-12-01')
df %>%
mutate(breaks = ifelse(cnt_rows %in% c(94, 96), '15 min', '5 min')) %>%
group_by(Name) %>%
mutate(dt = seq(date, by = first(breaks), length.out = n()))
I have a dataset named transaction having 350241 observations.
Sample of the data:
transaction_id timestamp product_code
19241 2001-01-11 15:48:00 1
29247 2001-04-08 11:25:00 9
34567 2001-03-10 16:24:00 17
48790 2001-09-23 13:33:00 45
56789 2001-11-01 11:47:00 52
QUESTION
How many transactions were carried out during 18:00 hour?
How can I find this using R?
I tried with tables but the dataset is big so it isn't showing up all the frequency counts.
One approach would be to create an hour variable using lubridate::hour(timestamp):
library(tidyverse)
library(lubridate)
df %>%
count(hour = hour(timestamp))
You can then filter for just hour 18:
df %>%
count(hour = hour(timestamp)) %>%
filter(hour == 18)
A more verbose way of accomplishing the same thing:
df %>%
mutate(hour = hour(timestamp)) %>%
group_by(hour) %>%
tally() %>%
filter(hour == 18)
In base R, convert the 'timestamp' to POSIXlt, extract the hour, convert it to logical vector (==) and get the sum of TRUE elements
sum(as.POSIXlt(df1$timestamp)$hour == 18)