I have a data frame like this:
Date
20130101
20130102
20130103
20130104
how i can split Date column in different column?
I have already used following function but it is not working :
library(data.table)
setDT(DF)[, tstrsplit(DATE, "/|\\s", type.convert = TRUE)]
Here are some solutions that do not require any packages. They all produce a data.frame with a "Date" class column followed by numeric columns for year, month and day. (The input used in reproducible form is given in the Note at the end.)
1) POSIXlt First convert the Date column to "Date" class giving date and then to an unclassed "POSIXlt" object giving lt. Now pick off the elements of lt appropriately:
date <- as.Date(as.character(DF$Date), format = "%Y%m%d")
lt <- unclass(as.POSIXlt(date))
with(lt, data.frame(Date = date, year = year + 1900, month = mon + 1, day = mday))
giving:
Date year month day
1 2013-01-01 2013 1 1
2 2013-01-02 2013 1 2
3 2013-01-03 2013 1 3
4 2013-01-04 2013 1 4
2) format
data.frame(date = as.Date(as.character(DF$Date), format = "%Y%m%d"),
year = as.numeric(format(date, "%Y")),
month = as.numeric(format(date, "%m")),
day = as.numeric(format(date, "%d")))
giving:
date year month day
1 2013-01-01 2013 1 1
2 2013-01-02 2013 1 2
3 2013-01-03 2013 1 3
4 2013-01-04 2013 1 4
3) math
with(DF, data.frame(date = as.Date(as.character(DF$Date), format = "%Y%m%d"),
year = Date %/% 10000,
month = Date %% 10000 %/% 100,
day = Date %% 100))
giving:
date year month day
1 2013-01-01 2013 1 1
2 2013-01-02 2013 1 2
3 2013-01-03 2013 1 3
4 2013-01-04 2013 1 4
4) read.fwf
data.frame(date = as.Date(as.character(DF$Date), format = "%Y%m%d"),
read.fwf(textConnection(as.character(DF$Date)), c(4, 2, 2),
col.names = c("year", "month", "day")))
giving:
date year month day
1 2013-01-01 2013 1 1
2 2013-01-02 2013 1 2
3 2013-01-03 2013 1 3
4 2013-01-04 2013 1 4
5) sub/read.table
date.ch <- sub("(....)(..)(..)", "\\1-\\2-\\3", DF$Date)
data.frame(date = as.Date(date.ch),
read.table(text = date.ch, col.names = c("year", "month", "day"), sep = "-"))
giving:
date year month day
1 2013-01-01 2013 1 1
2 2013-01-02 2013 1 2
3 2013-01-03 2013 1 3
4 2013-01-04 2013 1 4
Note: The input used, "DF", in reproducible form is:
DF <- data.frame(Date = 20130101:20130104)
If you not set on using data.table you could use the following command that incorporates substr:
x = data.frame("20130101", "20130102", "20130103", "20130104")
y<-data.frame(Year=substr(x[,1],1,4),
Month=substr(x[,1],5,6),
Day=substr(x[,1],7,8))
If you are sure your data is in the same format for the whole vector.
You can also do this with lubridate
library(dplyr)
library(lubridate)
data =
data_frame(Date = c(20130101, 20130102, 20130103, 20130104) ) %>%
mutate(date =
Date %>%
as.character %>%
ymd,
year = year(date),
month = month(date),
day = day(date))
Related
I want to replace .00 with -Q1, .25 with -Q2, .50 with -Q3, and .75 with -Q4 as given below. However, my code is not working as expected. Any hints?
library(tidyverse)
dt1 <-
tibble(Date = c(2015.00, 2015.25, 2015.50, 2015.75))
dt1
# A tibble: 4 x 1
Date
<dbl>
1 2015
2 2015.
3 2016.
4 2016.
dt1 %>%
pull(Date)
[1] 2015.00 2015.25 2015.50 2015.75
dt1 %>%
mutate(Date1 = str_replace_all(string = Date, pattern = c(".00" = "-Q1", ".25" = "-Q2", ".50" = "-Q3", ".75" = "-Q4")))
# A tidytable: 4 × 2
Date Date1
<dbl> <chr>
1 2015 2015
2 2015. 2015-Q2
3 2016. 2015.5
4 2016. 2015-Q4
There also is a zoo-function for that:
library(tidyverse)
library(zoo)
dt1 <-
tibble(Date = c(2015.00, 2015.25, 2015.50, 2015.75))
dt1 %>%
mutate(Date1 = format.yearqtr(Date, format = "%Y.Q%q") )
# Date Date1
# <dbl> <chr>
# 1 2015 2015.Q1
# 2 2015. 2015.Q2
# 3 2016. 2015.Q3
# 4 2016. 2015.Q4
You may also use integer division %/% and modulo division %% simultaneously
paste0(dt1$Date %/% 1, '-Q',(dt1$Date %% 1)*4 +1)
[1] "2015-Q1" "2015-Q2" "2015-Q3" "2015-Q4"
Thus, using it in piped syntax as
dt1 %>%
mutate(date1 = paste0(Date %/% 1, '-Q',(Date %% 1)*4 +1))
# A tibble: 4 x 2
Date date1
<dbl> <chr>
1 2015 2015-Q1
2 2015. 2015-Q2
3 2016. 2015-Q3
4 2016. 2015-Q4
here is a quick fix:
dt1 %>%
mutate(Date1 = str_replace_all(format(Date, nsmall = 2),
pattern = c(".00" = "-Q1", ".25" = "-Q2", ".50" = "-Q3", ".75" = "-Q4")))
The problem is that 2015.00 is first transformed to character at which point it becomes 2015. Therefore, the string replacement fails.
You can see this, by trying as.character(2015.00).
However, this can easily be fixed by using format to format the number first.
vec <- c("00" = "-Q1", "25" = "-Q2", "50" = "-Q3", "75" = "-Q4")
dt1 %>%
mutate(new = paste0(Date %/% 1, vec[sprintf("%02d", Date %% 1 * 100)]))
Date new
<dbl> <chr>
1 2015 2015-Q1
2 2015. 2015-Q2
3 2016. 2015-Q3
4 2016. 2015-Q4
library(tidyverse)
dt1 <-
as.character(c(2015.00, 2015.25, 2015.50, 2015.75))
dt1 <- if_else(str_detect(dt1, '\\.', negate = TRUE),
paste0(dt1, '.00'), #If condition TRUE
dt1) #if condition FALSE
value_before <- c("\\.00","\\.25","\\.5","\\.75" )
value_after <- c("-Q1", "-Q2","-Q3", "-Q4")
tibble(Date = str_replace(dt1, value_before, value_after))
#> # A tibble: 4 x 1
#> Date
#> <chr>
#> 1 2015-Q1
#> 2 2015-Q2
#> 3 2015-Q3
#> 4 2015-Q4
Created on 2021-06-01 by the reprex package (v2.0.0)
A solution with dyplr and tidyr:
Prepare decimals for further process with format
separate and mutate with -Q1-Q4
unite
library(tidyr)
library(dplyr)
dt1 %>%
mutate(Date = format(round(Date, digits=2), nsmall = 2)) %>%
separate(Date, into = c("Year", "Quarter"), remove=FALSE) %>%
mutate(Quarter = recode(Quarter, "00" = "-Q1", "25" = "-Q2", "50" = "-Q3", "75" = "-Q4")) %>%
unite("new", Year:Quarter, sep = "")
Output:
Date new
<chr> <chr>
1 2015.00 2015-Q1
2 2015.25 2015-Q2
3 2015.50 2015-Q3
4 2015.75 2015-Q4
I have a dataframe containing daily prices of a stock exchange with corresponding dates for several years. These dates are tradingdates and is thus excluded weekends and holidays. Ex:
df$date <- c(as.Date("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")
I have used lubridate to extract a column containg which month each date is in, but what I struggle with is creating a column that for each month of every year, calculates which number of trading day in the month it is. I.e. from the example, a counter that will start at 1 for 2017-04-03 as this is the first observation of the month and not 3 as it is the third day of the month and end at the last observation of the month. So that the column would look like this:
df$DayofMonth <- c(22, 23, 1, 2)
and not
df$DayofMonth <- c(30, 31, 3, 4)
Is there anybody that can help me?
Maybe this helps:
library(data.table)
library(stringr)
df <- setDT(df)
df[,YearMonth:=str_sub(Date,1,7)]
df[, DayofMonth := seq(.N), by = YearMonth]
You have a column called YearMonth with values like these '2020-01'.
Then for each group (month) you give each date an index which in your case would correspond to the trading day.
As you can see this would lead to 1 for the date '2017-04-03' since it is the first trading day that month. This works if your df is sorted from first date to latest date.
There is a way using lubridate to extract the date components and dplyr.
library(dplyr)
library(lubridate)
df <- data.frame(date = as.Date(c("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")))
df %>%
mutate(month = month(date),
year = year(date),
day = day(date)) %>%
group_by(year, month) %>%
mutate(DayofMonth = day - min(day) + 1)
# A tibble: 4 x 5
# Groups: year, month [2]
date month year day DayofMonth
<date> <dbl> <dbl> <int> <dbl>
1 2017-03-30 3 2017 30 1
2 2017-03-31 3 2017 31 2
3 2017-04-03 4 2017 3 1
4 2017-04-04 4 2017 4 2
You can try the following :
For each date find out the first day of that month.
Count how many working days are present between first_day_of_month and current date.
library(dplyr)
library(lubridate)
df %>%
mutate(first_day_of_month = floor_date(date, 'month'),
day_of_month = purrr::map2_dbl(first_day_of_month, date,
~sum(!weekdays(seq(.x, .y, by = 'day')) %in% c('Saturday', 'Sunday'))))
# date first_day_of_month day_of_month
#1 2017-03-30 2017-03-01 22
#2 2017-03-31 2017-03-01 23
#3 2017-04-03 2017-04-01 1
#4 2017-04-04 2017-04-01 2
You can drop the first_day_of_month column if not needed.
data
df <- data.frame(Date = as.Date(c("2017-03-30", "2017-03-31",
"2017-04-03", "2017-04-04")))
Problem:
I have records with a start and end date for an intervention and I want to merge the rows according to the following rule:
For each ID, any intervention that begins within one year of the last intervention ending, merge the rows so that the start_date is the earliest start date of the two rows, and the end_date is the latest end_date of the two rows.
I also want to keep track of intervention IDs if they are merged.
There can be five scenarios:
Two rows have the same start date, but different end dates.
Start date....End date
Start date.........End date
The period between row 2's start and end date lies within the period of row 1's start and end date.
Start date...................End date
.......Start date...End date
Row 2's intervention starts within Row 1's intervention period but ends later.
Start date.....End date
.....Start date.............End date
Row 2 starts within one year of the end of Row 1.
Start date....End date
......................|....<= 1 year....|Start date...End date
Row 2 starts over one year after the end of Row 1.
Start date...End date
.....................|........ > 1 year..........|Start date...End date
I want to merge rows in cases 1 to 4 but not 5.
Data:
library(data.table)
sample_data <- data.table(id = c(rep(11, 3), rep(21, 4)),
start_date = as.Date(c("2013-01-01", "2013-01-01", "2013-02-01", "2013-01-01", "2013-02-01", "2013-12-01", "2015-06-01")),
end_date = as.Date(c("2013-06-01", "2013-07-01", "2013-05-01", "2013-07-01", "2013-09-01", "2014-01-01", "2015-12-01")),
intervention_id = as.character(1:7),
all_ids = as.character(1:7))
> sample_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-06-01 1 1
2: 11 2013-01-01 2013-07-01 2 2
3: 11 2013-02-01 2013-05-01 3 3
4: 21 2013-01-01 2013-07-01 4 4
5: 21 2013-02-01 2013-09-01 5 5
6: 21 2013-12-01 2014-01-01 6 6
7: 21 2015-06-01 2015-12-01 7 7
The final result should look like:
> merged_data
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2015-06-01 2015-12-01 7 7
I'm not sure if the all_ids column is the best way to keep track of the intervention_id's so open to ideas for that. (The intervention_id's don't need to be in order in the all_ids column.)
It doesn't matter what the value of the intervention_id column is where rows have been merged.
What I tried:
I started off by writing a function to deal with only those cases where the start date is the same. It's a very non-R, non-data.table way of doing it and therefore very inefficient.
mergestart <- function(unmerged) {
n <- nrow(unmerged)
mini_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
merge_a <- function(unmerged, un_i, merged, m_i, no_of_records) {
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
while (un_i <= no_of_records) {
if(merged[m_i]$start_date == unmerged[un_i]$start_date) {
merged[m_i]$end_date <- max(merged[m_i]$end_date, unmerged[un_i]$end_date)
merged[m_i]$all_ids <- paste0(merged[m_i]$all_ids, ",", unmerged[un_i]$intervention_id)
un_i <- un_i + 1
} else {
m_i <- m_i + 1
merged[m_i] <- unmerged[un_i]
un_i <- un_i + 1
merge_a(unmerged, un_i, merged, m_i, (no_of_records - un_i))
}
}
return(merged)
}
mini_merged <- merge_a(unmerged, 1, mini_merged, 1, n)
return(copy(mini_merged[id != 0]))
}
Using this function on just one id gives:
> mergestart(sample_data[id == 11])
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 11 2013-02-01 2013-05-01 3 3
To use the function on the whole dataset:
n <- nrow(sample_data)
all_merged <- data.table(id = double(n),
start_date = as.Date(NA),
end_date = as.Date(NA),
intervention_id = character(n),
all_ids = character(n))
start_i <- 1
for (i in unique(sample_data$id)) {
id_merged <- mergestart(sample_data[id == i])
end_i <- start_i + nrow(id_merged) - 1
all_merged[start_i:end_i] <- copy(id_merged)
start_i <- end_i
}
all_merged <- all_merged[id != 0]
> all_merged
id start_date end_date intervention_id all_ids
1: 11 2013-01-01 2013-07-01 1 1,2
2: 21 2013-01-01 2013-07-01 4 4
3: 21 2013-02-01 2013-09-01 5 5
4: 21 2013-12-01 2014-01-01 6 6
5: 21 2015-06-01 2015-12-01 7 7
I also had a look at rolling joins but still can't get how to use it in this situation.
This answer https://stackoverflow.com/a/48747399/6170115 looked promising but I don't know how to integrate all the other conditions and track the intervention IDs with this method.
Can anyone point me in the right direction?
There are related questions How to flatten / merge overlapping time periods and Consolidate rows based on date ranges but none of them has the additional requirements posed by the OP.
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := {
tmp <- as.integer(start_date)
cumsum(tmp > shift(cummax(tmp + 365L), type = "lag", fill = 0L))
}, by = id][]
id start_date end_date intervention_id all_ids period
1: 11 2013-01-01 2013-06-01 1 1 1
2: 11 2013-01-01 2013-07-01 2 2 1
3: 11 2013-02-01 2013-05-01 3 3 1
4: 21 2013-01-01 2013-07-01 4 4 1
5: 21 2013-02-01 2013-09-01 5 5 1
6: 21 2013-12-01 2014-01-01 6 6 1
7: 21 2015-06-01 2015-12-01 7 7 2
For the sake of simplicity, it is assumed that one year has 365 days which ignores leap years with 366 days. If leap years are to be considered, a more sophisticated date arithmetic is required.
Unfortunately, cummax() has no method for arguments of class Date or IDate (data.table's integer version). Therefore, the coersion from Date to integer is required.
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
Edit: Correction
I just noted that I had misinterpreted OP's requirements. The OP has requested (emphasis mine):
For each ID, any intervention that begins within one year of the last
intervention ending, merge the rows so that the start_date is the
earliest start date of the two rows, and the end_date is the latest
end_date of the two rows.
The solution above looks for gaps of one year in the sequence of start_date but not in the sequence of start_date and the preceeding end_date as requested. The corrected version is:
library(data.table)
# ensure rows are ordered
setorder(sample_data, id, start_date, end_date)
# find periods
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id][]
# aggregate
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
id period start_date end_date intervention_id all_ids
1: 11 1 2013-01-01 2013-07-01 1 1, 2, 3
2: 21 1 2013-01-01 2014-01-01 4 4, 5, 6
3: 21 2 2015-06-01 2015-12-01 7 7
The result for the given sample dataset is identical for both versions which caused the error to slip through unrecognized.
Benchmark
The OP has mentioned in a comment that using lubridate's date arithmetic has dramatically enlarged run times.
According to my benchmark below, the penalty of using end_date %m+% years(1) is not that much. I have benchmarked three versions of the code:
v_1 is the corrected version from above.
v_2 pulls the type conversion and the data arithmetic out of the grouping part and creates two helper columns in advance.
v_3 is like v_2 but uses end_date %m+% years(1).
The benchmark is repeated for different problem sizes, i.e., total number of rows. Also, the number of different ids is varied as grouping may have an effect on performance. According to the OP, his full dataset of 500 k rows has 250 k unique ids which corresponds to an id_share of 0.5 (50%). In the benchmark id_shares of 0.5, 0.2, and 0.01 (50%, 20%, 1%) are simulated.
As sample_data is modified, each run starts with a fresh copy.
library(bench)
library(magrittr)
bm <- press(
id_share = c(0.5, 0.2, 0.01),
n_row = c(1000L, 10000L, 1e5L),
{
n_id <- max(1L, as.integer(n_row * id_share))
print(sprintf("Number of ids: %i", n_id))
set.seed(123L)
sample_data_0 <- lapply(seq(n_id), function(.id) data.table(
start_date = as.IDate("2000-01-01") + cumsum(sample(0:730, n_row / n_id, TRUE))
)) %>%
rbindlist(idcol = "id") %>%
.[, end_date := start_date + sample(30:360, n_row, TRUE)] %>%
.[, intervention_id := as.character(.I)]
mark(
v_1 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, period := cumsum(
as.integer(start_date) > shift(
cummax(as.integer(end_date) + 365L), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_2 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date) + 365)]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
v_3 = {
sample_data <- copy(sample_data_0)
setorder(sample_data, id, start_date, end_date)
sample_data[, `:=`(start = as.integer(start_date),
end = as.integer(end_date %m+% years(1)))]
sample_data[, period := cumsum(start > shift(cummax(end), type = "lag", fill = 0L))
, by = id]
sample_data[, .(start_date = start_date[1L],
end_date = max(end_date),
intervention_id = intervention_id[1L],
all_ids = toString(intervention_id)),
by = .(id, period)]
},
check = FALSE,
min_iterations = 3
)
}
)
ggplot2::autoplot(bm)
The result shows that the number of groups, i.e., number of unique id, does have a stronger effect on the run time than the different code versions. In case of many groups, the creation of helper columns before grouping (v_2) gains performance.
I have some data frame with event records with Start Times and End Times.
I want to expand the records into multiple records of consistant time intervals, let's say hour long intervals.
For example, lets say the data frame contains two records:
EventId Day StartTime EndTime
1 Mon 1 3
2 Tues 2 5
My desired new data frame should look like this
EventId Day Time
1 Mon 1
1 Mon 2
2 Tues 2
2 Tues 3
2 Tues 4
My function uses data.table::setDT to expand the records like this:
makeIncrementalRecords <- function(df) {
new <- data.table::setDT(df)[,
.(Time = seq(StartTime,
EndTime,
by = 1)),
by = .(EventId, Day)]
Executing this line by line, I have no issues, and I get the result I want. When I execute the function as part of a loaded package, I get the following error...
Error in x[j]: invalid subscript type 'list'
I am completely baffled as to why this code would suddenly stop working when executed as a function. I guess it has something to do with the local function environment. It works when I create the data frame as 'df' in the global environment and just execute the setDT function in the console.
Any suggestions?
Thanks
EDIT
I think this result is what you are looking for.
structure(list(EventId = 1:2, Day = c("Mon", "Tues"), StartTime = 1:2, EndTime = c(3L, 5L)),
.Names = c("EventId", "Day", "StartTime", "EndTime"),
row.names = c(NA, -2L), class = "data.frame") -> test_df
library(dplyr)
library(tidyr)
generate_val <- function(startT, endT){
(seq(from = startT, to = (endT-1), by = 1))
}
test_df %>%
rowwise() %>%
do(new_vars = generate_val(.$StartTime, .$EndTime)) %>%
bind_cols(test_df %>% select(-c(StartTime:EndTime))) %>%
unnest()
# A tibble: 5 x 3
EventId Day new_vars
<int> <chr> <dbl>
1 1 Mon 1
2 1 Mon 2
3 2 Tues 2
4 2 Tues 3
5 2 Tues 4
To package this in a function you would have to follow the NSE procedure described here - Programming with dplyr
If you don't mind using tidyr, this should work. It might be a little slower than data.table for large datasets (rows > 1 million).
library(tidyr)
test_df %>%
gather(., key = Time_type, value = Time, -EventId, -Day)
EventId Day Time_type Time
1 1 Mon StartTime 1
2 2 Tues StartTime 2
3 1 Mon EndTime 3
4 2 Tues EndTime 5
Here is a solution similar to that proposed in the comments.
library(tidyverse)
makeIncrementalRecords <- function(data){
data %>%
mutate(Time = map2(StartTime, EndTime, ~seq(.x, .y-1))) %>%
unnest() %>%
select(EventId, Day, Time)
}
makeIncrementalRecords(df)
# EventId Day Time
# 1 1 Mon 1
# 2 1 Mon 2
# 3 2 Tues 2
# 4 2 Tues 3
# 5 2 Tues 4
Or if you want to keep it with data.table
makeIncrementalRecords2 <- function(data){
data.table::setDT(data)[, .(Time = seq(StartTime, EndTime-1, by = 1)), by = .(EventId, Day)]
}
makeIncrementalRecords2(df)
# EventId Day Time
# 1: 1 Mon 1
# 2: 1 Mon 2
# 3: 2 Tues 2
# 4: 2 Tues 3
# 5: 2 Tues 4
I have a dataframe df1 like this :
timestamp
01-12-2015 00:04
01-12-2015 02:20
01-12-2015 02:43
01-12-2015 04:31
01-12-2015 08:51
01-12-2015 11:28
01-12-2015 20:53
01-12-2015 21:28
02-12-2015 00:30
02-12-2015 20:22
Which contains time stamps. I would want to get count by binning hours in 12 hours interval i.e(01-12-2015[0-9],01-12-2015[9-21], and so on.
output Sample:
DayOfMonth Group count
1 1 5
1 2 2
2 1 2
2 2 1
The day of month can be replaced by Serial Number also, starting with 1. Any help to solve this is highly appreciated.
A possible solution in base R:
# convert the 'timestamp' column to a datetime format
df1$timestamp <- as.POSIXct(strptime(df1$timestamp, format = '%d-%m-%Y %H:%M'))
# create day.of.month variable
df1$day.of.month <- format(df1$timestamp, '%d')
# extract the 12 hour interval as am/pm values
df1$group <- gsub('[0-9: ]+','\\1',format(df1$timestamp, '%r'))
# aggregate
aggregate(. ~ group + day.of.month, df1, length)
which gives:
group day.of.month timestamp
1 am 01 6
2 pm 01 2
3 am 02 1
4 pm 02 1
Another solution using data.table and and the pm function of lubridate:
library(lubridate)
library(data.table)
setDT(df1)[, timestamp := dmy_hm(timestamp)
][, group := pm(timestamp)+1
][, .N, .(day.of.month = day(timestamp),group)]
which gives:
day.of.month group N
1: 1 1 6
2: 1 2 2
3: 2 1 1
4: 2 2 1
Used data:
df1 <- structure(list(timestamp = c("01-12-2015 00:04", "01-12-2015 02:20", "01-12-2015 02:43", "01-12-2015 04:31", "01-12-2015 08:51",
"01-12-2015 11:28", "01-12-2015 20:53", "01-12-2015 21:28", "02-12-2015 00:30", "02-12-2015 20:22")),
.Names = "timestamp", class = "data.frame", row.names = c(NA,-10L))
We can use lubridate functions to convert to 'Datetime' class easily and with dplyr to get the output efficiently compared to base R methods.
library(lubridate)
library(dplyr)
df1 %>%
mutate(timestamp = dmy_hm(timestamp)) %>%
group_by(DayOfMonth = day(timestamp)) %>%
group_by(Group = as.numeric(cut(timestamp, breaks = "12 hour")),
add=TRUE) %>%
summarise(GroupCount = n())
# DayOfMonth Group GroupCount
# <int> <dbl> <int>
#1 1 1 6
#2 1 2 2
#3 2 1 1
#4 2 2 1
Or we can use a compact option with data.table
library(data.table)
setDT(df1)[, {t1 <- dmy_hm(timestamp); .(DayOfMonth = day(t1),
Group = (hour(t1)>12)+1L)}][, .(GroupCount = .N), .(DayOfMonth, Group)]
# DayOfMonth Group GroupCount
#1: 1 1 6
#2: 1 2 2
#3: 2 1 1
#4: 2 2 1
NOTE: The data.table solution is done with just two steps...
data
df1 <- structure(list(timestamp = c("01-12-2015 00:04", "01-12-2015 02:20",
"01-12-2015 02:43", "01-12-2015 04:31", "01-12-2015 08:51", "01-12-2015 11:28",
"01-12-2015 20:53", "01-12-2015 21:28", "02-12-2015 00:30", "02-12-2015 20:22"
)), .Names = "timestamp", class = "data.frame", row.names = c(NA,-10L))
Another possible solution in base R :
timeStamp <- c("01-12-2015 00:04","01-12-2015 02:20","01-12-2015 02:43","01-12-2015 04:31",
"01-12-2015 08:51","01-12-2015 11:28","01-12-2015 20:53","01-12-2015 21:28",
"02-12-2015 00:30","02-12-2015 20:22")
times <- as.POSIXlt(timeStamp,format="%d-%m-%Y %H:%M",tz='GMT')
DF <- data.frame(Times=times)
DF$Group <- as.logical(times$hour > 12) + 1
DF$DayOfMonth <- times$mday
res <- aggregate(Times ~ DayOfMonth + Group,data=DF, FUN = length)
# res :
# DayOfMonth Group Times
# 1 1 1 6
# 2 2 1 1
# 3 1 2 2
# 4 2 2 1
Or if you want to include dates in hours range: [21-0] of previous day in the next day :
timeStamp <- c("01-12-2015 00:04","01-12-2015 02:20","01-12-2015 02:43","01-12-2015 04:31",
"01-12-2015 08:51","01-12-2015 11:28","01-12-2015 20:53","01-12-2015 21:28",
"02-12-2015 00:30","02-12-2015 20:22")
times <- as.POSIXlt(timeStamp,format="%d-%m-%Y %H:%M",tz='GMT')
h <- times$hour + times$min*1/60 + times$sec*1/3600
# here we add 3 hours to the dates in hours range [21-0] in this way we
# push them to the next day
times[h >= 21] <- times[h >= 21] + 3*3600
DF <- data.frame(Times=times)
DF$Group <- ifelse(h < 9,1,ifelse(h <= 21,2,NA))
DF$DayOfMonth <- times$mday
res <- aggregate(Times ~ DayOfMonth + Group,data=na.omit(DF), FUN = length)
# res :
# DayOfMonth Group Times
# 1 1 1 5
# 2 2 1 2
# 3 1 2 2
# 4 2 2 1
Adding to the several already presented options, the stringi package has some date parsing functions as well:
library(stringi)
df1$timestamp <- stri_datetime_parse(df1$timestamp, format = 'dd-mm-yyyy HH:mm')
df1$DayOfMonth <- stri_datetime_format(df1$timestamp, format = 'd')
df1$Group <- stri_datetime_format(df1$timestamp, format = 'a')
After that you can get a count with for example the following two options:
# option 1:
aggregate(. ~ Group + DayOfMonth, df1, length) # copied from #ProcrastinatusMaximus
# option 2a:
library(dplyr)
df1 %>%
group_by(DayOfMonth, Group) %>%
tally()
# option 2b:
count(df1, DayOfMonth, Group)
The output of the latter:
DayOfMonth Group n
(chr) (chr) (int)
1 1 a.m. 6
2 1 p.m. 2
3 2 a.m. 1
4 2 p.m. 1