Is there a way to clean date and time data in r? - r

I am trying to summarize time from 4 am to 12 pm as morning, 12-9 as evening and 9 pm to 4 am as night. I am doing this to make a logistic regression model to know if the arrest would happen or not considering the type of crime and the time of the crime.
I have tried using the lubridate function but because the format is the string I am not able to use the function. And, as.Date function is neither of help since some of the strings are having this value: 03/26/2015 06:56:30 PM while some of the rows have this value: 04-12-15 20:24. Both the formatting are totally different so not able to use the as.Date function.
Apart from the as.Date function what we can do is convert all the 04-12-15 20:24 to 03/26/2015 06:56:30 PM format by doing something like => if you find - then replace it with / (for the date format).
I don't know how to achieve this goal.

You can use case_when() from the dplyr library to determine the format of the date and then proceed with the conversion based on the format type. From there we check the 24H time component to determine the time of day based on the bins in the OP.
library(dplyr)
chicago15 <- data.frame(Date = c("03/26/2015 06:56:30 PM","04-12-15 20:24",
"03/26/2015 06:56:30 AM","04-12-15 21:24",
"12/31/2017 03:28:43 AM"))
chicago15 %>%
dplyr::mutate(Date2 = dplyr::case_when(
grepl('-',Date) ~ as.POSIXct(Date,format = '%m-%d-%y %H:%M'),
TRUE ~ as.POSIXct(Date,format = '%m/%d/%Y %I:%M:%S %p')
)) %>%
dplyr::mutate(Time_of_Day = dplyr::case_when(
as.numeric(format(Date2,'%H')) >= 21 ~ 'night',
as.numeric(format(Date2,'%H')) >= 12 ~ 'evening',
as.numeric(format(Date2,'%H')) >= 4 ~ 'morning',
TRUE ~ 'night'
))
Date Date2 Time_of_Day
1 03/26/2015 06:56:30 PM 2015-03-26 18:56:30 evening
2 04-12-15 20:24 2015-04-12 20:24:00 evening
3 03/26/2015 06:56:30 AM 2015-03-26 06:56:30 morning
4 04-12-15 21:24 2015-04-12 21:24:00 night
5 12/31/2017 03:28:43 AM 2017-12-31 03:28:43 night

Related

attempt to use fct_collapse() with class Date, must be factor or character vector, not an s3 object

I have a dataset I want to plot which requires some simplifying of the date which will be my x-axis. Right now I have every single day from March 2020 to November 2022, but I want to use manually defined groups of 6 month periods, with the leftover just being the exception (This is my first question here so let me know if more context is needed).
Anyways, my instinct was to use fct_collapse, but I get this error:
.f must be a factor or character vector, not an S3 object with class Date
I understand it is because my column: by_date_total$date is a date
I don't see a forcats operation that would work, is my only option to convert the date class and then reconvert it back to date? If I convert the date class, how will the
the desired groups I set be read? I saw another answer which used as.date.frame to coerce the date class into a character class, but when I convert it to the character class
I can no longer use ('%y-%m-%d - %y-%m-%d') BUT I guess it never worked in the first place.
my dataframe by_date_total:
date total_deaths total_cases
<date> <dbl> <dbl>
2020-03-15 68 3595
2020-03-16 91 4502
2020-03-17 117 5901
2020-03-18 162 8345
2020-03-19 212 12387
2020-03-20 277 17998
2020-03-21 359 24507
This is what I tried that produced the error:
plot_by_date <- by_date_total%>%
mutate(
date2 =
fct_collapse(date,
'6 months' = c("2020-03-15" - "2020-09-14"),
'12 months' = c("2021-09-15" - "2021-03-14"),
'18 months' = c("2022-03-15" - "2022-09-14"),
'18 months+' = c("2022-09-15" - "2022-11-14"))
)
plot_by_date
I did not include the rest of the ggplot(aes()) info because I want to verify this step works first
changing it to character class idea: FOLLOWED BY RUNNING THE ABOVE AGAIN
ERROR:
non-numeric argument to binary operator
plot_by_date <- as_data_frame(by_date_total) %>%
rename(Date = date) %>%
mutate(Date = str_replace_all(Date, "\\D", "-"),
Date = as.character(Date))
plot_by_date
case_when() is a good alternative.
data.frame(date=as.Date(c("2020-03-16", "2020-03-14", "2021-09-16", "2022-03-16", "2022-09-16", "2022-11-15"))) %>%
mutate(date2 = case_when(date >= "2022-09-15" ~ "18+ months",
date >= "2022-03-15" ~ "18 months",
date >= "2021-09-15" ~ "12 months",
date >= "2020-03-15" ~ "6 months",
TRUE ~ "other"))
# date date2
#1 2020-03-16 6 months
#2 2020-03-14 other
#3 2021-09-16 12 months
#4 2022-03-16 18 months
#5 2022-09-16 18+ months
#6 2022-11-15 18+ months

Filtering out time data from R data frame

So i have a dataset in R:
IncidentID Time Vehicle
19002 4:48 Car
19003 12:30 Motorcycle
19004 14:00 Car
19005 9:30 Bicycle
And I'm trying to filter out some data, since its quite a large dataset. The above is just a few examples of data.
I want to filter out the data according to the time, where say i want to obtain the data where the Time is between 12pm to 6pm (18:00 in 24 hour format), hence i would have:
IncidentID Time Vehicle
19003 12:30 Motorcycle
19004 14:00 Car
I did:
incident <- read.csv("incident.csv")
afternoon_incident <- incident[which(incident$Time >= 12 && incident$Time <= 18),]
But I'm getting the error saying:
1: In Ops.factor(web$Time, 6:0) : ‘>=’ not meaningful for factors
2: In Ops.factor(web$Time, 12:0) : ‘<=’ not meaningful for factors
You can use lubridate to convert Time field into time object and then extract hour for filtering:
library(lubridate)
incident$Time <- hm(as.character(incident$Time))
incident[which(hour(incident$Time) >= 12 & hour(incident$Time) <= 18), ]
You need to first convert the Time into actual date-time object using as.POSIXct and then compare.
As you want to subset based on hour, we can extract only hour part of the data using format and keep rows which are in between 12 and 18 hour. Using base R, we can do
df$hour <- as.numeric(format(as.POSIXct(df$Time, format = "%H:%M"), "%H"))
subset(df, hour >= 12 & hour <= 18)
# IncidentID Time Vehicle hour
#2 19003 12:30 Motorcycle 12
#3 19004 14:00 Car 14
You can remove the hour column later if not needed.
For a general solution, we can create a date-time column and then compare
df$datetime <- as.POSIXct(df$Time, format = "%H:%M")
subset(df, datetime >= as.POSIXct("12:30:00", format = "%T") &
datetime <= as.POSIXct("18:30:00", format = "%T"))

How to manipulate the time part of a date column?

How do I write this code (hour is from lubridate package)?
Objective: if hour part of PICK_DATE is later than 16:00, the ADJ_PICK_DATE should be next day 03:00. If the hour part of PICK_DATE is earlier than 03:00, then ADJ_PICK_DATE is to be same day 03:00. Problem is, when there is no change needed, the code still adds 3 hours to the PICK_DATE i.e. when the hour part of PICK_DATE is within 03:00 and 16:00.
x$PICK_TIME <- cut(hour(x$PICK_DATE), c(-1, 2, 15, 24), c("EARLY", "OKAY", "LATE"))
x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="EARLY",
as.POSIXct(paste(format(x$PICK_DATE, "%d-%b-%Y"), "03:00"),
format="%d-%b-%Y %H:%M"), x$PICK_DATE)
x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="LATE",
as.POSIXct(paste(format(x$PICK_DATE+86400, "%d-%b-%Y"),
"03:00"), format="%d-%b-%Y %H:%M"),
x$ADJ_PICK_DATE)
x$ADJ_PICK_DATE <- as.POSIXct(x$ADJ_PICK_DATE, origin = "1970-01-01")
Help please.
Sample data:
PICK_DATE SHIP_DATE
01-APR-2017 00:51 02-APR-2017 06:55 AM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 09:39 AM
Here is a simple, reproducible example. I had to make up some sample data, based on an earlier question you asked. I suggest reading into dplyr and lubridate as they will help you with your work on manipulating dates.
EDIT: Updated to work with end-of-month dates.
library(lubridate)
library(dplyr)
df <- data.frame(pick_date = c("01-APR-2017 00:51", "02-APR-2017 08:53", "15-APR-2017 16:12", "23-APR-2017 02:04", "30-APR-2017 20:08"), ship_date = c("05-APR-2017 06:55", "09-APR-2017 12:11", "30-APR-2017 13:11", "02-MAY-2017 15:16", "05-MAY-2017 09:57"))
df %>%
mutate(pick_date = dmy_hm(pick_date)) %>%
mutate(ship_date = dmy_hm(ship_date)) %>%
mutate(pick_time = case_when(
hour(pick_date) <= 3 ~ "early",
hour(pick_date) >= 16 ~ "late",
TRUE ~ "okay")
) %>%
mutate(new_pick_time = case_when(
pick_time == "early" ~ hms(hours(3)),
pick_time == "late" ~ hms(hours(3)),
TRUE ~ hms(paste0(hour(pick_date), "H ", minute(pick_date), "M ", second(pick_date), "S")))
) %>%
mutate(temp_pick_date = case_when(
pick_time == "early" ~ pick_date,
pick_time == "late" ~ pick_date + days(1),
TRUE ~ pick_date)
) %>%
mutate(new_pick_date = make_datetime(year(temp_pick_date), month(temp_pick_date), day(temp_pick_date), hour(new_pick_time), minute(new_pick_time), second(new_pick_time))) %>%
select(-new_pick_time, -temp_pick_date)
This returns
pick_date ship_date pick_time new_pick_date
1 2017-04-01 00:51:00 2017-04-05 06:55:00 early 2017-04-01 03:00:00
2 2017-04-02 08:53:00 2017-04-09 12:11:00 okay 2017-04-02 08:53:00
3 2017-04-15 16:12:00 2017-04-30 13:11:00 late 2017-04-16 03:00:00
4 2017-04-23 02:04:00 2017-05-02 15:16:00 early 2017-04-23 03:00:00
5 2017-04-30 20:08:00 2017-05-05 09:57:00 late 2017-05-01 03:00:00
So it sounds like you just need to do two different arithmetic operations, conditional on the hour of a date time?
The simplest way I can think to access the hour component is to store the time in a POSIXlt. I believe the "l" stands or "list", and this lets you treat a timestamp like a list with the different time measurements being accessible attributes accordingly.
Like this:
> time <- as.POSIXlt('2017-07-29 15:12:01')
> time
[1] "2017-07-29 15:12:01 EDT"
> time$hour
[1] 15
So you could write a function that does the operation you desire, and feed it your date column. Hard for me to take it further because I don't quite understand the question, but here's a skeleton:
ComputeDifference <- function(time) {
if (time$hour < 3) {
# code to count orders between 0 and 3 "from same day 3:00"
}
if (time$hour > 16) {
# code to consider late orders
}
}
If you throw in sample data and refine the question, maybe I can take a more thorough crack at this.

Convert hour to date-time

I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time

R aggregate a dataframe by hours from a date with time field

I'm relatively new to R but I am very familiar with Excel and T-SQL.
I have a simple dataset that has a date with time and a numeric value associated it. What I'd like to do is summarize the numeric values by-hour of the day. I've found a couple resources for working with time-types in R but I was hoping to find a solution similar to is offered excel (where I can call a function and pass-in my date/time data and have it return the hour of the day).
Any suggestions would be appreciated - thanks!
library(readr)
library(dplyr)
library(lubridate)
df <- read_delim('DateTime|Value
3/14/2015 12:00:00|23
3/14/2015 13:00:00|24
3/15/2015 12:00:00|22
3/15/2015 13:00:00|40',"|")
df %>%
mutate(hour_of_day = hour(as.POSIXct(strptime(DateTime, "%m/%d/%Y %H:%M:%S")))) %>%
group_by(hour_of_day) %>%
summarise(meanValue = mean(Value))
breakdown:
Convert column of DateTime (character) into formatted time then use hour() from lubridate to pull out just that hour value and put it into new column named hour_of_day.
> df %>%
mutate(hour_of_day = hour(as.POSIXct(strptime(DateTime, "%m/%d/%Y %H:%M:%S"))))
Source: local data frame [4 x 3]
DateTime Value hour_of_day
1 3/14/2015 12:00:00 23 12
2 3/14/2015 13:00:00 24 13
3 3/15/2015 12:00:00 22 12
4 3/15/2015 13:00:00 40 13
The group_by(hour_of_day) sets the groups upon which mean(Value) is computed in the via the summarise(...) call.
this gives the result:
hour_of_day meanValue
1 12 22.5
2 13 32.0

Resources