R converting dates in different formats to a single format - r

I'm working with a dataset that looks like this -
data = data.frame(ID=c(1,2,3,4,5,6,7,8,9,10),
Date=c('Jan 11, 2019 12:00:00 am','Feb 15, 2019 12:00:00 am','Mar 8, 2019 12:00:00 am',
'Apr 5, 2019 12:00:00 am','Apr 12, 2019 12:00:00 am','26/01/2015 00:00','2015-02-16 00:00:00',
'2015-02-12 00:00:00','2015-11-10 00:00:00','Dec 7, 2018 12:00:00 am'))
#Converting the Date column to character
data$Date=as.character(data$Date)
The column Date contains dates in different formats. I'd like to clean this column up so that all dates are in the same format.
The desired format - YYYY-MM-DD
HERE'S MY ATTEMPT -
I used the AsDate function from the flipTime package to convert my dates.
require(devtools)
install_github("Displayr/flipTime")
library(flipTime)
data$Date_New=AsDate(data$Date)
Which gives me the following error
Error in handleParseFailure(deparse(substitute(x)), length(x),
on.parse.failure) : Could not parse data$Date into a valid date in
any format.
However when I try the same function with any single date from my dataset, it works fine.
AsDate("Feb 15, 2018 12:00:00 AM")
[1] "2018-02-15"
AsDate("19/07/2017 00:00")
[1] "2017-07-19"
Any suggestions or alternative solutions would be highly appreciated

Related

How to combine 12-hour time sheet and AM/PM column from spreadsheet in r

I have a spreadsheet that has the date and 12 hour time in one column and then another column that specifies AM/PM. How do I combine these files so I can use them as a POSIXct/POSIXlt/POSIXt object?
The spreadsheet has the time column as
DAY/MONTH/YEAR HOUR:MINUTE
while hour is in a 12-hour format from a roster of check in times. The other column just says AM or PM. I am trying to combine these columns and then convert them to 24 hour time and use it as a POSIXt object.
Example of what I see:
Timesheet
AM-PM
8/10/2022 9:00
AM
8/10/2022 9:01
AM
And this continues until 5:00 PM (same day)
What I have tried so far:
Timesheet %>%
unite("timestamp_24", c("timestamp_12","am_pm"),na.rm=FALSE)%>%
mutate(timestamp=(as.POSIXct(timestamp, format = "%d-%m-%Y %H:%M"))
This does not work as when they are combined it gives:
Timestamp_24
DAY/MONTH/YEAR HOUR:MINUTE_AM
and I think this is the crux of the issue because then as.POSIXct can't read it.
Here's my solution. The approach is simply to extract the hour, +12 if it is PM, then format correctly with as.POSXct (you need to use / rather than - in the format argument if the your dataframe is at is appears in your example).
I've done that with stringr::str_replace() which allows you to set a function for the replace argument.
Timesheet %>%
mutate(
time_24hr = stringr::str_replace(
time,
"\\d+(?=:..$)",
function(x) {
hr <- as.numeric(x) %% 12
ifelse(am_pm == "PM", hr + 12, hr)
}
),
time_24hr = as.POSIXct(time_24hr, format = "%d/%m/%Y %H:%M")
)
This is the result:
time am_pm time_24hr
1 8/10/2022 9:00 AM 2022-10-08 09:00:00
2 8/10/2022 9:01 PM 2022-10-08 21:01:00
3 8/10/2022 12:01 PM 2022-10-08 12:01:00
4 8/10/2022 12:01 AM 2022-10-08 00:01:00
EDIT. realized that this didn't work for 11 and 12 as the regex was only extracting the first character before :. Also wasn't working for 12:xx times. Fixed both. Added test cases to show that these work now.

Changing date formats and calculating duration using lubridate

I am trying to get the duration between issue_d and last_pymnt_d in my dataset using the lubridate package. issue_d is in the following formatting chr "2015-05-01T00:00:00Z" and last_pymnt_d is in chr "Feb-2017". I need them in the same format (just need "my" or "myd" is fine if "my" is not an option) Then I need to know calculate between issue_d and last_pymnt_d.
lcDataSet2$issue_d<-parse_date_time(lcDataSet2$issue_d, "myd")
turns my issue_d into NA. I also get the below error when even just trying to view last_pymnt_d in date format
as.Date(lcRawData$last_pymnt_d)
Error in charToDate(x) :
character string is not in a standard unambiguous format
How can I get these into the same date format and then calculate the duration?
The order and letter case of the format string is important for parsing dates.
library(lubridate)
parse_date_time('2015-05-01T00:00:00Z', 'Y-m-d H:M:S')
[1] "2015-05-01 UTC"
parse_date_time('Feb-2017', 'b-Y')
[1] "2017-02-01 UTC"
If wanting just the month and year there is a zoo function
library(zoo)
date1 <- as.yearmon('2015-05-01T00:00:00Z')
[1] "May 2015"
date2 <- as.yearmon('Feb-2017', '%b-%Y')
[1] "Feb 2017"
difftime(date2, date1)
Time difference of 642 days
The zoo package gives you a function as.yearmon to covert dates into yearmon objects containing only the month and year. Since your last_pymnt_d is only month and year, the best date difference you will get is number of months:
library(zoo)
issue_d <- "2015-05-01T00:00:00Z"
last_pymnt_d <- "Feb-2017"
diff <- as.yearmon(last_pymnt_d, format = "%b-%Y") - as.yearmon(as.Date(issue_d))
diff
1.75
Under the hood, the yearmon object is a number of years, with the decimal component representing the months. A difference in yearmon of 1.75 is 1 year and 9 months.
diff_months <- paste(round(diff * 12, 0), "months")
"21 months"
diff_yearmon <- paste(floor(diff), "years and", round((diff %% 1) * 12, 0), "months")
diff_yearmon
"1 years and 9 months"

How to split Monday, July 1, 2019 12:00:00:000 AM

I have read, studied, and tested, but I'm just not getting it. Here is my data frame:
MyDate TEMP1 TEMP2
Monday, July 1, 2019 12:00:00:000 AM 90.0 1586
Monday, July 1, 2019 12:01:00:000 AM 88.6 1581
Monday, July 1, 2019 12:02:00:000 AM 89.4 1591
Monday, July 1, 2019 12:03:00:000 AM 90.5 1586
I need to compare it to a second data frame:
Date Time A.B.Flow A.B.Batch.Volume
7/1/2019 14:47:46 1.0 2.0
7/9/2019 14:47:48 3.0 5.0
7/11/2019 14:47:52 0.0 2.0
7/17/2019 14:48:52 3.8 4.0
7/24/2019 14:49:52 0.0 3.1
I just have to combine the two data frames when the minutes dates, hours, and minutes match. The seconds do not have to match.
So far I have gleaned that I need to convert the first Column MyDate into separate Dates and Times. I've been unable to come up with a strsplit command that actually does this.
This just gives each element in quotes:
Tried, newdate <- strsplit(testdate$MyDate, "\\s+ ")[[3]]
This is better but "2019"is gone:
Tried, newdate <- strsplit(testdate$MyDate, "2019")
It looks like this:
[1] "Monday, July 1, " "12:00:00:000 AM"
[[2]]
[1] "Monday, July 1, " "12:01:00:000 AM"
[[3]]
[1] "Monday, July 1, " "12:02:00:000 AM"
[[4]]
[1] "Monday, July 1, " "12:03:00:000 AM"
Please tell me what I am doing wrong. I would love some input as to whether I am barking up the wrong tree.
I've tried a few other things using anytime and lubridate, but I keep coming back to this combined date and time with the day written out as my nemesis.
You could get rid of the day (Monday, ...) in your MyDate field by splitting on ',', removing the first element, then combining the rest and converting to POSIXCt.
Assuming your first dataframe is called df:
dt <- strsplit(df$MyDate, ',')
df$MyDate2 <- sapply(dt, function(x) trimws(paste0(x[-1], collapse = ',')))
df$MyDate2 <- as.POSIXct(df$MyDate2, format = '%b %d, %Y %H:%M:%S')
And since you are not interested in the seconds part of the timestamps, you can do:
df$MyDate2 <- format(df$MyDate2, '%Y-%m-%d %H:%M')
You should similarly convert the Date/Time fields of your second dataframe df2, creating a MyDate2 field there with the seconds part removed as above.
Now you can merge the two dataframes on the MyDate2 column.
This might give you a hint:
Since you have time, you shouldn't used as.Date but rather as.POSIXct, imho.
x=c("Monday, July 1, 2019 12:00:00:000 AM 90.0 1586")
Months=c("January","February","March","April","May","June","July","August","September","October","November","December")
GetDate=function(x){
x=str_remove_all(x,",")#get rid of the
mo=which(Months==word(x,2))
day=word(x,3)
year=word(x,4)
time=word(x,5)
as.POSIXct(paste(paste(year,mo,day,sep="-"),time))
}
GetDate(x)

Dates out by 2 days when I convert to Date Format in R

When I am converting dates from characters to "dates" it seems to be off by 2 days from excel?
My example
mydata <- c(38808,40422,40493,40606)
as.Date(mydata, origin="1900-01-01")
# [1] "2006-04-03" "2010-09-03" "2010-11-13" "2011-03-06"
yet in excel the dates are as follows
Date in Excel in R Delta
38808 2006-04-01 2006-04-03 2
40422 2010-09-01 2010-09-03 2
40493 2010-11-11 2010-11-13 2
40606 2011-03-04 2011-03-06 2
I get around it by changing origin date to 1899-12-30 but I am sure I am doing something wrong.
Thanks
It is a known problem that Excel thinks 1900 was a leap year, but it was not. So Excel counts an extra day (for nonexistent Feb 29, 1900). In addition, Excel considers "1900-01-01" as day 1, not day 0.
Maybe the link will help:
http://www.cpearson.com/excel/datetime.htm
For excel dates you need this one:
mydata <- c(38808,40422,40493,40606)
as.Date(mydata, origin = "1899-12-30")
[1] "2006-04-01" "2010-09-01" "2010-11-11" "2011-03-04"

Error in converting date time to 24 hour format

I have the following dataframe and am trying to calculate the difference in minutes between dates in vectors and store it into a new one.
Reportnumber OpenedDate
00001 22/1/2016 5:52:12 PM
00002 20/1/2016 4:15:06 PM
00003 18/1/2016 1:09:46 PM
00004 15/1/2016 10:47:40 AM
00005 15/1/2016 10:32:37 AM
00006 14/1/2016 2:13:48 PM
00007 14/1/2016 11:12:29 AM
00008 14/1/2016 10:17:30 AM
00009 12/1/2016 2:25:03 PM
Before using difftime to get the difference, I'm trying to convert the time to a 24 hour format and strip AM/PM, I'm doing the following:
dataset$convertedDate <- as.POSIXct('dataset$OpenedDate', format="%d/%b/%Y %H:%M:%s")
I don't get an error in the console but the dataset$convertedDate vector isn't updated.
Is this the right way to approach the problem?
Update:
Get ready for a facepalm.
Look closely at the call you are making:
dataset$convertedDate <- as.POSIXct('dataset$OpenedDate', format="%d/%b/%Y %H:%M:%s")
You are passing in 'dataset$OpenedDate' instead of dataset$OpenedDate. In other words, you are actually passing in a text string to as.POSIXct()! I verified that passing in a string to as.POSIXct() indeed returns NA, which is what you are seeing.
You were also missing a format parameter for PM (%p). Try the following, which assumes that the timezone is UTC (which you can change to fit your needs):
as.POSIXct(df$OpenedDate, format="%d/%m/%Y %I:%M:%S %p", tz="UTC")
Output:
[1] "2016-01-22 17:52:12 UTC" "2016-01-20 16:15:06 UTC"
Data:
df <- data.frame(Reportnumber=c('00001', '00002'),
OpenedDate=c('22/1/2016 5:52:12 PM', '20/1/2016 4:15:06 PM'),
ClosedDate=c('25/1/2016 1:35:05 PM', '20/1/2016 4:30:06 PM'))

Resources