Parsing date and time from xlsx import - r

I have a column of dates with the following format in excel: MM/DD/YY AM or MM/DD/YY PM and was able to parse this date after importing with readxl::read_excel.
parse_date_time(x, '%m/%d/%y %p', tz = "UTC")
Now, if I wanted to bring in MM/DD/YY HH:MM PM instead, the import comes in as a number. For example.
"3/16/20 3:00 PM" becomes 43906.625 after import.
One solution would be to import date columns as strings, however, I have 50 columns in the file and don't want to hard code each column type. Is there a way to get the date and time from this numerical value instead (i.e. 43906.625)?

Excel uses a "day-integer" format. R uses "seconds-integer" for time and "day-integer" for Date, so depending on which class you are converting to, you need to accommodate a day of seconds (86,400). It is also worth knowing that Excel uses an "origin" from 1899 (the year).
as.POSIXct(43906.625 * 86400, origin = "1899-12-30", tz = "UTC")
# [1] "2020-03-16 15:00:00 UTC"
As a bit of history: the reason that it's "1899-12-30" and not, say "1899-12-31" (end of the day?) or something else is mentioned in a blog post from 2013:
For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. (Excel’s designer thought 1900 was a leap year, but it was not.) For Excel on Mac, the origin date is January 1, 1904.
https://www.r-bloggers.com/date-formats-in-r/
I don't know the canonical reference for this, and the website from which R-Bloggers borrowed/scraped that article from is not responsive. I would much prefer to have still-active and more-canonical references for this assertion (that engineers mis-identified the leap-year).

Related

How to format time zone offset in lubridate

I want to format the date in ISO 8601 format using lubridate. At the moment the code I have parses the date almost the way I want. The only thing I want to change is to have a colon in the time zone offset. My code at the moment:
dateTime <- str_match(fileName, dateTimeRegex)[2] %>% ymd_hms() %>% strftime(format = "%y-%m-%dT%H:%M:%S%z", tz = "UTC")
Sample output:
"19-09-26T10:45:00+0000"
Expected output:
"19-09-26T10:45:00+00:00"
Is there a simple way to do it, without parsing this manually? %z creates 0000, but I need a colon there
From Wikipedia (emphasis mine):
The UTC offset is the difference in hours and minutes from Coordinated Universal Time (UTC, or GMT) for a particular place and date. It is generally shown in the format ±[hh]:[mm], ±[hh][mm], or ±[hh]. So if the time being described is one hour ahead of UTC (such as the time in Berlin during the winter), the UTC offset would be "+01:00", "+0100", or simply "+01".
HH:MM is just one way to format time offsets, the others being HHMM and HH, so your output conforms to ISO 8601.
We can use regex to achieve your desired output. Using sub
x <- "19-09-26T10:45:00+0000"
sub("(.*\\+)(\\d{2})(\\d{2})", "\\1\\2:\\3", x)
#[1] "19-09-26T10:45:00+00:00"

Import separate date and time (hh:mm) excel columns, to use for time elapsed calculation

Newbie here, first post (please be gentle). I have been trying to resolve this for several hours, so finally decided time to ask advice.
I have a large spreadsheet which I am importing with readxl. It contains one column with date (format dd/mm/yyyy) and several time columns in format hh:mm as can be seen: excel
Essentially I want to be able to import both time and date columns and combine them, so that I can then do some other calculations, like time elapsed.
If I import letting R guess the col-types, it converts the times to POSIXct, but these then have a date on 1899 attached to them: R_POSIXct
If I force readxl to assign the time column to numeric, I get a decimal (e.g. 0.315972222 for 07:35), which then tried converting using similar syntax to
format(as.POSIXct(Sys.Date() + 0.315972222), "%Y-%m-%d %H:%M:%S", tz="UTC")
i.e.
df$datetime <- format(as.POSIXct(df$date + df$time), "%Y-%m-%d %H:%M", tz="UTC")
which results in the correct date, but with a time of 00:00, not the time it is passed.
I have tried searching here and found posts to be not quite the same question (e.g. Combining date and time columns into dd/mm/yyyy hh:mm), and have read widely, including about about lubridate, but as I'm only 6 months into R, am finding some explanations a bit cryptic.
Suggestions or ignposting appreciated (if there are solutions I haven't found)
If you subtract the number of days between 1899-01-01 and 1970-01-01 and then multiply that (shifted) Excel numeric value by 3600 you should come close to the number of seconds since start of 1970. You could then convert to POSIXct with as.POSIXct( x, origin="1970-01-01"). That does seem to be "the hard way", however
It would be far easier and probably more accurate to convert the date-times to YYYY-MM-DD H:M:S format and then export as csv to be imported into R as text. There is a "POSIXct" colClasses argument to read.csv, although it doesn't handle separate columns of date and time. For that you would be advised to import as character values and then paste the dates and times. Then watch you format strings for as.POSIXct. The dd/mm/yyyy "format" would be specified by "%d/%m/%Y".

Dealing with twitter timestamps in R

I've got a dataset with tweets and the informations twitter provides about them. I need to transform my dates from the format given to one I can understand properly (preferentially using a function I can choose the format, since I might need to select the tweets by day of the week, time of the day or anything like that) using R, I'm just starting to learn the language.
the format I've got the dates in is:
1420121295000
1420121298000
I've researched a bit before answering and tried to use functions like as.POSIXct, as>POSIXlt and others, they all got me this error:
Error in as.POSIXct.default(date, format = "%a %b %d %H:%M:%S %z %Y", :
do not know how to convert 'date' to class "POSIXct"
The format above is in epochs. Assuming this is in milliseconds since the epoch (you would have to double-check with the Twitter api), you can convert from epoch to UTC time using anytime function from the anytime package as shown below, which returns "2015-01-01 14:08:15 UTC."
anytime(1420121295000*0.001) #times 0.001 to convert to seconds
format(anytime(1420121295000*0.001), tz = "America/New_York", usetz=TRUE) #converting from UTC to EST timezone.

Interconverting POSIXct and numeric in R

I'm importing data from Excel and then trying to manipulate dates and times in R and it's giving me SUCH A HEADACHE. In the Excel file, one column contained a date and time, and another column contained a different time for that same day. The data in R looks basically like this example:
mydata <- data.frame(DateTime1 = as.POSIXct(c("2014-12-13 04:56:00",
"2014-12-13 09:30:00",
"2014-12-13 11:30:00",
"2014-12-13 13:30:00"),
origin = "1970-01-01", tz = "GMT"),
Time2 = c(0.209, 0.209, 0.715, 0.715))
I'd like to have a new column in POSIXct format with the date and the 2nd time, and I can't get that to work. I've tried:
mydata$DateTime2 <- as.POSIXct(as.numeric(as.Date(mydata$DateTime1)
+ mydata$Time2), origin = "1970-01-01",
tz = "GMT")
but that gives me dates and times close to 1/1/1970.
This is more convoluted, but one thing that has worked in other similiar situations that I've also tried is:
library(lubridate)
mydata$DateTime2 <- ymd_hms(format(as.POSIXct(as.Date(mydata$DateTime1) +
mydata$Time2,
origin = "1899-12-30", tz = "GMT")))
but that gives me dates and times that are off by 8 hours. That time difference makes me think that the problem is the time zone since I'm on Pacific Standard Time, but I set it to GMT both in the input data and when trying to convert! What gives? I'm hesitant to just add 8 hours to everything because of daylight savings time complications.
Really, both of the attempts I'm listing here seem to have problems with interconversion, i.e., if you start with a POSIXct object and convert it to numeric and then convert it back again to POSIXct, you should end up back where you started and you don't. Similarly, if you start with the time zone being GMT and then you add something that also is set to have the time zone as GMT, then you shouldn't have a problem with things somewhere mysteriously getting converted to the system time zone.
Advice?
I found an answer based on chris holbrook's answer here: How do you convert dates/times from one time zone to another in R?
This worked:
mydata$DateTime2 <- as.POSIXct(as.Date(mydata$DateTime1) +
mydata$Time2)
attributes(mydata$DateTime2)$tzone <- "GMT"
#MichaelChirico and I were correct that the time zone was the problem. I'm still not sure why, but the time zone for DateTime2 was apparently PST. It didn't list "PST" when I checked str(mydata$DateTime2), but based on the time difference, it must have been, in fact, PST until I set the attributes. Crazy. It did that even though DateTime1 was GMT.

R: Posix (Unix) Time Crazy Conversion

Unix time is 1435617000.
as.Date(1435617000,origin="01-01-1970")
[1] "3930586-11-23"
Which is wrong. I'm trying to (a) get the correct date, which, per epoch converter is GMT: Mon, 29 Jun 2015 22:30:00 GMT.
How do I get R to tell me the month, day, year, hour, minute & second? Thank you.
I think the reason why that happen is because as.Date converts arguments to class date objects. In this case you do not need a date but a class POSIXct object because your input, the x vector, contains other informations that as.Date is not able to manage. Another problem that even with the right function could appear, is that if when you do not specify the right time zone with the tz argument (except the case where your time zone is the same as the original time).
The following code does the job.
x <- 1435617000
as.POSIXct(x, origin = "1970-01-01", tz ="GMT")
[1] "2015-06-29 22:30:00 GMT"
Use as.Date
Just in the case you wanted only the date but you have a complete Unix time like x, you have to just divide by 86400 (which is the number of seconds in a day!) to get only the right date.
as.Date(x/86400L, origin = "1970-01-01")
[1] "2015-06-29"
Another important detail
The origin argument has to be supplied with YYYY-MM-DD and not like you did DD-MM-YYYY I am not sure but I think that the former is the only accepted and correct way.

Resources