Having trouble converting df index to datetime object - datetime

So this is my dataframe
Ticker Owner \
SEC Form 4
Nov 09 02:19 PM HSY HERSHEY TRUST
Nov 09 02:05 PM HSY HERSHEY TRUST CO
Nov 09 02:03 PM WDFC PITTARD DANIEL E
Nov 09 01:34 PM IMGN Enyedy Mark J
Nov 09 01:25 PM ORI ZUCARO ALDO C
I'm trying to convert the index(SEC Form 4) into a datetime object, so I can use that object's methods. However, I like the current format style of the date (Nov 09 02:19 PM) and don't want to replace it with something like (2016-11-09 14:19).
pd.to_datetime(df.index, format = '%b %d')
pd.to_datetime(df.index, format = '%b %d %I:%M %p' )
I played around with some of these format parameters but it seems these change the look display style of the date into something like (2016-11-09 14:19:00) format, which is not the format that I want.
I even tried to see if there was a dtype datetime that I can just convert to (so I won't have to change the display look) but I had no luck finding such a dtype.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html
Thank you.

Maybe you missed to append the year since it is not specified in the data. Here is a possible solution.
zz = """"SEC Form 4" Ticker Owner
"Nov 09 02:19 PM" "HSY" "HERSHEY TRUST"
"Nov 09 02:05 PM" HSY "HERSHEY TRUST CO"
"Nov 09 02:03 PM" WDFC "PITTARD DANIEL E"
"Nov 09 01:34 PM" IMGN "Enyedy Mark J"
"Nov 09 01:25 PM" ORI "ZUCARO ALDO C"
"""
df = pd.read_table(io.StringIO(zz), delim_whitespace=True)
df.set_index('SEC Form 4', inplace=True)
# Adding the missing year
df.index = '2016 ' + df.index
# There is no need to detail the expected format
df.index = pd.to_datetime(df.index)
print(df.index.dtype)
print(df)
# datetime64[ns]
# Ticker Owner
# 2016-11-09 14:19:00 HSY HERSHEY TRUST
# 2016-11-09 14:05:00 HSY HERSHEY TRUST CO
# 2016-11-09 14:03:00 WDFC PITTARD DANIEL E
# 2016-11-09 13:34:00 IMGN Enyedy Mark J
# 2016-11-09 13:25:00 ORI ZUCARO ALDO C

Related

Extract time stamps from string and convert to R POSIXct object

Currently, my dataset has a time variable (factor) in the following format:
weekday month day hour min seconds +0000 year
I don't know what the "+0000" field is but all observations have this. For example:
"Tues Feb 02 11:05:21 +0000 2018"
"Mon Jun 12 06:21:50 +0000 2017"
"Wed Aug 01 11:24:08 +0000 2018"
I want to convert these values to POSIXlt or POSIXct objects(year-month-day hour:min:sec) and make them numeric. Currently, using as.numeric(as.character(time-variable)) outputs incorrect values.
Thank you for the great responses! I really appreciate a lot.
Not sure how to reproduce the transition from factor to char, but starting from that this code should work:
t <- unlist(strsplit(as.character("Tues Feb 02 11:05:21 +0000 2018")," "))
strptime(paste(t[6],t[2],t[3], t[4]),format='%Y %b %d %H:%M:%S')
PS: More on date formats and conversion: https://www.stat.berkeley.edu/~s133/dates.html
For this problem you can get by without using lubridate. First, to extract individual dates we can use regmatches and gregexpr:
date_char <- 'Tue Feb 02 11:05:21 +0000 2018 Mon Jun 12 06:21:50 +0000 2017'
ptrn <- '([[:alpha:]]{3} [[:alpha:]]{3} [[:digit:]]{2} [[:digit:]]{2}\\:[[:digit:]]{2}\\:[[:digit:]]{2} \\+[[:digit:]]{4} [[:digit:]]{4})'
date_vec <- unlist( regmatches(date_char, gregexpr(ptrn, date_char)))
> date_vec
[1] "Tue Feb 02 11:05:21 +0000 2018" "Mon Jun 12 06:21:50 +0000 2017"
You can learn more about regular expressions here.
In the above example +0000 field is the UTC offset in hours e.g. it would be -0500 for EST timezone. To convert to R date-time object:
> as.POSIXct(date_vec, format = '%a %b %d %H:%M:%S %z %Y', tz = 'UTC')
[1] "2018-02-02 11:05:21 UTC" "2017-06-12 06:21:50 UTC"
which is the desired output. The formats can be found here or you can use lubridate::guess_formats(). If you don't specify the tz, you'll get the output in your system's time zone (e.g. for me that would be EST). Since the offset is specified in the format, R correctly carries out the conversion.
To get numeric values, the following works:
> as.numeric(as.POSIXct(date_vec, format = '%a %b %d %H:%M:%S %z %Y', tz = 'UTC'))
[1] 1517569521 1497248510
Note: this is based on uniform string structure. In the OP there was Tues instead of Tue which wouldn't work. The above example is based on the three-letter abbreviation which is the standard reporting format.
If however, your data is a mix of different formats, you'd have to extract individual time strings (customized regexes, of course), then use lubridate::guess_formats() to get the formats and then use those to carry out the conversion.
Hope this is helpful!!

Inconsistency in abbreviated month names in non-english (danish & norwegian)

I wish convert some dates in Norwegian to actual dates in R. I'm using readr, and it kind of works - but I stumbled upon an issue which really annoys me, and I don't really know how to get around it.
Here is an illustration of my problem:
> parse_date(c("29. mai 2017", "29. sep 2017"), format = "%d. %b %Y", locale = locale("nn"))
Warning: 1 parsing failure.
row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 2 NA date like %d. %b %Y 29. sep 2017
[1] "2017-05-29" NA
So it catches the date in May but not the one in September. It turns out that this is because the abbreviation for September in Norwegian needs a "." (sep. instead of sep), whereas the May abbreviations does not (probably because it's actually not an abbreviation ;-)):
locale("nb")
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: søndag (søn.), mandag (man.), tirsdag (tir.), onsdag (ons.), torsdag(tor.), fredag (fre.), lørdag (lør.)
Months: januar (jan.), februar (feb.), mars (mar.), april (apr.), mai (mai), juni (jun.), juli (jul.), august (aug.), september (sep.), oktober (okt.), november (nov.), desember (des.)
AM/PM: a.m./p.m.
However it seems inconsistent that it will not require the same number of charterers for all months. I also noticed that these annoying "." are not a part of the abbreviations in English:
> locale("en")
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday (Thu), Friday (Fri), Saturday (Sat)
Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), June (Jun), July (Jul), August (Aug),
September (Sep), October (Oct), November (Nov), December (Dec)
AM/PM: AM/PM
It really is terrible inconvenient also because I believe it is somewhat rare to actually include the "." at all when registration dates with abbreviations (but that is really just based on personal preferences and experience). Any input is much appreciated.
You can edit the locale manually like this...
loc <- locale("nb")
loc$date_names$mon_ab <- substr(loc$date_names$mon_ab, 1, 3) #just take first 3 characters
parse_date(c("29. mai 2017", "29. sep 2017"), format = "%d. %b %Y", locale = loc)
[1] "2017-05-29" "2017-09-29"
A solution similar to and inspired #Andrew Gustar is creating your own date_names object:
loc <- locale("nb")
myNo <- date_names(mon = loc$date_names$mon,
mon_ab = substr(loc$date_names$mon_ab, 1, 3),
day = loc$date_names$day,
day_ab = substr(loc$date_names$day, 1, 3))
parse_date(c("29. mai 2017", "29. sep 2017"), format = "%d. %b %Y", locale = locale(date_names = myNo))
[1] "2017-05-29" "2017-09-29"

Difficult Date Time Conversion in R

I'm trying to separate this date/time string in R but have not been successful.
Here is an example of the strings:
"Thu Sep 28 02:11:51 +0000 2017"
"Mon Oct 02 19:22:35 +0000 2017"
What is the best way to make this tidy? I've realized this is far beyond my skills.
Try something like this:
as.POSIXct(gsub("\\+0000", '', "Thu Sep 28 02:11:51 +0000 2017"), format = "%a %b %d %H:%M:%S %Y")
which gives "2017-09-28 02:11:51 EDT"

How can I convert string similar to this one "May 31 2015 11:45PM" to date format in R

I am trying to convert a string similar to "May 31 2015 11:45PM" to a time format in R. How can I do that?
I tried this but it didn't do the job:
mdy_hm("May 31 2015 11:45PM")
library(lubridate)
s <- 'May 31 2015 11:45PM'
ts <- mdy_hm(s)
str(ts)
yields for me
POSIXct[1:1], format: "2015-05-31 23:45:00"

R convert following format to date class

I have a data frame consist of two columns date and Text. The format of date is somewhat typical as Jan 09 05:44:30 +0000 2015. Now i want to convert this date in to format as 01/09/2015 05:44:30 or Jan/09/2015 05:44:30. I did some efforts on single date and it worked fine but same failed on whole date column. Please help .
I tried as such way :
p <- "Jan 09 05:44:30 +0000 2015"
p <- sub("Jan","01",p)
p1 <- strsplit(p," ")
p2 <- unlist(p1)
append(p2,p2[5], after=2)
I have data frame which looks like :
Text Date
"...some text ....." Jan 09 05:44:30 +0000 2015
"...some text ....." Jan 09 05:44:30 +0000 2015
"...some text ....." Jan 09 05:44:30 +0000 2015
"...some text ....." Jan 09 05:44:30 +0000 2015
and I want it as:
Text Date
"...some text ....." 01/09/2015 05:44:30
"...some text ....." 01/09/2015 05:44:30
"...some text ....." 01/09/2015 05:44:30
"...some text ....." 01/09/2015 05:44:30
Study help("strptime") to learn how to create the format string.
p <- "Jan 09 05:44:30 +0000 2015"
as.POSIXct(p, format="%b %d %H:%M:%S %z %Y", tz="GMT")
#[1] "2015-01-09 05:44:30 GMT"
This gives you a datetime object (and is of course vectorized). Use the format function as necessary for creating output strings with other formats if you must.

Resources