I have several rows in my table like below:
row1: abc changed on 12 November, 2008 11:30 AM and its abc..region1
row2: defg updated 14 January, 2012 08:20 PM ......region2
row3: ghijkl corrected by 18 august, 2013 9:30 AM ..something..region3
My requirement is as follows:
All the above dates are in EST time zone and date format is exactly as above and does not change.
I want to update the dates in these rows from EST to different time zones as per the region in that row, and the format should be changed to something like 12 dec 2016 7:30 AM.
So the query I framed is (taking row1 as example) as below:
select regexp_replace(
'abc changed on 12 November, 2008 11:30 AM and its abc..region1',
'([0-9]{2})([[:blank:]]) (January|February|March|April|May|June|July|August|September|October|November|December)(,[[:blank:]])([0-9]{4})([[:blank:]])([0-9]{2}:[0-9]{2})([[:blank:]])(AM|PM)','\1-\3-\5 \7 \9',1,0,'i')
output:
abc changed on 12-November-2008 11:30 AM and its abc..region1
So I am happy with the above query till now because I get a string
with the formatted date. Even though this is not the final date
format, I can use this date to pass to some function which converts
this date according to the region do some processing and fianlly
return a date type.For the same purpose I add to_date in the above
query:
select regexp_replace(
'abc changed on 12 November, 2008 11:30 AM and its abc..region1',
'([0-9]{2})([[:blank:]]) (January|February|March|April|May|June|July|August|September|October|November|December)(,[[:blank:]])([0-9]{4})([[:blank:]])([0-9]{2}:[0-9]{2})([[:blank:]])(AM|PM)',
substr('\1-\3-\5 \7 \9',1),
1,0,'i')
output:
abc changed on 12-November-2008 11:30 AM and its
abc..region1 --> works fine till here
Now I am adding to_date to convert the date string type to real date
type to do some processing on it:
select regexp_replace(
'abc changed on 12 November, 2008 11:30 AM and its abc..region1',
'([0-9]{2})([[:blank:]]) (January|February|March|April|May|June|July|August|September|October|November|December)(,[[:blank:]])([0-9]{4})([[:blank:]])([0-9]{2}:[0-9]{2})([[:blank:]])(AM|PM)',
to_date(substr('\1-\3-\5 \7 \9',1),'dd-mon-yyyy HH:MI AM'),
1,0,'i')
This query is giving me an error:
ORA-01858: a non-numeric character found where a numeric was expected
I checked whether wrong parameters were being passed to
to_date(), and fired the query below, but it worked fine.
Select to_date('12-November-2008 11:30 AM','dd-mon-yyyy HH:MI AM')
from dual;
output:
12-Nov-2008
(I am not worried with the timestamp because itnternall it will be anyways in this date)
To avoid confusion I have numbered the substrings of the regular expression above:
([0-9]{2})-->1 ([[:blank:]])-->2
(January|February|March|April|May|June|July|August|September|October|November|December)-->3
(,[[:blank:]])-->4 ([0-9]{4})-->5 ([[:blank:]])-->6
([0-9]{2}:[0-9]{2})-->7 ([[:blank:]])-->8 (AM|PM)-->9
select regexp_replace(
'abc changed on 12 November, 2008 11:30 AM and its abc..region1',
'([0-9]{2})([[:blank:]]) (January|February|March|April|May|June|July|August|September|October|November|December)(,[[:blank:]])([0-9]{4})([[:blank:]])([0-9]{2}:[0-9]
{2})([[:blank:]])(AM|PM)','\1-\3-\5 \7 \9',1,0,'i')
Assuming your string always has the date in it in that particular format (and that there are no invalid dates etc etc) then the following should work for you:
WITH sample_data AS (SELECT ' the date is 12 November, 2008 11:30 AM' str FROM dual UNION ALL
SELECT 'Here''s a date of 1 March, 2015 1:43 pm' str FROM dual UNION ALL
SELECT '1 February,2016 9:43 AM' str FROM dual UNION ALL
SELECT 'And again it''s 21 May, 2016 9:43 AM and a little bit extra' str FROM dual)
SELECT str,
to_date(regexp_replace(str, '^.*?([[:digit:]]{1,2} [[:alpha:]]{3,9}, ?[[:digit:]]{4} [[:digit:]]{1,2}\:[[:digit:]]{2} (A|P)M).*$', '\1', 1, 1, 'i'), 'dd Month yyyy, hh:mi am') dt
FROM sample_data;
STR DT
---------------------------------------------------------- -------------------
the date is 12 November, 2008 11:30 AM 12/11/2008 11:30:00
Here's a date of 1 March, 2015 1:43 pm 01/03/2015 13:43:00
1 February,2016 9:43 AM 01/02/2016 09:43:00
And again it's 21 May, 2016 9:43 AM and a little bit extra 21/05/2016 09:43:00
The regular expression can be broken down as follows:
^.*? - match any character (except new line) from the start of the line as few times as possible, which may be 0 or more.
([[:digit:]]{1,2} [[:alpha:]]{3,9}, ?[[:digit:]]{4} [[:digit:]]{1,2}\:[[:digit:]]{2} (A|P)M) - this is the pattern we're looking for, and which we'll use to replace the whole string with (this is aliased as \1, which we can then pass into the replace string parameter).
.*$ - match any character up to the end of the string
The second part of the pattern can be further broken down as:
[[:digit:]]{1,2} - one or two digits
- a single space character
[[:alpha:]]{3,9} - three to nine letters (upper or lower case)
, ? - a comma followed by 0 or 1 spaces
[[:digit:]]{4} - four digits
- a single space character
[[:digit:]]{1,2} - one or two digits
\: - a single colon character
[[:digit:]]{1,2} - two digits
- a single space character
(A|P)M - either the letter A or P followed by an M
This should do the trick for you:
WITH sample_data AS (SELECT 'abc changed on 12 November, 2008 11:30 AM and its abc..region1' str FROM dual UNION ALL
SELECT 'defg updated 14 January, 2012 08:20 PM ......region2' str FROM dual UNION ALL
SELECT 'ghijkl corrected by 18 august, 2013 9:30 AM ..something..region3' str FROM dual)
SELECT str,
regexp_replace(str,
'(^.*?)(([[:digit:]]{1,2}) (January|February|March|April|May|June|July|August|September|October|November|December), (?[[:digit:]]{4} [[:digit:]]{1,2}\:[[:digit:]]{2} (A|P)M))(.*$)',
'\1\3-\4-\5\7', 1, 1, 'i') dt
FROM sample_data;
STR DT
------------------------------------------------------------------- --------------------------------------------------------------------------------
abc changed on 12 November, 2008 11:30 AM and its abc..region1 abc changed on 12-November-2008 11:30 AM and its abc..region1
defg updated 14 January, 2012 08:20 PM ......region2 defg updated 14-January-2012 08:20 PM ......region2
ghijkl corrected by 18 august, 2013 9:30 AM ..something..region3 ghijkl corrected by 18-august-2013 9:30 AM ..something..region3
Related
As we all know, date parsing in Go has it's quirks*.
However, I have now come up against needing to parse a datetime string in CCYY-MM-DDThh:mm:ss[.sss...] to a valid date in Go.
This CCYY format is a format that seems to be ubiquitous in astronomy, essentially the CC is the current century, so although we're in 2022, the century is the 21st century, meaning the date in CCYY format would be 2122.
How do I parse a date string in this format, when we can't specify a coded layout?
Should I just parse in that format, and subtract one "century" e.g., 2106 becomes 2006 in the parsed datetime...?
Has anyone come up against this niche problem before?
*(I for one would never have been able to remember January 2nd, 3:04:05 PM of 2006, UTC-0700 if it wasn't the exact time of my birth! I got lucky)
The time package does not support parsing centuries. You have to handle it yourself.
Also note that a simple subtraction is not enough, as e.g. the 21st century takes place between January 1, 2001 and December 31, 2100 (the year may start with 20 or 21). If the year ends with 00, you do not have to subtract 100 years.
I would write a helper function to parse such dates:
func parse(s string) (t time.Time, err error) {
t, err = time.Parse("2006-01-02T15:04:05[.000]", s)
if err == nil && t.Year()%100 != 0 {
t = t.AddDate(-100, 0, 0)
}
return
}
Testing it:
fmt.Println(parse("2101-12-31T12:13:14[.123]"))
fmt.Println(parse("2122-10-29T12:13:14[.123]"))
fmt.Println(parse("2100-12-31T12:13:14[.123]"))
fmt.Println(parse("2201-12-31T12:13:14[.123]"))
Which outputs (try it on the Go Playground):
2001-12-31 12:13:14.123 +0000 UTC <nil>
2022-10-29 12:13:14.123 +0000 UTC <nil>
2100-12-31 12:13:14.123 +0000 UTC <nil>
2101-12-31 12:13:14.123 +0000 UTC <nil>
As for remembering the layout's time:
January 2, 15:04:05, 2006 (zone: -0700) is a common order in the US, and in this representation parts are in increasing numerical order: January is month 1, 15 hour is 3PM, year 2006 is 6. So the ordinals are 1, 2, 3, 4, 5, 6, 7.
I for one would never have been able to remember January 2nd, 3:04:05 PM of 2006, UTC-0700 if it wasn't the exact time of my birth! I got lucky.
The reason for the Go time package layout is that it is derived from the Unix (and Unix-like) date command format. For example, on Linux,
$ date
Fri Apr 15 08:20:43 AM EDT 2022
$
Now, count from left to right,
Month = 1
Day = 2
Hour = 3 (or 15 = 12 + 3)
Minute = 4
Second = 5
Year = 6
Note: Rob Pike is an author of The Unix Programming Environment
I have a dataset that looks like this:
datetime count
18:28:20.602 UTC DEC 08 2016 1
20:42:32.017 UTC DEC 08 2016 5
15:33:40.691 UTC DEC 08 2016 1
17:11:54.008 UTC DEC 08 2016 3
20:28:57.861 UTC DEC 08 2016 0
.
.
.
.
The datetime column is in the string format. I'm having difficulty in converting it to a timestamp.
How do I write a Impala/Hive query so that I get the data between '18:28:00.000 UTC DEC 08 2016' to '18:33:00.000 UTC DEC 08 2016'
With Hive:
cast(from_unixtime(unix_timestamp(SHITTY_FORMAT, 'HH:mm:ss.SSS zzz MMM dd yyyy'), 'yyyy-MM-dd HH:mm:ss.SSS') as Timestamp)
...will translate your shitty String format into a UNIX timestamp, then into String standard format (in local timezone because that's the Hive convention), then into a Timestamp.
There is no easier way, unfortunately. And you may have some edge cases because of the 1h overlap in summer/winter times.
Source: the Hive documentation, of course...
With Impala (which does not support the zzz format modifier):
cast(from_unixtime(unix_timestamp(regexp_replace(SHITTY_FORMAT, ' UTC ', ' '), 'HH:mm:ss.SSS MMM dd yyyy'), 'yyyy-MM-dd HH:mm:ss.SSS') as Timestamp)
...will translate your shitty String format into a UNIX timestamp, assuming that all your inputs are in UTC, then into String standard format (in UTC timezone because that's the Impala convention), then into a Timestamp.
I have a huge dataframe that looks something like this:
Insider Trading Relationship Date \
SEC Form 4
Nov 16 04:06 PM Silverman Gene Director Nov 14
Oct 27 07:00 AM RAKOLTA JOHN JR Director Oct 26
Nov 16 04:09 PM LEIGHTON F THOMSON Chief Executive Officer Nov 15
Nov 02 04:20 PM Blumofe Robert EVP Platform Nov 01
Oct 28 04:03 PM MCCONNELL RICK M President Prods & Development Oct 28
I'm trying to change the index dtype into a datetime dtype via this code
pd.to_datetime(df2.index, format = '%b %d %I:%M %p')
but it's yielding the error:
Traceback (most recent call last):
File "<pyshell#126>", line 1, in <module>
pd.to_datetime(df2.index, format = '%b %d %I:%M %p')
File "C:\Python27\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Python27\lib\site-packages\pandas\tseries\tools.py", line 420, in to_datetime
return _convert_listlike(arg, box, format, name=arg.name)
File "C:\Python27\lib\site-packages\pandas\tseries\tools.py", line 407, in _convert_listlike
raise e
Is there a way I can find the index of where the error is occurring?
It seems I can set errors to coerce which would just return a Nan as the date, but I would like to avoid that.
Thanks!
You are right, just finish the logic. Set to coerce and filter the index against all values being isnull() to find all the incorrect indices.
As a beginner, creating timestamps or formatted dates ended up being a little more of a challenge than I would have expected. What are some basic examples for reference?
Ultimately you want to review the datetime documentation and become familiar with the formatting variables, but here are some examples to get you started:
import datetime
print('Timestamp: {:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now()))
print('Timestamp: {:%Y-%b-%d %H:%M:%S}'.format(datetime.datetime.now()))
print('Date now: %s' % datetime.datetime.now())
print('Date today: %s' % datetime.date.today())
today = datetime.date.today()
print("Today's date is {:%b, %d %Y}".format(today))
schedule = '{:%b, %d %Y}'.format(today) + ' - 6 PM to 10 PM Pacific'
schedule2 = '{:%B, %d %Y}'.format(today) + ' - 1 PM to 6 PM Central'
print('Maintenance: %s' % schedule)
print('Maintenance: %s' % schedule2)
The output:
Timestamp: 2014-10-18 21:31:12
Timestamp: 2014-Oct-18 21:31:12
Date now: 2014-10-18 21:31:12.318340
Date today: 2014-10-18
Today's date is Oct, 18 2014
Maintenance: Oct, 18 2014 - 6 PM to 10 PM Pacific
Maintenance: October, 18 2014 - 1 PM to 6 PM Central
Reference link: https://docs.python.org/3.4/library/datetime.html#strftime-strptime-behavior
>>> import time
>>> print(time.strftime('%a %H:%M:%S'))
Mon 06:23:14
from datetime import datetime
dt = datetime.now() # for date and time
ts = datetime.timestamp(dt) # for timestamp
print("Date and time is:", dt)
print("Timestamp is:", ts)
You might want to check string to datetime operations for formatting.
from datetime import datetime
datetime_str = '09/19/18 13:55:26'
datetime_object = datetime.strptime(datetime_str, '%m/%d/%y %H:%M:%S')
print(type(datetime_object))
print(datetime_object) # printed in default format
Output:
<class 'datetime.datetime'>
2018-09-19 13:55:26
I have pattern namely QUARTERDATE and FILENAME inside the file.
Both will have some value as in below eg.
My requirement is, I should rename the file like FILENAME_QUARTERDATE.
My file(myfile.txt) will be as below:
QUARTERDATE: 03/31/14 - 06/29/14
FILENAME : LEAD
field1 field2
34567
20.0 5,678
20.0 5,678
20.0 5,678
20.0 5,678
20.0 5,678
I want the the file name to be as LEAD_201402.txt
Date range in the file is for Quarter 2, so i given as 201402.
Thanks in advance for the replies.
newname=$(awk '/QUARTERDATE/ { split($4, d, "/");
quarter=sprintf("%04d%02d", 2000+d[3], int((d[1]-1)/3)+1); }
/FILENAME/ { fn = $3; print fn "_" quarter; exit; }' "$file")
mv "$file" "$newname"
How is a quarter defined?
As noted in comments to the main question, the problem is as yet ill-defined.
What data would appear in the previous quarter's QUARTERDATE line? Could Q1 ever start with a date in December of the previous year? Could the end date of Q2 ever be in July (or Q1 in April, or Q3 in October, or Q4 in January)? Since the first date of Q2 is in March, these alternatives need to be understood. Could a quarter ever start early and end late simultaneously (a 14 week quarter)?
To which the response was:
QUARTERDATE of Q2 will start as 1st Monday of April and end as last Sunday of June.
Which triggered a counter-response:
2014-03-31 is a Monday, but hardly a Monday in April. What this mainly means is that your definition of a quarter is, as yet, not clear. For example, next year, 2015-03-30 is a Monday, but 'the first Monday in April' is 2015-04-06. The last Sunday in March 2015 is 2015-03-29. So which quarter does the week (Mon) 2015-03-30 to (Sun) 2015-04-05 belong to, and why? If you don't know (both how and why), we can't help you reliably.
Plausible working hypothesis
The lessons of Y2K have been forgotten already (why else are two digits used for the year, dammit!).
Quarters run for an integral number of weeks.
Quarters start on a Monday and end on a Sunday.
Quarters remain aligned with the calendar quarters, rather than drifting around the year. (There are 13 weeks in 91 days, and 4 such quarters in a year, but there's a single extra day in an ordinary year and two extra in a leap year, which mean that occasionally you will get a 14-week quarter, to ensure things stay aligned.)
The date for the first date in a quarter will be near 1st January, 1st April, 1st July or 1st October, but the month might be December, March (as in the question), June or September.
The date for the last date in a quarter will be near 31st March, 30th June, 30th September, 31st December, but the month might be April, July, October or January.
By adding 1 modulo 12 (values in the range 1..12, not 0..11) to the start month, you should end up with a month firmly in the calendar quarter.
By subtracting 1 modulo 12 (values in the range 1..12 again) to the end month, you should end up with a month firmly in calendar quarter.
If the data is valid, the 'start + 1' and 'end - 1' months should be in the same quarter.
The early year might be off-by-one if the start date is in December (but that indicates Q1 of the next year).
The end year might be off-by-one if the end date is in January (but that indicates Q4 of the prior year).
More resilient code
Despite the description above, it is possible to write code that detects the quarter despite any or all of the idiosyncrasies of the quarter start and end dates. This code borrows a little from Barmar's answer, but the algorithm is more resilient to the vagaries of the calendar and the quarter start and end dates.
#!/bin/sh
awk '/QUARTERDATE/ {
split($2, b, "/")
split($4, e, "/")
if (b[1] == 12) { q = 1; y = e[3] }
else if (e[1] == 1) { q = 4; y = b[3] }
else
{
if (b[3] != e[3]) {
print "Year mismatch (" $2 " vs " $4 ") in file " FILENAME
exit 1
}
m = int((b[1] + e[1]) / 2)
q = int((m - 1) / 3) + 1
y = e[3]
}
quarter = sprintf("%.4d%.2d", y + 2000, q)
}
/FILENAME/ {
print $3 "_" quarter
# exit
}' "$#"
The calculation for m adds the start month plus one to the end month minus one and then does integer division by two. With the extreme cases already taken care of, this always yields a month number that is in the correct quarter.
The comment in front of the exit associated with FILENAME allows testing more easily. When processing each file separately, as in Barmar's example, that exit is an important optimization. Note that the error message gives an empty file name if the input comes from standard input. (Offhand, I'm not sure how to print the error message to standard error rather than standard output, other than by a platform-specific technique such as print "message" > "/dev/stderr" or print "message" > "/dev/fd/2".)
Given this sample input data (semi-plausible start and end dates for 6 quarters from 2014Q1 through 2015Q2):
QUARTERDATE: 12/30/13 - 03/30/14
FILENAME : LEAD
QUARTERDATE: 03/31/14 - 06/29/14
FILENAME : LEAD
QUARTERDATE: 06/30/14 - 09/28/14
FILENAME : LEAD
QUARTERDATE: 09/29/14 - 12/28/14
FILENAME : LEAD
QUARTERDATE: 12/29/14 - 03/29/15
FILENAME : LEAD
QUARTERDATE: 03/30/15 - 06/29/15
FILENAME : LEAD
The output from this script is:
LEAD_201401
LEAD_201402
LEAD_201403
LEAD_201404
LEAD_201501
LEAD_201502
You can juggle the start and end dates of the quarters within reason and you should still get the required output. But always be wary of calendrical calculations; they are almost invariably harder than you expect.