Python v3.8.5
conda v4.9.2
Here is the code I run:
from datetime import datetime as dt
date_str = '1957 Oct 4 1928:34'
date_object = dt.strptime(date_str, "%Y %b %d %H%M:%S")
This will work perfectly from command line:
~/code/ » python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from datetime import datetime as dt
>>> str = '1957 Oct 4 1928:34'
>>> date_object = dt.strptime(str, "%Y %b %d %H%M:%S")
>>> date_object
datetime.datetime(1957, 10, 4, 19, 28, 34)
>>>
But will fail when I run it from a script:
(calmenv) -----------------------------------------------------------------------------------------------
~/code/ » which python
/home/bipbip/anaconda3/envs/calmenv/bin/python
(calmenv) -----------------------------------------------------------------------------------------------
~/code » python --version
Python 3.8.5
(calmenv) -----------------------------------------------------------------------------------------------
~/code » python bin/datamanager.py --launch_tb
date_str in try: '1957 Oct 4 1928:34'
time data '1957 Oct 4 1928:34' does not match format '%Y %b %d %H%M:%S'
The script is launched in a conda environment. The version of python is the same in both cases. The conda environment is always active.
Digging around in the datetime.py file, I found out that the regex used by strptime to convert the string is not the same in both cases.
From case 1 (command line, success case):
data_string in strptime fn: 1957 Oct 4 1928:34
format_regex in strptime: re.compile('(?P<Y>\\d\\d\\d\\d)\\s+(?P<b>jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\s+(?P<d>3[0-1]|[1-2]\\d|0[1-9]|[1-9]| [1-9])\\s+(?P<H>2[0-3]|[0-1]\\d|\\d)(?P<M>[0-5]\\d|\\d):(?P<S>6[0-1]|[0-5]\\d|\\d), re.IGNORECASE)
From case 2 (script, fail case):
data_string in strptime fn: 1957 Oct 4 1928:34
format_regex in strptime: re.compile('(?P<Y>\\d\\d\\d\\d)\\s+(?P<b>janv\\.|févr\\.|avril|juil\\.|sept\\.|mars|juin|août|oct\\.|nov\\.|déc\\.|mai)\\s+(?P<d>3[0-1]|[1-2]\\d|0[1-9]|[1-9]| [1-9])\\s+(?P<H>2[0-3]|[0-1]\\d|\\d)(?P<M>[0-5]\\d|\, re.IGNORECASE)
Why is strptime taking 2 different regexes to validate the string format, depending on where it is called from ?
Related
Through Python i'm trying to convert the future date into another format and subtract with current date but it's throwing error.
Python version = Python 3.6.8
from datetime import datetime
enddate = 'Thu Jun 02 08:00:00 EDT 2022'
todays = datetime.today()
print ('Tpday =',todays)
Modified_date1 = datetime.strptime(enddate, ' %a %b %d %H:%M:%S %Z %Y')
subtract_days= Modified_date1 - todays
print (subtract_days.days)
Output
Today = 2022-02-02 08:06:53.687342
Traceback (most recent call last):
File "1.py", line 106, in trusstore_output
Modified_date1 = datetime.strptime(enddate1, ' %a %b %d %H:%M:%S %Z %Y')
File "/usr/lib64/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/usr/lib64/python3.6/_strptime.py", line 362, in _strptime
(data_string, format))
ValueError: time data ' Thu Jun 02 08:00:00 EDT 2022' does not match format ' %a %b %d %H:%M:%S %Z %Y'
During handling of the above exception, another exception occurred:
Linux server date
$ date
Wed Feb 2 08:08:36 CST 2022
Point 6 in the Documentation tells that not all Timezone formats are available to be parsed by strptime.
%Z [...]
So someone living in Japan may have JST, UTC, and GMT as valid values, but probably not EST. It will raise ValueError for invalid values.
If possible, you could get the server date with the -u flag and parse the UTC timestamp.
date -u
Mi 2. Feb 14:39:11 UTC 2022
PS:
Also watch out for the leading whitespace in your strings.
If EDT is available on your system, the Value Error could be a result of the a mixup between enddate and enddate1.
' Thu Jun 02 08:00:00 EDT 2022' vs. enddate = 'Thu Jun 02 08:00:00 EDT 2022'
Unfortunately, only a subset of timezones is supported by strptime.
If you can ensure that the input does not contain any other timezones than EDT or EST, you could replace these by the corresponding UTC offsets and use %z instead of %Z:
from datetime import datetime
date_str = "Thu Jun 02 08:00:00 EDT 2022"
date_str = date_str.replace("EDT", "-0400")
date_str = date_str.replace("EST", "-0500")
date_parsed = datetime.strptime(date_str, "%a %b %d %H:%M:%S %z %Y")
# 2022-06-02 08:00:00-04:00
print(date_parsed)
What's the difference between US/Mountain and AZ timezone. Why is it adding an extra 28 min?
>>> strtime = datetime.datetime.strptime('10:00pm', '%I:%M%p')
>>> tz = timezone('US/Mountain').localize(strtime)
>>> print tz
1900-01-01 22:00:00-07:00
>>> tz = timezone(us.states.lookup('AZ').capital_tz).localize(strtime)
>>> print tz
1900-01-01 22:00:00-07:28 <<-----
this is most likely due to the fact that your year is 1900 (see also this question); it works fine if you add a current year:
import datetime
from pytz import timezone
import us
strtime = datetime.datetime.strptime('2020 10:00pm', '%Y %I:%M%p')
tz = timezone('US/Mountain').localize(strtime)
print(tz)
# 2020-01-01 22:00:00-07:00
tz = timezone(us.states.lookup('AZ').capital_tz).localize(strtime)
print(tz)
# 2020-01-01 22:00:00-07:00
(I'm using Python3 but that shouldn't make a difference, I get the same 28 min offset for year 1900)
I have a text file (61Gb) containing on each line, a string representing a date, e.g. Thu Dec 16 18:53:32 +0000 2010
Iterating the file on a single core would take too long, therefore I would like to use Pyspark and the Mapreduce technology to quickly find frequencies of lines for a day in a certain year.
What I think is a good start:
import dateutil.parser
text_file = sc.textFile('dates.txt')
date_freqs = text_file.map(lambda line: dateutil.parser.parse(line)) \
.map(lambda date: date + 1) \
.reduceByKey(lambda a, b: a + b)
Unfortunately I can't understand how to filter on a certain year and reduce by key. The key is the day.
Example output:
Thu Dec 16 26543
Thu Dec 17 345
etc.
As alluded to in another answer, dateutil.parser.parse returns a datetime object which has year, month, and day attributes:
>>> dt = dateutil.parser.parse('Thu Dec 16 18:53:32 +0000 2010')
>>> dt.year
2010
>>> dt.month
12
>>> dt.day
16
Starting with this RDD:
>>> rdd = sc.parallelize([
... 'Thu Oct 21 5:12:38 +0000 2010',
... 'Thu Oct 21 4:12:38 +0000 2010',
... 'Wed Sep 22 15:46:40 +0000 2010',
... 'Sun Sep 4 22:28:48 +0000 2011',
... 'Sun Sep 4 21:28:48 +0000 2011'])
Here's how you can get the counts for all year-month-day combinations:
>>> from operator import attrgetter
>>> counts = rdd.map(dateutil.parser.parse).map(
... attrgetter('year', 'month', 'day')).countByValue()
>>> counts
defaultdict(<type 'int'>, {(2010, 9, 22): 1, (2010, 10, 21): 2, (2011, 9, 4): 2})
To get the output you want:
>>> for k, v in counts.iteritems():
... print datetime.datetime(*k).strftime('%a %b %y'), v
...
Wed Sep 10 1
Thu Oct 10 2
Sun Sep 11 2
If you want counts for only a certain year, you can filter the RDD before doing the count:
>>> counts = rdd.map(dateutil.parser.parse).map(
... attrgetter('year', 'month', 'day')).filter(
... lambda (y, m, d): y == 2010).countByValue()
>>> counts
defaultdict(<type 'int'>, {(2010, 9, 22): 1, (2010, 10, 21): 2})
Something along the lines of this might be a good start:
import dateutil.parser
text_file = sc.textFile('dates.txt')
date_freqs = text_file.map(lambda line: dateutil.parser.parse(line))
.keyBy((_.year, _.month, _.day)) // somehow get the year, month, day to key by
.countByKey()
I should add that dateutil is not standard in Python. If you do not have sudo right on your cluster, this could pose a problem. As a solution I would like to propose using datetime:
import datetime
def parse_line(d):
f = "%a %b %d %X %Y"
date_list = d.split()
date = date_list[:4]
date.append(date_list[5])
date = ' '.join(date)
return datetime.datetime.strptime(date, f)
counts = rdd.map(parse_line)\
.map(attrgetter('year', 'month', 'day'))\
.filter(lambda (y, m, d): y == 2015)\
.countByValue()
I am interested in better solutions using: Parquet, Row/Columns etc.
As a beginner, creating timestamps or formatted dates ended up being a little more of a challenge than I would have expected. What are some basic examples for reference?
Ultimately you want to review the datetime documentation and become familiar with the formatting variables, but here are some examples to get you started:
import datetime
print('Timestamp: {:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now()))
print('Timestamp: {:%Y-%b-%d %H:%M:%S}'.format(datetime.datetime.now()))
print('Date now: %s' % datetime.datetime.now())
print('Date today: %s' % datetime.date.today())
today = datetime.date.today()
print("Today's date is {:%b, %d %Y}".format(today))
schedule = '{:%b, %d %Y}'.format(today) + ' - 6 PM to 10 PM Pacific'
schedule2 = '{:%B, %d %Y}'.format(today) + ' - 1 PM to 6 PM Central'
print('Maintenance: %s' % schedule)
print('Maintenance: %s' % schedule2)
The output:
Timestamp: 2014-10-18 21:31:12
Timestamp: 2014-Oct-18 21:31:12
Date now: 2014-10-18 21:31:12.318340
Date today: 2014-10-18
Today's date is Oct, 18 2014
Maintenance: Oct, 18 2014 - 6 PM to 10 PM Pacific
Maintenance: October, 18 2014 - 1 PM to 6 PM Central
Reference link: https://docs.python.org/3.4/library/datetime.html#strftime-strptime-behavior
>>> import time
>>> print(time.strftime('%a %H:%M:%S'))
Mon 06:23:14
from datetime import datetime
dt = datetime.now() # for date and time
ts = datetime.timestamp(dt) # for timestamp
print("Date and time is:", dt)
print("Timestamp is:", ts)
You might want to check string to datetime operations for formatting.
from datetime import datetime
datetime_str = '09/19/18 13:55:26'
datetime_object = datetime.strptime(datetime_str, '%m/%d/%y %H:%M:%S')
print(type(datetime_object))
print(datetime_object) # printed in default format
Output:
<class 'datetime.datetime'>
2018-09-19 13:55:26
The month format specifier doesn't seem to work.
from datetime import datetime
endDate = datetime.strptime('10 3 2011', '%j %m %Y')
print endDate
2011-01-10 00:00:00
endDate = datetime.strptime('21 5 1987', '%j %m %Y')
print endDate
1987-01-21 00:00:00
Now, according to the manual the manual:
%m = Month as a decimal number [01,12].
So, what am I missing, other than the hair I've pulled out trying to understand why my django __filter queries return nothing (the dates going in aren't valid!)? I've tried 03 and 05 to no avail.
Versions of things, platform, architecture et al:
$ python --version
Python 2.7
$ python3 --version
Python 3.1.2
$ uname -r
2.6.35.11-83.fc14.x86_64 (that's Linux/Fedora 14/64-bit).
You can't mix the %j with others format code like %m because if you look in the table that you linked %j is the Day of the year as a decimal number [001,366] so 10 correspondent to the 10 day of the year so it's 01 of January ...
So you have just to write :
>>> datetime.strptime('10 2011', '%j %Y')
datetime.datetime(2011, 1, 10, 0, 0)
Else if you you wanted to use 10 as the day of the mount you should do :
>>> datetime.strptime('10 3 2011', '%d %m %Y')
datetime.datetime(2011, 3, 10, 0, 0)
Isn't %j the "day of year" parser, which may be forcing strptime to choose January 21, overriding the %m rule?
%j specifies a day of the year. It's impossible for the 10th day of the year, January 10, to occur in March, so your month specification is being ignored. Garbage In, Garbage Out.