I'm trying to convert date string to date number using dtstr2dtnummx (three time faster than datenum), but for this input
dtstr2dtnummx({'2010-12-12
12:21:13.101'},'yyyy-mm-dd
HH:MM:SS.FFF')
and this input
dtstr2dtnummx({'2010-12-12
12:21:13.121'},'yyyy-mm-dd
HH:MM:SS.FFF')
getting the same output. I used the following tutorial to build up the date format.
Ahh sorry, UPDATED
The corresponding format of 'FFF' in datenum is 'SSS' in dtstr2dtnummx, as can be seen in cnv2icudf.m line #126. The end result is:
>> d1 = dtstr2dtnummx({'2010-12-12 12:21:13.101'},'yyyy-MM-dd HH:mm:ss.SSS')
d1 =
734484.514734965
>> d2 = dtstr2dtnummx({'2010-12-12 12:21:13.121'},'yyyy-MM-dd HH:mm:ss.SSS')
d2 =
734484.514735197
>> % double check the results - difference should equal 0.02 secs:
>> secsPerDay = 24*60*60;
>> timeDiff = secsPerDay * (d2-d1)
timeDiff =
0.019996
I have now posted an article about this on http://undocumentedmatlab.com/blog/datenum-performance/
Related
I have a list of dates that gets imported as strings. I've tried a bunch of different things to convert it to a list of dates. Doesn't matter how I do it I get an error.
#import valid dates from map file as list
valdts = dfmap.loc[row, 'valdata'].split(', ')
print(valdts)
>>
['1/1/1990', '6/6/1990', '7/4/1776']
#convert strings to dates
attempt1:
valdts = [d.strftime('%Y-%m-%d') for d in valdts]
>>
AttributeError: 'str' object has no attribute 'strftime'
attempt2:
a = [date_obj.strftime('%Y%m%d') for date_obj in valdts]
>>
AttributeError: 'str' object has no attribute 'strftime'
attempt3:
a = datetime.strptime(valdts, '%d %b %Y')
>>
TypeError: strptime() argument 1 must be str, not list
attempt4:
a = valdts.sort(key = lambda dt: datetime.strptime(dt, '%d %m %Y'))
>>
ValueError: time data '1/1/1990' does not match format '%d %m %Y'
attempt5:
for dt in valdts:
dt = dt.replace('/',',')
print(dt)
c = datetime.strptime('.'.join(str(i) for i in dt).zfill(8), "%Y.%m.%d")
>>
'1,1,1990'
ValueError: time data '1.,.1.,.1.9.9.0' does not match format '%Y.%m.%d'
attempt6:
for dt in valdts:
dt = dt.replace('/',',')
datetime.strptime(dt, '%d %m %Y')
>>
ValueError: time data '1,1,1990' does not match format '%d %m %Y'
I'm getting quite frustrated. The different approaches above are based on responses to similar but not quite the same questions posted by others. The question most similar to mine has been downvoted. Am I trying to do something stupid? Would really appreciate some help here. Thanks.
Note: datetime.datetime gives me an error. AttributeError: type object 'datetime.datetime' has no attribute 'datetime' but just datetime works for other parts of my code.
This is the work around I finally came up with. But would welcome a better method that doesn't require splitting each date.
valdts = dfmap.loc[row, 'valdata'].split(', ')
print(valdts)
>>
['1/1/1990', '6/6/1990', '7/4/1776']
for dt in valdts:
ldt = dt.split('/')
valdt = datetime(int(ldt[2]), int(ldt[1]), int(ldt[0]))
ldates.append(valdt)
print(ldt)
>>
note1: datetime.datetime didn't work for me because of the way I'd imported datetime. See excellent explanations here.
note2: converting the individual numbers to int was crucial. nothing else worked for me. Credit to the solution provided by #waitingkuo in the same link above.
I am trying to process website login session data by each user. I am reading an S3 session log file into an RDD. The data looks something like this.
----------------------------------------
User | Site | Session start | Session end
---------------------------------------
Joe |Waterloo| 9/21/19 3:04 AM |9/21/19 3:18 AM
Stacy|Kirkwood| 8/4/19 3:06 PM |8/4/19 3:54 PM
John |Waterloo| 9/21/19 8:48 AM |9/21/19 9:05 AM
Stacy|Kirkwood| 8/4/19 4:16 PM |8/4/19 5:41 PM
...
...
I want to find out how many users were logged in each second of the hour on a given day.
Example: I might be processing this data for 9/21/19 only. So, I would need to remove all other records and then SUM user sessions for each second of the hour for all 24 hours of 9/21/19. The output should be possibly 24 rows for all the hours of 9/21/19 and then counts for each second of the day(yikes, second by second data!).
Is this something possible to do in pyspark using either rdds or DF?
(Apologize for the tardiness in building the grid).
Thanks
my dataset
data=[['Joe','Waterloo','9/21/19 3:04 AM','9/21/19 3:18 AM'],['Stacy','Kirkwood','8/4/19 3:06 PM','8/4/19 3:54 PM'],['John','Waterloo','9/21/19 8:48 AM','9/21/19 9:05 AM'],
['Stacy','Kirkwood','9/21/19 4:06 PM', '9/21/19 4:54 PM'],
['Mo','Hashmi','9/21/19 1:06 PM', '9/21/19 5:54 PM'],
['Murti','Hash','9/21/19 1:00 PM', '9/21/19 3:00 PM'],
['Floo','Shmi','9/21/19 9:10 PM', '9/21/19 11:54 PM']]
cSchema = StructType([StructField("User", StringType())\
,StructField("Site", StringType())
, StructField("Sesh-Start", StringType())
, StructField("Sesh-End", StringType())])
df= spark.createDataFrame(data,schema=cSchema)
display(df)
parse timestamp
df1=df.withColumn("Start", F.from_unixtime(F.unix_timestamp("Sesh-Start",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).withColumn("End", F.from_unixtime(F.unix_timestamp("Sesh-End",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).drop("Sesh-Start","Sesh-End")
build and register udf, for multiple hours per person
def yo(a,b):
from datetime import datetime
d1 = datetime.strptime(str(a), '%Y-%m-%d %H:%M:%S')
d2 = datetime.strptime(str(b), '%Y-%m-%d %H:%M:%S')
y=[]
if d1.hour == d2.hour:
y.append(d1.hour)
else:
for i in range(d1.hour,d2.hour+1):
y.append(i)
return y
rng= udf(yo, ArrayType(IntegerType()))
explode list of hours into column
df2=df1.withColumn("new", rng(F.col("Start"),F.col("End"))).withColumn("new1",F.explode("new")).drop("new")
get seconds for each hour
df3=df2.withColumn("Seconds", when(F.hour("Start")==F.hour("End"), F.col("End").cast('long') - F.col("Start").cast('long'))
.when(F.hour("Start")==F.col("new1"), 3600-F.minute("Start")*60)
.when(F.hour("End")==F.col("new1"), F.minute("End")*60)
.otherwise(3600))
create temp view and query it
df3.createOrReplaceTempView("final")
display(spark.sql("Select new1, sum(Seconds) from final group by new1 order by new1"))
The above answer by Lennart could be more perfomant because he uses a join to get all the different hours, instead I use a UDF which could be slower. My code will work for any user who can be online for any amount of hours. My data used only the day required, so you could use day filter given above to limit your query to the day in question.. Final output
Try to check this:
Initiaize filter.
val filter = to_date("2019-09-21")
val startFilter = to_timestamp("2019-09-21 00:00:00.000")
val endFilter = to_timestamp("2019-09-21 23:59:59.999")
Generate range (0 .. 23).
hours = spark.range(24).collect()
Get actual user sessions that match the filter.
df = sessions.alias("s") \
.where(filter >= to_date(s.start) & filter <= to_date(s.end)) \
.select(s.user, \
when(s.start < startFilter, startFilter).otherwise(s.start).alias("start"), \
when(s.end > endFilter, endFilter).otherwise(s.end).alias("end"))
Combine match user sessions with range of hours.
df2 = df.join(hours, hours.id.between(hour(df.start), hour(df.end)), 'inner') \
.select(df.user, hours.id.alias("hour"), \
(when(hour(df.end) > hours.id, 360).otherwise(minute(df.end) * 60 + second(df.end)) - \
when(hour(df.start) < hours.id, 0).otherwise(minute(df.start) * 60 + second(df.start))).alias("seconds"))
Generate summary: calculate users count and sum of seconds for each hour of sessions.
df2.groupBy(df2.hour)\
.agg(count(df2.user).alias("user counts"), \
sum(dg2.seconds).alias("seconds")) \
.show()
Hope this helps.
Edit: Apologies, the sample data frame is a little off. Below is the corrected sample dataframe I'm trying to convert:
Timestamp (CST)
12/8/2018 05:23 PM
11/29/2018 10:20 PM
I tried the following code based on recommendation below but got null values returned.
df = df.withColumn('Timestamp (CST)_2', from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy/MM/dd hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
df = df.withColumn("Timestamp (CST)_3", F.to_timestamp(F.col("Timestamp (CST)_2")))
--------------------------------------------------------------------------------
I have a field called "Timestamp (CST)" that is a string. It is in Central Standard Time.
Timestamp (CST)
2018-11-21T5:28:56 PM
2018-11-21T5:29:16 PM
How do I create a new column that takes "Timestamp (CST)" and change it to UTC and convert it to a datetime with the time stamp on the 24 hour clock?
Below is my desired table and I would like the datatype to be timestamp:
Timestamp (CST)_2
2018-11-21T17:28:56.000Z
2018-11-21T17:29:16.000Z
I tried the following code but all the results came back null:
df = df.withColumn("Timestamp (CST)_2", to_timestamp("Timestamp (CST)", "yyyy/MM/dd h:mm p"))
Firstly, import from_unixtime, unix_timestamp and col using
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
Then, reconstructing your scenario in a DataFrame df_time
>>> cols = ['Timestamp (CST)']
>>> vals = [
... ('2018-11-21T5:28:56 PM',),
... ('2018-11-21T5:29:16 PM',)]
>>> df_time = spark.createDataFrame(vals, cols)
>>> df_time.show(2, False)
+---------------------+
|Timestamp (CST) |
+---------------------+
|2018-11-21T5:28:56 PM|
|2018-11-21T5:29:16 PM|
+---------------------+
Then, my approach would be
>>> df_time_twenfour = df_time.withColumn('Timestamp (CST)', \
... from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy-MM-dd'T'hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
>>> df_time_twenfour.show(2, False)
+------------------------+
|Timestamp (CST) |
+------------------------+
|2018-11-21T17:28:56.000Z|
|2018-11-21T17:29:16.000Z|
+------------------------+
Notes
If you want time to be in 24-Hour format then, you would use HH instead of hh.
Since, you have a PM, you use aa in yyyy-MM-dd'T'hh:mm:ss aa to specify PM.
Your, input string has T in it so, you have to specify it as above format.
the option aa as mentioned in #pyy4917's answer might give legacy errors. To fix it, replace aa with a.
The full code as below:
df_time_twenfour = df_time.withColumn('Timestamp (CST)', \ ...
from_unixtime(unix_timestamp(col(('Timestamp (CST)')), \...
"yyyy-MM-dd'T'hh:mm:ss a"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
I am using FTP and I have retrieved a list of Files. In the files command line there is a datetime field.
It reads
11-13-13 11:31AM
Can anyone tell me how I can parse this. I thought this might work.
DateTime.ParseExact(date,"MM-DD-YY HH:MMtt", System.Globalization.CultureInfo.InvariantCulture.DateTimeFormat); But i still get an exception.
day should be dd not DD
Year should be yy not YY
since hour is 12 hour format it should be hh also MM (minutes should be mm)
DateTime.ParseExact(date,"MM-dd-yy hh:mmtt",...)
eg:-
DateTime date = DateTime.ParseExact("11-13-13 11:31AM", "MM-dd-yy hh:mmtt", System.Globalization.CultureInfo.InvariantCulture);
Response.Write(date);
Noob here,
I'm stuck at trying to present user input in military time into standard time. The code works so far, but I need to subtract 12 hours from the end time to display in standard time. How do I do this using datetime.time? Also, do I need to convert the original user input to an integer to perform datetime.timedelta calculations? Previous questions don't seem to answer my coding questions.
My code is:
def timeconvert():
print "Hello and welcome to Python Payroll 1.0."
print ""
# User input for start time. Variable stored.
start = raw_input("Enter your check-in time in military format (0900): ")
# User input for end time. Variable stored.
end = raw_input("Enter your check-out time in military format (1700): ")
print ""
# ---------------------------------------------------------------------------
# Present user input in standard time format hhmm = hh:mm
# ---------------------------------------------------------------------------
import datetime, time
convert_start = datetime.time(hour=int(start[0:2]), minute=int(start[2:4]))
# need to find a way to subtract 12 from the hour to present end time in standard time
convert_end = datetime.time(hour=int(end[0:2]), minute=int(end[2:4]))
print 'You started at', convert_start.strftime("%H:%M"),'am', 'and ended at', convert_end.strftime("%H:%M"), 'pm'
# ---------------------------------------------------------------------------
# Use timedelta to caculate time worked.
# ---------------------------------------------------------------------------
# print datetime.timedelta
timeconvert()
raw_input("Press ENTER to exit program") # Closes program.
Thanks.
You can use strftime("%I:%M %p") to get standard 12 hour formatting with "AM" or "PM" at the end. See the Python documentation for more details on datetime string formatting.
Also, while it is not natively supported, you can simply use the two datetime.time instances to do your calculation as part of the timedelata constructor.
The below code should suffice, though proper error checking should definitely be used. ;)
--ap
start = raw_input("Enter your check-in time in military format (0900): ")
end = raw_input("Enter your check-out time in military format (1700): ")
# convert user input to datetime instances
start_t = datetime.time(hour=int(start[0:2]), minute=int(start[2:4]))
end_t = datetime.time(hour=int(end[0:2]), minute=int(end[2:4]))
delta_t = datetime.timedelta(
hours = (end_t.hour - start_t.hour),
minutes = (end_t.minute - start_t.minute)
)
# datetime format
fmt = "%I:%M %p"
print 'You started at %s and ended at %s' % (start_t.strftime(fmt), end_t.strftime(fmt))
print 'You worked for %s' % (delta_t)
def time12hr(string):
hours = string[:2]
minutes = string[2:]
x = " "
if int(hours) == 12:
x = "p.m."
hours = "12"
elif int(hours) == 00:
x = "a.m."
hours = "12"
elif int(hours) > 12:
x = "p.m."
hours = str(int(hours) - 12)
else:
x = "a.m."
return "%s:%s %s"%(hours ,minutes,x)
print time12hr('1202')
print time12hr('1200')
print time12hr('0059')
print time12hr('1301')
print time12hr('0000')