Pyspark: Convert String Datetime in 12 hour Clock to Date time with 24 hour clock (Time Zone Change) - datetime

Edit: Apologies, the sample data frame is a little off. Below is the corrected sample dataframe I'm trying to convert:
Timestamp (CST)
12/8/2018 05:23 PM
11/29/2018 10:20 PM
I tried the following code based on recommendation below but got null values returned.
df = df.withColumn('Timestamp (CST)_2', from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy/MM/dd hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
df = df.withColumn("Timestamp (CST)_3", F.to_timestamp(F.col("Timestamp (CST)_2")))
--------------------------------------------------------------------------------
I have a field called "Timestamp (CST)" that is a string. It is in Central Standard Time.
Timestamp (CST)
2018-11-21T5:28:56 PM
2018-11-21T5:29:16 PM
How do I create a new column that takes "Timestamp (CST)" and change it to UTC and convert it to a datetime with the time stamp on the 24 hour clock?
Below is my desired table and I would like the datatype to be timestamp:
Timestamp (CST)_2
2018-11-21T17:28:56.000Z
2018-11-21T17:29:16.000Z
I tried the following code but all the results came back null:
df = df.withColumn("Timestamp (CST)_2", to_timestamp("Timestamp (CST)", "yyyy/MM/dd h:mm p"))

Firstly, import from_unixtime, unix_timestamp and col using
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
Then, reconstructing your scenario in a DataFrame df_time
>>> cols = ['Timestamp (CST)']
>>> vals = [
... ('2018-11-21T5:28:56 PM',),
... ('2018-11-21T5:29:16 PM',)]
>>> df_time = spark.createDataFrame(vals, cols)
>>> df_time.show(2, False)
+---------------------+
|Timestamp (CST) |
+---------------------+
|2018-11-21T5:28:56 PM|
|2018-11-21T5:29:16 PM|
+---------------------+
Then, my approach would be
>>> df_time_twenfour = df_time.withColumn('Timestamp (CST)', \
... from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy-MM-dd'T'hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
>>> df_time_twenfour.show(2, False)
+------------------------+
|Timestamp (CST) |
+------------------------+
|2018-11-21T17:28:56.000Z|
|2018-11-21T17:29:16.000Z|
+------------------------+
Notes
If you want time to be in 24-Hour format then, you would use HH instead of hh.
Since, you have a PM, you use aa in yyyy-MM-dd'T'hh:mm:ss aa to specify PM.
Your, input string has T in it so, you have to specify it as above format.

the option aa as mentioned in #pyy4917's answer might give legacy errors. To fix it, replace aa with a.
The full code as below:
df_time_twenfour = df_time.withColumn('Timestamp (CST)', \ ...
from_unixtime(unix_timestamp(col(('Timestamp (CST)')), \...
"yyyy-MM-dd'T'hh:mm:ss a"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))

Related

pyspark daterange calculations in spark

I am trying to process website login session data by each user. I am reading an S3 session log file into an RDD. The data looks something like this.
----------------------------------------
User | Site | Session start | Session end
---------------------------------------
Joe |Waterloo| 9/21/19 3:04 AM |9/21/19 3:18 AM
Stacy|Kirkwood| 8/4/19 3:06 PM |8/4/19 3:54 PM
John |Waterloo| 9/21/19 8:48 AM |9/21/19 9:05 AM
Stacy|Kirkwood| 8/4/19 4:16 PM |8/4/19 5:41 PM
...
...
I want to find out how many users were logged in each second of the hour on a given day.
Example: I might be processing this data for 9/21/19 only. So, I would need to remove all other records and then SUM user sessions for each second of the hour for all 24 hours of 9/21/19. The output should be possibly 24 rows for all the hours of 9/21/19 and then counts for each second of the day(yikes, second by second data!).
Is this something possible to do in pyspark using either rdds or DF?
(Apologize for the tardiness in building the grid).
Thanks
my dataset
data=[['Joe','Waterloo','9/21/19 3:04 AM','9/21/19 3:18 AM'],['Stacy','Kirkwood','8/4/19 3:06 PM','8/4/19 3:54 PM'],['John','Waterloo','9/21/19 8:48 AM','9/21/19 9:05 AM'],
['Stacy','Kirkwood','9/21/19 4:06 PM', '9/21/19 4:54 PM'],
['Mo','Hashmi','9/21/19 1:06 PM', '9/21/19 5:54 PM'],
['Murti','Hash','9/21/19 1:00 PM', '9/21/19 3:00 PM'],
['Floo','Shmi','9/21/19 9:10 PM', '9/21/19 11:54 PM']]
cSchema = StructType([StructField("User", StringType())\
,StructField("Site", StringType())
, StructField("Sesh-Start", StringType())
, StructField("Sesh-End", StringType())])
df= spark.createDataFrame(data,schema=cSchema)
display(df)
parse timestamp
df1=df.withColumn("Start", F.from_unixtime(F.unix_timestamp("Sesh-Start",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).withColumn("End", F.from_unixtime(F.unix_timestamp("Sesh-End",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).drop("Sesh-Start","Sesh-End")
build and register udf, for multiple hours per person
def yo(a,b):
from datetime import datetime
d1 = datetime.strptime(str(a), '%Y-%m-%d %H:%M:%S')
d2 = datetime.strptime(str(b), '%Y-%m-%d %H:%M:%S')
y=[]
if d1.hour == d2.hour:
y.append(d1.hour)
else:
for i in range(d1.hour,d2.hour+1):
y.append(i)
return y
rng= udf(yo, ArrayType(IntegerType()))
explode list of hours into column
df2=df1.withColumn("new", rng(F.col("Start"),F.col("End"))).withColumn("new1",F.explode("new")).drop("new")
get seconds for each hour
df3=df2.withColumn("Seconds", when(F.hour("Start")==F.hour("End"), F.col("End").cast('long') - F.col("Start").cast('long'))
.when(F.hour("Start")==F.col("new1"), 3600-F.minute("Start")*60)
.when(F.hour("End")==F.col("new1"), F.minute("End")*60)
.otherwise(3600))
create temp view and query it
df3.createOrReplaceTempView("final")
display(spark.sql("Select new1, sum(Seconds) from final group by new1 order by new1"))
The above answer by Lennart could be more perfomant because he uses a join to get all the different hours, instead I use a UDF which could be slower. My code will work for any user who can be online for any amount of hours. My data used only the day required, so you could use day filter given above to limit your query to the day in question.. Final output
Try to check this:
Initiaize filter.
val filter = to_date("2019-09-21")
val startFilter = to_timestamp("2019-09-21 00:00:00.000")
val endFilter = to_timestamp("2019-09-21 23:59:59.999")
Generate range (0 .. 23).
hours = spark.range(24).collect()
Get actual user sessions that match the filter.
df = sessions.alias("s") \
.where(filter >= to_date(s.start) & filter <= to_date(s.end)) \
.select(s.user, \
when(s.start < startFilter, startFilter).otherwise(s.start).alias("start"), \
when(s.end > endFilter, endFilter).otherwise(s.end).alias("end"))
Combine match user sessions with range of hours.
df2 = df.join(hours, hours.id.between(hour(df.start), hour(df.end)), 'inner') \
.select(df.user, hours.id.alias("hour"), \
(when(hour(df.end) > hours.id, 360).otherwise(minute(df.end) * 60 + second(df.end)) - \
when(hour(df.start) < hours.id, 0).otherwise(minute(df.start) * 60 + second(df.start))).alias("seconds"))
Generate summary: calculate users count and sum of seconds for each hour of sessions.
df2.groupBy(df2.hour)\
.agg(count(df2.user).alias("user counts"), \
sum(dg2.seconds).alias("seconds")) \
.show()
Hope this helps.

I want to find the day difference between 2 date column in azure app insight?

We have a log file where we store the searches happening on our platform. Now there is a departure date and I want to find the searches where departure date is after 330 days from today.
I am trying to run the query to find the difference between departure date column and logtime(entry time of the event into log). But getting the below error:
Query could not be parsed at 'datetime("departureDate")' on line [5,54]
Token: datetime("departureDate")
Line: 5
Position: 54
Date format of departure date is mm/dd/yyyy and logtime format is typical datetime format of app insight.
Query that I am running is below:
customEvents
| where name == "SearchLog"
| extend departureDate = tostring(customDimensions.departureDate)
| extend logTime = tostring(customDimensions.logTime)
| where datetime_diff('day',datetime("departureDate"),datetime("logTime")) > 200
As suggested I ran the below query but now I am getting 0 results but there is data that satisfy the given criteria.
customEvents
| where name == "SearchLog"
| extend departureDate = tostring(customDimensions.departureDate)
| extend logTime = tostring(customDimensions.logTime)
| where datetime_diff('day',todatetime(departureDate),todatetime(logTime)) > 200
Example:
departureDate
04/09/2020
logTime
8/13/2019 8:45:39 AM -04:00
I also tried the below query to check whether data format is supported or not and it gave correct response.
customEvents
| project datetime_diff('day', datetime('04/30/2020'),datetime('8/13/2019 8:25:51 AM -04:00'))
Please use the below query. Use todatetime statement to convert string to datetime
customEvents
| where name == "SearchLog"
| extend departureDate = tostring(customDimensions.departureDate)
| extend logTime = tostring(customDimensions.logTime)
| where datetime_diff('day',todatetime(departureDate),todatetime(logTime)) > 200
The double quotes inside datetime operator in where clause should be removed.
Your code should look like:
where datetime_diff('day',datetime(departureDate),datetime(logTime)) > 200

How to get month format with 0 after evaluate using Robot framework

I try to get month and year format like "06/19" after evaluated month but I got just "6/19".
Month and year
${currentYear}= Get Current Date result_format=%y
${currentDate}= Get Current Date
${datetime} = Convert Date ${currentDate} datetime
${getMonth}= evaluate ${datetime.month} - 1
log to console ${getMonth}/${currentYear}
I already tried another way by created variable #{MONTHSNO} ${EMPTY} 01 02 03 04 05 06 07 08 09 10 11 12 and return ${MONTHSNO}[${getMonth}]/${currentYear} I got 06/19 but I'm not sure the robot have another way to convert month to "06" by without to make the variable like these.
You can acheive this by using a custom keywords that will return the date in month/year format
Then you can use relativedelta() to subtract a month from your date
to install dateutil:
pip install python-dateutil
test.py
from datetime import datetime
from dateutil.relativedelta import relativedelta
def return_current_date_minus_one_month():
strDate = datetime.today()
Subtracted_date = strDate + relativedelta(months=-1)
Date = Subtracted_date.strftime('%m/%y')
return Date
test.robot
*** Settings ***
Library test.py
*** Test Cases ***
Month and year
${current_date} = Test.Return Current Date Minus One Month
log ${current_date}
result = ${current_date} = 06/19
When you run Evaluate command, you are running python commands. So let's take a look at datetime docs:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
They have this part in the end saying that library time can be useful. So I suggest you to use this command to return month in 0X format:
${getMonth}= evaluate time.strftime("%m")
This just return 07 to me (because now it's July)
You can substract from your current date one month like this with the DateTime Library:
${date1}= Get Current Date result_format=%d.%m.%Y
${date2}= Substract Time To Date ${date1} 30 days date_format=%d.%m.%Y result_format=%m.%Y
maybe this helps

Extract date from a string column containing timestamp in Pyspark

I have a dataframe which has a date in the following format:
+----------------------+
|date |
+----------------------+
|May 6, 2016 5:59:34 AM|
+----------------------+
I intend to extract the date from this in the format YYYY-MM-DD ; so the result should be for the above date - 2016-05-06.
But when I extract is using the following:
df.withColumn('part_date', from_unixtime(unix_timestamp(df.date, "MMM dd, YYYY hh:mm:ss aa"), "yyyy-MM-dd"))
I get the following date
2015-12-27
Can anyone please advise on this? I do not intend to convert my df to rdd to use datetime function from python and want to use this in the dataframe it self.
There are some errors with your pattern. Here's a suggestion:
from_pattern = 'MMM d, yyyy h:mm:ss aa'
to_pattern = 'yyyy-MM-dd'
df.withColumn('part_date', from_unixtime(unix_timestamp(df['date'], from_pattern), to_pattern)).show()
+----------------------+----------+
|date |part_date |
+----------------------+----------+
|May 6, 2016 5:59:34 AM|2016-05-06|
+----------------------+----------+

Convert 12-hour date/time to 24-hour date/time

I have a tab delimited file where each record has a timestamp field in 12-hour format:
mm/dd/yyyy hh:mm:ss [AM|PM].
I need to quickly convert these fields to 24-hour time:
mm/dd/yyyy HH:mm:ss.
What would be the best way to do this? I'm running on a Windows platform, but I have access to sed, awk, perl, python, and tcl in addition to the usual Windows tools.
Using Perl and hand-crafted regexes instead of facilities like strptime:
#!/bin/perl -w
while (<>)
{
# for date times that don't use leading zeroes, use this regex instead:
# (?:\d{1,2}/\d{1,2}/\d{4} )(\d{1,2})(?::\d\d:\d\d) (AM|PM)
while (m%(?:\d\d/\d\d/\d{4} )(\d\d)(?::\d\d:\d\d) (AM|PM)%)
{
my $hh = $1;
$hh -= 12 if ($2 eq 'AM' && $hh == 12);
$hh += 12 if ($2 eq 'PM' && $hh != 12);
$hh = sprintf "%02d", $hh;
# for date times that don't use leading zeroes, use this regex instead:
# (\d{1,2}/\d{1,2}/\d{4} )(\d{1,2})(:\d\d:\d\d) (?:AM|PM)
s%(\d\d/\d\d/\d{4} )(\d\d)(:\d\d:\d\d) (?:AM|PM)%$1$hh$3%;
}
print;
}
That's very fussy - but also converts possibly multiple timestamps per line.
Note that the transformation for AM/PM to 24-hour is not trivial.
12:01 AM --> 00:01
12:01 PM --> 12:01
01:30 AM --> 01:30
01:30 PM --> 13:30
Now tested:
perl ampm-24hr.pl <<!
12/24/2005 12:01:00 AM
09/22/1999 12:00:00 PM
12/12/2005 01:15:00 PM
01/01/2009 01:56:45 AM
12/30/2009 10:00:00 PM
12/30/2009 10:00:00 AM
!
12/24/2005 00:01:00
09/22/1999 12:00:00
12/12/2005 13:15:00
01/01/2009 01:56:45
12/30/2009 22:00:00
12/30/2009 10:00:00
Added:
In What is a Simple Way to Convert Between an AM/PM Time and 24 hour Time in JavaScript, an alternative algorithm is provided for the conversion:
$hh = ($1 % 12) + (($2 eq 'AM') ? 0 : 12);
Just one test...probably neater.
It is a 1-line thing in python:
time.strftime('%H:%M:%S', time.strptime(x, '%I:%M %p'))
Example:
>>> time.strftime('%H:%M:%S', time.strptime('08:01 AM', '%I:%M %p'))
'08:01:00'
>>> time.strftime('%H:%M:%S', time.strptime('12:01 AM', '%I:%M %p'))
'00:01:00'
Use Pythons datetime module someway like this:
import datetime
infile = open('input.txt')
outfile = open('output.txt', 'w')
for line in infile.readlines():
d = datetime.strptime(line, "input format string")
outfile.write(d.strftime("output format string")
Untested code with no error checking. Also it reads the entire input file in memory before starting.
(I know there is plenty of room for improvements like with statement...I make this a community wiki entry if anyone likes to add something)
To just convert the hour field, in python:
def to12(hour24):
return (hour24 % 12) if (hour24 % 12) > 0 else 12
def IsPM(hour24):
return hour24 > 11
def to24(hour12, isPm):
return (hour12 % 12) + (12 if isPm else 0)
def IsPmString(pm):
return "PM" if pm else "AM"
def TestTo12():
for x in range(24):
print x, to12(x), IsPmString(IsPM(x))
def TestTo24():
for pm in [False, True]:
print 12, IsPmString(pm), to24(12, pm)
for x in range(1, 12):
print x, IsPmString(pm), to24(x, pm)
This might be too simple thinking, but why not import it into excel, select the entire column and change the date format, then re-export as a tab delimited file? (I didn't test this, but it somehow sounds logical to me :)
Here i have converted 24 Hour system to 12 Hour system.
Try to use this method for your problem.
DateFormat fmt = new SimpleDateFormat("yyyyMMddHHssmm");
try {
Date date =fmt.parse("20090310232344");
System.out.println(date.toString());
fmt = new SimpleDateFormat("dd-MMMM-yyyy hh:mm:ss a ");
String dateInString = fmt.format(date);
System.out.println(dateInString);
} catch (Exception e) {
System.out.println(e.getMessage());
}
RESULT:
Tue Mar 10 23:44:23 IST 2009
10-March-2009 11:44:23 PM
In Python: Converting 12hr time to 24hr time
import re
time1=input().strip().split(':')
m=re.search('(..)(..)',time1[2])
sec=m.group(1)
tz=m.group(2)
if(tz='PM'):
time[0]=int(time1[0])+12
if(time1[0]=24):
time1[0]-=12
time[2]=sec
else:
if(int(time1[0])=12):
time1[0]-=12
time[2]=sec
print(time1[0]+':'+time1[1]+':'+time1[2])
Since you have multiple languages, I'll suggest the following algorithm.
1 Check the timestamp for the existence of the "PM" string.
2a If PM does not exist, simply convert the timestamp to the datetime object and proceed.
2b If PM does exist, convert the timestamp to the datetime object, add 12 hours, and proceed.

Resources