pyspark daterange calculations in spark - rdd

I am trying to process website login session data by each user. I am reading an S3 session log file into an RDD. The data looks something like this.
----------------------------------------
User | Site | Session start | Session end
---------------------------------------
Joe |Waterloo| 9/21/19 3:04 AM |9/21/19 3:18 AM
Stacy|Kirkwood| 8/4/19 3:06 PM |8/4/19 3:54 PM
John |Waterloo| 9/21/19 8:48 AM |9/21/19 9:05 AM
Stacy|Kirkwood| 8/4/19 4:16 PM |8/4/19 5:41 PM
...
...
I want to find out how many users were logged in each second of the hour on a given day.
Example: I might be processing this data for 9/21/19 only. So, I would need to remove all other records and then SUM user sessions for each second of the hour for all 24 hours of 9/21/19. The output should be possibly 24 rows for all the hours of 9/21/19 and then counts for each second of the day(yikes, second by second data!).
Is this something possible to do in pyspark using either rdds or DF?
(Apologize for the tardiness in building the grid).
Thanks

my dataset
data=[['Joe','Waterloo','9/21/19 3:04 AM','9/21/19 3:18 AM'],['Stacy','Kirkwood','8/4/19 3:06 PM','8/4/19 3:54 PM'],['John','Waterloo','9/21/19 8:48 AM','9/21/19 9:05 AM'],
['Stacy','Kirkwood','9/21/19 4:06 PM', '9/21/19 4:54 PM'],
['Mo','Hashmi','9/21/19 1:06 PM', '9/21/19 5:54 PM'],
['Murti','Hash','9/21/19 1:00 PM', '9/21/19 3:00 PM'],
['Floo','Shmi','9/21/19 9:10 PM', '9/21/19 11:54 PM']]
cSchema = StructType([StructField("User", StringType())\
,StructField("Site", StringType())
, StructField("Sesh-Start", StringType())
, StructField("Sesh-End", StringType())])
df= spark.createDataFrame(data,schema=cSchema)
display(df)
parse timestamp
df1=df.withColumn("Start", F.from_unixtime(F.unix_timestamp("Sesh-Start",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).withColumn("End", F.from_unixtime(F.unix_timestamp("Sesh-End",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).drop("Sesh-Start","Sesh-End")
build and register udf, for multiple hours per person
def yo(a,b):
from datetime import datetime
d1 = datetime.strptime(str(a), '%Y-%m-%d %H:%M:%S')
d2 = datetime.strptime(str(b), '%Y-%m-%d %H:%M:%S')
y=[]
if d1.hour == d2.hour:
y.append(d1.hour)
else:
for i in range(d1.hour,d2.hour+1):
y.append(i)
return y
rng= udf(yo, ArrayType(IntegerType()))
explode list of hours into column
df2=df1.withColumn("new", rng(F.col("Start"),F.col("End"))).withColumn("new1",F.explode("new")).drop("new")
get seconds for each hour
df3=df2.withColumn("Seconds", when(F.hour("Start")==F.hour("End"), F.col("End").cast('long') - F.col("Start").cast('long'))
.when(F.hour("Start")==F.col("new1"), 3600-F.minute("Start")*60)
.when(F.hour("End")==F.col("new1"), F.minute("End")*60)
.otherwise(3600))
create temp view and query it
df3.createOrReplaceTempView("final")
display(spark.sql("Select new1, sum(Seconds) from final group by new1 order by new1"))
The above answer by Lennart could be more perfomant because he uses a join to get all the different hours, instead I use a UDF which could be slower. My code will work for any user who can be online for any amount of hours. My data used only the day required, so you could use day filter given above to limit your query to the day in question.. Final output

Try to check this:
Initiaize filter.
val filter = to_date("2019-09-21")
val startFilter = to_timestamp("2019-09-21 00:00:00.000")
val endFilter = to_timestamp("2019-09-21 23:59:59.999")
Generate range (0 .. 23).
hours = spark.range(24).collect()
Get actual user sessions that match the filter.
df = sessions.alias("s") \
.where(filter >= to_date(s.start) & filter <= to_date(s.end)) \
.select(s.user, \
when(s.start < startFilter, startFilter).otherwise(s.start).alias("start"), \
when(s.end > endFilter, endFilter).otherwise(s.end).alias("end"))
Combine match user sessions with range of hours.
df2 = df.join(hours, hours.id.between(hour(df.start), hour(df.end)), 'inner') \
.select(df.user, hours.id.alias("hour"), \
(when(hour(df.end) > hours.id, 360).otherwise(minute(df.end) * 60 + second(df.end)) - \
when(hour(df.start) < hours.id, 0).otherwise(minute(df.start) * 60 + second(df.start))).alias("seconds"))
Generate summary: calculate users count and sum of seconds for each hour of sessions.
df2.groupBy(df2.hour)\
.agg(count(df2.user).alias("user counts"), \
sum(dg2.seconds).alias("seconds")) \
.show()
Hope this helps.

Related

How do I show time in ASP.NET?

I have a label in my asp.net web site that will shows the time. I want the output like here. in the morning like this: 08:26 and after 12 am,it shows 15:28
My code does not work. It only supports the first part.
DateTime tim = DateTime.Now;
int hh = p.GetHour(tim);
int mm = p.GetMinute(tim);
Label7.Text = DateTime.Now.ToString("hh:mm");
According to the Custom date and time format strings docs page - you can see:
"hh" The hour, using a 12-hour clock from 01 to 12.
"HH" The hour, using a 24-hour clock from 00 to 23.
So in your case - just use the capitalized HH for your formatting:
Label7.Text = DateTime.Now.ToString("HH:mm");
and you should get what you're looking for.

I want to find the day difference between 2 date column in azure app insight?

We have a log file where we store the searches happening on our platform. Now there is a departure date and I want to find the searches where departure date is after 330 days from today.
I am trying to run the query to find the difference between departure date column and logtime(entry time of the event into log). But getting the below error:
Query could not be parsed at 'datetime("departureDate")' on line [5,54]
Token: datetime("departureDate")
Line: 5
Position: 54
Date format of departure date is mm/dd/yyyy and logtime format is typical datetime format of app insight.
Query that I am running is below:
customEvents
| where name == "SearchLog"
| extend departureDate = tostring(customDimensions.departureDate)
| extend logTime = tostring(customDimensions.logTime)
| where datetime_diff('day',datetime("departureDate"),datetime("logTime")) > 200
As suggested I ran the below query but now I am getting 0 results but there is data that satisfy the given criteria.
customEvents
| where name == "SearchLog"
| extend departureDate = tostring(customDimensions.departureDate)
| extend logTime = tostring(customDimensions.logTime)
| where datetime_diff('day',todatetime(departureDate),todatetime(logTime)) > 200
Example:
departureDate
04/09/2020
logTime
8/13/2019 8:45:39 AM -04:00
I also tried the below query to check whether data format is supported or not and it gave correct response.
customEvents
| project datetime_diff('day', datetime('04/30/2020'),datetime('8/13/2019 8:25:51 AM -04:00'))
Please use the below query. Use todatetime statement to convert string to datetime
customEvents
| where name == "SearchLog"
| extend departureDate = tostring(customDimensions.departureDate)
| extend logTime = tostring(customDimensions.logTime)
| where datetime_diff('day',todatetime(departureDate),todatetime(logTime)) > 200
The double quotes inside datetime operator in where clause should be removed.
Your code should look like:
where datetime_diff('day',datetime(departureDate),datetime(logTime)) > 200

Pyspark: Convert String Datetime in 12 hour Clock to Date time with 24 hour clock (Time Zone Change)

Edit: Apologies, the sample data frame is a little off. Below is the corrected sample dataframe I'm trying to convert:
Timestamp (CST)
12/8/2018 05:23 PM
11/29/2018 10:20 PM
I tried the following code based on recommendation below but got null values returned.
df = df.withColumn('Timestamp (CST)_2', from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy/MM/dd hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
df = df.withColumn("Timestamp (CST)_3", F.to_timestamp(F.col("Timestamp (CST)_2")))
--------------------------------------------------------------------------------
I have a field called "Timestamp (CST)" that is a string. It is in Central Standard Time.
Timestamp (CST)
2018-11-21T5:28:56 PM
2018-11-21T5:29:16 PM
How do I create a new column that takes "Timestamp (CST)" and change it to UTC and convert it to a datetime with the time stamp on the 24 hour clock?
Below is my desired table and I would like the datatype to be timestamp:
Timestamp (CST)_2
2018-11-21T17:28:56.000Z
2018-11-21T17:29:16.000Z
I tried the following code but all the results came back null:
df = df.withColumn("Timestamp (CST)_2", to_timestamp("Timestamp (CST)", "yyyy/MM/dd h:mm p"))
Firstly, import from_unixtime, unix_timestamp and col using
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
Then, reconstructing your scenario in a DataFrame df_time
>>> cols = ['Timestamp (CST)']
>>> vals = [
... ('2018-11-21T5:28:56 PM',),
... ('2018-11-21T5:29:16 PM',)]
>>> df_time = spark.createDataFrame(vals, cols)
>>> df_time.show(2, False)
+---------------------+
|Timestamp (CST) |
+---------------------+
|2018-11-21T5:28:56 PM|
|2018-11-21T5:29:16 PM|
+---------------------+
Then, my approach would be
>>> df_time_twenfour = df_time.withColumn('Timestamp (CST)', \
... from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy-MM-dd'T'hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
>>> df_time_twenfour.show(2, False)
+------------------------+
|Timestamp (CST) |
+------------------------+
|2018-11-21T17:28:56.000Z|
|2018-11-21T17:29:16.000Z|
+------------------------+
Notes
If you want time to be in 24-Hour format then, you would use HH instead of hh.
Since, you have a PM, you use aa in yyyy-MM-dd'T'hh:mm:ss aa to specify PM.
Your, input string has T in it so, you have to specify it as above format.
the option aa as mentioned in #pyy4917's answer might give legacy errors. To fix it, replace aa with a.
The full code as below:
df_time_twenfour = df_time.withColumn('Timestamp (CST)', \ ...
from_unixtime(unix_timestamp(col(('Timestamp (CST)')), \...
"yyyy-MM-dd'T'hh:mm:ss a"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))

correct sum of hours in access

I have two columns in an access 2010 database with some calculated field:
time_from time_until calculated_field(time_until-time_from)
10:45 15:00 4:15
13:15 16:00 2:45
11:10 16:00 4:50
08:00 15:00 7:00
08:00 23:00 15:00
Now so far, it is good: calculated field did its job to tell me total hours and mins...
now, I need a sum of a calculated field....
I put in an expression builder: =Sum([time_until]-[time_from])
I guess total sum should give me 33:50... but it gives me some 9:50. why is this happening? Is there a way to fix this?
update:
when I put like this:
=Format(Sum([vrijeme_do]-[vrijeme_od])*24)
I get a decimal point number... which I suppose is correct....
for example, 25hrs and 30mins is shown as 25,5
but, how do I format this 25,5 to look like 25:30?
As #Arvo mentioned in his comment, this is a formatting problem. Your expected result for the sum of calculated_field is 33:50. However that sum is a Date/Time value, and since the number of hours is greater than 24, the day portion of the Date/Time is advanced by 1 and the remainder 9:50 is displayed as the time. Apparently your total is formatted to display only the time portion; the day portion is not displayed.
But the actual Date/Time value for the sum of calculated_field is #12/31/1899 09:50#. You can use a custom function to display that value in your desired format:
? duration_hhnn(#12/31/1899 09:50#)
33:50
This is the function:
Public Function duration_hhnn(ByVal pInput As Date) As String
Dim lngDays As Long
Dim lngMinutes As Long
Dim lngHours As Long
Dim strReturn As String
lngDays = Int(pInput)
lngHours = Hour(pInput)
lngMinutes = Minute(pInput)
lngHours = lngHours + (lngDays * 24)
strReturn = lngHours & ":" & Format(lngMinutes, "00")
duration_hhnn = strReturn
End Function
Note the function returns a string value so you can't do further date arithmetic on it directly.
Similar to the answer from #HansUp, it can be done without VBA code like so
Format(24 * Int(SUM(elapsed_time)) + Hour(SUM(elapsed_time)), "0") & ":" & Format(SUM(elapsed_time), "Nn")
I guess you are trying to show the total in a text box? the correct expression would be =SUM([calculated_field_name]).

Convert 12-hour date/time to 24-hour date/time

I have a tab delimited file where each record has a timestamp field in 12-hour format:
mm/dd/yyyy hh:mm:ss [AM|PM].
I need to quickly convert these fields to 24-hour time:
mm/dd/yyyy HH:mm:ss.
What would be the best way to do this? I'm running on a Windows platform, but I have access to sed, awk, perl, python, and tcl in addition to the usual Windows tools.
Using Perl and hand-crafted regexes instead of facilities like strptime:
#!/bin/perl -w
while (<>)
{
# for date times that don't use leading zeroes, use this regex instead:
# (?:\d{1,2}/\d{1,2}/\d{4} )(\d{1,2})(?::\d\d:\d\d) (AM|PM)
while (m%(?:\d\d/\d\d/\d{4} )(\d\d)(?::\d\d:\d\d) (AM|PM)%)
{
my $hh = $1;
$hh -= 12 if ($2 eq 'AM' && $hh == 12);
$hh += 12 if ($2 eq 'PM' && $hh != 12);
$hh = sprintf "%02d", $hh;
# for date times that don't use leading zeroes, use this regex instead:
# (\d{1,2}/\d{1,2}/\d{4} )(\d{1,2})(:\d\d:\d\d) (?:AM|PM)
s%(\d\d/\d\d/\d{4} )(\d\d)(:\d\d:\d\d) (?:AM|PM)%$1$hh$3%;
}
print;
}
That's very fussy - but also converts possibly multiple timestamps per line.
Note that the transformation for AM/PM to 24-hour is not trivial.
12:01 AM --> 00:01
12:01 PM --> 12:01
01:30 AM --> 01:30
01:30 PM --> 13:30
Now tested:
perl ampm-24hr.pl <<!
12/24/2005 12:01:00 AM
09/22/1999 12:00:00 PM
12/12/2005 01:15:00 PM
01/01/2009 01:56:45 AM
12/30/2009 10:00:00 PM
12/30/2009 10:00:00 AM
!
12/24/2005 00:01:00
09/22/1999 12:00:00
12/12/2005 13:15:00
01/01/2009 01:56:45
12/30/2009 22:00:00
12/30/2009 10:00:00
Added:
In What is a Simple Way to Convert Between an AM/PM Time and 24 hour Time in JavaScript, an alternative algorithm is provided for the conversion:
$hh = ($1 % 12) + (($2 eq 'AM') ? 0 : 12);
Just one test...probably neater.
It is a 1-line thing in python:
time.strftime('%H:%M:%S', time.strptime(x, '%I:%M %p'))
Example:
>>> time.strftime('%H:%M:%S', time.strptime('08:01 AM', '%I:%M %p'))
'08:01:00'
>>> time.strftime('%H:%M:%S', time.strptime('12:01 AM', '%I:%M %p'))
'00:01:00'
Use Pythons datetime module someway like this:
import datetime
infile = open('input.txt')
outfile = open('output.txt', 'w')
for line in infile.readlines():
d = datetime.strptime(line, "input format string")
outfile.write(d.strftime("output format string")
Untested code with no error checking. Also it reads the entire input file in memory before starting.
(I know there is plenty of room for improvements like with statement...I make this a community wiki entry if anyone likes to add something)
To just convert the hour field, in python:
def to12(hour24):
return (hour24 % 12) if (hour24 % 12) > 0 else 12
def IsPM(hour24):
return hour24 > 11
def to24(hour12, isPm):
return (hour12 % 12) + (12 if isPm else 0)
def IsPmString(pm):
return "PM" if pm else "AM"
def TestTo12():
for x in range(24):
print x, to12(x), IsPmString(IsPM(x))
def TestTo24():
for pm in [False, True]:
print 12, IsPmString(pm), to24(12, pm)
for x in range(1, 12):
print x, IsPmString(pm), to24(x, pm)
This might be too simple thinking, but why not import it into excel, select the entire column and change the date format, then re-export as a tab delimited file? (I didn't test this, but it somehow sounds logical to me :)
Here i have converted 24 Hour system to 12 Hour system.
Try to use this method for your problem.
DateFormat fmt = new SimpleDateFormat("yyyyMMddHHssmm");
try {
Date date =fmt.parse("20090310232344");
System.out.println(date.toString());
fmt = new SimpleDateFormat("dd-MMMM-yyyy hh:mm:ss a ");
String dateInString = fmt.format(date);
System.out.println(dateInString);
} catch (Exception e) {
System.out.println(e.getMessage());
}
RESULT:
Tue Mar 10 23:44:23 IST 2009
10-March-2009 11:44:23 PM
In Python: Converting 12hr time to 24hr time
import re
time1=input().strip().split(':')
m=re.search('(..)(..)',time1[2])
sec=m.group(1)
tz=m.group(2)
if(tz='PM'):
time[0]=int(time1[0])+12
if(time1[0]=24):
time1[0]-=12
time[2]=sec
else:
if(int(time1[0])=12):
time1[0]-=12
time[2]=sec
print(time1[0]+':'+time1[1]+':'+time1[2])
Since you have multiple languages, I'll suggest the following algorithm.
1 Check the timestamp for the existence of the "PM" string.
2a If PM does not exist, simply convert the timestamp to the datetime object and proceed.
2b If PM does exist, convert the timestamp to the datetime object, add 12 hours, and proceed.

Resources