I am trying to process website login session data by each user. I am reading an S3 session log file into an RDD. The data looks something like this.
----------------------------------------
User | Site | Session start | Session end
---------------------------------------
Joe |Waterloo| 9/21/19 3:04 AM |9/21/19 3:18 AM
Stacy|Kirkwood| 8/4/19 3:06 PM |8/4/19 3:54 PM
John |Waterloo| 9/21/19 8:48 AM |9/21/19 9:05 AM
Stacy|Kirkwood| 8/4/19 4:16 PM |8/4/19 5:41 PM
...
...
I want to find out how many users were logged in each second of the hour on a given day.
Example: I might be processing this data for 9/21/19 only. So, I would need to remove all other records and then SUM user sessions for each second of the hour for all 24 hours of 9/21/19. The output should be possibly 24 rows for all the hours of 9/21/19 and then counts for each second of the day(yikes, second by second data!).
Is this something possible to do in pyspark using either rdds or DF?
(Apologize for the tardiness in building the grid).
Thanks
my dataset
data=[['Joe','Waterloo','9/21/19 3:04 AM','9/21/19 3:18 AM'],['Stacy','Kirkwood','8/4/19 3:06 PM','8/4/19 3:54 PM'],['John','Waterloo','9/21/19 8:48 AM','9/21/19 9:05 AM'],
['Stacy','Kirkwood','9/21/19 4:06 PM', '9/21/19 4:54 PM'],
['Mo','Hashmi','9/21/19 1:06 PM', '9/21/19 5:54 PM'],
['Murti','Hash','9/21/19 1:00 PM', '9/21/19 3:00 PM'],
['Floo','Shmi','9/21/19 9:10 PM', '9/21/19 11:54 PM']]
cSchema = StructType([StructField("User", StringType())\
,StructField("Site", StringType())
, StructField("Sesh-Start", StringType())
, StructField("Sesh-End", StringType())])
df= spark.createDataFrame(data,schema=cSchema)
display(df)
parse timestamp
df1=df.withColumn("Start", F.from_unixtime(F.unix_timestamp("Sesh-Start",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).withColumn("End", F.from_unixtime(F.unix_timestamp("Sesh-End",'MM/dd/yyyy hh:mm aa'),'20yy-MM-dd HH:mm:ss').cast("timestamp")).drop("Sesh-Start","Sesh-End")
build and register udf, for multiple hours per person
def yo(a,b):
from datetime import datetime
d1 = datetime.strptime(str(a), '%Y-%m-%d %H:%M:%S')
d2 = datetime.strptime(str(b), '%Y-%m-%d %H:%M:%S')
y=[]
if d1.hour == d2.hour:
y.append(d1.hour)
else:
for i in range(d1.hour,d2.hour+1):
y.append(i)
return y
rng= udf(yo, ArrayType(IntegerType()))
explode list of hours into column
df2=df1.withColumn("new", rng(F.col("Start"),F.col("End"))).withColumn("new1",F.explode("new")).drop("new")
get seconds for each hour
df3=df2.withColumn("Seconds", when(F.hour("Start")==F.hour("End"), F.col("End").cast('long') - F.col("Start").cast('long'))
.when(F.hour("Start")==F.col("new1"), 3600-F.minute("Start")*60)
.when(F.hour("End")==F.col("new1"), F.minute("End")*60)
.otherwise(3600))
create temp view and query it
df3.createOrReplaceTempView("final")
display(spark.sql("Select new1, sum(Seconds) from final group by new1 order by new1"))
The above answer by Lennart could be more perfomant because he uses a join to get all the different hours, instead I use a UDF which could be slower. My code will work for any user who can be online for any amount of hours. My data used only the day required, so you could use day filter given above to limit your query to the day in question.. Final output
Try to check this:
Initiaize filter.
val filter = to_date("2019-09-21")
val startFilter = to_timestamp("2019-09-21 00:00:00.000")
val endFilter = to_timestamp("2019-09-21 23:59:59.999")
Generate range (0 .. 23).
hours = spark.range(24).collect()
Get actual user sessions that match the filter.
df = sessions.alias("s") \
.where(filter >= to_date(s.start) & filter <= to_date(s.end)) \
.select(s.user, \
when(s.start < startFilter, startFilter).otherwise(s.start).alias("start"), \
when(s.end > endFilter, endFilter).otherwise(s.end).alias("end"))
Combine match user sessions with range of hours.
df2 = df.join(hours, hours.id.between(hour(df.start), hour(df.end)), 'inner') \
.select(df.user, hours.id.alias("hour"), \
(when(hour(df.end) > hours.id, 360).otherwise(minute(df.end) * 60 + second(df.end)) - \
when(hour(df.start) < hours.id, 0).otherwise(minute(df.start) * 60 + second(df.start))).alias("seconds"))
Generate summary: calculate users count and sum of seconds for each hour of sessions.
df2.groupBy(df2.hour)\
.agg(count(df2.user).alias("user counts"), \
sum(dg2.seconds).alias("seconds")) \
.show()
Hope this helps.
These lines give the date and time in UTC:
t:timedate(absolute_real_time() - (10*3600));
t0:substring(t,1,20);
t1:concat(substring(t,12,17), " ", substring(t,9,11), "/", substring(t,6,8), "/", substring(t,1,5));
t2:concat(substring(t,1,5), substring(t,6,8), substring(t,9,11), substring(t,12,14), substring(t,15,17), substring(t,18,20));
I know that '?\*autoconf\-version\*;' can give the Maxima version number, so maybe there is some undocumented way to get the local time.
Otherwise are there any ready-made functions that can convert
UTC time to local time given conditions for start/end of daylight saving time
e.g. UTC time to UK time (which is GMT/BST depending on the time of year)?
It's not clear to me exactly what you need, but perhaps the following helps. By the way, do you really need to extract the parts (year, month, day, etc)? If so, it might be more convenient to work directly in Lisp. See DECODE-UNIVERSAL-TIME at the Common Lisp Hyperspec (a web search will find it).
The timedate now (in the just-released Maxima 5.39) accepts an optional argument which is the time zone offset, in hours (plus or minus). The time zone offset may be noninteger (e.g. 2.5). Offset 0 indicates UTC. If the offset is omitted, the time is formatted in the local time zone.
(%i5) t:absolute_real_time();
(%o5) 3691202499
(%i6) timedate (t, 0);
(%o6) 2016-12-20 06:01:39+00:00
(%i7) timedate (t);
(%o7) 2016-12-19 22:01:39-08:00
Note that the daylight saving time flag is applied at the "time of the time". Here is a time from next summer, when daylight saving time is in effect.
(%i8) timedate (t + 6*30.25*24*3600);
(%o8) 2017-06-19 11:01:39-07:00
The parse_timedate function has also been (in Maxima 5.39) updated to recognize time zone offsets.
(%i9) parse_timedate ("2016-12-19 22:01:39-08:00");
(%o9) 3691202499
As with timedate if the offset is omitted, it is assumed to be in the local time zone.
(%i10) parse_timedate ("2016-12-19 22:01:39");
(%o10) 3691202499
Note also that Maxima does not recognize any symbolic time zone indicators such as "UTC", "GMT", "EDT", "America/New_York", etc., only numerical time zone offsets.
To clarify the problem, before revealing the solution:
these are the steps that I take in Maxima v5.30
to get the time in UTC, in a readable format:
Note: When I use Maxima v5.30 (in the UK),
for some unknown reason, the time is always UTC adjusted
by 10 hours, and does not adjust for DST.
/* 1st Jan 2017 12 noon: */
timedate(3692260800); /* "2017-01-01 22:00:00+10:00" */
timedate(3692260800-10*3600); /* "2017-01-01 12:00:00+10:00" */
substring(timedate(3692260800-10*3600),1,20); /* "2017-01-01 12:00:00" */
Note: timedate works better/differently in later versions of Maxima,
but some institutions recommend installing a specific version of Maxima.
Sometimes I want the date in the form: 'yyyyMMddHHmmss'.
A function for this is:
SecUTCToDate(vSec,vHour):=
block([d1,d2],
d1:timedate(vSec+vHour*3600),
d2:concat(substring(d1,1,5), substring(d1,6,8), substring(d1,9,11), substring(d1,12,14), substring(d1,15,17), substring(d1,18,20)),
parse_string(d2)
);
Note: [d1,d2] keeps those variables local to within the block, and not global.
To get the local time I have to add on hours based on my time zone (0 in the UK), and DST.
To calculate whether a time is within the DST period requires an individual function per time zone: in the UK, and many European countries, one such function is:
/* correct for the years 1900-2200 inclusive */
SecUTCIsDSTUK(vSec):=
block([vLeap,vDaysMar25,vDaysOct25,vWDayMar25,vWDayOct25,vRange1,vRange2],
vYear : parse_string(substring(timedate(vSec),1,5)),
vLeap : floor((vYear-1900)/4), if (vYear>=2100) then vLeap : vLeap-1,
vDaysMar25 : (vYear-1900)*365 + vLeap + 83,
vDaysOct25 : vDaysMar25 + 214,
vWDayMar25 : mod(vDaysMar25+1,7),
vWDayOct25 : mod(vDaysOct25+1,7),
vRange1 : (vDaysMar25+mod(-vWDayMar25,7))*86400 + 3600,
vRange2 : (vDaysOct25+mod(-vWDayOct25,7))*86400 + 3600,
if ((vSec >= vRange1) and (vSec < vRange2)) then 1 else 0);
You can create a mac file with such a function, and call up the the function when needed, e.g.:
load("C:\\MyFolder\\MyFile.mac");
SecUTCIsDSTUK(absolute_real_time());
SecUTCIsDSTUK(absolute_real_time()+86400*180);
thank you for your helpful response,
results (v. 5.39.0) (works fine, param 2 omitted gives local time, param 2 as 0 gives UTC):
t:3691202499;
timedate (t);
timedate (t + 6*30.25*24*3600);
timedate (t + 6*30*24*3600);
timedate (t, 0);
timedate (t + 6*30.25*24*3600, 0);
timedate (t + 6*30*24*3600, 0);
:lisp (decode-universal-time 3691202499)
:lisp (decode-universal-time 3691202499 0)
:lisp (decode-universal-time 3706754499)
:lisp (decode-universal-time 3706754499 0)
3691202499
"2016-12-20 06:01:39+00:00"
"2017-06-19 19:01:39+01:00"
"2017-06-18 07:01:39+01:00"
"2016-12-20 06:01:39+00:00"
"2017-06-19 18:01:39+00:00"
"2017-06-18 06:01:39+00:00"
39 1 6 20 12 2016 1 NIL 0
39 1 6 20 12 2016 1 NIL 0
39 1 7 18 6 2017 6 T 0
39 1 6 18 6 2017 6 NIL 0
results (v. 5.30.0) (it seems param 2 omitted gives UTC+10, with no daylight saving time):
(if this is true, I would have to find another way to get local time, possibly by Common LISP commands)
t:3691202499;
timedate (t);
timedate (t + 6*30.25*24*3600);
timedate (t + 6*30*24*3600);
:lisp (decode-universal-time 3691202499)
:lisp (decode-universal-time 3691202499 0)
:lisp (decode-universal-time 3706754499)
:lisp (decode-universal-time 3706754499 0)
3691202499
"2016-12-20 16:01:39+10:00"
"2017-06-20 04:01:39.0+10:00"
"2017-06-18 16:01:39+10:00"
39 1 16 20 12 2016 1 NIL -10
39 1 6 20 12 2016 1 NIL 0
39 1 16 18 6 2017 6 NIL -10
39 1 6 18 6 2017 6 NIL 0
(I can see that the timedate and decode-universal-time functions
have key differences between Maxima versions)
thank you for the website mention,
CLHS: Section The Environment Dictionary
http://clhs.lisp.se/Body/c_enviro.htm
is there a list of LISP commands that work in Maxima?
the main reason for the datestamp concerns:
to produce datestamps for filenames such as 'z title yyyymmddhhmmss.txt',
or for friendly dates inside those files such as 'hh:mm dd/mm/yyyy',
the string manipulation method was the simplest method
that I could successfully code (I don't explicitly need to extract individual d m y etc)
Please help to calculate Moving/Rolling back Weekly Sum of Amount($4) based on Distributor wise ($2) and Rolling Date wise.
Want to set vaiable like
RollingStartDate ==01/05/2015 and RollingInterval==7 and RollingEndDate ==08/05/2015
For Example :
1st May 2015 Rolling 7 Days data set would be from 01/05/2015 to 25/04/2015
2nd May 2015 Rolling 7 Days data set would be from 02/05/2015 to 26/04/2015
....................................................................
7th May 2015 Rolling 7 Days data set would be from 07/05/2015 to 01/05/2015
8th May 2015 Rolling 7 Days data set would be from 08/05/2015 to 02/05/2015
Input.csv
Des,Date,Distributor,Amount,Loc
aaa,25/04/2015,abc123,25,bbb
aaa,25/04/2015,xyz456,75,bbb
aaa,26/04/2015,xyz456,50,bbb
aaa,27/04/2015,abc123,250,bbb
aaa,27/04/2015,abc123,100,bbb
aaa,29/04/2015,xyz456,50,bbb
aaa,30/04/2015,abc123,25,bbb
aaa,01/05/2015,xyz456,75,bbb
aaa,01/05/2015,abc123,50,bbb
aaa,02/05/2015,abc123,25,bbb
aaa,02/05/2015,xyz456,75,bbb
aaa,04/05/2015,abc123,30,bbb
aaa,04/05/2015,xyz456,35,bbb
aaa,05/05/2015,xyz456,12,bbb
aaa,06/05/2015,abc123,32,bbb
aaa,06/05/2015,xyz456,43,bbb
aaa,07/05/2015,xyz456,87,bbb
aaa,08/05/2015,abc123,58,bbb
aaa,08/05/2015,xyz456,98,bbb
Example: 8th May 2015 Rolling 7 Days data set would be from 08/05/2015 to 02/05/2015
aaa,02/05/2015,abc123,25,bbb
aaa,02/05/2015,xyz456,75,bbb
aaa,04/05/2015,abc123,30,bbb
aaa,04/05/2015,xyz456,35,bbb
aaa,05/05/2015,xyz456,12,bbb
aaa,06/05/2015,abc123,32,bbb
aaa,06/05/2015,xyz456,43,bbb
aaa,07/05/2015,xyz456,87,bbb
aaa,08/05/2015,abc123,58,bbb
aaa,08/05/2015,xyz456,98,bbb
Output for 8th May 2015 Rolling 7 Days data set
RollingDate,Distributor,Amount
08/05/2015,abc123,145
08/05/2015,xyz456,350
I am able to obtain the above output from this command :
awk -F, '{key=$3;b[key]=b[key]+$4} END {for(i in a) print i","b[i]}'
Kindly suggest how to derive weekly split-up data sets then Sum.
Desired Output:
RollingDate,Distributor,Amount
01/05/2015,abc123,450
01/05/2015,xyz456,250
02/05/2015,abc123,450
02/05/2015,xyz456,250
03/05/2015,abc123,450
03/05/2015,xyz456,200
04/05/2015,abc123,130
04/05/2015,xyz456,235
05/05/2015,abc123,130
05/05/2015,xyz456,247
06/05/2015,abc123,162
06/05/2015,xyz456,240
07/05/2015,abc123,137
07/05/2015,xyz456,327
08/05/2015,abc123,145
08/05/2015,xyz456,350
Edit#1
1.
The logic is to find a Sum of Amount is billed to the distributor for the period of 7days range, i.e if i need to calculate sum for 1st May then I need to consider the line items from 1st May,30th Apr,29th Apr,28th Apr,27th Apr,26th Apr and 25th Apr , It is equivalent to 1st May (-) minus 6 days back ... like wise 2nd May rolling date is equal to from 2nd May to 26th May ( 2nd May minus 6 days back ..)
2.
Date format is DD/MM/YYYY - 02/05/2015 is 2nd May
Since the file contains 2 to 3 months deatils , dont want to select the first date (25/04/2015) from file then do minus 6 days back analysis , hence "RollingStartDate" will help from which dates need to consider the data , "RollingInterval" will help to do the analysis for "7 days" moving back or "14 days" moving back or "30 days monthly " moving back analysis.
"RollingEndDate" will help to avoid if actual file contains any future date data availabe , in this case if 09th or 15th may date line items need to be excluded ...
Here's a solution that just excludes dates that don't have 7 days before them instead of requiring a specific start/stop range:
$ cat tst.awk
BEGIN { FS=OFS=","; window=(window?window:7); secsPerDay=24*60*60 }
NR==1 { print "RollingDate", $3, $4; next }
{
endSecs = mktime(gensub(/(..)\/(..)\/(....)/,"\\3 \\2 \\1 0 0 0","",$2))
if (begSecs=="") {
begSecs = endSecs + ((window-1) * secsPerDay)
}
amount[endSecs][$3] += $4
dists[$3]
}
END {
for (currSecs=begSecs; currSecs<=endSecs; currSecs+=secsPerDay) {
for (dayNr=1; dayNr<=window; dayNr++) {
rollSecs = currSecs - ((dayNr-1) * secsPerDay)
for (dist in dists) {
sum[dist] += (rollSecs in amount ? amount[rollSecs][dist] : 0)
}
}
for (dist in dists) {
print strftime("%d/%m/%Y",currSecs), dist, sum[dist]
delete sum[dist]
}
}
}
.
$ awk -f tst.awk file
RollingDate,Distributor,Amount
01/05/2015,xyz456,250
01/05/2015,abc123,450
02/05/2015,xyz456,250
02/05/2015,abc123,450
03/05/2015,xyz456,200
03/05/2015,abc123,450
04/05/2015,xyz456,235
04/05/2015,abc123,130
05/05/2015,xyz456,247
05/05/2015,abc123,130
06/05/2015,xyz456,240
06/05/2015,abc123,162
07/05/2015,xyz456,327
07/05/2015,abc123,137
08/05/2015,xyz456,350
08/05/2015,abc123,145
.
To use some different window size than 7 days, just set it on the command line:
$ awk -v window=5 -f tst.awk file
RollingDate,Distributor,Amount
29/04/2015,xyz456,175
29/04/2015,abc123,375
30/04/2015,xyz456,100
30/04/2015,abc123,375
01/05/2015,xyz456,125
01/05/2015,abc123,425
02/05/2015,xyz456,200
02/05/2015,abc123,100
03/05/2015,xyz456,200
03/05/2015,abc123,100
04/05/2015,xyz456,185
04/05/2015,abc123,130
05/05/2015,xyz456,197
05/05/2015,abc123,105
06/05/2015,xyz456,165
06/05/2015,abc123,87
07/05/2015,xyz456,177
07/05/2015,abc123,62
08/05/2015,xyz456,275
08/05/2015,abc123,120
The above uses GNU awk for true 2D arrays and time functions. Hopefully it's clear enough that you can make any modifications you need to include/exclude specific date ranges.