enter image description hereI have a hive table with more than millions records.
The input is of the following type:
Input:
rowid |starttime |endtime |line |status
--- 1 2007-07-19 00:05:00 2007-07-19 00:23:00 l1 s1
--- 2 2007-07-20 00:00:10 2007-07-20 00:22:00 l1 s2
--- 3 2007-07-19 00:00:00 2007-07-19 00:11:00 l2 s2
What I want to do is when 1st order the table by starttime group by line.
Then find the difference between two consecutive rows endtime and starttime. If the difference is more than 5mins then in a new table add a new row in between with status misstime.
In input row 1 & 2 the time time difference is 1 hour 10 mins so 1st I will create row for 19th Date and complete that days with missing time and then add one more row for 20th as below.
output:
rowid |starttime |endtime |line |status
--- 1 |2007-07-19 00:05:00 |2007-07-19 00:23:00 |l1 |s1
--- 2 |2007-07-19 00:23:01 |2007-07-19 00:00:00 |l1 |misstime
--- 3 |2007-07-20 00:00:01 |2007-07-20 00:00:09 |l1 |misstime
--- 4 |2007-07-20 00:00:10 |2007-07-20 00:22:00 |l1 |s2
--- 3 |2007-07-19 00:00:00 |2007-07-19 00:11:00 |l2 |s2
Can anyone help me achieve this directly in hue - hive ?
Unix script will also do.
Thanks in advance.
The solution template is:
Use LAG() function to get previous line starttime or endtime.
For each line calculate the different between current and previous time
Filter rows with difference more than 5 minutes.
Transform the dataset into required output.
Example:
insert into yourtable
select
s.rowid,
s.starttime ,
s.endtime,
--calculate your status here, etc, etc
from
(
select rowid starttime endtime,
lag(endtime) over(partition by rowid order by starttime) prev_endtime
from yourtable ) s
where (unix_timestamp(endtime) - unix_timestamp(prev_endtime))/60 > 5 --latency>5 min
Related
If the dataframe is like below:
year
month
day
weekday
hour
2017
January
1
Sunday
0
2018
September
22
Saturday
11
Then I need to add another column with values of type timestamp like the following:
2017-01-01 00:00:00
2018-09-22 11:00:00
I'm trying unix_timestamp after concatenating the fields into string type but not working.
you can concat the elements into a string and use to_timestamp (or from_unixtime(unix_timestamp())) with the appropriate datetime pattern.
here's the example
data_sdf. \
withColumn('ts',
func.to_timestamp(func.concat_ws(' ', 'year', 'month', 'day', 'hour'),
'yyyy MMMM d H'
)
). \
show(truncate=False)
# +----+---------+---+--------+----+-------------------+
# |year|month |day|weekday |hour|ts |
# +----+---------+---+--------+----+-------------------+
# |2017|January |1 |Sunday |0 |2017-01-01 00:00:00|
# |2018|September|22 |Saturday|11 |2018-09-22 11:00:00|
# +----+---------+---+--------+----+-------------------+
I have this table, date is a TEXT field and the only field.
date
2020-01-01
2010-03-01
2010-06-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
I want the table to join itself on the date that is 1 year smaller than the target record, I tried this. However it doesn't seemed to work when I only add number after doing a strftime.
SELECT d0.*, d1.* from the_table d0 left join the_table d1 on strftime(d0."date",'%Y') = strftime(d1."date",'%Y') + 1;
What I want is the following result
date date
2020-01-01 None
2010-03-01 2011-01-01
2010-06-01 2011-01-01
2011-01-01 2012-01-01
...
But this is what it returned instead.
I have several questions regarding this issue?
Besides the example that joins table on a specific difference in year. How do I do this for months, days etc?
Does the strftime uses the index if there's an index created on that field? The date field is the primary key field in the example. How do I know if I'm using indices? If not how do I make it use the index?
The syntax for the function strftime() requires the format to be the first argument and the date to be the second.
Also, strftime() returns a string, so you must convert it to a number (implicitly by adding 0) if you want to compare it to a number:
SELECT d0.*, d1.*
FROM the_table d0 LEFT JOIN the_table d1
ON strftime('%Y', d0."date") + 1 = strftime('%Y', d1."date") + 0;
See the demo.
Results:
date
date
2020-01-01
null
2010-03-01
2011-01-01
2010-06-01
2011-01-01
2011-01-01
2012-01-01
2012-01-01
2013-01-01
2013-01-01
2014-01-01
2014-01-01
2015-01-01
2015-01-01
null
You can apply the same code by changing the format to '%m' or '%d' to compare month or day respectively, if the year is not relevant.
But, if you want to join on the next day of each date you can do it with the function date():
SELECT d0.*, d1.*
FROM the_table d0 LEFT JOIN the_table d1
ON date(d0."date", '+1 day') = d1."date";
Also, strftime() and date() are functions and normally SQLite would not use any index with these functions.
SQLite supports indexes on expressions (also: SQLite Expression-based Index), but I don't think that this would help in your case.
Not an exact answer or correction to your current approach, but you could use the DATE() function here with an offset of 1 year:
SELECT d0.*, d1.*
FROM the_table d0
LEFT JOIN the_table d1
ON d0."date" = DATE(d1."date", '+1 year');
I am trying to join two datatables using rolling join. I have looked at various answers including here but unfortunately unable to locate one that helps in this case. I am borrowing the same example from the link posted.
my first dataset is a websession data for two users 1 and 2:
user web_date_time
1 29-Oct-2016 6:10:03 PM
1 29-Oct-2016 7:34:17 PM
1 30-Oct-2016 2:08:03 PM
1 30-Oct-2016 3:55:12 PM
2 31-Oct-2016 11:32:12 AM
2 31-Oct-2016 2:59:56 PM
2 01-Nov-2016 12:49:44 PM
My second time stamp is for purchase:
user purchase_date_time
1 29-Oct-2016 6:10:00 PM
1 29-Oct-2016 6:11:00 PM
2 31-Oct-2016 11:35:12 AM
2 31-Oct-2016 2:50:00 PM
My desired output is which web session led to a purchase but with a constraint. The constraint is - the websession should be after the previous purchase. The desired out is as follows (requires for all purchases, an additional column "websession_led_purchase" to be created ):
user purchase_date_time websession_led_purchase
1 29-Oct-2016 6:10:00 PM NA
1 29-Oct-2016 6:11:00 PM 29-Oct-2016 6:10:03 PM
2 31-Oct-2016 11:35:12 AM 31-Oct-2016 11:32:12 AM
2 31-Oct-2016 2:50:00 PM NA
The first NA is on account of no websession before that purchase, the second NA is on account of no websession after the previous purchase (and before the purchase) that led to the second purchase for user 2.
I tried using the roll join method of dt2[dt1,roll=Inf], however, I get "31-Oct-2016 11:32:12 AM" for the fourth row in the desired output, which is incorrect.
Let me know your advice.
The rolling joins is behaving as expected.
The document suggests as:
+Inf (or TRUE) rolls the prevailing value in x forward. It is also known as last observation carried forward (LOCF).
That means the last observation can be carried forward and joined with for many records. Exactly the same is happening with 4th row where 2016-10-31 11:32:12 is coped and mapped with even next record (2016-10-31 14:50:00).
A simple way to fix this issue is to match lag value of websession_led_purchase with current row if those two are same then set value in current row as NA. This will ensure data was carried forwards only-once.
library(lubridate)
library(data.table)
setDT(DT1)
setDT(DT2)
DT1[,':='(date_time = dmy_hms(web_date_time), web_date_time = dmy_hms(web_date_time))]
DT2[, ':='(date_time = dmy_hms(purchase_date_time),
purchase_date_time = dmy_hms(purchase_date_time)) ]
setkey(DT1, user, date_time)
setkey(DT2, user, date_time)
DT1[DT2, roll= Inf][,.(user, purchase_date_time,
websession_led_purchase = as.POSIXct(ifelse(!is.na(shift(web_date_time)) &
web_date_time == shift(web_date_time), NA, web_date_time),
origin = "1970-01-01"))]
# user purchase_date_time websession_led_purchase
# 1: 1 2016-10-29 18:10:00 <NA>
# 2: 1 2016-10-29 18:11:00 2016-10-29 19:10:03
# 3: 2 2016-10-31 11:35:12 2016-10-31 11:32:12
# 4: 2 2016-10-31 14:50:00 <NA>
I need to get the full date with timestamp from a column that has a format 'yyyymm'. For example i need to get 2007-01-01 00:00:00:000 from 200701.
My Column 'A' consists of:
200701
200702
200703
...
...
...
I need another to calculate another column 'B' showing:
2007-01-01 00:00:00.000
2007-02-01 00:00:00.000
2007-03-01 00:00:00.000
2007-04-01 00:00:00.000
Column B has to be a calculation based on Column A or Sys.Calendar. Using platform Teradata 14.
Please let me know the answer. Thank you in advance for your answers.
If the datatype is a string:
cast(col as timestamp(3) format 'yyyymm')
If it's numeric:
cast(cast(col * 100 - 19000000 + 1 as date) as timestamp(3))
Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.