Surround Search with KQL - azure-data-explorer

Surround Search with KQL: How can I retrieve five records that were logged (based on a specific datetime column) before and after (again, based on a given datetime column) one/several record(s)?
Reference from Linux logs: we can search for "failed login" and obtain a list of 5 events logged before and after a failed login. The query can be phrased as follows:
$ grep -B 5 -A 5 'failed login' var/log/auth.log
Source: https://www.manageengine.com/products/eventlog/logging-guide/syslog/analyzing-syslogs-with-tools-techniques.html > search "Surround Search".
I tried the next() operator, but it doesn't retrieve the value of the entire record, only the value in a specific column.
Example:
cluster("https://help.kusto.windows.net").database("Samples").
StormEvents
| serialize
| extend NextEpisode = next(EpisodeId,5)
| extend PrevEpisode = prev(EpisodeId,5)
| extend formated_text = strcat("Current episode: ", EpisodeId, " .Next episode: ", NextEpisode, " .Prev episode: ", PrevEpisode)
| where StartTime == datetime(2007-12-13T09:02:00Z)
| where EndTime == datetime(2007-12-13T10:30:00Z)
| project-reorder formated_text, *

rows_near plugin
cluster("https://help.kusto.windows.net").database("Samples").StormEvents
| order by StartTime asc
| evaluate rows_near(EventType == "Dense Smoke", 5)
| project StartTime, EventType
StartTime
EventType
2007-09-04T18:15:00Z
Thunderstorm Wind
2007-09-04T18:51:00Z
Thunderstorm Wind
2007-09-04T19:15:00Z
Flash Flood
2007-09-04T22:00:00Z
Dense Fog
2007-09-04T22:00:00Z
Dense Fog
2007-09-04T22:00:00Z
Dense Smoke
2007-09-04T22:00:00Z
Dense Fog
2007-09-04T22:00:00Z
Dense Fog
2007-09-05T02:00:00Z
Flash Flood
2007-09-05T04:45:00Z
Flash Flood
2007-09-05T06:00:00Z
Flash Flood
2007-10-17T15:51:00Z
Thunderstorm Wind
2007-10-17T15:55:00Z
Hail
2007-10-17T15:56:00Z
Thunderstorm Wind
2007-10-17T15:58:00Z
Hail
2007-10-17T16:00:00Z
Thunderstorm Wind
2007-10-17T16:00:00Z
Dense Smoke
2007-10-17T16:00:00Z
Thunderstorm Wind
2007-10-17T16:00:00Z
Thunderstorm Wind
2007-10-17T16:03:00Z
Funnel Cloud
2007-10-17T16:05:00Z
Thunderstorm Wind
2007-10-17T16:08:00Z
Hail
2007-11-05T06:00:00Z
Lake-Effect Snow
2007-11-05T06:00:00Z
Winter Storm
2007-11-05T07:00:00Z
Winter Storm
2007-11-05T07:00:00Z
Winter Storm
2007-11-05T07:00:00Z
Winter Storm
2007-11-05T07:00:00Z
Dense Smoke
2007-11-05T07:00:00Z
Winter Storm
2007-11-05T08:44:00Z
Hail
2007-11-05T09:57:00Z
Blizzard
2007-11-05T11:00:00Z
Strong Wind
2007-11-05T11:00:00Z
Strong Wind
Fiddle

Related

How to retrieve specific date data from the table in kusto query

I need to retrieve the specific date data from the table
Can any one help me out this..
Sample data:
Timestamp
message
2010-09-12
king
2010-09-12
queen
2010-09-13
raju
2010-09-13
Rani
2010-09-14
Ramu
2010-09-12
somu
Expecting results:
Timestamp
message
2010-09-12
king
2010-09-12
queen
2010-09-12
somu
Only 2010-09-12 date results required.
Thanks in advance....
If all datetime values are at the start of the day then you can use a simple equality.
datatable(Timestamp:datetime, message:string)
[
datetime("2010-09-12") ,"king"
,datetime("2010-09-12") ,"queen"
,datetime("2010-09-13") ,"raju"
,datetime("2010-09-13") ,"Rani"
,datetime("2010-09-14") ,"Ramu"
,datetime("2010-09-12") ,"somu"
]
| where Timestamp == datetime("2010-09-12")
If all datetime values are at the start of the day then you can use a simple equality.
Timestamp
message
2010-09-12T00:00:00Z
king
2010-09-12T00:00:00Z
queen
2010-09-12T00:00:00Z
somu
Fiddle
If datetime values have timestamp parts, you'll need to check a range of dates.
datatable(Timestamp:datetime, message:string)
[
datetime("2010-09-12 00:00:00") ,"king"
,datetime("2010-09-12 12:34:56") ,"queen"
,datetime("2010-09-13 00:00:00") ,"raju"
,datetime("2010-09-13 15:23:02") ,"Rani"
,datetime("2010-09-14 11:11:11") ,"Ramu"
,datetime("2010-09-12 02:03:04") ,"somu"
]
| where Timestamp >= datetime("2010-09-12") and Timestamp < datetime("2010-09-13")
Timestamp
message
2010-09-12T00:00:00Z
king
2010-09-12T02:03:04Z
somu
2010-09-12T12:34:56Z
queen
Fiddle

JOINing databases with SQLite

I have 4 databases relating to the America's Cup.
SELECT * FROM teams
>
Code | Country | TeamName
ITA |Italy | Luna Rossa Prada Pirelli Team
NZ |New Zealand | Emirates Team New Zealand
UK |United Kingdom | INEOS Team UK
USA |United States of America | NYYC American Magic
4 rows
SELECT * FROM races
>
Race Tournament Date Racedate
RR1R1 RR 15-Jan 18642
RR1R2 RR 15-Jan 18642
RR1R3 RR 16-Jan 18643
RR2R1 RR 16-Jan 18643
RR2R2 RR 17-Jan 18644
RR2R3 RR 17-Jan 18644
RR3R1 RR 23-Jan 18650
RR3R2 RR 23-Jan 18650
RR3R3 RR 23-Jan 18650
SFR1 SF 29-Jan 18656
1-10 of 31 rows
SELECT * FROM tournaments
>
Tournament Event TournamentName
RR Prada Cup Round Robin
SF Prada Cup Semi-Final
F Prada Cup Final
AC America's Cup Americas Cup
4 rows
SELECT *
FROM results
>
Race Code Result
FR1 ITA Win
FR1 UK Loss
FR2 UK Loss
FR2 ITA Win
FR3 UK Loss
FR3 ITA Win
FR4 ITA Win
FR4 UK Loss
FR5 ITA Win
FR5 UK Loss
1-10 of 62 rows
and I'm trying to write an SQL query that will output the number of races each team won by tournament, and show the output. The output table should include the full name of the Event, the Tournament and the full name of each team. My query at the moment looks like this:
SELECT TeamName, Result, Event, tournaments.Tournament
FROM teams LEFT JOIN results
ON teams.Code = results.Code
LEFT JOIN races
ON results.Race = races.Race
LEFT JOIN tournaments
ON races.Tournament = tournaments.Tournament
WHERE Result = 'Win'
ORDER BY tournaments.Tournament
which outputs:
TeamName Result Event Tournament
Emirates Team New Zealand Win America's Cup AC
Emirates Team New Zealand Win America's Cup AC
Luna Rossa Prada Pirelli Team Win America's Cup AC
Luna Rossa Prada Pirelli Team Win America's Cup AC
Emirates Team New Zealand Win America's Cup AC
Luna Rossa Prada Pirelli Team Win America's Cup AC
Emirates Team New Zealand Win America's Cup AC
Emirates Team New Zealand Win America's Cup AC
Emirates Team New Zealand Win America's Cup AC
Emirates Team New Zealand Win America's Cup AC
When I try to COUNT(Result) AS NumberOfWins, I get:
TeamName Result NumberOfWins Event Tournament
Luna Rossa Prada Pirelli Team Win 31 Prada Cup F
1 row
Why does adding the count count only Luna Rossa's wins? How can I change the query to fix it?
Why does adding the count count only Luna Rossa's wins?
Count() is an aggregate function and produces one result per GROUP.
As you have no GROUP BY clause the entire result set is a single group and hence the single result.
The reason why you got Tournament F is due to
If the SELECT statement is an aggregate query without a GROUP BY clause, then each aggregate expression in the result-set is evaluated once across the entire dataset. Each non-aggregate expression in the result-set is evaluated once for an arbitrarily selected row of the dataset. The same arbitrarily selected row is used for each non-aggregate expression. Or, if the dataset contains zero rows, then each non-aggregate expression is evaluated against a row consisting entirely of NULL values. As per SQLite SELECT -
How can I change the query to fix it?
So you need a GROUP BY clause. To create groups upon which the count() function will work on.
You probably want GROUP BY Tournament,TeamName
e.g.
SELECT TeamName, Result, Event, tournaments.Tournament, count(*)
FROM teams LEFT JOIN results
ON teams.Code = results.Code
LEFT JOIN races
ON results.Race = races.Race
LEFT JOIN tournaments
ON races.Tournament = tournaments.Tournament
WHERE Result = 'Win'
GROUP BY Tournament,Teamname
ORDER BY tournaments.Tournament

Multiple orderBy in firestore

I have a question about how multiple orderBy works.
Supposing these documents:
collection/
doc1/
date: yesterday at 11:00pm
number: 1
doc2/
date: today at 01:00am
number: 6
doc3/
date: today at 13:00pm
number: 0
If I order by two fields like this:
.orderBy("date", "desc")
.orderBy("number", "desc")
.get()
How are those documents sorted? And, what about doing the opposite?
.orderBy("number", "desc")
.orderBy("date", "desc")
.get()
Will this result in the same order?
I'm a bit confused since I don't know if it will always end up ordering by the last orderBy.
In the documentation for orderBy() in Firebase it says this:
You can also order by multiple fields. For example, if you wanted to order by state, and within each state order by population in descending order:
Query query = cities.orderBy("state").orderBy("population", Direction.DESCENDING);
So, it is basically that. With logic from SQL where you have ORDER BY to order your table. Let's say you have a database of customers who are from all over the world. Then you can use ORDER BY Country and you will order them by their Country in any order you want. But if you add the second argument, let's say Customer Name, then it will first order by the Country and then within that ordered list it will order by Customer Name. Example:
1. Adam | USA |
2. Jake | Germany |
3. Anna | USA |
4. Semir | Croatia |
5. Hans | Germany |
When you call orderBy("country") you will get this:
1. Semir | Croatia |
2. Jake | Germany |
3. Hans | Germany |
4. Adam | USA |
5. Anna | USA |
Then when you call orderBy("customer name") you get this:
1. Semir | Croatia |
2. Hans | Germany |
3. Jake | Germany |
4. Adam | USA |
5. Anna | USA |
You can see that Hans and Jake switched places, because H is before J but they are still ordered by the Country name. In your case when you use this:
.orderBy("date", "desc")
.orderBy("number", "desc")
.get()
It will first order by the date and then by the numbers. But since you don't have the same date values, you won't notice any difference. This also goes for the second one. But let's say that one of your fields had the same date, so your data looks like this:
collection/
doc1/
date: yesterday at 11:00pm
number: 1
doc2/
date: today at 01:00am
number: 6
doc3/
date: today at 01:00am
number: 0
Now, doc2 and doc3 are both dated to today at 01:00am. Now when you order by the date they will be one below the other, probably doc2 will be shown first. But when you use orderBy("number") then it will check for numbers inside the same dates. So, if its just orderBy("number") without "desc" you would get this:
orderBy("date");
// output: 1. doc1, 2. doc2, 3. doc3
orderBy("number");
// output: 1. doc1, 2. doc3, 3. doc2
Because number 0 is before 6. Just reverse it for desc.

Efficient loan repayment calculation

I have a table of loan issuance and repayments by customers that I have preprocessed like this
customerID | balanceChange | trxDate | TYPE
242105 | 500 | 20170605 | loan
242105 | 1500 | 20170605 | loan
242105 | -1000 | 20170607 | payment
242111 | 500 | 20170605 | loan
242111 | -500 | 20170606 | payment
242111 | 500 | 20170607 | loan
242111 | -500 | 20170609 | payment
242151 | 500 | 20170605 | loan
What I would like to do is to (1) count for each of the loans issued every day, how many of them have been paid back in full, and (2) how many days did it take the customer to pay them.
The rule of the repayment is of course FIFO (First In First Out), so the oldest loan gets paid back first.
In the example above, the solution would be
trxDate | nRepayments | timeGap(days)
20170605 | 2 | 1.5
20170606 | 0 | 0
20170607 | 1 | 2
So, the explanation on why the solution is like that is on 20170605, there are 4 loans issued (2 to customerID 242105, and he other two to 242111 and 242151), but only 2 of those loans were paid back (the 500 given to 242105 and the 500 given to 242111). The timeGap is the average of sum of how many days did it took every customers to pay them back (242105 paid back on 20170607 - 2 day, and 242111 paid back on 20170606 - 1 day), so (2+1)/2 = 1.5.
I have tried to calculate the nRepayments (I figured if I did this the timeGap should be a piece of cake) with the following R script.
#Recoveries
data_loans_rec <- data_loans %>% arrange(customerID, trxDate) %>% as.data.table()
data_loans_rec[is.na(data_loans_rec)] <- 0
data_loans_rec <- data_loans_rec[, index := seq_len(.N), by = customerID][!(index == 1 & TYPE == "payment")][, index := seq_len(.N), by = customerID]
n_loans_given <- data_loans[TYPE == "loan", ][, .(nloans = .N), .(payment)][order(payment)]
n_loans_rec <- copy(n_loans_given)
n_loans_rec[, nloans:=0]
unique_cust <- unique(data_loans_rec$customerID)
#Check repayment for every customer================
for (i in 1:length(unique_cust)) {
cur_cust <- unique_cust[i]
list_loan <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(balanceChange)] )
list_loan_time <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(trxDate) ])
list_pay <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "payment", .(balanceChange) ])
if (dim(list_pay)[1] == 0) { #if there are no payments
list_pay <- c(0)
}
sum_paid <- sum(abs(list_pay))
i_paid_until <- 0
for (i_loantime in 1:(dim(list_loan_time)[1])) {
#if there is only one loan
if (i_loantime == 0) {
i_loantime <- 1
}
loan_curr <- list_loan[i_loantime]
loan_left <- loan_curr - sum_paid
if (loan_left <= 0) {
n_loans_rec[trxDate == list_loan_time[i_loantime], nloans:=nloans+1]
sum_paid <- sum_paid - loan_curr
print (paste(i_loantime, list_loan_time[i_loantime], n_loans_rec[trxDate == list_loan_time[i_loantime], .(nloans)]))
# break
} else {
break
}
}
print (i)
}
The idea is that for every customer, make a list of loans, time of loan, and payments. The best case scenario is if the customer's total amount of loan is equal or less (due to dirty data) the total amount of payment (full payment). Then the number of repayments equals the number of loans issued to that customer. The average case is when customer's make a partial payment. In which case, I sum the total amount of payments, and I iterate through each loan the customer made whilst summing the total amount of loans as I iterate. If the amount of loan finally exceeds the amount of payments, then I count how many loans have actually been covered by the customer's payments.
The problem is I have millions of customers, and each of them have made loans and payments at least 5 times. So, since I am using a nested loop, it would take hours to complete.
So, I am asking here if anyone has ever come across this problem and/or have a better, more efficient solution.
Thanks in advance!
Your logic is quite complicated and with this answer I don't attempt to replicate it fully; my intention is just to give you some ideas on how to optimise.
Also, as mentioned in comments, you could try to parallelise, or maybe use another programming language.
Anyway, as your setup is already with data.table, you can try to use global operations to the full set, as much as you can, which will usually go faster than your big loop. Something for example like this.
I first calculate, per customer id, the balance and the sum of payments done:
data_loans_rec <- data_loans_rec[, balance := sum(balanceChange), by = customerID]
data_loans_rec <- data_loans_rec[, sumPayments := sum(balanceChange[TYPE == "payment"]), by = customerID]
With this, you already know that every customer with balance 0 has repaid everything:
data_loans_rec <- data_loans_rec[TYPE == "loan" & balance == 0, repaid := TRUE, by = list(customerID, index)]
These operations of course read a lot of data if you have millions of customers, but I'd say that data.table should handle them pretty quickly.
For the rest of the customers, but only for those registers that are a loan and you don't know yet if they have been repaid, you can use a data.table function.
setRepaid <- function(balanceChange, sumPayments) {
# note that here you get a vector for all the loans of a customer
sumPay <- (-1) * sumPayments[1]
if (sumPay == 0)
return(rep(FALSE, length(balanceChange)))
number_of_loans_paid <- 0
for (i in 1:length(balanceChange)) {
if (sum(balanceChange[1:i]) > sumPay)
break
number_of_loans_paid <- number_of_loans_paid + 1
}
return(c(rep(TRUE, number_of_loans_paid), rep(FALSE, length(balanceChange)-number_of_loans_paid)))
}
data_loans_rec <- data_loans_rec[TYPE == "loan" & is.na(repaid), repaid := setRepaid(balanceChange, sumPayments), by = list(customerID) ]
With that you get the desired result, at least for your example.
customerID balanceChange trxDate TYPE index balance sumPayments repaid
1: 242105 500 20170605 loan 1 1000 -1000 TRUE
2: 242105 1500 20170605 loan 2 1000 -1000 FALSE
3: 242105 -1000 20170607 payment 3 1000 -1000 NA
4: 242111 500 20170605 loan 1 0 -1000 TRUE
5: 242111 -500 20170606 payment 2 0 -1000 NA
6: 242111 500 20170607 loan 3 0 -1000 TRUE
7: 242111 -500 20170609 payment 4 0 -1000 NA
8: 242151 500 20170605 loan 1 500 0 FALSE
Advantages being: the final loop works over much less customers, you have some stuff already precalculated, and you rely on data.table for actually substituting your loop. Hopefully this approach will give you an improvement. I think it is work a try.

Time/Date range grammars

I need to parse strings containing time spans such as:
Thursday 6:30-7:30 AM
December 30, 2009 - January 1, 2010
1/15/09, 7:30 to 8:30 PM
Thursday, from 6:30 to 7:30 AM
and others...
added
6:30 to 7:30
and date/times such as most any cases that Word's insert->date can generate
As I'd be extremely surprised if anything out there covers all the cases I need to cover, I'm looking for grammars to start from.
Ok, the following grammar parses anything in your example:
DTExp = Day, ['-', Day]
Day = DayExp, [[','], ['from'], TimeRange]
DayExp = WeekDay
| [Weekday], Month, DayNumber, [[','], YearNumber]
| [Weekday], MonthNumber, '/', DayNumber, ['/', YearNumber]
TimeRange = Time, [['-'|'to'] Time]
Time = HourNumber, ':', MinuteNumber, ['AM'|'PM']
WeekDay = 'monday' | 'tuesday' | ...
Month = MonthNumber | MonthName
MonthName = 'january' | 'february' | ...
DayNumber = Number
MonthNumber = Number
YearNumber = Number, ['AD'|'BC']
HourNumber = Number
MinuteNumber = Number
There is a slight problem in the grammar. If a DayExp is read, followed by a Time, and a '-', then you could expect another DayExp or another time. But this is solved by a lookahead, because if it is a time, a number is followed by a ':'.
Lets try to construct a parse tree:
Thursday 6 : 30 - 7 : 30 AM
| | | | | |
WeekDay Number : Number - Number : Number |
| -----|---- -----|-----------
| Time - Time
| ---------|---------
DayExp TimeRange
----------|-----------
Day
|
DTExp

Resources