How can I aggregate fields based on the value of another field? - azure-data-explorer

I have an Azure hosted application in which some of our users have been complaining of difficulty logging-in. So I added some logs which show up in Application Insights. A sample of the data is shown below:
I need to create a report that shows:
The number of unique users (the Identifier field) that successfully logged-in and the number of unique users that failed to login.
The number of failed login attempts that preceded a successful attempt (if any) - is this even possible in KQL?
One one my attempts was:
customEvents
| order by timestamp asc
| summarize TotalUserCount=dcount(tostring(customDimensions["Identifier"])),
SuccessCount=countif(name startswith "Success"),
FailureCount=countif(name !startswith "Success")
But this is wrong, I need countif(name...) to also be distinct by Identifier.
I'm new to KQL and so would appreciate some help.
Thanks.

I would start from analyzing the data in the session level.
It's very easy to take it from there and summarize it to the user level etc.
// Data sample generation. Not part of the solution
// Setup
let p_event_num = 30;
let p_identifiers_num = 3;
let p_max_distance_between_events = 2h;
let p_names = dynamic(["Unsuccessful login. Invalid cred", "Unsuccessful login. Account wa", "Successful login"]);
// Internal
let p_identifiers = toscalar(range i from 1 to p_identifiers_num step 1 | summarize make_list(new_guid()));
let p_names_num = array_length(p_names);
let customEvents = materialize
(
range i from 1 to p_event_num step 1
| extend ['timestamp [UTC]'] = ago(24h*rand())
| extend Identifier = tostring(p_identifiers[toint(rand(p_identifiers_num))])
| extend name = p_names[toint(rand(p_names_num))]
);
// Solution starts here
customEvents
| project-rename ts = ['timestamp [UTC]']
| partition hint.strategy=native by Identifier
(
order by ts asc
| extend session_id = row_cumsum(iff(ts - prev(ts) >= p_max_distance_between_events, 1, 0))
| summarize session_start = min(ts)
,session_end = max(ts)
,session_duration = 0s
,session_events = count()
,session_successes = countif(name startswith "Successful")
,session_failures = countif(name !startswith "Successful")
,arg_max(ts, name)
by Identifier, session_id
)
| project-away ts
| project-rename session_last_name = name
| extend session_duration = session_end - session_start
| order by Identifier asc, session_id asc
| as user_sessions
Identifier
session_id
session_start
session_end
session_duration
session_events
session_successes
session_failures
session_last_name
3b169e06-52e5-45d8-b951-62d5e8ab385b
0
2022-06-26T20:22:22.4006737Z
2022-06-26T20:22:22.4006737Z
00:00:00
1
0
1
Unsuccessful login. Account wa
3b169e06-52e5-45d8-b951-62d5e8ab385b
1
2022-06-26T22:47:01.8487347Z
2022-06-26T22:47:01.8487347Z
00:00:00
1
1
0
Successful login
3b169e06-52e5-45d8-b951-62d5e8ab385b
2
2022-06-27T04:57:15.6405722Z
2022-06-27T07:32:10.4409854Z
02:34:54.8004132
4
1
3
Unsuccessful login. Account wa
3b169e06-52e5-45d8-b951-62d5e8ab385b
3
2022-06-27T10:44:19.8739205Z
2022-06-27T12:46:14.2586725Z
02:01:54.3847520
3
0
3
Unsuccessful login. Account wa
3b169e06-52e5-45d8-b951-62d5e8ab385b
4
2022-06-27T14:50:35.3882433Z
2022-06-27T14:50:35.3882433Z
00:00:00
1
0
1
Unsuccessful login. Account wa
3b169e06-52e5-45d8-b951-62d5e8ab385b
5
2022-06-27T18:33:51.4464796Z
2022-06-27T18:47:06.0628481Z
00:13:14.6163685
2
0
2
Unsuccessful login. Invalid cred
63ce6481-818e-4f3b-913e-88a1b76ac423
0
2022-06-26T19:27:05.1220534Z
2022-06-26T20:24:53.5616443Z
00:57:48.4395909
2
0
2
Unsuccessful login. Account wa
63ce6481-818e-4f3b-913e-88a1b76ac423
1
2022-06-27T02:17:03.4123257Z
2022-06-27T02:36:50.1918116Z
00:19:46.7794859
3
1
2
Successful login
63ce6481-818e-4f3b-913e-88a1b76ac423
2
2022-06-27T13:27:27.2550722Z
2022-06-27T14:32:39.6361479Z
01:05:12.3810757
3
2
1
Successful login
63ce6481-818e-4f3b-913e-88a1b76ac423
3
2022-06-27T17:20:34.3725797Z
2022-06-27T17:20:34.3725797Z
00:00:00
1
0
1
Unsuccessful login. Account wa
6ed81ab3-447e-481d-8bb3-a5f4087234bb
0
2022-06-26T22:38:39.3105749Z
2022-06-26T22:38:39.3105749Z
00:00:00
1
0
1
Unsuccessful login. Account wa
6ed81ab3-447e-481d-8bb3-a5f4087234bb
1
2022-06-27T03:06:04.340965Z
2022-06-27T04:49:37.3314224Z
01:43:32.9904574
3
3
0
Successful login
6ed81ab3-447e-481d-8bb3-a5f4087234bb
2
2022-06-27T07:11:47.260913Z
2022-06-27T07:11:47.260913Z
00:00:00
1
0
1
Unsuccessful login. Account wa
6ed81ab3-447e-481d-8bb3-a5f4087234bb
3
2022-06-27T11:39:02.356791Z
2022-06-27T16:49:23.5818891Z
05:10:21.2250981
4
2
2
Unsuccessful login. Invalid cred
Fiddle

I need countif(name...) to also be distinct by Identifier.
If I understood your intention correctly, you could use dcountif().
For example:
customEvents
| where timestamp > ago(1d)
| extend Identifier = tostring(customDimensions["Identifier"])
| summarize TotalUserCount = dcount(Identifier),
SuccessCount = dcountif(Identifier, name startswith "Success"),
FailureCount = dcountif(Identifier, name !startswith "Success")
The number of failed login attempts that preceded a successful attempt (if any) - is this even possible in KQL?
You could try using the scan operator for this: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/scan-operator

Related

Parsing NSG Flowlogs in Azure Log Analytics Workspace to separate Public IP addresses

I have been updating a KQL query for use in reviewing NSG Flow Logs to separate the columns for Public/External IP addresses. However the data within each cell of the column contains additional information that needs to be parsed out so my excel addin can run NSLOOKUP against each cell and looking for additional insights. Later I would like to use the parse operator (https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parseoperator) to separate this information to determine what that external IP address belongs to through nslookup, resolve-dnsname, whois , or other means.
However currently I am attempting to parse out the column, but is not comma delimited and instead uses a single space and multiple pipes. Below is my query and I would like to add a parse to this to either have a comma delimited string in a single cell [ for PublicIP (combination of Source and Destination), PublicSourceIP, and PublicDestIP. ] or break it out into multiple rows. How would parse be best used to separate this information, or is there a better operator to use to carry this out?
For Example the content could look like this
"20.xx.xx.xx|1|0|0|0|0|0 78.xxx.xxx.xxx|1|0|0|0|0|0"
AzureNetworkAnalytics_CL
| where SubType_s == 'FlowLog' and (FASchemaVersion_s == '1'or FASchemaVersion_s == '2')
| extend NSG = NSGList_s, Rule = NSGRule_s,Protocol=L4Protocol_s, Hits = (AllowedInFlows_d + AllowedOutFlows_d + DeniedInFlows_d + DeniedOutFlows_d)
| project-away NSGList_s, NSGRule_s
| project TimeGenerated, NSG, Rule, SourceIP = SrcIP_s, DestinationIP = DestIP_s, DestinationPort = DestPort_d, FlowStatus = FlowStatus_s, FlowDirection = FlowDirection_s, Protocol=L4Protocol_s, PublicIP=PublicIPs_s,PublicSourceIP = SrcPublicIPs_s,PublicDestIP=DestPublicIPs_s
// ## IP Address Filtering ##
| where isnotempty(PublicIP)
**| parse kind = regex PublicIP with * "|1|0|0|0|0|0" ipnfo ' ' *
| project ipnfo**
// ## port filtering
| where DestinationPort == '443'
Based on extract_all() followed by strcat_array() or mv-expand
let AzureNetworkAnalytics_CL = datatable (RecordId:int, PublicIPs_s:string)
[
1 ,"51.105.236.244|2|0|0|0|0|0 51.124.32.246|12|0|0|0|0|0 51.124.57.242|1|0|0|0|0|0"
,2 ,"20.44.17.10|6|0|0|0|0|0 20.150.38.228|1|0|0|0|0|0 20.150.70.36|2|0|0|0|0|0 20.190.151.9|2|0|0|0|0|0 20.190.151.134|1|0|0|0|0|0 20.190.154.137|1|0|0|0|0|0 65.55.44.109|2|0|0|0|0|0"
,3 ,"20.150.70.36|1|0|0|0|0|0 52.183.220.149|1|0|0|0|0|0 52.239.152.234|2|0|0|0|0|0 52.239.169.68|1|0|0|0|0|0"
];
// Option 1
AzureNetworkAnalytics_CL
| project RecordId, PublicIPs = strcat_array(extract_all("(?:^| )([^|]+)", PublicIPs_s),',');
// Option 2
AzureNetworkAnalytics_CL
| mv-expand with_itemindex=i PublicIP = extract_all("(?:^| )([^|]+)", PublicIPs_s) to typeof(string)
| project RecordId, i = i+1, PublicIP
Fiddle
Option 1
RecordId
PublicIPs
1
51.105.236.244,51.124.32.246,51.124.57.242
2
20.44.17.10,20.150.38.228,20.150.70.36,20.190.151.9,20.190.151.134,20.190.154.137,65.55.44.109
3
20.150.70.36,52.183.220.149,52.239.152.234,52.239.169.68
Option 2
RecordId
i
PublicIP
1
1
51.105.236.244
1
2
51.124.32.246
1
3
51.124.57.242
2
1
20.44.17.10
2
2
20.150.38.228
2
3
20.150.70.36
2
4
20.190.151.9
2
5
20.190.151.134
2
6
20.190.154.137
2
7
65.55.44.109
3
1
20.150.70.36
3
2
52.183.220.149
3
3
52.239.152.234
3
4
52.239.169.68
David answers your question. I would just like to add that I worked on the raw NSG Flow Logs and parsed them using kql in this way:
The raw JSON:
{"records":[{"time":"2022-05-02T04:00:48.7788837Z","systemId":"x","macAddress":"x","category":"NetworkSecurityGroupFlowEvent","resourceId":"/SUBSCRIPTIONS/x/RESOURCEGROUPS/x/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/x","operationName":"NetworkSecurityGroupFlowEvents","properties":{"Version":2,"flows":[{"rule":"DefaultRule_DenyAllInBound","flows":[{"mac":"x","flowTuples":["1651463988,0.0.0.0,192.168.1.6,49944,8008,T,I,D,B,,,,"]}]}]}}]}
kql parsing:
| mv-expand records
| evaluate bag_unpack(records)
| extend flows = properties.flows
| mv-expand flows
| evaluate bag_unpack(flows)
| mv-expand flows
| extend flowz = flows.flowTuples
| mv-expand flowz
| extend result=split(tostring(flowz), ",")
| extend source_ip=tostring(result[1])
| extend destination_ip=tostring(result[2])
| extend source_port=tostring(result[3])
| extend destination_port=tostring(result[4])
| extend protocol=tostring(result[5])
| extend traffic_flow=tostring(result[6])
| extend traffic_decision=tostring(result[7])
| extend flow_state=tostring(result[8])
| extend packets_src_to_dst=tostring(result[9])
| extend bytes_src_to_dst=tostring(result[10])
| extend packets_dst_to_src=tostring(result[11])
| extend bytes_dst_to_src=tostring(result[12])

Filter for appearance of 2 values that must at least exist 1 times

Title may be bad, couldn't think of a better one.
My comment data, each comment is assigned to an account by usernameChannelId:
usernameChannelId | hasTopic | sentiment_sum | commentId
a | 1 | 4 | xyxe24
a | 0 | 2 | h5hssd
a | 1 | 3 | k785hg
a | 0 | 2 | j7kgbf
b | 1 | -2 | 76hjf2
c | 0 | -1 | 3gqash
c | 1 | 2 | ptkfja
c | 0 | -2 | gbe5gs
c | 1 | 1 | hghggd
My code:
SELECT u.usernameChannelId, avg(sentiment_sum) sentiment_sum, u.hasTopic
FROM total_comments u
WHERE u.hasTopic is True
GROUP BY u.usernameChannelId
HAVING count(u.usernameChannelId) > 0
UNION
SELECT u.usernameChannelId, avg(sentiment_sum) sentiment_sum, u.hasTopic
FROM total_comments u
WHERE u.hasTopic is False
GROUP BY u.usernameChannelId
I want to get all usernameChannelIds that have at least 1 comment with hasTopic == 0 and 1 comment with hasTopic == 1 (to compare both groups statistically and remove user that only commented in topic or offtopic videos).
How can I filter like that?
Here's a little trick that may help. First, you need to get familiar with the CASE expression., here's an excerpt from the doc.
The CASE expression
A CASE expression serves a role similar to IF-THEN-ELSE in other
programming languages.
The optional expression that occurs in between the CASE keyword and
the first WHEN keyword is called the "base" expression. There are two
basic forms of the CASE expression: those with a base expression and
those without.
An expression like CASE when hasTopic is False then 1 else 0 END will evaluate to 1 if hasTopic is 0. An expression for hasTopic is True would be similar.
Now, those CASEs can be summed, which will tell you if user has any rows with hasTopic True and hasTopic False.
Something like this in the having clause might do the trick (one for each value of course)
HAVING SUM(CASE when hasTopic is False then 1 else 0 END) > 0
(it would be necessary to remove the WHERE clause, and the UNION query would be unnecessary).

How to obtain distinct values based on another column in the same table?

I'm not sure how to word the title properly so sorry if it wasn't clear at first.
What I want to do is to find users that have logged into a specific page, but not the other.
The table I have looks like this:
Users_Logins
------------------------------------------------------
| IDLogin | Username | Page | Date | Hour |
|---------|----------|-------|------------|----------|
| 1 | User_1 | Url_1 | 2019-05-11 | 11:02:51 |
| 2 | User_1 | Url_2 | 2019-05-11 | 14:16:21 |
| 3 | User_2 | Url_1 | 2019-05-12 | 08:59:48 |
| 4 | User_2 | Url_1 | 2019-05-12 | 16:36:27 |
| ... | ... | ... | ... | ... |
------------------------------------------------------
So as you can see, User 1 logged into Url 1 and 2, but User 2 logged into Url 1 only.
How should I go about finding users that logged into Url 1, but never logged into Url 2 during a certain period of time?
Thanks in advance!
I will try to improve the title of your question later, but for the time being, this is how I accomplished what you are asking for:
Query:
select distinct username from User_Logins
where page = 'Url_1'
and username not in
(select username from User_Logins
where Page = 'Url_2')
and date BETWEEN '2019-05-12' AND '2019-05-12'
and hour BETWEEN '00:00:00' AND '12:00:00';
Returns:
User_2
Comments:
I basically used a sub query to filter out the usernames you don't care about. :)
The time range is getting only 1 result, which you can test by removing the "distinct" in the first line of the query. If you then remove the time range from the query, you'll get 2 results.
You can do it with group by username and apply the conditions in a HAVING clause:
select username
from User_Logins
where
date between '..........' and '..........'
and
hour between '..........' and '..........';
group by username
having
sum(page = 'Url_1') > 0
and
sum(page = 'Url_2') = 0
Replace the dots with the date/time intervals you want.

Activiti and candidate groups

In APS 1.8.1, I have defined a process where each task has a candidate group.
When I login in with a user that belongs to a candidate group, I cannot see the process instance.
I have found out that when I try to access the process instances, APS executes the following query in the database:
select distinct RES.* , DEF.KEY_ as PROC_DEF_KEY_, DEF.NAME_ as PROC_DEF_NAME_, DEF.VERSION_ as PROC_DEF_VERSION_, DEF.DEPLOYMENT_ID_ as DEPLOYMENT_ID_
from ACT_HI_PROCINST RES
left outer join ACT_RE_PROCDEF DEF on RES.PROC_DEF_ID_ = DEF.ID_
left join ACT_HI_IDENTITYLINK I_OR0 on I_OR0.PROC_INST_ID_ = RES.ID_
WHERE RES.TENANT_ID_ = 'tenant_1'
and
( (
exists(select LINK.USER_ID_ from ACT_HI_IDENTITYLINK LINK where USER_ID_ = '1003' and LINK.PROC_INST_ID_ = RES.ID_)
)
or (
I_OR0.TYPE_ = 'participant'
and
I_OR0.GROUP_ID_ IN ('1','2','2023','2013','2024','2009','2025','2026','2027','2028','2029','2007','2018','2020','2017','2015','2012','2003','2021','2019','2004','2002','2005','2030','2031','2032','2011','2006','2008','2014','2010','2016','2022','2033','2034','2035','2036','2037','1003')
) )
order by RES.START_TIME_ desc
LIMIT 50 OFFSET 0
This query does not return any record for two reasons:
In my ACT_HI_IDENTITYLINK no tasks have both the group_id_ and the proc_inst_id_ set.
The type of the record is "candidate" but the query is looking for "participant"
select * fro m ACT_HI_IDENTITYLINK;
-[ RECORD 1 ]-+----------
id_ | 260228
group_id_ |
type_ | starter
user_id_ | 1002
task_id_ |
proc_inst_id_ | 260226
-[ RECORD 2 ]-+----------
id_ | 260294
group_id_ | 2006
type_ | candidate
user_id_ |
task_id_ | 260293
proc_inst_id_ |
-[ RECORD 3 ]-+----------
id_ | 260300
group_id_ | 2009
type_ | candidate
user_id_ |
task_id_ | 260299
proc_inst_id_ |
-[ RECORD 4 ]-+----------
id_ | 262503
group_id_ |
type_ | starter
user_id_ | 1002
task_id_ |
proc_inst_id_ | 262501
-[ RECORD 5 ]-+----------
id_ | 262569
group_id_ | 2016
type_ | candidate
user_id_ |
task_id_ | 262568
proc_inst_id_ |
-[ RECORD 6 ]-+----------
id_ | 262575
group_id_ | 2027
type_ | candidate
user_id_ |
task_id_ | 262574
proc_inst_id_ |
Why the query is looking only for "participant" and why the records that have type_ = 'candidate' do not have any proc_inst_id_ set ?
UPDATE:
The problem with the constraint "participant" has a simple workaround: it would be enough to add the same candidate group as a participant.
See also Feature allowing "Participant" configuration in BPM Modeler
Unfortunately this is not enough to solve the second problem. The record is still not returned because the column proc_inst_id_ is not set.
I tried to update the column manually on the "participant" record and I have verified that doing so the page is accessible and works well.
Does anyone know why the column is not set ?
A possible solution (or workaround until ACTIVITI-696 is fixed) is to add each group added as candidate to a task as a participant of the process instance.
There is a REST API that does it:
POST /enterprise/process-instances/{processInstanceId}/identitylinks
What this API does should be done by a task listener that will automatically add the candidate groups of the created task as participant of the process instance.
To add the new identity link, use in the listener the following lines:
ActivitiEntityEvent aee = (ActivitiEntityEvent)activitiEvent;
TaskEntity taskEntity = (TaskEntity)aee.getEntity();
List<IdentityLinkEntity> identities = taskEntity.getIdentityLinks();
if (identities != null) {
for (IdentityLinkEntity identityLinkEntity : identities) {
String groupId = identityLinkEntity.getGroupId();
runtimeService.addGroupIdentityLink(activitiEvent.getProcessInstanceId(), groupId, "participant");
};
}
first try to check that your workflow is really started by access to "workflow I have started". You should see your task in "active task" if not, that means there is some errors in your definitions. If everything is ok, check your group name and don’t forget to add "GROUP_"myGRPName.
If you want to see the workflow instances it’s simpler with web script and services.

How can I access the data in a Cassandra Table using RCassandra

I need to get the data in a column of a table Cassandra Database. I am using RCassandra for this. After getting the data I need to do some text mining on it. Please suggest me how do connect to cassandra, and get the data into my R Script using RCassandra
My RScript :
library(RCassandra)
connect.handle <- RC.connect(host="127.0.0.1", port=9160)
RC.cluster.name(connect.handle)
RC.use(connect.handle, 'mykeyspace')
sourcetable <- RC.read.table(connect.handle, "sourcetable")
print(ncol(sourcetable))
print(nrow(sourcetable))
print(sourcetable)
This will print the output as:
> print(ncol(sourcetable))
[1] 1
> print(nrow(sourcetable))
[1] 18
> print(sourcetable)
144 BBC News
158 IBN Live
123 Reuters
131 IBN Live
But my cassandra table contains four columns, but here its showing only 1 column. I need to get each column values separated. So how do I get the individual column values(Eg.each feedurl) What changes should I make in my R script?
My cassandra table, named sourcetable
I have used Cassandra and R with the correct Cran Jar files, but RCassandra is easier. RCassandra is a direct interface to Cassandra without the use of Java. To connect to Cassandra you will use RC.connect to return a connection handle like this.
RC.connect(host = <xxx>, port = <xxx>)
RC.login(conn, username = "bar", password = "foo")
You can then use a RC.get command to retrieve data or RC.ReadTable command to read table data.
BUT, First you should read THIS
I am confused as well. Table demo.emp has 4 row and 4 columns ( empid, deptid, first_name and last_name). Neither RC.get nor RC.read.table gets the all the data.
cqlsh:demo> select * from emp;
empid | deptid | first_name | last_name
-------+--------+------------+-----------
1 | 1 | John | Doe
1 | 2 | Mia | Lewis
2 | 1 | Jean | Doe
2 | 2 | Manny | Lewis
> RC.get.range.slices(c, "emp", limit=10)
[[1]]
key value ts
1 1.474796e+15
2 John 1.474796e+15
3 Doe 1.474796e+15
4 1.474796e+15
5 Mia 1.474796e+15
[[2]]
key value ts
1 1.474796e+15
2 Jean 1.474796e+15
3 Doe 1.474796e+15
4 1.474796e+15
5 Manny 1.474796e+15

Resources