Kusto query, comparing array of CIDR ranges to an IP - azure-data-explorer

I am trying to do something in Kusto similar to this post:
Filter IPs if they are in list of ranges
but using the IP ranges from a publicly available list to compare to some logs.
Here's what I have tried, I believe the issue relates to me not knowing how to reference the "network" property of the external data.
I get a "Query could not be parsed" error. Apologies for the formatting, I'm not sure how to make it respect line breaks.
let IP_Data = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'];
let testIP = datatable (myip: string) ['4.28.114.50','4.59.176.50']; //Random IPs in Canada
testIP
| mv-apply tmpIP = IP_Data.network to typeof(string) on (
where ipv4_is_in_range(myip, tmpIP
)
| project-away tmpIP

This answers the OP question directly, however there is a great solution for this scenario, based on the ipv4_lookup plugin.
See new answer
For both options -
Since the CSV has header, so I added with (ignoreFirstRecord = true) to the external_data
Option 1
testIP is defined as array (and not a single column table).
The base table is IP_Data but the mv-apply is done on testIP array. This enables you to access values from both IP_Data and testIP
let IP_Data = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'] with (ignoreFirstRecord = true);
let testIP = dynamic(['4.28.114.50','4.59.176.50']); //Random IPs in Canada
IP_Data
| mv-apply testIP = testIP to typeof(string) on (where ipv4_is_in_range(testIP, network))
network
geoname_id
continent_code
continent_name
country_iso_code
country_name
is_anonymous_proxy
is_satellite_provider
testIP
4.28.114.0/24
6251999
NA
North America
CA
Canada
false
false
4.28.114.50
4.59.176.0/24
6251999
NA
North America
CA
Canada
false
false
4.59.176.50
Fiddle
Option 2
Cross join both tables (using a dummy column) and then filter the results
let IP_Data = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'] with (ignoreFirstRecord = true);
let testIP = datatable (myip: string) ['4.28.114.50','4.59.176.50']; //Random IPs in Canada
testIP | extend dummy = 1
| join kind=inner (IP_Data | extend dummy = 1) on dummy
| where ipv4_is_in_range(myip, network)
| project-away dummy*
myip
network
geoname_id
continent_code
continent_name
country_iso_code
country_name
is_anonymous_proxy
is_satellite_provider
4.28.114.50
4.28.114.0/24
6251999
NA
North America
CA
Canada
false
false
4.59.176.50
4.59.176.0/24
6251999
NA
North America
CA
Canada
false
false
Fiddle

New answer
Demo for 1M IPs, based on the ipv4_lookup plugin
let geoip2_ipv4 = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'] with (ignoreFirstRecord = true)
| extend continent_name = coalesce(continent_name, '--- Missing ---');
let ips = materialize(range i from 1 to 1000000 step 1 | extend ip = format_ipv4(tolong(rand() * pow(2,32))));
ips
| evaluate ipv4_lookup(geoip2_ipv4, ip, network, return_unmatched = true)
| summarize count() by continent_name
continent_name
count_
North America
399059
Asia
201902
Europe
173566
South America
33795
Oceania
13384
Africa
17569
--- Missing ---
226
160499
Fiddle

Related

Parsing NSG Flowlogs in Azure Log Analytics Workspace to separate Public IP addresses

I have been updating a KQL query for use in reviewing NSG Flow Logs to separate the columns for Public/External IP addresses. However the data within each cell of the column contains additional information that needs to be parsed out so my excel addin can run NSLOOKUP against each cell and looking for additional insights. Later I would like to use the parse operator (https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parseoperator) to separate this information to determine what that external IP address belongs to through nslookup, resolve-dnsname, whois , or other means.
However currently I am attempting to parse out the column, but is not comma delimited and instead uses a single space and multiple pipes. Below is my query and I would like to add a parse to this to either have a comma delimited string in a single cell [ for PublicIP (combination of Source and Destination), PublicSourceIP, and PublicDestIP. ] or break it out into multiple rows. How would parse be best used to separate this information, or is there a better operator to use to carry this out?
For Example the content could look like this
"20.xx.xx.xx|1|0|0|0|0|0 78.xxx.xxx.xxx|1|0|0|0|0|0"
AzureNetworkAnalytics_CL
| where SubType_s == 'FlowLog' and (FASchemaVersion_s == '1'or FASchemaVersion_s == '2')
| extend NSG = NSGList_s, Rule = NSGRule_s,Protocol=L4Protocol_s, Hits = (AllowedInFlows_d + AllowedOutFlows_d + DeniedInFlows_d + DeniedOutFlows_d)
| project-away NSGList_s, NSGRule_s
| project TimeGenerated, NSG, Rule, SourceIP = SrcIP_s, DestinationIP = DestIP_s, DestinationPort = DestPort_d, FlowStatus = FlowStatus_s, FlowDirection = FlowDirection_s, Protocol=L4Protocol_s, PublicIP=PublicIPs_s,PublicSourceIP = SrcPublicIPs_s,PublicDestIP=DestPublicIPs_s
// ## IP Address Filtering ##
| where isnotempty(PublicIP)
**| parse kind = regex PublicIP with * "|1|0|0|0|0|0" ipnfo ' ' *
| project ipnfo**
// ## port filtering
| where DestinationPort == '443'
Based on extract_all() followed by strcat_array() or mv-expand
let AzureNetworkAnalytics_CL = datatable (RecordId:int, PublicIPs_s:string)
[
1 ,"51.105.236.244|2|0|0|0|0|0 51.124.32.246|12|0|0|0|0|0 51.124.57.242|1|0|0|0|0|0"
,2 ,"20.44.17.10|6|0|0|0|0|0 20.150.38.228|1|0|0|0|0|0 20.150.70.36|2|0|0|0|0|0 20.190.151.9|2|0|0|0|0|0 20.190.151.134|1|0|0|0|0|0 20.190.154.137|1|0|0|0|0|0 65.55.44.109|2|0|0|0|0|0"
,3 ,"20.150.70.36|1|0|0|0|0|0 52.183.220.149|1|0|0|0|0|0 52.239.152.234|2|0|0|0|0|0 52.239.169.68|1|0|0|0|0|0"
];
// Option 1
AzureNetworkAnalytics_CL
| project RecordId, PublicIPs = strcat_array(extract_all("(?:^| )([^|]+)", PublicIPs_s),',');
// Option 2
AzureNetworkAnalytics_CL
| mv-expand with_itemindex=i PublicIP = extract_all("(?:^| )([^|]+)", PublicIPs_s) to typeof(string)
| project RecordId, i = i+1, PublicIP
Fiddle
Option 1
RecordId
PublicIPs
1
51.105.236.244,51.124.32.246,51.124.57.242
2
20.44.17.10,20.150.38.228,20.150.70.36,20.190.151.9,20.190.151.134,20.190.154.137,65.55.44.109
3
20.150.70.36,52.183.220.149,52.239.152.234,52.239.169.68
Option 2
RecordId
i
PublicIP
1
1
51.105.236.244
1
2
51.124.32.246
1
3
51.124.57.242
2
1
20.44.17.10
2
2
20.150.38.228
2
3
20.150.70.36
2
4
20.190.151.9
2
5
20.190.151.134
2
6
20.190.154.137
2
7
65.55.44.109
3
1
20.150.70.36
3
2
52.183.220.149
3
3
52.239.152.234
3
4
52.239.169.68
David answers your question. I would just like to add that I worked on the raw NSG Flow Logs and parsed them using kql in this way:
The raw JSON:
{"records":[{"time":"2022-05-02T04:00:48.7788837Z","systemId":"x","macAddress":"x","category":"NetworkSecurityGroupFlowEvent","resourceId":"/SUBSCRIPTIONS/x/RESOURCEGROUPS/x/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/x","operationName":"NetworkSecurityGroupFlowEvents","properties":{"Version":2,"flows":[{"rule":"DefaultRule_DenyAllInBound","flows":[{"mac":"x","flowTuples":["1651463988,0.0.0.0,192.168.1.6,49944,8008,T,I,D,B,,,,"]}]}]}}]}
kql parsing:
| mv-expand records
| evaluate bag_unpack(records)
| extend flows = properties.flows
| mv-expand flows
| evaluate bag_unpack(flows)
| mv-expand flows
| extend flowz = flows.flowTuples
| mv-expand flowz
| extend result=split(tostring(flowz), ",")
| extend source_ip=tostring(result[1])
| extend destination_ip=tostring(result[2])
| extend source_port=tostring(result[3])
| extend destination_port=tostring(result[4])
| extend protocol=tostring(result[5])
| extend traffic_flow=tostring(result[6])
| extend traffic_decision=tostring(result[7])
| extend flow_state=tostring(result[8])
| extend packets_src_to_dst=tostring(result[9])
| extend bytes_src_to_dst=tostring(result[10])
| extend packets_dst_to_src=tostring(result[11])
| extend bytes_dst_to_src=tostring(result[12])

Query all primary roads from PostGis Database filled with OSM Data within a boundary

I imported OpenStreetMap data through osm2pgsql into PgSQL (PostGIS)
and I would like to get an SF object from the data containing
all primary roads (geometry) within a an area (bbox) into R.
I got lost since I would like to have also relations and nodes
and im not sure if only a query on planet_osm_roads will be sufficient and how the imported data structure is different to osm xml data im normaly working with.
I understand it is probably a bit broader question but
I would appreciate a start so to say to understand the query language better.
This is my approach but sadly i get an error
conn <- RPostgreSQL::dbConnect("PostgreSQL", host = MYHOST,
dbname = "osm_data", user = "postgres", password = MYPASSWORD)
pgPostGIS(conn)
a<-pgGetGeom(conn, c("public", "planet_osm_roads"), geom = "way", gid = "osm_id",
other.cols = FALSE, clauses = "WHERE highway = 'primary' && ST_MakeEnvelope(11.2364353533134, 47.8050651144447, 11.8882527806375, 48.2423300001326)")
a<-st_as_sf(a)
This is an error i get:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "ST_MakeEnvelope"
LINE 2: ...lic"."planet_osm_roads" WHERE "way" IS NOT NULL ST_MakeEnv...
^
)
Error in pgGetGeom(conn, c("public", "planet_osm_roads"), geom = "way", :
No geometries found.
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
Could not create execute: SELECT DISTINCT a.geo AS type
FROM (SELECT ST_GeometryType("way") as geo FROM "public"."planet_osm_roads" WHERE "way" IS NOT NULL ST_MakeEnvelope(11.2364353533134, 47.8050651144447, 11.8882527806375, 48.2423300001326)) a;
This is the db:
osm_data=# \d
List of relations
Schema | Name | Type | Owner
----------+--------------------+----------+----------
public | geography_columns | view | postgres
public | geometry_columns | view | postgres
public | planet_osm_line | table | postgres
public | planet_osm_nodes | table | postgres
public | planet_osm_point | table | postgres
public | planet_osm_polygon | table | postgres
public | planet_osm_rels | table | postgres
public | planet_osm_roads | table | postgres
public | planet_osm_ways | table | postgres
public | spatial_ref_sys | table | postgres
topology | layer | table | postgres
topology | topology | table | postgres
topology | topology_id_seq | sequence | postgres
schema_name table_name geom_column geometry_type type
1 public planet_osm_line way LINESTRING GEOMETRY
2 public planet_osm_point way POINT GEOMETRY
3 public planet_osm_polygon way GEOMETRY GEOMETRY
4 public planet_osm_roads way LINESTRING GEOMETRY
Table "public.planet_osm_roads"
Column | Type | Collation | Nullable | Default
--------------------+---------------------------+-----------+----------+---------
osm_id | bigint | | |
access | text | | |
addr:housename | text | | |
addr:housenumber | text | | |
addr:interpolation | text | | |
admin_level | text | | |
aerialway | text | | |
aeroway | text | | |
amenity | text | | |
area | text | | |
barrier | text | | |
bicycle | text | | |
brand | text | | |
bridge | text | | |
boundary | text | | |
building | text | | |
construction | text | | |
Your query looks just fine. Check the following example:
WITH planet_osm_roads (highway,way) AS (
VALUES
('primary','SRID=3857;POINT(1283861.57 6113504.88)'::geometry), --inside your bbox
('secondary','SRID=3857;POINT(1286919.06 6067184.04)'::geometry) --somewhere else ..
)
SELECT highway,ST_AsText(way)
FROM planet_osm_roads
WHERE
highway IN ('primary','secondary','tertiary') AND
planet_osm_roads.way && ST_Transform(
ST_MakeEnvelope(11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326, 4326),3857
);
highway | st_astext
---------+------------------------------
primary | POINT(1283861.57 6113504.88)
This image illustrates the BBOX and the points used in the example above
Check the documentation for more information on the bbox intersection operator &&.
However, there are a few things to consider.
In case you're creating the BBOX yourself in order to have an area for ST_Contains, consider simply using ST_DWithin. It will basically do the same, but you only have to provide a reference point and the distance.
Depending on the distribution of highway types in the table planet_osm_roads and considering that your queries will always look for either primary,secondary or tertiary highways, creating a partial index could significantly improve performance. As the documentation says:
A partial index is an index built over a subset of a table; the subset
is defined by a conditional expression (called the predicate of the
partial index). The index contains entries only for those table rows
that satisfy the predicate. Partial indexes are a specialized feature,
but there are several situations in which they are useful.
So try something like this:
CREATE INDEX idx_planet_osm_roads_way ON planet_osm_roads USING gist(way)
WHERE highway IN ('primary','secondary','tertiary');
And also highway needs to be indexed. So try this ..
CREATE INDEX idx_planet_osm_roads_highway ON planet_osm_roads (highway);
.. or even another partial index, in case you can't delete the other data but you don't need it for anything:
CREATE INDEX idx_planet_osm_roads_highway ON planet_osm_roads (highway)
WHERE highway IN ('primary','secondary','tertiary');
You can always identify bottlenecks and check if the query planer is using your index with EXPLAIN.
Further reading
Getting all Buildings in range of 5 miles from specified coordinates
Buffers (Circle) in PostGIS
How can I set data from on table to another according spatial relation geometries in these tables
I figured it out.
This is how you would get an SF object out of PostGIS Database filled with OSM Data within 11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326 BBOX:
library(sf)
library(RPostgreSQL)
library(tictoc)
pw <- MYPASSWORD
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "osm_data",
host = "localhost", port = 5432,
user = "postgres", password = pw)
tic()
sf_data = st_read(con, query = "SELECT osm_id,name,highway,way
FROM planet_osm_roads
WHERE highway = 'primary' OR highway = 'secondary' OR highway = 'tertiary'AND ST_Contains(
ST_Transform(
ST_MakeEnvelope(11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326,4326)
,3857)
,planet_osm_roads.way);")
toc()
RPostgreSQL::dbDisconnect(con)
I have to verify if BBOX values are actually getting considered .. im not sure.

Multiple orderBy in firestore

I have a question about how multiple orderBy works.
Supposing these documents:
collection/
doc1/
date: yesterday at 11:00pm
number: 1
doc2/
date: today at 01:00am
number: 6
doc3/
date: today at 13:00pm
number: 0
If I order by two fields like this:
.orderBy("date", "desc")
.orderBy("number", "desc")
.get()
How are those documents sorted? And, what about doing the opposite?
.orderBy("number", "desc")
.orderBy("date", "desc")
.get()
Will this result in the same order?
I'm a bit confused since I don't know if it will always end up ordering by the last orderBy.
In the documentation for orderBy() in Firebase it says this:
You can also order by multiple fields. For example, if you wanted to order by state, and within each state order by population in descending order:
Query query = cities.orderBy("state").orderBy("population", Direction.DESCENDING);
So, it is basically that. With logic from SQL where you have ORDER BY to order your table. Let's say you have a database of customers who are from all over the world. Then you can use ORDER BY Country and you will order them by their Country in any order you want. But if you add the second argument, let's say Customer Name, then it will first order by the Country and then within that ordered list it will order by Customer Name. Example:
1. Adam | USA |
2. Jake | Germany |
3. Anna | USA |
4. Semir | Croatia |
5. Hans | Germany |
When you call orderBy("country") you will get this:
1. Semir | Croatia |
2. Jake | Germany |
3. Hans | Germany |
4. Adam | USA |
5. Anna | USA |
Then when you call orderBy("customer name") you get this:
1. Semir | Croatia |
2. Hans | Germany |
3. Jake | Germany |
4. Adam | USA |
5. Anna | USA |
You can see that Hans and Jake switched places, because H is before J but they are still ordered by the Country name. In your case when you use this:
.orderBy("date", "desc")
.orderBy("number", "desc")
.get()
It will first order by the date and then by the numbers. But since you don't have the same date values, you won't notice any difference. This also goes for the second one. But let's say that one of your fields had the same date, so your data looks like this:
collection/
doc1/
date: yesterday at 11:00pm
number: 1
doc2/
date: today at 01:00am
number: 6
doc3/
date: today at 01:00am
number: 0
Now, doc2 and doc3 are both dated to today at 01:00am. Now when you order by the date they will be one below the other, probably doc2 will be shown first. But when you use orderBy("number") then it will check for numbers inside the same dates. So, if its just orderBy("number") without "desc" you would get this:
orderBy("date");
// output: 1. doc1, 2. doc2, 3. doc3
orderBy("number");
// output: 1. doc1, 2. doc3, 3. doc2
Because number 0 is before 6. Just reverse it for desc.

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

efficiently calculating connected components in pyspark

I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.
City | SRC | DEST
Houston Kyle -> Benny
Houston Benny -> Charles
Houston Charles -> Denny
Omaha Carol -> Brian
etc.
I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?
edit:
I thought I could do something like
select connected_components(*) from dataframe
groupby city
where connected_components generates a list of items.
Suppose your data is like this
import org.apache.spark._
import org.graphframes._
val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"),
("Houston","Charles","Denny"),("Omaha","carol","Brian"),
("Omaha","Brian","Daniel"),("Omaha","Sara","Marry"))
var df = spark.createDataFrame(l).toDF("city","src","dst")
Create a list of cities for which you want to run connected components
cities = List("Houston","Omaha")
Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm
val cities = List("Houston","Omaha")
for(city <- cities){
val edges = df.filter(df("city") === city).drop("city")
val vert = edges.select("src").union(edges.select("dst")).
distinct.select(col("src").alias("id"))
val g = GraphFrame(vert,edges)
val res = g.connectedComponents.run()
res.select("id", "component").orderBy("component").show()
}
Output
| id| component|
+-------+------------+
| Kyle|249108103168|
|charles|249108103168|
| Benny|249108103168|
|Charles|721554505728|
| Denny|721554505728|
+-------+------------+
+------+------------+
| id| component|
+------+------------+
| Marry|858993459200|
| Sara|858993459200|
| Brian|944892805120|
| carol|944892805120|
|Daniel|944892805120|
+------+------------+

Resources