efficiently calculating connected components in pyspark

efficiently calculating connected components in pyspark - graph

I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.
City | SRC | DEST
Houston Kyle -> Benny
Houston Benny -> Charles
Houston Charles -> Denny
Omaha Carol -> Brian
etc.
I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?
edit:
I thought I could do something like
select connected_components(*) from dataframe
groupby city
where connected_components generates a list of items.

Suppose your data is like this
import org.apache.spark._
import org.graphframes._
val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"),
("Houston","Charles","Denny"),("Omaha","carol","Brian"),
("Omaha","Brian","Daniel"),("Omaha","Sara","Marry"))
var df = spark.createDataFrame(l).toDF("city","src","dst")
Create a list of cities for which you want to run connected components
cities = List("Houston","Omaha")
Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm
val cities = List("Houston","Omaha")
for(city <- cities){
val edges = df.filter(df("city") === city).drop("city")
val vert = edges.select("src").union(edges.select("dst")).
distinct.select(col("src").alias("id"))
val g = GraphFrame(vert,edges)
val res = g.connectedComponents.run()
res.select("id", "component").orderBy("component").show()
}
Output
| id| component|
+-------+------------+
| Kyle|249108103168|
|charles|249108103168|
| Benny|249108103168|
|Charles|721554505728|
| Denny|721554505728|
+-------+------------+
+------+------------+
| id| component|
+------+------------+
| Marry|858993459200|
| Sara|858993459200|
| Brian|944892805120|
| carol|944892805120|
|Daniel|944892805120|
+------+------------+

Related

Kusto: Apply function on multiple column values during bag_unpack

Given a dynamic field, say, milestones, it has value like: {"ta": 1655859586546, "tb": 1655859586646},
How do I print a table with columns like "ta", "tb" etc, with the single row as unixtime_milliseconds_todatetime(tolong(taValue)), unixtime_milliseconds_todatetime(tolong(tbValue)) etc.
I figured that I'll need to write a function that I can call, so I created this:-
let f = view(a:string ){
unixtime_milliseconds_todatetime(tolong(a))
};
I can use this function with a normal column as:- project f(columnName).
However, in this case, its a dynamic field, and the number of items in the list is large, so I do not want to enter the fields manually. This is what I have so far.
log_table
| take 1
| evaluate bag_unpack(milestones, "m_") // This gives me fields as columns
// | project-keep m_* // This would work, if I just wanted the value, however, I want `view(columnValue)
| project-keep f(m_*) // This of course doesn't work, but explains the idea.

Based on the mv-apply operator
// Generate data sample. Not part of the solution.
let log_table = materialize(range record_id from 1 to 10 step 1 | mv-apply range(1, 1 + rand(5), 1) on (summarize milestones = make_bag(pack_dictionary(strcat("t", make_string(to_utf8("a")[0] + toint(rand(26)))), 1600000000000 + rand(60000000000)))));
// Solution Starts here.
log_table
| mv-apply kv = milestones on
(
extend k = tostring(bag_keys(kv)[0])
| extend v = unixtime_milliseconds_todatetime(tolong(kv[k]))
| summarize milestones = make_bag(pack_dictionary(k, v))
)
| evaluate bag_unpack(milestones)
record_id
ta
tb
tc
td
te
tf
tg
th
ti
tk
tl
tm
to
tp
tr
tt
tu
tw
tx
tz
1
2021-07-06T20:24:47.767Z
2
2021-05-09T07:21:08.551Z
2022-07-28T20:57:16.025Z
2022-07-28T14:21:33.656Z
2020-11-09T00:54:39.71Z
2020-12-22T00:30:13.463Z
3
2021-12-07T11:07:39.204Z
2022-05-16T04:33:50.002Z
2021-10-20T12:19:27.222Z
4
2022-01-31T23:24:07.305Z
2021-01-20T17:38:53.21Z
5
2022-04-27T22:41:15.643Z
7
2022-01-22T08:30:08.995Z
2021-09-30T08:58:46.47Z
8
2022-03-14T13:41:10.968Z
2022-03-26T10:45:19.56Z
2022-08-06T16:50:37.003Z
10
2021-03-03T11:02:02.217Z
2021-02-28T09:52:24.327Z
2021-04-09T07:08:06.985Z
2020-12-28T20:18:04.973Z
9
2022-02-17T04:55:35.468Z
6
2022-08-02T14:44:15.414Z
2021-03-24T10:22:36.138Z
2020-12-17T01:14:40.652Z
2022-01-30T12:45:54.28Z
2022-03-31T02:29:43.114Z
Fiddle

Kusto query, comparing array of CIDR ranges to an IP

I am trying to do something in Kusto similar to this post:
Filter IPs if they are in list of ranges
but using the IP ranges from a publicly available list to compare to some logs.
Here's what I have tried, I believe the issue relates to me not knowing how to reference the "network" property of the external data.
I get a "Query could not be parsed" error. Apologies for the formatting, I'm not sure how to make it respect line breaks.
let IP_Data = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'];
let testIP = datatable (myip: string) ['4.28.114.50','4.59.176.50']; //Random IPs in Canada
testIP
| mv-apply tmpIP = IP_Data.network to typeof(string) on (
where ipv4_is_in_range(myip, tmpIP
)
| project-away tmpIP

This answers the OP question directly, however there is a great solution for this scenario, based on the ipv4_lookup plugin.
See new answer
For both options -
Since the CSV has header, so I added with (ignoreFirstRecord = true) to the external_data
Option 1
testIP is defined as array (and not a single column table).
The base table is IP_Data but the mv-apply is done on testIP array. This enables you to access values from both IP_Data and testIP
let IP_Data = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'] with (ignoreFirstRecord = true);
let testIP = dynamic(['4.28.114.50','4.59.176.50']); //Random IPs in Canada
IP_Data
| mv-apply testIP = testIP to typeof(string) on (where ipv4_is_in_range(testIP, network))
network
geoname_id
continent_code
continent_name
country_iso_code
country_name
is_anonymous_proxy
is_satellite_provider
testIP
4.28.114.0/24
6251999
NA
North America
CA
Canada
false
false
4.28.114.50
4.59.176.0/24
6251999
NA
North America
CA
Canada
false
false
4.59.176.50
Fiddle
Option 2
Cross join both tables (using a dummy column) and then filter the results
let IP_Data = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'] with (ignoreFirstRecord = true);
let testIP = datatable (myip: string) ['4.28.114.50','4.59.176.50']; //Random IPs in Canada
testIP | extend dummy = 1
| join kind=inner (IP_Data | extend dummy = 1) on dummy
| where ipv4_is_in_range(myip, network)
| project-away dummy*
myip
network
geoname_id
continent_code
continent_name
country_iso_code
country_name
is_anonymous_proxy
is_satellite_provider
4.28.114.50
4.28.114.0/24
6251999
NA
North America
CA
Canada
false
false
4.59.176.50
4.59.176.0/24
6251999
NA
North America
CA
Canada
false
false
Fiddle

New answer
Demo for 1M IPs, based on the ipv4_lookup plugin
let geoip2_ipv4 = external_data(network:string,geoname_id:long,continent_code:string,continent_name:string ,country_iso_code:string,country_name:string,is_anonymous_proxy:bool,is_satellite_provider:bool)
['https://raw.githubusercontent.com/datasets/geoip2-ipv4/master/data/geoip2-ipv4.csv'] with (ignoreFirstRecord = true)
| extend continent_name = coalesce(continent_name, '--- Missing ---');
let ips = materialize(range i from 1 to 1000000 step 1 | extend ip = format_ipv4(tolong(rand() * pow(2,32))));
ips
| evaluate ipv4_lookup(geoip2_ipv4, ip, network, return_unmatched = true)
| summarize count() by continent_name
continent_name
count_
North America
399059
Asia
201902
Europe
173566
South America
33795
Oceania
13384
Africa
17569
--- Missing ---
226
160499
Fiddle

Query all primary roads from PostGis Database filled with OSM Data within a boundary

I imported OpenStreetMap data through osm2pgsql into PgSQL (PostGIS)
and I would like to get an SF object from the data containing
all primary roads (geometry) within a an area (bbox) into R.
I got lost since I would like to have also relations and nodes
and im not sure if only a query on planet_osm_roads will be sufficient and how the imported data structure is different to osm xml data im normaly working with.
I understand it is probably a bit broader question but
I would appreciate a start so to say to understand the query language better.
This is my approach but sadly i get an error
conn <- RPostgreSQL::dbConnect("PostgreSQL", host = MYHOST,
dbname = "osm_data", user = "postgres", password = MYPASSWORD)
pgPostGIS(conn)
a<-pgGetGeom(conn, c("public", "planet_osm_roads"), geom = "way", gid = "osm_id",
other.cols = FALSE, clauses = "WHERE highway = 'primary' && ST_MakeEnvelope(11.2364353533134, 47.8050651144447, 11.8882527806375, 48.2423300001326)")
a<-st_as_sf(a)
This is an error i get:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "ST_MakeEnvelope"
LINE 2: ...lic"."planet_osm_roads" WHERE "way" IS NOT NULL ST_MakeEnv...
^
)
Error in pgGetGeom(conn, c("public", "planet_osm_roads"), geom = "way", :
No geometries found.
In addition: Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
Could not create execute: SELECT DISTINCT a.geo AS type
FROM (SELECT ST_GeometryType("way") as geo FROM "public"."planet_osm_roads" WHERE "way" IS NOT NULL ST_MakeEnvelope(11.2364353533134, 47.8050651144447, 11.8882527806375, 48.2423300001326)) a;
This is the db:
osm_data=# \d
List of relations
Schema | Name | Type | Owner
----------+--------------------+----------+----------
public | geography_columns | view | postgres
public | geometry_columns | view | postgres
public | planet_osm_line | table | postgres
public | planet_osm_nodes | table | postgres
public | planet_osm_point | table | postgres
public | planet_osm_polygon | table | postgres
public | planet_osm_rels | table | postgres
public | planet_osm_roads | table | postgres
public | planet_osm_ways | table | postgres
public | spatial_ref_sys | table | postgres
topology | layer | table | postgres
topology | topology | table | postgres
topology | topology_id_seq | sequence | postgres
schema_name table_name geom_column geometry_type type
1 public planet_osm_line way LINESTRING GEOMETRY
2 public planet_osm_point way POINT GEOMETRY
3 public planet_osm_polygon way GEOMETRY GEOMETRY
4 public planet_osm_roads way LINESTRING GEOMETRY
Table "public.planet_osm_roads"
Column | Type | Collation | Nullable | Default
--------------------+---------------------------+-----------+----------+---------
osm_id | bigint | | |
access | text | | |
addr:housename | text | | |
addr:housenumber | text | | |
addr:interpolation | text | | |
admin_level | text | | |
aerialway | text | | |
aeroway | text | | |
amenity | text | | |
area | text | | |
barrier | text | | |
bicycle | text | | |
brand | text | | |
bridge | text | | |
boundary | text | | |
building | text | | |
construction | text | | |

Your query looks just fine. Check the following example:
WITH planet_osm_roads (highway,way) AS (
VALUES
('primary','SRID=3857;POINT(1283861.57 6113504.88)'::geometry), --inside your bbox
('secondary','SRID=3857;POINT(1286919.06 6067184.04)'::geometry) --somewhere else ..
)
SELECT highway,ST_AsText(way)
FROM planet_osm_roads
WHERE
highway IN ('primary','secondary','tertiary') AND
planet_osm_roads.way && ST_Transform(
ST_MakeEnvelope(11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326, 4326),3857
);
highway | st_astext
---------+------------------------------
primary | POINT(1283861.57 6113504.88)
This image illustrates the BBOX and the points used in the example above
Check the documentation for more information on the bbox intersection operator &&.
However, there are a few things to consider.
In case you're creating the BBOX yourself in order to have an area for ST_Contains, consider simply using ST_DWithin. It will basically do the same, but you only have to provide a reference point and the distance.
Depending on the distribution of highway types in the table planet_osm_roads and considering that your queries will always look for either primary,secondary or tertiary highways, creating a partial index could significantly improve performance. As the documentation says:
A partial index is an index built over a subset of a table; the subset
is defined by a conditional expression (called the predicate of the
partial index). The index contains entries only for those table rows
that satisfy the predicate. Partial indexes are a specialized feature,
but there are several situations in which they are useful.
So try something like this:
CREATE INDEX idx_planet_osm_roads_way ON planet_osm_roads USING gist(way)
WHERE highway IN ('primary','secondary','tertiary');
And also highway needs to be indexed. So try this ..
CREATE INDEX idx_planet_osm_roads_highway ON planet_osm_roads (highway);
.. or even another partial index, in case you can't delete the other data but you don't need it for anything:
CREATE INDEX idx_planet_osm_roads_highway ON planet_osm_roads (highway)
WHERE highway IN ('primary','secondary','tertiary');
You can always identify bottlenecks and check if the query planer is using your index with EXPLAIN.
Further reading
Getting all Buildings in range of 5 miles from specified coordinates
Buffers (Circle) in PostGIS
How can I set data from on table to another according spatial relation geometries in these tables

I figured it out.
This is how you would get an SF object out of PostGIS Database filled with OSM Data within 11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326 BBOX:
library(sf)
library(RPostgreSQL)
library(tictoc)
pw <- MYPASSWORD
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "osm_data",
host = "localhost", port = 5432,
user = "postgres", password = pw)
tic()
sf_data = st_read(con, query = "SELECT osm_id,name,highway,way
FROM planet_osm_roads
WHERE highway = 'primary' OR highway = 'secondary' OR highway = 'tertiary'AND ST_Contains(
ST_Transform(
ST_MakeEnvelope(11.2364353533134,47.8050651144447,11.8882527806375,48.2423300001326,4326)
,3857)
,planet_osm_roads.way);")
toc()
RPostgreSQL::dbDisconnect(con)
I have to verify if BBOX values are actually getting considered .. im not sure.

Conditional mutating of the R data frame based on the strings

I am using R and trying to create a new column based on the string information from the existing columns.
My data is like:
risk_code | area
-----------------------------------
DEEP DIGGING ALL | --
CONSTRUCTION PRO | Construction
CLAIMS ONSHORE | --
OFFSHORE CLAIMS | --
And the result I need is:
risk_code | area | area_new
-------------------------------------------------
DEEP DIGGING ALL | -- | Digging
CONSTRUCTION PRO | Construction | Construction
CLAIMS ONSHORE | -- | Onshore
OFFSHORE CLAIMS | -- | Offshore
I understanding that I make several mistakes in the code, but after the whole week of staring on it and internet searching, I cannot get the result I need.
I appreciate your help.
Thanks in advance.
Occupancy <- read_excel("Occupancy.xlsx")
OccupancyMutated <- mutate(Occupancy, area_new = area)
OccupancyMutated <- as.data.frame(OccupancyMutated)
OccupancyMutated$area_new[Occupancy$area == "--"] <-
{
if (OccupancyMutated$risk_code == %Digging%) {"Digging"}
else if (OccupancyMutated$risk_code == %ONSHORE%) {"Onshore"}
else if (OccupancyMutated$risk_code == %OFFSHORE%) {"Offshore"}
else {"empty"}
}
View(OccupancyMutated)

We can use stringr for this operation. The function word will extract the first word of each string in risk_code and the function str_to_title will convert to your required format. Both functions are vectorized so simply,
library(stringr)
str_to_title(word(df$risk_code, 1, 1))
#[1] "Digging" "Construction" "Onshore" "Offshore"
If it is not always the first word and you need to do it for specific words only, you can do,
str_to_title(str_extract(tolower(df$risk_code), 'digging|offshore|onshore'))
#[1] "Digging" NA "Onshore" "Offshore"

So, this is the answer (thanks to Sotos):
Occupancy <- read_excel("Occupancy.xlsx")
OccupancyMutated <- mutate(Occupancy, area_new = area)
OccupancyMutated <- as.data.frame(OccupancyMutated)
OccupancyMutated$area_new[Occupancy$area == "--"] <-
str_to_title(str_extract(tolower(Occupancy$risk_code), 'Extraction|Offshore|Onshore'))
View(OccupancyMutated)

AdvancedDatagrid Iterating Through Each Row of the Open Leaf/Tree

I need to get the data for each row in an advanceddatagrid where the nodes are open.
For example, my ADG looks like this:
+ Science
- Math
- Passed
John Doe | A+ | Section C
Amy Rourke | B- | Section B
- Failed
Jane Doe | F | Section D
Mike Cones | F | Section D
- English
+ Passed
+ Failed
- History
+ Passed
- Failed
Lori Pea | F | Section C
I tried using the following code to get the open nodes:
var o:Object = new Object();
o = IHierarchicalCollectionView(myADG.dataProvider).openNodes;
But doing the following code to inspect the object:
Alert.show(ObjectUtil.toString(o), 'object inpsection');
Gives me:
(Object)#0
Math (2)
children = (mx.collections::ArrayCollection)#2
filterFunction = (null)
length = 2
list = (mx.collections::ArrayList)#3
length = 2
source = (Array)#4
[0] (Object)#5
children = (mx.collections::ArrayCollection)#6
filterFunction = (null)
length = 2
list = (mx.collections::ArrayList)#7
length = 2
source = (Array)#8
[0] <Table>
<Name>John Doe</Name>
<Grade>A+</Grade>
<Section>Section C</Section>
</Table>
[1] <Table>
<Name>Amy Rourke</Name>
<Grade>B-</Grade>
<Section>Section B</Section>
....
...
..
Basically, I just need to create an object or array or xmllist that would give me:
Math | Passed | John Doe | A+ | Section C
Math | Passed | Amy Rourke | B- | Section B
Math | Failed | Jane Doe | F | Section D
Math | Failed | Mike Cones | F | Section D
History | Failed | Lori Pea | F | Section C
Any suggestion would be highly appreciated. Thanks

You should be able to iterate across the openNodes object's properties and for each one grab the collection and concat the values onto a new array then use that as the source of another type of collection if necessary. Something like this:
var newArray:Array = [];
for(var property:String in o)
{
newArray = newArray.concat(o[property][0].source); //Passed, property is subject as in Math
newArray = newArray.concat(o[property][1].source); //Failed property is subject as in Math
}
The only real problem with this is you're trying to also keep the Math and passed or failed in the objects, otherwise the above should work. To get this other part working I think you need to break each of the statements above into it's own loop that iterates across the source of the openNodes object and puts the right values into a new Value Object you make up that has the subject and the pass or fail set on it. Then you could store these values as well, also notice I'm assuming the pass fail is always organized this way in the original data structure where in each subject you'll have two arrays and the first will be pass followed by fail.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

efficiently calculating connected components in pyspark - graph

Related

Kusto: Apply function on multiple column values during bag_unpack

Kusto query, comparing array of CIDR ranges to an IP

Query all primary roads from PostGis Database filled with OSM Data within a boundary

Conditional mutating of the R data frame based on the strings

AdvancedDatagrid Iterating Through Each Row of the Open Leaf/Tree

Categories

Resources