neo4j mean of property for all friends - graph

Having a graph like:
CREATE (Alice:Person {id:'a', fraud:1})
CREATE (Bob:Person {id:'b', fraud:0})
CREATE (Charlie:Person {id:'c', fraud:0})
CREATE (David:Person {id:'d', fraud:0})
CREATE (Esther:Person {id:'e', fraud:0})
CREATE (Fanny:Person {id:'f', fraud:0})
CREATE (Gabby:Person {id:'g', fraud:0})
CREATE (Fraudster:Person {id:'h', fraud:1})
CREATE
(Alice)-[:CALL]->(Bob),
(Bob)-[:SMS]->(Charlie),
(Charlie)-[:SMS]->(Bob),
(Fanny)-[:SMS]->(Charlie),
(Esther)-[:SMS]->(Fanny),
(Esther)-[:CALL]->(David),
(David)-[:CALL]->(Alice),
(David)-[:SMS]->(Esther),
(Alice)-[:CALL]->(Esther),
(Alice)-[:CALL]->(Fanny),
(Fanny)-[:CALL]->(Fraudster)
When trying to query like:
MATCH (a)-->(b)
WHERE b.fraud = 1
RETURN (count() / ( MATCH (a) -->(b) RETURN count() ) * 100)
I want to compute the fraudulence of a user which (as fraud is only either 0 or 1 is defined as the mean of all connected nodes fraud level:
MATCH ()--(f)
RETURN f.id, f.fraud, COUNT(*), COLLECT(f) AS fs
returns the correct number of friends, but is not able to access these i.e. in the collect statement is only accessing the node itself:
╒══════╤═════════╤══════════════╤══════════╤══════════════════════════════════════════════════════════════════════╕
│"f.id"│"f.fraud"│"avg(f.fraud)"│"COUNT(*)"│"fs" │
╞══════╪═════════╪══════════════╪══════════╪══════════════════════════════════════════════════════════════════════╡
│"h" │1 │1 │1 │[{"fraud":1,"id":"h"}] │
├──────┼─────────┼──────────────┼──────────┼──────────────────────────────────────────────────────────────────────┤
│"f" │0 │0 │4 │[{"fraud":0,"id":"f"},{"fraud":0,"id":"f"},{"fraud":0,"id":"f"},{"frau│
│ │ │ │ │d":0,"id":"f"}] │
....
I.e. naively calculating the average
MATCH ()--(f)
RETURN f.id, avg(f.fraud)
will only consider this single node and not the network. How can I consider the social network of a node instead (up to a defined depth, i.e. here 1) to improve the original answer of neo4j percentage of attribute for social network
edit
MATCH p = ()--()
UNWIND nodes(p) AS f
RETURN f.id, f.fraud, COUNT(*), COLLECT({id: f.id, fraud: f.fraud}) AS fs
will return only duplicates of the original node in the list and not the connected nodes:
│"f.id"│"f.fraud"│"COUNT(*)"│"fs" │
╞══════╪═════════╪══════════╪══════════════════════════════════════════════════════════════════════╡
│"h" │1 │2 │[{"id":"h","fraud":1},{"id":"h","fraud":1}] │
├──────┼─────────┼──────────┼──────────────────────────────────────────────────────────────────────┤
│"f" │0 │8 │[{"id":"f","fraud":0},{"id":"f","fraud":0},{"id":"f","fraud":0},{"id":│
│ │ │ │"f","fraud":0},{"id":"f","fraud":0},{"id":"f","fraud":0},{"id":"f","fr│
│ │ │ │aud":0},{"id":"f","fraud":0}] │
edit 2
MATCH p = (source)--(destination)
RETURN source.id, source.fraud, COUNT(*), COLLECT({id: destination.id, fraud: destination.fraud}) AS neighbors
is already pretty close - but lacking the avg function

MATCH p = (source)-[*..3]-(destination)
RETURN source.id, source.fraud, COUNT(*), avg(destination.fraud), COLLECT({id: destination.id, fraud: destination.fraud}) AS neighbors
includes the fraudulence defined as the average

Related

How to change the country name and then draw the global map?

I got the dataset like following:
I want to make a world map and see which country have higher mean salary, maybe represent through density or sth else, like density higher means the mean salary is higher, I tried do that with vegalite but I always got the error:
then I realized this data have country name like this:
https:
Ru means russia, NZ means new zealand …Is there any way that I can covert these into the complete country name? and where have I got wrong on this map code?
Can someone help me with that please?
Thanks for any help:)
I just wanna say thank you to all people that offered me suggestions!!!!!!
I have successfully change my country name, but I don't know how to make a map for each country and show which country have higher mean value, Can someone give me some advices please?
These abbreviations resemble ISO 3166 alpha-2 codes. The Julia package Countries.jl is great for converting ISO standards:
julia> using Countries, DataFrames
julia> ds = DataFrame(company_location=["RU", "US", "NZ"], Mean=[1580000, 142000, 122000])
3×2 DataFrame
Row │ company_location Mean
│ String Int64
─────┼───────────────────────────
1 │ RU 1580000
2 │ US 142000
3 │ NZ 122000
julia> ds.country_names = [x.name for x in get_country.(ds.company_location)];
julia> ds
3×3 DataFrame
Row │ company_location Mean country_names
│ String Int64 String
─────┼───────────────────────────────────────────────
1 │ RU 1580000 Russian Federation
2 │ US 142000 United States
3 │ NZ 122000 New Zealand
Alternatively, you could make your own dictionary that maps codes to full names. This could be useful if the abbreviations are non-standard, or if you're working with something else in general:
julia> using DataFrames
julia> ds = DataFrame(company_location=["RU", "US", "NZ"], Mean=[1580000, 142000, 122000])
3×2 DataFrame
Row │ company_location Mean
│ String Int64
─────┼───────────────────────────
1 │ RU 1580000
2 │ US 142000
3 │ NZ 122000
julia> country_codes = Dict("RU" => "Russia", "US" => "United States", "NZ" => "New Zealand");
julia> ds.country_names = getindex.(Ref(country_codes), ds.company_location);
julia> ds
3×3 DataFrame
Row │ company_location Mean country_names
│ String Int64 String
─────┼──────────────────────────────────────────
1 │ RU 1580000 Russia
2 │ US 142000 United States
3 │ NZ 122000 New Zealand
(I couldn't find an obvious way to pass multiple indices to a dictionary, but this post on the Julia Discourse shows a working method, which I've used in my example)

Change identifier by looping through folders -- R

I have a loop related question. I have the following folder structure (excerpt):
├───Y2017
│ UDB_cSK17D.csv
│ UDB_cSK17H.csv
│ UDB_cSK17P.csv
│ UDB_cSK17R.csv
│ UDB_cUK17D.csv
│ UDB_cUK17H.csv
│ UDB_cUK17P.csv
│ UDB_cUK17R.csv
└───Y2018
│ UDB_cSK18D.csv
│ UDB_cSK18H.csv
│ UDB_cSK18P.csv
│ UDB_cSK18R.csv
│ UDB_cUK18D.csv
│ UDB_cUK18H.csv
│ UDB_cUK18P.csv
│ UDB_cUK18R.csv
All the files have the same structure. I would like to loop through them and extract data from a select number of columns. The file names also all have the same structure. All files have:
unique country identified (e.g. UK, SK in the examples above)
unique database type (D, H, P... - last character in file name)
I would like to construct a loop that iterates through the file names. For one country this would work like this:
library(data.table)
ldf<-list()
country_id<-"UK(.*)"
db_id<-"P.csv$"
listcsv<-dir(pattern = paste0(country_id,db_id), recursive = T, full.names = T)
for (k in 1:length(listcsv)){
ldf[[k]]<-fread(listcsv[k],select = c("PB010","PB020"))
}
uk_data<-bind_rows(as.data.frame(do.call(rbind,ldf[])))
This code extract all the columns I need based on the country identifier I give it (UK in this example). As I have numerous countries in my data set I would like to have a code that iterates through and updates the country identifier. I have tried the following:
ldf_new<-list()
countries <-c("SK", "UK")
for (i in 1:length(countries)) {
currcty1 <- countries[i]
listcsv<-dir(pattern = paste0(currcty1,"(.*)",db_id), recursive = T, full.names = T)
# print(listcsv)
ldf_new<-fread(listcsv[i],select = c("PB010","PB020"))
}
What happens here is that I get only the results of the last iteration in the variable ldf_new (i.e. UK in this case). Is there any way I could get the results for SK and UK.
Many thanks in advance!
Changing the last line of your loop so that a new element is added to the list should do the trick:
ldf_new<-list()
countries <-c("SK", "UK")
for (i in 1:length(countries)) {
currcty1 <- countries[i]
listcsv<-dir(pattern = paste0(currcty1,"(.*)",db_id), recursive = T, full.names = T)
# print(listcsv)
ldf_new<-c(ldf_new, fread(listcsv[i],select = c("PB010","PB020")))
}

Kusto query could not be parsed

I have SecurityLog with fields like DstIP_s and want to display records matching my trojanDst table
let trojanDst = datatable (DstIP_s:string)
[ "1.1.1.1","2.2.2.2","3.3.3.3"
];
SecurityLog |
| join trojanDst on DstIP_s
I am getting query could not be parsed error ?
The query you posted has a redundant pipe (|) before the join.
From an efficiency standpoint, make sure the left side of the join is the smaller one, as suggested here: https://learn.microsoft.com/en-us/azure/kusto/query/best-practices#join-operator
This is too long for a comment. As #Yoni L pointed the problem is doubled pipe operator.
For anyone with SQL background join may be a bit counterintuitive(in reality it is kind=innerunique):
JOIN operator:
kind unspecified, kind=innerunique
Only one row from the left side is matched for each value of the on
key. The output contains a row for each match of this row with rows
from the right.
Kind=inner
There's a row in the output for every combination of matching rows
from left and right.
let t1 = datatable(key:long, value:string)
[
1, "a",
1, "b"
];
let t2 = datatable(key:long, value:string)
[
1, "c",
1, "d"
];
t1| join t2 on key;
Output:
┌─────┬───────┬──────┬────────┐
│ key │ value │ key1 │ value1 │
├─────┼───────┼──────┼────────┤
│ 1 │ a │ 1 │ c │
│ 1 │ a │ 1 │ d │
└─────┴───────┴──────┴────────┘
Demo
SQL style JOIN version:
let t1 = datatable(key:long, value:string)
[
1, "a",
1, "b"
];
let t2 = datatable(key:long, value:string)
[
1, "c",
1, "d"
];
t1| join kind=inner t2 on key;
Output:
┌─────┬───────┬──────┬────────┐
│ key │ value │ key1 │ value1 │
├─────┼───────┼──────┼────────┤
│ 1 │ b │ 1 │ c │
│ 1 │ a │ 1 │ c │
│ 1 │ b │ 1 │ d │
│ 1 │ a │ 1 │ d │
└─────┴───────┴──────┴────────┘
Demo
There are many join types in KQL such as innerunique, inner, leftouter, rightouter, fullouter, anti and more. here you can find the full list

Neo4j: How time-consuming is EVERY branch between node A and F?

The following graph is given:
-> B --> E -
/ \
A - -> F
\ /
-> C --> D -
All nodes are of type task. As properties, they have a start time and an end time (both are of the data type DateTime).
All relationships are CONNECT_TO and are directed to the right. The relationships have no properties.
Can somebody help me how the following query should look like in Cypher:
How time-consuming is EVERY branch between node A and F?
A list as result would be fine:
Path Duration [minutes]
---------------------------------
A->B->E->F 100
A->C->D->F 50
Thanks for your help.
Creating your graph
The first statement creates the nodes, the second the relationships between them.
CREATE
(TaskA:Task {name: 'TaskA', time:10}),
(TaskB:Task {name: 'TaskB', time:20}),
(TaskC:Task {name: 'TaskC', time:30}),
(TaskD:Task {name: 'TaskD', time:10}),
(TaskE:Task {name: 'TaskE', time:40}),
(TaskF:Task {name: 'TaskF', time:10})
CREATE
(TaskA)-[:CONNECT_TO]->(TaskB),
(TaskB)-[:CONNECT_TO]->(TaskE),
(TaskE)-[:CONNECT_TO]->(TaskF),
(TaskA)-[:CONNECT_TO]->(TaskC),
(TaskC)-[:CONNECT_TO]->(TaskD),
(TaskD)-[:CONNECT_TO]->(TaskF);
Your desired solution
Defining your start node (Task A)
Finding path of variable length
Defining your end node (Task F)
Retrieve all task nodes for each path
Sum the duration for all tasks of each path
Bonus: amount of tasks per path
Neo4j Statement:
// |----------- 1 -----------| |----- 2 ----| |----------- 3 -----------|
MATCH path = (taskA:Task {name: 'TaskA'})-[:CONNECT_TO*]->(taskF:Task {name: 'TaskF'})
UNWIND
// |-- 4 -|
nodes(path) AS task
// |---- 5 -----| |--- 6 ----|
RETURN path, sum(task.time) AS timeConsumed, length(path)+1 AS taskAmount;
Result
╒══════════════════════════════════════════════════════════════════════╤════════════════╤════════════╕
│"path" │ "timeConsumed" │"taskAmount"│
╞══════════════════════════════════════════════════════════════════════╪════════════════╪════════════╡
│[{"name":"TaskA","time":10},{},{"name":"TaskB","time":20},{"name":"Tas│80 │4 │
│kB","time":20},{},{"name":"TaskE","time":40},{"name":"TaskE","time":40│ │ │
│},{},{"name":"TaskF","time":10}] │ │ │
├──────────────────────────────────────────────────────────────────────┼────────────────┼────────────┤
│[{"name":"TaskA","time":10},{},{"name":"TaskC","time":30},{"name":"Tas│60 │4 │
│kC","time":30},{},{"name":"TaskD","time":10},{"name":"TaskD","time":10│ │ │
│},{},{"name":"TaskF","time":10}] │ │ │
└──────────────────────────────────────────────────────────────────────┴────────────────┴────────────┘

Julia DataFrame columns starting with number?

This may be a stupid question, but for the life of me I can't figure out how to get Julia to read a csv file with column names that start with numbers and use them in DataFrames. How does one do this?
For example, say I have the file "test.csv" which contains the following:
,1Y,2Y,3Y
1Y,11,12,13
2Y,21,22,23
If I just use readtable(), I get this:
julia> using DataFrames
julia> df = readtable("test.csv")
2x4 DataFrames.DataFrame
| Row | x | x1Y | x2Y | x3Y |
|-----|------|-----|-----|-----|
| 1 | "1Y" | 11 | 12 | 13 |
| 2 | "2Y" | 21 | 22 | 23 |
What gives? How can I get the column names to be what they're supposed to be, "1Y, "2Y, etc.?
The problem is that in DataFrames, column names are symbols, which aren't meant to (see comment below) start with a number.
You can see this by doing e.g. typeof(:2), which will return Int64, rather than (as you might expect) Symbol. Thus, to get your columnnames into a useable format, DataFrames will have to prefix it with a letter - typeof(:x2) will return Symbol, and is therefore a valid column name.
Unfortunately, you can't use numbers for starting names in DataFrames.
The code that does the parsing of names makes sure that this restriction stays like this.
I believe this is because of how parsing takes place in julia: :aa names a symbol, while :2aa is a value (makes more sense considering 1:2aa is a range)
You could just use rename!() after the import:
df = csv"""
,1Y,2Y,3Y
1Y,11,12,13
2Y,21,22,23
"""
rename!(df, Dict(:x1Y =>Symbol("1Y"), :x2Y=>Symbol("2Y"), :x3Y=>Symbol("3Y") ))
2×4 DataFrames.DataFrame
│ Row │ x │ 1Y │ 2Y │ 3Y │
├─────┼──────┼────┼────┼────┤
│ 1 │ "1Y" │ 11 │ 12 │ 13 │
│ 2 │ "2Y" │ 21 │ 22 │ 23 │
Still you may experience problems later in your code, better to avoid column names starting with numbers...

Resources