Can I use this CSV to load a neo4j graph with cypher? - graph

I am a medical doctor trying to model a drugs to enzymes database and am starting with a CSV file I use to load my data into the Gephi graph layouting program. I understand the power of a graph db but am illiterate with cypher:
The current CSV has the following format:
source;target;arc_type; <- this is an header needed for Gephi import
artemisinin;2B6;induces;
...
amiodarone;1A2;represses;
...
3A457;carbamazepine;metabolizes;
These sample records show the three types of relationships. Drugs can repress or augment a cytochrome, and cytochromes metabolize drugs.
Is there a way to use this CSV as is to load into neo4j and create the graph?
Thank you very much.

In neo4j terminology, a relationship must have "type", and a node can have any number of labels. It looks like your use case could benefit from labelling your nodes with either Drug or Cytochrome.
Here is a possible neo4j data model for your use case:
(:Drug)-[:MODULATES {induces: false}]->(:Cytochrome)
(:Cytochrome)-[:METABOLIZES]->(:Drug)
The induces property has a boolean value indicating whether a drug induces (true) or represses (false) the related cythochrome.
The following is a (somewhat complex) query that generates the above data model from your CSV file:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///Drugs.csv' AS line FIELDTERMINATOR ';'
WITH line,
CASE line.arc_type
WHEN 'metabolizes' THEN {a: [1]}
WHEN 'induces' THEN {b: [true]}
ELSE {b: [false]}
END AS todo
FOREACH (ignored IN todo.a |
MERGE (c:Cytochrome {id: line.source})
MERGE (d:Drug {id: line.target})
MERGE (c)-[:METABOLIZES]->(d)
)
FOREACH (induces IN todo.b |
MERGE (d:Drug {id: line.source})
MERGE (c:Cytochrome {id: line.target})
MERGE (d)-[:MODULATES {induces: induces}]->(c)
)
The FOREACH clause does nothing if the value after the IN is null.

Yes it's possible, but you will need to install APOC : a list of usefull stored procedures for Neo4j. You can find it here : https://neo4j-contrib.github.io/neo4j-apoc-procedures/
Then you should put your CSV file into the import folder of Neo4j, and run those queries :
The first one to create a unique constraint on :Node(name) :
CREATE CONSTRAINT ON (n:Node) ASSERT n.name IS UNIQUE;
And then this query to import your data :
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///my-csv-file.csv' AS line
MERGE (n:Node {name:line.source})
MERGE (m:Node {name:line.target})
CALL apoc.create.relationship(n, line.arc_type,{​}, m)

Related

Fail to access files in ADLS Gen 2 with ADX External table "Virtual columns"

I have a simple folder tree in Azure Data Lake Gen 2 that is partitioned by date with the following standard folder structure: {yyyy}/{MM}/{dd}. e.g. /Container/folder1/sub_folder/2020/11/01
In each leaf folder, I have some CSV files with few columns but without a timestamp (as the date is already embedded in the folder name).
I am trying to create an ADX external table that will include a virtual column of the date, and then query the data in ADX by date (this is a well-known pattern in Hive and Big data in general).
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = ("/date=" datetime_pattern("year={yyyy}/month={MM}/day={dd}", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder/;{key}'
)
with (includeHeaders = 'All')
Unfortunately querying the table fails, and show artifacts return an empty list.
external_table("Table Name")
| take 10
.show external table Walmart_2141_OEE artifacts
with the following exception:
Query execution has resulted in error (0x80070057): Partial query failure: The parameter is incorrect. (message: 'path2
Parameter name: Argument 'path2' failed to satisfy condition 'Can't append a full path': at Concat in C:\source\Src\Common\Kusto.Cloud.Platform\Utils\UriPath.cs: line 25:
I tried to follow many types of pathformats and datetime_pattern as described in the documentation but nothing worked.
Any ideas?
According to your description the following definition should work:
.create-or-alter external table TableName (col1:double, col2:double, col3:double, col4:double)
kind=adl
partition by (Date:datetime)
pathformat = (datetime_pattern("yyyy/MM/dd", Date))
dataformat=csv
(
h#'abfss://container#datalake_name.dfs.core.windows.net/folder1/subfolder;{key}'
)
with (includeHeaders = 'All')

Neo4J and Cypher query

I am new to Neo4j and Cypher query.My create query is like each Shop has 2 chillers which has 2 PLCs each which in turn has 2 sensors each.
The create is as below
Create(:SHOP{name:"Shop1"})-[:hasChiller]->(:CHILLER{name:"Chiller1"})
Create(:SHOP{name:"Shop1"})-[:hasChiller]->(:CHILLER{name:"Chiller2"})
Create(:SHOP{name:"Shop2"})-[:hasChiller]->(:CHILLER{name:"Chiller3"})
Create(:SHOP{name:"Shop2"})-[:hasChiller]->(:CHILLER{name:"Chiller4"})
Create(:CHILLER{name:"Chiller1"})-[:hasPLC]->(:PLC{name:"Plc1"})
Create(:CHILLER{name:"Chiller1"})-[:hasPLC]->(:PLC{name:"Plc2"})
Create(:CHILLER{name:"Chiller2"})-[:hasPLC]->(:PLC{name:"Plc3"})
Create(:CHILLER{name:"Chiller2"})-[:hasPLC]->(:PLC{name:"Plc4"})
Create(:CHILLER{name:"Chiller3"})-[:hasPLC]->(:PLC{name:"Plc5"})
Create(:CHILLER{name:"Chiller3"})-[:hasPLC]->(:PLC{name:"Plc6"})
Create(:CHILLER{name:"Chiller4"})-[:hasPLC]->(:PLC{name:"Plc7"})
Create(:CHILLER{name:"Chiller4"})-[:hasPLC]->(:PLC{name:"Plc8"})
Create(:PLC{name:"Plc1"})-[:hasSensor]->(:SENSOR{name:"Sensor1"})
Create(:PLC{name:"Plc1"})-[:hasSensor]->(:SENSOR{name:"Sensor2"})
Create(:PLC{name:"Plc2"})-[:hasSensor]->(:SENSOR{name:"Sensor3"})
Create(:PLC{name:"Plc2"})-[:hasSensor]->(:SENSOR{name:"Sensor4"})
Create(:PLC{name:"Plc3"})-[:hasSensor]->(:SENSOR{name:"Sensor5"})
Create(:PLC{name:"Plc3"})-[:hasSensor]->(:SENSOR{name:"Sensor6"})
Create(:PLC{name:"Plc4"})-[:hasSensor]->(:SENSOR{name:"Sensor7"})
Create(:PLC{name:"Plc4"})-[:hasSensor]->(:SENSOR{name:"Sensor8"})
Create(:PLC{name:"Plc5"})-[:hasSensor]->(:SENSOR{name:"Sensor9"})
Create(:PLC{name:"Plc5"})-[:hasSensor]->(:SENSOR{name:"Sensor10"})
Create(:PLC{name:"Plc6"})-[:hasSensor]->(:SENSOR{name:"Sensor11"})
Create(:PLC{name:"Plc6"})-[:hasSensor]->(:SENSOR{name:"Sensor12"})
Create(:PLC{name:"Plc7"})-[:hasSensor]->(:SENSOR{name:"Sensor13"})
Create(:PLC{name:"Plc7"})-[:hasSensor]->(:SENSOR{name:"Sensor14"})
Create(:PLC{name:"Plc8"})-[:hasSensor]->(:SENSOR{name:"Sensor15"})
Create(:PLC{name:"Plc8"})-[:hasSensor]->(:SENSOR{name:"Sensor16"})
However the Match to get the sensors under SHOP1
MATCH(s:SHOP{name:"Shop1"})-[:hasChiller]->(cc:CHILLER)-[:hasPLC]->(pp:PLC)-[:hasSensor]->(ss:SENSOR) return ss.name
returns nothing.Says no changes and no data.
I am trying this out on Neo4J sandbox environment.I did this based on the understanding i had using match clause in SQL SERVER GRAPH 2019 where this works.
Can anyone point out where i am going wrong?
You are improperly creating multiple instances of the "same" node. You should create each node once, and then use its bound variable name later on when you need to create relationships involving that node.
Delete all your data and follow this pattern instead (you have to fill in the "..." parts):
CREATE
(sh1:SHOP{name:"Shop1"}), (sh2:SHOP{name:"Shop1"}),
(c1:CHILLER{name:"Chiller1"}), (c2:CHILLER{name:"Chiller2"}),(c3:CHILLER{name:"Chiller3"}), (c4:CHILLER{name:"Chiller4"}),
(p1:PLC{name:"Plc1"}), ..., (p8:PLC{name:"Plc8"}),
(se1:SENSOR{name:"Sensor1"}), ..., (se16:SENSOR{name:"Sensor16"}),
(sh1)-[:hasChiller]->(c1), (sh1)-[:hasChiller]->(c2),
... // create remaining relationships using bound variable names for nodes

Assign query using 'match()' to subgraph

I have a JanusGraph database with a graph structure as follows:
(Paper)<-[AuthorOf]-(Author)
I'm want to use Gremlin's match clause to query the data and assign the results to a subgraph. This is what I have so far:
g.V().match(
__.as('a').has('Paper','paperTitle', 'The name of my paper'),
__.as('a').in('AuthorOf').outV().as('b')).
select('b').values()
This query returns what I want, the Authors of the paper I'm for which I'm searching. However, I want to assign the results to a subgraph so I can export it using:
sg.io(IoCore.graphml()).writeGraph("/home/ubuntu/myresults.graphml")
Previously, I've achieved this with a different query structure like this:
sg = g.V().has('paperTitle', 'The name of my paper').
inE('AuthorOf').subgraph('sg1').
outV().
cap('sg1').
next()
Is there away to achieve the same results using the 'match()' statement?
After a little trial and error I was able to create a working solution:
sg = g.V().match(
__.as('a').has('Paper','paperTitle', 'ladle pouring guide'),
__.as('a').inE('AuthorOf').subgraph('sg').outV().as('b')).
cap('sg').next()
At first, I was trying to use the 'select' statement to isolate the subgraph. After reviewing the documentation on 'subgraph' and learning more about side-effects in gremlin I realized it wasn't necessary.

How to find a common variable in a large number of databases using Stata

So I have a large number of databases (82) in Stata, that each contain around 1300 variables and several thousand observations. Some of these databases contain variables that give the mean or standard deviation of certain concepts. For example, a variable in such a dataset could be called "leverage_mean". Now, I want to know which datasets contain variables called concept_mean or concept_sd, without having to go through every dataset by hand.
I was thinking that maybe there is a way to loop through the databases looking for variables containing "mean" or "sd", unfortunately I have idea how to do this. I'm using R and Stata datafiles.
Yes, you can do this with a loop in stata as well as R. First, you should check out the stata command ds and the package findname, which will do many of the things described here and much more. But to show you what is happening "under the hood", I'll show the Stata code that can achieve this below:
/*Set your current directory to the location of your databases*/
cd "[your cd here]"
Save the names of the 82 databases to a list called "filelist" using stata's dir function for macros. NOTE: you don't specify what kind of file your database files are, so I'm assuming .xls. This command saves all files with extension ".xls" into the list. What type of file you save into the list and how you import your database will depend on what type of files you are reading in.
local filelist : dir . files "*.xls"
Then loop over all files to show which ones contain variables that end with "_sd" or "_mean".
foreach file of local filelist {
/*import the data*/
import excel "`file'", firstrow clear case(lower)
/*produce a list of the variables that end with "_sd" and "_mean"*/
cap quietly describe *_sd *_mean, varlist
if length("r(varlist)") > 0 {
/*If the database contains variables of interest, display the database file name and variables on screen*/
display "Database `file' contains variables: " r(varlist)
}
}
Final note, this loop will only display the database name and variables of interest contained within it. If you want to perform actions on the data, or do anything else, those actions need to be included in the position of the final "display" command (which you may or may not ultimately actually need).
You can use filelist, (from SSC) to create a dataset of files. To install filelist, type in Stata's Command window:
ssc install filelist
With a list of datasets in memory, you can then loop over each file and use describe to get a list of variables for each file. You can store this list of variables in a single string variable. For example, the following will collect the names of all Stata datasets shipped with Stata and then store for each the variables they contain:
findfile "auto.dta"
local base_dir = subinstr("`r(fn)'", "/a/auto.dta", "", 1)
dis "`base_dir'"
filelist, dir("`base_dir'") pattern("*.dta")
gen variables = ""
local nmatch = _N
qui forvalues i = 1/`nmatch' {
local f = dirname[`i'] + "/" + filename[`i']
describe using "`f'", varlist
replace variables = " `r(varlist)' " in `i'
}
leftalign // also from SSC, to install: ssc install leftalign
Once you have all this information in the data in memory, you can easily search for specific variables. For example:
. list filename if strpos(variables, " rep78 ")
+-----------+
| filename |
|-----------|
13. | auto.dta |
14. | auto2.dta |
+-----------+
The lookfor_all package (SSC) is there for that purpose:
cd "pathtodirectory"
lookfor_all leverage_mean
Just make sure the file extensions are in lowercase(.dta) and not upper.

RNeo4j cypher - retrieving paths

I'm trying to extract a sub-graph from a global network (sub-networks of specific nodes to a specific depth).
The network is composed of nodes labeled as Account with a property of iban and relationships of TRANSFER_TO_AGG.
The cypher syntax is as followed:
MATCH (a:Account { iban :'FR7618206004274157697300156' }),(b:Account),
p = allShortestPaths((a)-[:TRANSFER_TO_AGG*..3]-(b))
RETURN p limit 250
This works perfectly on the Neo4J web interface. However, when trying to save the results to an R object using the command cypher I get the following error:
"Error in as.data.frame.list(value, row.names = rlabs) :
supplied 92 row names for 1 rows"
I believe this is due to the fact that if returning data, you can only query for tabular results. That is, this method has no current functionality for Cypher results containing array properties, collections, nodes, or relationships.
Can anyone offer a solution ?
I've recently added functionality for returning pathways as R objects. First, uninstall / reinstall RNeo4j. Then, see:
?getSinglePath
?getPaths
?shortestPath
?allShortestPaths
?nodes
?rels
?startNode
?endNode
For your query specifically, you would use getPaths():
library(RNeo4j)
graph = startGraph("http://localhost:7474/db/data/")
query = "
MATCH (a:Account { iban :'FR7618206004274157697300156' }),(b:Account),
p = allShortestPaths((a)-[:TRANSFER_TO_AGG*..3]-(b))
RETURN p limit 250
"
p = getPaths(graph, query)
p is a list of path objects. See the docs for examples of using the apply family of functions with a list of path objects.

Resources