Titan Graph Queries taking too long to execute - graph

I have a problem with the executing speed of Titan queries.
To be more specific:
I created a property file for my graph using BerkeleyJe which is looking like this:
storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph
Afterwards, i opened the Gremlin.bat to open my Graph.
I set up all the neccessary Index Keys for my nodes:
m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()
(all other keys are created the same way...)
I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading.
That works without a Problem.
Then i execute a groupBy query which is looking like that:
m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()
With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species".
Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.
But now to the problem.
My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.
With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.
So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days.
I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.
So my first Question:
In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?
Second Question:
Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(
I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.
Thanks a lot!
Best regards,
Ricardo
EDIT 1: The way im loading the CSV in the Graph:
I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit()

Related

AzureML Dataset.File.from_files creation extremely slow even with 4 files

I have a few thousand of video files in my BlobStorage, which I set it as a datastore.
This blob storage receives new files every night and I need to split the data and register each split as a new version of AzureML Dataset.
This is how I do the data split, simply getting the blob paths and splitting them.
container_client = ContainerClient.from_connection_string(AZ_CONN_STR,'keymoments-clips')
blobs = container_client.list_blobs('soccer')
blobs = map(lambda x: Path(x['name']), blobs)
train_set, test_set = get_train_test(blobs, 0.75, 3, class_subset={'goal', 'hitWoodwork', 'penalty', 'redCard', 'contentiousRefereeDecision'})
valid_set, test_set = split_data(test_set, 0.5, 3)
train_set, test_set, valid_set are just nx2 numpy arrays containing blob storage path and class.
Here is when I try to create a new version of my Dataset:
datastore = Datastore.get(workspace, 'clips_datastore')
dataset_train = Dataset.File.from_files([(datastore, b) for b, _ in train_set[:4]], validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
How is it possible that the Dataset creation seems to hang for an indefinite time even with only 4 paths?
I saw in the doc that providing a list of Tuple[datastore, path] is perfectly fine. Do you know why?
Thanks
Do you have your Azure Machine Learning Workspace and your Azure Storage Account in different Azure Regions? If that's true, latency may be a contributing factor with validate=True.
Another possibility may be slowness in the way datastore paths are resolved. This is an area where improvements are being worked on.
As an experiment, could you try creating the dataset using a url instead of datastore? Let us know if that makes a difference to performance, and whether it can unblock your current issue in the short term.
Something like this:
dataset_train = Dataset.File.from_files(path="https://bloburl/**/*.mp4?accesstoken", validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
I'd be interested to see what happens if you run the dataset creation code twice in the same notebook/script. Is it faster the second time? I ask because it might be an issue with the .NET core runtime startup (which would only happen on the first time you run the code)
EDIT 9/16/20
While it doesn't seem to make sense that .NET core invoked when not data is moving, is suspect it is the validate=True part of the param that requires that all the data be inspected (which can computationally expensive). I'd be interested to see what happens if that param is False

Neo4J and Cypher query

I am new to Neo4j and Cypher query.My create query is like each Shop has 2 chillers which has 2 PLCs each which in turn has 2 sensors each.
The create is as below
Create(:SHOP{name:"Shop1"})-[:hasChiller]->(:CHILLER{name:"Chiller1"})
Create(:SHOP{name:"Shop1"})-[:hasChiller]->(:CHILLER{name:"Chiller2"})
Create(:SHOP{name:"Shop2"})-[:hasChiller]->(:CHILLER{name:"Chiller3"})
Create(:SHOP{name:"Shop2"})-[:hasChiller]->(:CHILLER{name:"Chiller4"})
Create(:CHILLER{name:"Chiller1"})-[:hasPLC]->(:PLC{name:"Plc1"})
Create(:CHILLER{name:"Chiller1"})-[:hasPLC]->(:PLC{name:"Plc2"})
Create(:CHILLER{name:"Chiller2"})-[:hasPLC]->(:PLC{name:"Plc3"})
Create(:CHILLER{name:"Chiller2"})-[:hasPLC]->(:PLC{name:"Plc4"})
Create(:CHILLER{name:"Chiller3"})-[:hasPLC]->(:PLC{name:"Plc5"})
Create(:CHILLER{name:"Chiller3"})-[:hasPLC]->(:PLC{name:"Plc6"})
Create(:CHILLER{name:"Chiller4"})-[:hasPLC]->(:PLC{name:"Plc7"})
Create(:CHILLER{name:"Chiller4"})-[:hasPLC]->(:PLC{name:"Plc8"})
Create(:PLC{name:"Plc1"})-[:hasSensor]->(:SENSOR{name:"Sensor1"})
Create(:PLC{name:"Plc1"})-[:hasSensor]->(:SENSOR{name:"Sensor2"})
Create(:PLC{name:"Plc2"})-[:hasSensor]->(:SENSOR{name:"Sensor3"})
Create(:PLC{name:"Plc2"})-[:hasSensor]->(:SENSOR{name:"Sensor4"})
Create(:PLC{name:"Plc3"})-[:hasSensor]->(:SENSOR{name:"Sensor5"})
Create(:PLC{name:"Plc3"})-[:hasSensor]->(:SENSOR{name:"Sensor6"})
Create(:PLC{name:"Plc4"})-[:hasSensor]->(:SENSOR{name:"Sensor7"})
Create(:PLC{name:"Plc4"})-[:hasSensor]->(:SENSOR{name:"Sensor8"})
Create(:PLC{name:"Plc5"})-[:hasSensor]->(:SENSOR{name:"Sensor9"})
Create(:PLC{name:"Plc5"})-[:hasSensor]->(:SENSOR{name:"Sensor10"})
Create(:PLC{name:"Plc6"})-[:hasSensor]->(:SENSOR{name:"Sensor11"})
Create(:PLC{name:"Plc6"})-[:hasSensor]->(:SENSOR{name:"Sensor12"})
Create(:PLC{name:"Plc7"})-[:hasSensor]->(:SENSOR{name:"Sensor13"})
Create(:PLC{name:"Plc7"})-[:hasSensor]->(:SENSOR{name:"Sensor14"})
Create(:PLC{name:"Plc8"})-[:hasSensor]->(:SENSOR{name:"Sensor15"})
Create(:PLC{name:"Plc8"})-[:hasSensor]->(:SENSOR{name:"Sensor16"})
However the Match to get the sensors under SHOP1
MATCH(s:SHOP{name:"Shop1"})-[:hasChiller]->(cc:CHILLER)-[:hasPLC]->(pp:PLC)-[:hasSensor]->(ss:SENSOR) return ss.name
returns nothing.Says no changes and no data.
I am trying this out on Neo4J sandbox environment.I did this based on the understanding i had using match clause in SQL SERVER GRAPH 2019 where this works.
Can anyone point out where i am going wrong?
You are improperly creating multiple instances of the "same" node. You should create each node once, and then use its bound variable name later on when you need to create relationships involving that node.
Delete all your data and follow this pattern instead (you have to fill in the "..." parts):
CREATE
(sh1:SHOP{name:"Shop1"}), (sh2:SHOP{name:"Shop1"}),
(c1:CHILLER{name:"Chiller1"}), (c2:CHILLER{name:"Chiller2"}),(c3:CHILLER{name:"Chiller3"}), (c4:CHILLER{name:"Chiller4"}),
(p1:PLC{name:"Plc1"}), ..., (p8:PLC{name:"Plc8"}),
(se1:SENSOR{name:"Sensor1"}), ..., (se16:SENSOR{name:"Sensor16"}),
(sh1)-[:hasChiller]->(c1), (sh1)-[:hasChiller]->(c2),
... // create remaining relationships using bound variable names for nodes

saveWorkbook Execution time

I'm sorry if this was already answered but I couldn't find.
I'm using the XLConnect package to add new entries to a spreadsheet, but the execution time of saveWorkbook is increasing and delaying all other tasks that depend on the updated spreadsheet.
The work flow is the following:
Query a SQL db for new entries (Load the result using read.table);
Load out-of-date spreadsheet and save each sheets as a entry of a
list;
Add entries to appropriate sheets/list element;
Color lines, using setCellStyel, according to a series of
parameters (example in code bellow);
saveWorkbook
cs_completo=getOrCreateCellStyle(wb, name = "Cs_Completo")
setFillPattern(cs_completo, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(cs_completo, color = XLC$COLOR.LIGHT_GREEN)
for(status in c("Conferido","Impresso","Entregue","Envelopado")){
if(sum(grepl(status,dados$NomeStatusExame))>0){
index=which(grepl(status,dados$NomeStatusExame))+1
lapply(1:length(desired_tabs),function(x) setCellStyle(wb, sheet = sheet, row=index, col=x,cellstyle = cs_completo))}
}
}
Steps 1 through 4 are complete under 3 three minutes (some sheets have as much as 2000 lines).
Step 5 takes at least 30 minutes!
Is there a way to speed up the saveWorkbook writing process?
I don't know why but saving the workbook to a new file take much less time (under a minute) than overwrite the existing one!

BigQuery Timeout Errors in R Loop Using bigrquery

I am running a query in a loop for each store in a dataframe. Typically there are 70 or so stores so the loop repeats that many times for each complete loop.
Maybe 75% of the time this loop works all the way through with no errors.
About 25% of the time I get the following error during any one of the loop iterations:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached
Then I have to figure out which iteration bombed, and repeat the loop excluding iterations that completed successfully.
I can't find anything on the web to help me understand what is causing this seemingly random error. Perhaps it is a BQ technical issue? There does not seem to be any relation to the size of the result set it crashes on.
Here is the part of my code that does the loop...again it works all the way through most of the time. The cartesian product across IDs is intentional, as I want every combination of each Test ID with all possible Control IDs within store.
sql<-"SELECT pstore as store, max(pretrips) as pretrips FROM analytics.campaign_ids
group by 1 order by 1"
store_maxtrips<-query_exec(sql,project=project, max_pages = 1)
store_maxtrips
for (i in 1:length(store_maxtrips$store)) {
#pull back all ids shopping in same primary store as each test ID with their pre metrics
sql<-paste("SELECT a.pstore as pstore, a.id as test_id,
b.id as ctl_id,
(abs(a.zpbsales-b.zpbsales)*",wt_pb_sales,")+(abs(a.zcatsales-b.zcatsales)*",wt_cat_sales,")+
(abs(a.zsales-b.zsales)*",wt_retail_sales,")+(abs(a.ztrips-b.ztrips)*",wt_retail_trips,") as zscore
FROM analytics.campaign_ids a inner join analytics.pre_zscores b
on a.pstore=b.pstore
where a.id<>b.id and a.pstore=",store_maxtrips$store[i]," order by a.pstore, a.id, zscore")
print(paste("processing store",store_maxtrips$store[i]))
query_exec(sql,project=project,destination_table = "analytics.campaign_matches",
write_disposition = "WRITE_APPEND", max_pages = 1)
}
Solved!
It turns out I was using query_exec, but I should have been using insert_query_job since I do not want to retrieve any results. The errors were all happening in the course of R trying to retrieve results from BigQuery which I didn't want anyhow.
By using insert_query_job + wait_for(job) in my loop instead of the query_exec command, it eliminated all issues with the loop finishing.
I did also need to add a try() function to help circumvent some rare errors that still popped up with this approach. Thanks to MarkeD for this tip. So my final solution looked like this:
try(job<-insert_query_job(sql,project=project,destination_table = "analytics.campaign_matches",
write_disposition = "WRITE_APPEND"))
wait_for(job)
Thanks to everyone who commented and helped me research the issue.

Issue while ingesting a Titan graph into Faunus

I have installed both Titan and Faunus and each seems to be working properly (titan-0.4.4 & faunus-0.4.4)
However, after ingesting a sizable graph in Titan and trying to import it in Faunus via
FaunusFactory.open( )
I am experiencing issues. To be more precise, I do seem to get a faunus graph from the call FaunusFactory.open( ),
faunusgraph[titanhbaseinputformat->titanhbaseoutputformat]
but then, even asking a simple
g.v(10)
I do get this error:
Task Id : attempt_201407181049_0009_m_000000_0, Status : FAILED
com.thinkaurelius.titan.core.TitanException: Exception in Titan
at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.getAdminInterface(HBaseStoreManager.java:380)
at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.ensureColumnFamilyExists(HBaseStoreManager.java:275)
at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.openDatabase(HBaseStoreManager.java:228)
My property file is taken straight out of the Faunus page with Titan-HBase input, except of course changing the url of the hadoop cluster:
faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
faunus.graph.input.titan.storage.backend=hbase
faunus.graph.input.titan.storage.hostname= my IP
faunus.graph.input.titan.storage.port=2181
faunus.graph.input.titan.storage.tablename=titan
faunus.graph.output.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseOutputFormat
faunus.graph.output.titan.storage.backend=hbase
faunus.graph.output.titan.storage.hostname= IP of my host
faunus.graph.output.titan.storage.port=2181
faunus.graph.output.titan.storage.tablename=titan
faunus.graph.output.titan.storage.batch-loading=true
faunus.output.location=output1
zookeeper.znode.parent=/hbase-unsecure
titan.graph.output.ids.block-size=100000
Anyone can help?
ADDENDUM:
To address the comment below, here is some context: as I have mentioned, I have a graph in Titan and can perform basic gremlin queries on it.
However, I do need to run a gremlin global query which, due to the size of the graph, needs Faunus and its underlying MR capabilities. Hence the need to import it. The error I get doesn't look to me as if it points to some inconsistency in the graph itself.
I'm not sure that you have your "flow" of Faunus right. If your end result is to do a global query of the graph, then consider this approach:
pull your graph to sequence file
issue your global query over the sequence file
More specifically create hbase-seq.properties:
# input graph parameters
faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
faunus.graph.input.titan.storage.backend=hbase
faunus.graph.input.titan.storage.hostname=localhost
faunus.graph.input.titan.storage.port=2181
faunus.graph.input.titan.storage.tablename=titan
# hbase.mapreduce.scan.cachedrows=1000
# output data (graph or statistic) parameters
faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
faunus.output.location=snapshot
faunus.output.location.overwrite=true
In Faunus, copy do:
g = FaunusFactory.open('hbase-seq.properties')
g._()
That will read the graph from hbase and write it to sequence file in HDFS. Next, create: seq-noop.properties with these contents:
# input graph parameters
faunus.graph.input.format=org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat
faunus.input.location=snapshot/job-0
# output data parameters
faunus.graph.output.format=com.thinkaurelius.faunus.formats.noop.NoOpOutputFormat
faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
faunus.output.location=analysis
faunus.output.location.overwrite=true
The above configuration will read your sequence file from the previous step and without re-writing the graph (that's what NoOpOutputFormat is for). Now in Faunus do:
g = FaunusFactory.open('seq-noop.properties')
g.V.sideEffect('{it.degree=it.bothE.count()}').degree.groupCount()
This will execute a degree distribution, writing the results in HDFS to the 'analysis' directory. Obviously you can do whatever Faunus-flavored Gremlin you want here - I just wanted to provide an example. I think this is a pretty standard "flow" or pattern for using Faunus from a graph analysis perspective.

Resources