Unable to print node properties in DSE graph - graph

Apologies as this might be very basic question on the topic but I am new to Gremlin/DSE Graph and i tried many ways to extract data i am inserting to my graph but somehow i am unable to make it work.
Here is what i have:
1. Graph with allow_scans set to true
2. Schema with propertyKey and vertexes defined and materialized index on NodeID of all Vertexes.
There are no relationships right now, just vertexes with data points.
I wrote a program to insert all my nodes to DSE Graph which is working successfully as i get response like below after program created every Vertex:
Result({u'id': {u'out_vertex': {u'community_id': 853347840, u'~label': u'vertex', u'member_id': 14}, u'~type': u'Name', u'local_id': u'00000000-0000-8012-0000-000000000000'}, u'value': u'amount', u'label': u'Name'})]
Ok So now the nodes are inserted, i want to extract them and print their names:
So i did:
g.V().hasLabel('FIELD').has('NodeID','2559b635f077e86c7370ab1c4c798a06').values('Name');
Above failed successfully with blank result. I mean it runs successfully with null output like there is no error but then there is no output. (null in gremlin-console and 'Success - No Results' in DataStax Studio)
Then i came across documentation that graph will not know if the 'has' will only return one node or more, so i used next for iterating as per documentation and tutorials:
g.V().hasLabel('FIELD').has('NodeID','2559b635f077e86c7370ab1c4c798a06').next().values('Name');
Even this failed with
org.apache.tinkerpop.gremlin.driver.exception.ResponseException
(Datastax studio doesnt show more information) - How can i debug this further?
I even came across lambda approach in which i use map :
g.V().hasLabel('FIELD').has('NodeID','2559b635f077e86c7370ab1c4c798a06').map{it.get().value('Name')};
which responded with 'it' not being defined.
(i even tried valueMap - not sure if it was even required)
What am i doing wrong to find and print properties values of a node?
Any directions or query which can help me extract the names and other properties? Even a multi-step query? however i don't think this should be that complicated.
UPDATE:
As per answer i get the following traceback:
gremlin> :> g.V().hasLabel('FIELD').has('NodeID','2559b635f077e86c7370ab1c4c798a06').valueMap(true).next();
org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException
Type ':help' or ':h' for help.
Display stack trace? [yN]y
org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException
at org.apache.tinkerpop.gremlin.console.groovy.plugin.DriverRemoteAcceptor.submit(DriverRemoteAcceptor.java:170)
at org.apache.tinkerpop.gremlin.console.commands.SubmitCommand.execute(SubmitCommand.groovy:41)
at org.codehaus.groovy.tools.shell.Shell.execute(Shell.groovy:104)
at org.codehaus.groovy.tools.shell.Groovysh.super$2$execute(Groovysh.groovy)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1215)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
at org.codehaus.groovy.tools.shell.Groovysh.executeCommand(Groovysh.groovy:259)
at org.apache.tinkerpop.gremlin.console.GremlinGroovysh.execute(GremlinGroovysh.groovy:84)
at org.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:122)
at org.codehaus.groovy.tools.shell.ShellRunner.work(ShellRunner.groovy:95)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1215)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:152)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.work(InteractiveShellRunner.groovy:124)
at org.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:59)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1215)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:152)
at org.codehaus.groovy.tools.shell.InteractiveShellRunner.run(InteractiveShellRunner.groovy:83)
at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:232)
at org.apache.tinkerpop.gremlin.console.Console.<init>(Console.groovy:152)
at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:232)
at org.apache.tinkerpop.gremlin.console.Console.main(Console.groovy:401)
I am able to do some similar operations in another graph. Is something wrong with the graph?
UPDATE 2
My Graph vertices were wrongly defined.
The key to drill down to this solution is in the ~label in the result. It points to vertex instead it should be Field
While defining the data insert team had put label in quotes when they had to put mention label without quotes. Hence i was not able to traverse the nodes.

You need to make sure to iterate the traversal. Most commonly you would use either:
iterate() get zero result
next() get one result
toList() get many results
I'd guess that NodeID is unique, so try something like this:
g.V().hasLabel('FIELD').has('NodeID','2559b635f077e86c7370ab1c4c798a06').
values('Name').next();
If you're interested in all of the properties on that vertex, try:
g.V().hasLabel('FIELD').has('NodeID','2559b635f077e86c7370ab1c4c798a06').
valueMap(true).next();

Related

How to convert RDD to spark dataframe using sparklyr?

I have a lot of files with text data pushed by azure IOT on a blob storage in a lot of folders, and I want to read them and have a delta lake table with one row for each line of a file. I used to read them file by file, but it takes too much time so I want to use spark to speed up this treatment. It needs to integrate a databricks workflow made in R.
I've found spark_read_text function to read text file, but it cannot recursively read directory, it only understand if all the files are in one directory.
Here is an example of a file path (appid/partition/year/month/day/hour/minute/file):
app_id/10/2023/02/06/08/42/gdedir22hccjq
Partition is a random folder (there is around 30 of them right now) that azure IoT seems to create to treat data in parallel, so data for the same date can be split in several folders, which does not simplify the reading efficiency.
So the only function I found to do that is spark.textFile, which works with jokers and recursively handle directories. The only problem is that it return a RDD, and I can't find a way to transform it to a spark dataframe, which could ultimatly be accessed using a tbl_spark R object.
Here is what I did so far:
You need to set the config to recursively read the folder (here I do this on databricks in a dedicate python cell):
%py
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
Then I can create a RDD:
j_rdd <- spark_context(sc) %>%
invoke("textFile", "/mnt/my_cont/app_id/*/2022/11/17/*", 10L)
This work to create the RDD, and as you can see I can map all the partitions (before the year) with a "*", as well as the folders four hours and minutes recursively with the "*" at the end.
I can collect it and create a R dataframe:
lst <- invoke(j_rdd, "collect")
data.frame(row = unlist(lst))
This correctly get my data, one column of text and one row for each line of each file (I can't display an example for privacy reason but it's not important).
The problem is I don't want to collect, but want to update a delta table with this data, and can't find a way to get a sparklyr object that I can use. The j_rdd object I got is like this:
>j_obj
<jobj[2666]>
org.apache.spark.rdd.MapPartitionsRDD
/mnt/my_cont/app_id/*/2022/11/17/* MapPartitionsRDD[80] at textFile at NativeMethodAccessorImpl.java:0
The closer I got so far: I tried to copy code here to convert data to a dataframe using invoke, but I don't seems to do it correctly:
contents_field <- invoke_static(sc, "sparklyr.SQLUtils", "createStructField", "contents", "character", TRUE)
schema <- invoke_static(sc, "sparklyr.SQLUtils", "createStructType", list(contents_field))
j_df <- invoke(hive_context(sc), "createDataFrame", j_rdd, schema)
invoke(j_df, "createOrReplaceTempView", "tmp_test")
dfs <- tbl(sc, "tmp_test")
dfs %>% sdf_nrow()
I only have one column with character in it so I thought it would work, but I get this error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 25.0 failed 4 times, most recent failure: Lost task 14.3 in stage 25.0 (TID 15158) (10.221.193.133 executor 2): java.lang.RuntimeException: Error while encoding: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, contents), StringType, false), true, false, true) AS contents#366
at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1192)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:236)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:208)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:233)
... 28 more
Does anyone have an idea how to convert this RDD object (using R/sparklyr) that I got in return of the invoke function in something usable without collecting data ?
Finally, I found that spark_read_text can also read multiple files with jokers, but you have to put a joker for each directories and files, it cannot discover folders recursively.
For example:
dfs <- spark_read_text(sc, "/mnt/container/app_id/10/2023/02/06/*")
...doesn't work. But:
dfs <- spark_read_text(sc, "/mnt/container/app_id/10/2023/02/06/*/*/*")
...works. Also:
dfs <- spark_read_text(sc, "/mnt/container/app_id/*/2023/02/06/*/*/*")
...with a joker above the date also works.
As the directory depth doesn't change in my case, that's enough for me.

How to load SparkNLP offline model in python

I need to use sparknlp to do lemmatization in python, i want to use the pretrained pipeline, however need to do it offline. what is the correct way to do this? i am not able to find any python example.
I am passing token as the inputcol for lemmatization and lemma as the outputcol. Following is my code:
documentAssembler = DocumentAssembler().setInputCol("Transcript").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(['document']).setOutputCol('token')
lemmatizer = LemmatizerModel().load("xx").setInputCols(["token"]).setOutputCol("lemma")
error message:
Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.annotators.LemmatizerModel.
: java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.util.ConfigHelper$
at com.johnsnowlabs.nlp.serialization.Feature.<init>(Feature.scala:22)
at com.johnsnowlabs.nlp.serialization.MapFeature.<init>(Feature.scala:145)
at com.johnsnowlabs.nlp.annotators.LemmatizerModel.<init>(LemmatizerModel.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Not sure why you're getting this specific error (how did you initialize Spark?), but this is another issue.
To use the lemmatizer offline you first need to download a pretrained one, for example from here, and unzip it. Then change your code to
lemmatizer = LemmatizerModel().load("/path/to/unzipped/model").setInputCols...

Selenium findElement(By.cssSelector) for "text" that has no associated tag

Looking at HTML code with this as a web page drop list option:
<div class="select5-result-label" id="select5-result-label-31" role="option">
<span class="select2-match"></span>
"Jedi"
</div>
Attempting to use selenium css selectors to select/click the "Jedi" option. Tried:
- driver.findElement(By.id("select5-result-label-31")).click();
- driver.findElement(By.cssSelector("div#select5-result-label-31")).click();
- driver.findElement(By.cssSelector("div[text='Jedi']")).click();
but none of the attempts above have been able to select the drop list value. Returns runtime error message of:
*** Element info: {Using=css selector, value=div[text='Jedi']}
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Suspect it is because the "Jedi" value has no tag. Any suggestions to select/click the option above? Very much appreciated, thanks in advance.

Printing/Fetching Vertex values from a path

Just getting started with gremlin.
Printing out all the Vertex values worked out fine
gremlin> g.V().values()
==>testing 2
==>Cash Processing
==>Sales
==>Marketing
==>Accounting
I was able to find all the directly connected path between my Vertices.
gremlin> g.V().hasLabel('Process')
.repeat(both().simplePath())
.until(hasLabel('Process'))
.dedup().path()
==>[v[25],v[28]]
==>[v[25],v[26]]
==>[v[26],v[27]]
==>[v[26],v[25]]
Now am trying to print out the values in the path like ['Sales', 'Accounting'] instead of [v[25],v[28]]
Not been able to figure out a way yet
Already tried and failed with
Unfold: Does not get me 1-1 mapping
gremlin> g.V().hasLabel('Process').repeat(both().simplePath()).until(hasLabel('Process')).dedup().path().unfold().values()
==>Cash Processing
==>Accounting
==>Cash Processing
==>Sales
==>Sales
==>Marketing
==>Sales
==>Cash Processing
Path seems to be of a different data-type and does not support .values() function
gremlin> g.V().hasLabel('Process')
.repeat(both().simplePath())
.until(hasLabel('Process'))
.dedup().path().values()
org.apache.tinkerpop.gremlin.process.traversal.step.util.ImmutablePath cannot be cast to org.apache.tinkerpop.gremlin.structure.Element
Tried the following google searches and didnt get the answer
gremlin print a path
gremlin get values in a path
and few more word twisting
Found one at here that was for java but that didnt work for me
l = []; g.V().....path().fill(l)
(but cant create list, Cannot set readonly property: list for class: org.apache.tinkerpop.gremlin.structure.VertexProperty$Cardinality
)
I have running it on Gremlin console (running ./gremlin.sh)
You can use the by step to modulate the elements inside the path. For example by supplying valueMap(true) to by you get the properties of the vertices, together with the vertex labels and their ids:
gremlin> g.V().repeat(both().simplePath()).times(1).dedup().path().by(valueMap(true))
==>[[id:1,name:[marko],label:person,age:[29]],[id:3,name:[lop],lang:[java],label:software]]
==>[[id:1,name:[marko],label:person,age:[29]],[id:2,name:[vadas],label:person,age:[27]]]
==>[[id:1,name:[marko],label:person,age:[29]],[id:4,name:[josh],label:person,age:[32]]]
==>[[id:2,name:[vadas],label:person,age:[27]],[id:1,name:[marko],label:person,age:[29]]]
==>[[id:3,name:[lop],lang:[java],label:software],[id:6,name:[peter],label:person,age:[35]]]
==>[[id:4,name:[josh],label:person,age:[32]],[id:5,name:[ripple],lang:[java],label:software]]
I used the modern graph which is one of TinkerPop's toy graphs that are often used for such examples. Your output will look a bit different and you may want to use something else than valueMap(true) for the by modulator. The TinkerPop documentation of the path step itself contains two more advanced examples for path().by() that you might want to check out.

Pattern to parse this string to a DateTime

Hey everyone I am trying to parse out a DateTime from a string that looks like "20110406080000.000[-4:EDT]" and am running into problems with the [-4:EDT]
DateTimeFormat.forPattern("yyyyMMddHHmmss.SSS[ZZ]").parseDateTime("20110406080000.000[-4:EDT]") results in the following error
java.lang.IllegalArgumentException: Invalid format: "20110406080000.000[-4:EDT]" is malformed at "-4:EDT]"
at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:673)
at .<init>(<console>:8)
at .<clinit>(<console>)
at RequestResult$.<init>(<console>:9)
at RequestResult$.<clinit>(<console>)
at RequestResult$scala_repl_result(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at scala.tools.nsc.Interpreter$Request$$anonfun$loadAndRun$1$$anonfun$apply$17.apply(Interpreter.scala:988)
at scala.tools.nsc.Interpreter$...
Any suggestions would be greatly appreciated.
You need to either
strip the suffix off before parsing
write your own DateTimeParser for the end part, using DateTimeFormatterBuilder to combine your parser with a standard parser for the first part

Resources