Using Spark to add Edges gremlin

Using Spark to add Edges gremlin - gremlin

I can't save my edge when I'm using spark as follows:
for information it can save edge by using gremlin console
val graph = DseGraphFrameBuilder.dseGraph("GRAPH_NAME", spark)
graph.V().has("vertex1","field1","value").as("a").V().has("vertex2","field1","value").addE("myEdgeLabel").to("a")
When I try: graph.edges.show()
I get an empty table

addE() step is not yet implemented in DseGraphFrames, you should use DGF specific updateEdges() function. The function is design for bulk updates It take spark dataframe with new edges in DGF format:
scala> newEdges.printSchema
root
|-- src: string (nullable = false)
|-- dst: string (nullable = false)
|-- ~label: string (nullable = true)
src and dst columns are encoded vertex ids. you can ether construct them with g.idColumn() helper function or select them from vertices.
Usually you know ids and use helper function
scala> val df = Seq((1, 2, "myEdgeLabel")).toDF("v1_id", "v2_id", "~label")
scala> val newEdges=df.select(g.idColumn("vertex2", $"v2_id") as "src", g.idColumn("vertex1", $"v1_id") as "dst", $"~label")
scala> g.updateEdges(newEdges)
For your particular case, you can query ids first and then insert base on them. never do this in production, this approach is slow and is not bulk. on huge graphs use method #1:
val dst = g.V.has("vertex1","field1","value").id.first.getString(0)
val src = g.V.has("vertex2","field1","value").id.first.getString(0)
val newEdges = Seq((src, dst, "myEdgeLabel")).toDF("src", "dst", "~label")
g.updateEdges(newEdges)
See documentation: https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameImport.html

Related

Airflow SqlToS3Operator has unwanted an index in the beginning

Recent airflow-providers-amazon has deprecated MySQLToS3Operator and introduced SqlToS3Operator and now it is adding an index column in the beginning of the CSV dump.
For example, if I run the following
sql_to_s3_task = SqlToS3Operator(
task_id="sql_to_s3_task",
sql_conn_id=conn_id_name,
query="SELECT created_at, score FROM my_table",
s3_bucket=bucket_name,
s3_key=key,
replace=True,
)
The S3 file has something like this:
,created_at,score
1,2023-01-01,5
2,2023-01-02,6
The output seems to be a direct dump from Pandas. How can I remove this unwanted preceding index column?

The operator uses pandas DataFrame under the hood.
You should use pd_kwargs. It allows you to pass arguments to include in DataFrame .to_parquet(), .to_json() or .to_csv().
Since your output is csv the relevant pandas.DataFrame.to_csv parameters are:
header: bool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
index: bool, default True
Write row names (index).
Thus you can do:
sql_to_s3_task = SqlToS3Operator(
task_id="sql_to_s3_task",
sql_conn_id=conn_id_name,
query="SELECT created_at, score FROM my_table",
s3_bucket=bucket_name,
s3_key=key,
replace=True,
file_format="csv",
pd_kwargs={"index": False, "header": False},
)

Syntax error: SYN0001 despite it working on the Kusto query editor online

I'm building queries in Python and executing them on my Kusto clusters using Kusto client's execute_query method.
I've been hit by the following error: azure.kusto.data.exceptions.KustoApiError: Request is invalid and cannot be processed: Syntax error: SYN0001: I could not parse that, sorry. [line:position=0:0].
However, when debugging, I've taken the query as it is, and ran it on my clusters thru the Kusto platform on Azure.
The query is similar to the following:
StormEvents
| where ingestion_time() > ago(1h)
| summarize
count_matching_regex_State=countif(State matches regex "[A-Z]*"),
count_not_empty_State=countif(isnotempty(State))
| summarize
Matching_State=sum(count_matching_regex_State),
NotEmpty_State=sum(count_not_empty_State)
| project
ratio_State=todouble(Matching_State) / todouble(Matching_State + NotEmpty_State)
| project
ratio_State=iff(isnan(ratio_State), 0.0, round(ratio_State, 3))
Queries are built in Python using string interpolations and such:
## modules.py
def match_regex_query(fields: list, regex_patterns: list, kusto_client):
def match_regex_statements(field, regex_patterns):
return " or ".join(list(map(lambda pattern: f"{field} matches regex \"{pattern}\"", regex_patterns)))
count_regex_statement = list(map(
lambda field: f"count_matching_regex_{field} = countif({match_regex_statements(field, regex_patterns)}), count_not_empty_{field} = countif(isnotempty({field}))", fields))
count_regex_statement = ", ".join(count_regex_statement)
summarize_sum_statement = list(map(lambda field: f"Matching_{field} = sum(count_matching_regex_{field}), NotEmpty_{field} = sum(count_not_empty_{field})", fields))
summarize_sum_statement = ", ".join(summarize_sum_statement)
project_ratio_statement = list(map(lambda field: f"ratio_{field} = todouble(Matching_{field})/todouble(Matching_{field}+NotEmpty_{field})", fields))
project_ratio_statement = ", ".join(project_ratio_statement)
project_round_statement = list(map(lambda field: f"ratio_{field} = iff(isnan(ratio_{field}),0.0,round(ratio_{field}, 3))", fields))
project_round_statement = ", ".join(project_round_statement)
query = f"""
StormEvents
| where ingestion_time() > ago(1h)
| summarize {count_regex_statement}
| summarize {summarize_sum_statement}
| project {project_ratio_statement}
| project {project_round_statement}
"""
clean_query = query.replace("\n", " ").strip()
try:
result = kusto_client.execute_query("Samples", clean_query)
except Exception as err:
logging.exception(
f"Error while computing regex metric : {err}")
result = []
return result
## main.py
#provide your kusto client here
cluster = "https://help.kusto.windows.net"
kcsb = KustoConnectionStringBuilder.with_interactive_login(cluster)
client = KustoClient(kcsb)
fields = ["State"]
regex_patterns = ["[A-Z]*"]
metrics = match_regex_query(fields, regex_patterns, client)
Is there a better way to debug this problem?
TIA!

the query your code generates is invalid, as the regular expressions include characters that aren't properly escaped.
see: the string data type
this is your invalid query (based on the client request ID you provided in the comments):
LiveStream_CL()
| where ingestion_time() > ago(1h)
| summarize count_matching_regex_deviceHostName_s = countif(deviceHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_deviceHostName_s = countif(isnotempty(deviceHostName_s)), count_matching_regex_sourceHostName_s = countif(sourceHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_sourceHostName_s = countif(isnotempty(sourceHostName_s)), count_matching_regex_destinationHostName_s = countif(destinationHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_destinationHostName_s = countif(isnotempty(destinationHostName_s)), count_matching_regex_agentHostName_s = countif(agentHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_agentHostName_s = countif(isnotempty(agentHostName_s))
...
whereas this is how it should look like (note the addition of the #s):
LiveStream_CL()
| where ingestion_time() > ago(1h)
| summarize
count_matching_regex_deviceHostName_s = countif(deviceHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"),
count_not_empty_deviceHostName_s = countif(isnotempty(deviceHostName_s)),
count_matching_regex_sourceHostName_s = countif(sourceHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"),
count_not_empty_sourceHostName_s = countif(isnotempty(sourceHostName_s)),
count_matching_regex_destinationHostName_s = countif(destinationHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"),
count_not_empty_destinationHostName_s = countif(isnotempty(destinationHostName_s)),
count_matching_regex_agentHostName_s = countif(agentHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0, 61}[a-zA-Z0-9\$])?$"),
count_not_empty_agentHostName_s = countif(isnotempty(agentHostName_s))
...

how to convert string of mapping to mapping in pyspark

I have a csv file look like this (it is saved from pyspark output)
name_value
"[quality1 -> good, quality2 -> OK, quality3 -> bad]"
"[quality1 -> good, quality2 -> excellent]"
how can I use pyspark to read this csv file and convert name_value column into a map type?

Something like the below
data = {}
line = '[quality1 -> good, quality2 -> OK, quality3 -> bad]'
parts = line[1:-1].split(',')
for part in parts:
k,v = part.split('->')
data[k.strip()] = v.strip()
print(data)
output
{'quality1': 'good', 'quality2': 'OK', 'quality3': 'bad'}

Using a combination of split and regexp_replace cuts the string into key value pairs. In a second step each key value pair is transformed first into a struct and then into a map element:
from pyspark.sql import functions as F
df=spark.read.option("header","true").csv(...)
df1=df.withColumn("name_value", F.split(F.regexp_replace("name_value", "[\\[\\]]", ""),",")) \
.withColumn("name_value", F.map_from_entries(F.expr("""transform(name_value, e -> (regexp_extract(e, '^(.*) ->',1),regexp_extract(e, '-> (.*)$',1)))""")))
df1 has now the schema
root
|-- name_value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and contains the same data like the original csv file.

IndexError: list index out of range, scores.append( (fields[0], fields[1]))

I'm trying to read a file and put contents in a list. I have done this mnay times before and it has worked but this time it throws back the error "list index out of range".
the code is:
with open("File.txt") as f:
scores = []
for line in f:
fields = line.split()
scores.append( (fields[0], fields[1]))
print(scores)
The text file is in the format;
Alpha:[0, 1]
Bravo:[0, 0]
Charlie:[60, 8, 901]
Foxtrot:[0]
I cant see why it is giving me this problem. Is it because I have more than one value for each item? Or is it the fact that I have a colon in my text file?
How can I get around this problem?
Thanks

If I understand you well this code will print you desired result:
import re
with open("File.txt") as f:
# Let's make dictionary for scores {name:scores}.
scores = {}
# Define regular expressin to parse team name and team scores from line.
patternScore = '\[([^\]]+)\]'
patternName = '(.*):'
for line in f:
# Find value for team name and its scores.
fields = re.search(patternScore, line).groups()[0].split(', ')
name = re.search(patternName, line).groups()[0]
# Update dictionary with new value.
scores[name] = fields
# Print output first goes first element of keyValue in dict then goes keyName
for key in scores:
print (scores[key][0] + ':' + key)
You will recieve following output:
60:Charlie
0:Alpha
0:Bravo
0:Foxtrot

Julia dictionary "key not found" only when using loop

Still trying to figure out this problem (I was having problems building a dictionary, but managed to get that working thanks to rickhg12hs).
Here's my current code:
#open files with codon:amino acid pairs, initiate dictionary:
file = open(readall, "rna_codons.txt")
seq = open(readall, "rosalind_prot.txt")
codons = {"UAA" => "stop", "UGA" => "stop", "UAG" => "stop"}
#generate dictionary entries using pairs from file:
for m in eachmatch(r"([AUGC]{3,3})\s([A-Z])\s", file)
codon, aa = m.captures
codons[codon] = aa
end
All of that code seems to work as intended. At this point, I have the dictionary I want, and the right keys point to the right entries. If I just do print(codons["AUG"]) for example, it prints 'M', which is the correct output. Now I want to scan through a string in the second file, and for every 3 letters, pull out the entry referenced in the dictionary and add it to the prot string. So I tried:
for m in eachmatch(r"([AUGC]{3,3})", seq)
amac = codons[m.captures]
prot = "$prot$amac"
end
But this kicks out the error key not found: ["AUG"]. I know the key exists, because I can print codons["AUG"] and it returns the proper entry, so why can't it find that key when it's in the loop?