how to convert string of mapping to mapping in pyspark - dictionary

I have a csv file look like this (it is saved from pyspark output)
name_value
"[quality1 -> good, quality2 -> OK, quality3 -> bad]"
"[quality1 -> good, quality2 -> excellent]"
how can I use pyspark to read this csv file and convert name_value column into a map type?

Something like the below
data = {}
line = '[quality1 -> good, quality2 -> OK, quality3 -> bad]'
parts = line[1:-1].split(',')
for part in parts:
k,v = part.split('->')
data[k.strip()] = v.strip()
print(data)
output
{'quality1': 'good', 'quality2': 'OK', 'quality3': 'bad'}

Using a combination of split and regexp_replace cuts the string into key value pairs. In a second step each key value pair is transformed first into a struct and then into a map element:
from pyspark.sql import functions as F
df=spark.read.option("header","true").csv(...)
df1=df.withColumn("name_value", F.split(F.regexp_replace("name_value", "[\\[\\]]", ""),",")) \
.withColumn("name_value", F.map_from_entries(F.expr("""transform(name_value, e -> (regexp_extract(e, '^(.*) ->',1),regexp_extract(e, '-> (.*)$',1)))""")))
df1 has now the schema
root
|-- name_value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and contains the same data like the original csv file.

Related

Syntax error: SYN0001 despite it working on the Kusto query editor online

I'm building queries in Python and executing them on my Kusto clusters using Kusto client's execute_query method.
I've been hit by the following error: azure.kusto.data.exceptions.KustoApiError: Request is invalid and cannot be processed: Syntax error: SYN0001: I could not parse that, sorry. [line:position=0:0].
However, when debugging, I've taken the query as it is, and ran it on my clusters thru the Kusto platform on Azure.
The query is similar to the following:
StormEvents
| where ingestion_time() > ago(1h)
| summarize
count_matching_regex_State=countif(State matches regex "[A-Z]*"),
count_not_empty_State=countif(isnotempty(State))
| summarize
Matching_State=sum(count_matching_regex_State),
NotEmpty_State=sum(count_not_empty_State)
| project
ratio_State=todouble(Matching_State) / todouble(Matching_State + NotEmpty_State)
| project
ratio_State=iff(isnan(ratio_State), 0.0, round(ratio_State, 3))
Queries are built in Python using string interpolations and such:
## modules.py
def match_regex_query(fields: list, regex_patterns: list, kusto_client):
def match_regex_statements(field, regex_patterns):
return " or ".join(list(map(lambda pattern: f"{field} matches regex \"{pattern}\"", regex_patterns)))
count_regex_statement = list(map(
lambda field: f"count_matching_regex_{field} = countif({match_regex_statements(field, regex_patterns)}), count_not_empty_{field} = countif(isnotempty({field}))", fields))
count_regex_statement = ", ".join(count_regex_statement)
summarize_sum_statement = list(map(lambda field: f"Matching_{field} = sum(count_matching_regex_{field}), NotEmpty_{field} = sum(count_not_empty_{field})", fields))
summarize_sum_statement = ", ".join(summarize_sum_statement)
project_ratio_statement = list(map(lambda field: f"ratio_{field} = todouble(Matching_{field})/todouble(Matching_{field}+NotEmpty_{field})", fields))
project_ratio_statement = ", ".join(project_ratio_statement)
project_round_statement = list(map(lambda field: f"ratio_{field} = iff(isnan(ratio_{field}),0.0,round(ratio_{field}, 3))", fields))
project_round_statement = ", ".join(project_round_statement)
query = f"""
StormEvents
| where ingestion_time() > ago(1h)
| summarize {count_regex_statement}
| summarize {summarize_sum_statement}
| project {project_ratio_statement}
| project {project_round_statement}
"""
clean_query = query.replace("\n", " ").strip()
try:
result = kusto_client.execute_query("Samples", clean_query)
except Exception as err:
logging.exception(
f"Error while computing regex metric : {err}")
result = []
return result
## main.py
#provide your kusto client here
cluster = "https://help.kusto.windows.net"
kcsb = KustoConnectionStringBuilder.with_interactive_login(cluster)
client = KustoClient(kcsb)
fields = ["State"]
regex_patterns = ["[A-Z]*"]
metrics = match_regex_query(fields, regex_patterns, client)
Is there a better way to debug this problem?
TIA!
the query your code generates is invalid, as the regular expressions include characters that aren't properly escaped.
see: the string data type
this is your invalid query (based on the client request ID you provided in the comments):
LiveStream_CL()
| where ingestion_time() > ago(1h)
| summarize count_matching_regex_deviceHostName_s = countif(deviceHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_deviceHostName_s = countif(isnotempty(deviceHostName_s)), count_matching_regex_sourceHostName_s = countif(sourceHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_sourceHostName_s = countif(isnotempty(sourceHostName_s)), count_matching_regex_destinationHostName_s = countif(destinationHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_destinationHostName_s = countif(isnotempty(destinationHostName_s)), count_matching_regex_agentHostName_s = countif(agentHostName_s matches regex "^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"), count_not_empty_agentHostName_s = countif(isnotempty(agentHostName_s))
...
whereas this is how it should look like (note the addition of the #s):
LiveStream_CL()
| where ingestion_time() > ago(1h)
| summarize
count_matching_regex_deviceHostName_s = countif(deviceHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"),
count_not_empty_deviceHostName_s = countif(isnotempty(deviceHostName_s)),
count_matching_regex_sourceHostName_s = countif(sourceHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"),
count_not_empty_sourceHostName_s = countif(isnotempty(sourceHostName_s)),
count_matching_regex_destinationHostName_s = countif(destinationHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0,61}[a-zA-Z0-9\$])?$"),
count_not_empty_destinationHostName_s = countif(isnotempty(destinationHostName_s)),
count_matching_regex_agentHostName_s = countif(agentHostName_s matches regex #"^[a-zA-Z0-9\$]([a-zA-Z0-9\-\_\.\$]{0, 61}[a-zA-Z0-9\$])?$"),
count_not_empty_agentHostName_s = countif(isnotempty(agentHostName_s))
...

unpivot (wide to long) with dynamic column names

I need a function wide_to_long that turns a wide table into a long table and that accepts an argument id_vars for which the values have to be repeated (see example).
Sample input
let T_wide = datatable(name: string, timestamp: datetime, A: int, B: int) [
'abc','2022-01-01 12:00:00',1,2,
'def','2022-01-01 13:00:00',3,4
];
Desired output
Calling wide_to_long(T_wide, dynamic(['name', 'timestamp'])) should produce the following table.
let T_long = datatable(name: string, timestamp: datetime, variable: string, value: int) [
'abc','2022-01-01 12:00:00','A',1,
'abc','2022-01-01 12:00:00','B',2,
'def','2022-01-01 13:00:00','A',3,
'def','2022-01-01 13:00:00','B',4
];
Attempt
I've come pretty far with the following code.
let wide_to_long = (T:(*), id_vars: dynamic) {
// get names of keys to remove later
let all_columns = toscalar(T | getschema | summarize make_list(ColumnName));
let remove = set_difference(all_columns, id_vars);
// expand columns not contained in id_vars
T
| extend packed1 = pack_all()
| extend packed1 = bag_remove_keys(packed1, id_vars)
| mv-expand kind=array packed1
| extend variable = packed1[0], value = packed1[1]
// remove unwanted columns
| project packed2 = pack_all()
| project packed2 = bag_remove_keys(packed2, remove)
| evaluate bag_unpack(packed2)
| project-away packed1
};
The problems are that the solution feels clunky (is there a better way?) and the columns in the result are ordered randomly. The second issue is minor, but annoying.

Why is my vector not accumulating the iteration?

I have this type of data that I want to send to a dataframe:
So, I am iterating through it and sending it to a vector. But my vector never keeps the data.
Dv = Vector{Dict}()
for item in reader
push!(Dv,item)
end
length(Dv)
This is what I get:
And I am sure this is the right way to do it. It works in Python:
EDIT
This is the code that I use to access the data that I want to send to a dataframe:
results=pyimport("splunklib.results")
kwargs_oneshot = (earliest_time= "2019-09-07T12:00:00.000-07:00",
latest_time= "2019-09-09T12:00:00.000-07:00",
count=0)
searchquery_oneshot = "search index=iis | lookup geo_BST_ONT longitude as sLongitude, latitude as sLatitude | stats count by featureId | geom geo_BST_ONT allFeatures=True | head 2"
oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot; kwargs_oneshot...)
# Get the results and display them using the ResultsReader
reader = results.ResultsReader(oneshotsearch_results)
for item in reader
println(item)
end
ResultsReader is a streaming reader. This means you will "consume" its elements as you iterate over them. You can covert it to an array with collect. Do not print the items before you collect.
results=pyimport("splunklib.results")
kwargs_oneshot = (earliest_time= "2019-09-07T12:00:00.000-07:00",
latest_time= "2019-09-09T12:00:00.000-07:00",
count=0)
searchquery_oneshot = "search index=iis | lookup geo_BST_ONT longitude as sLongitude, latitude as sLatitude | stats count by featureId | geom geo_BST_ONT allFeatures=True | head 2"
oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot; kwargs_oneshot...)
# Get the results
reader = results.ResultsReader(oneshotsearch_results)
# collect them into an array
Dv = collect(reader)
# Now you can iterate over them without changing the result
for item in Dv
println(item)
end

Using Spark to add Edges gremlin

I can't save my edge when I'm using spark as follows:
for information it can save edge by using gremlin console
val graph = DseGraphFrameBuilder.dseGraph("GRAPH_NAME", spark)
graph.V().has("vertex1","field1","value").as("a").V().has("vertex2","field1","value").addE("myEdgeLabel").to("a")
When I try: graph.edges.show()
I get an empty table
addE() step is not yet implemented in DseGraphFrames, you should use DGF specific updateEdges() function. The function is design for bulk updates It take spark dataframe with new edges in DGF format:
scala> newEdges.printSchema
root
|-- src: string (nullable = false)
|-- dst: string (nullable = false)
|-- ~label: string (nullable = true)
src and dst columns are encoded vertex ids. you can ether construct them with g.idColumn() helper function or select them from vertices.
Usually you know ids and use helper function
scala> val df = Seq((1, 2, "myEdgeLabel")).toDF("v1_id", "v2_id", "~label")
scala> val newEdges=df.select(g.idColumn("vertex2", $"v2_id") as "src", g.idColumn("vertex1", $"v1_id") as "dst", $"~label")
scala> g.updateEdges(newEdges)
For your particular case, you can query ids first and then insert base on them. never do this in production, this approach is slow and is not bulk. on huge graphs use method #1:
val dst = g.V.has("vertex1","field1","value").id.first.getString(0)
val src = g.V.has("vertex2","field1","value").id.first.getString(0)
val newEdges = Seq((src, dst, "myEdgeLabel")).toDF("src", "dst", "~label")
g.updateEdges(newEdges)
See documentation: https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameImport.html

IndexError: list index out of range, scores.append( (fields[0], fields[1]))

I'm trying to read a file and put contents in a list. I have done this mnay times before and it has worked but this time it throws back the error "list index out of range".
the code is:
with open("File.txt") as f:
scores = []
for line in f:
fields = line.split()
scores.append( (fields[0], fields[1]))
print(scores)
The text file is in the format;
Alpha:[0, 1]
Bravo:[0, 0]
Charlie:[60, 8, 901]
Foxtrot:[0]
I cant see why it is giving me this problem. Is it because I have more than one value for each item? Or is it the fact that I have a colon in my text file?
How can I get around this problem?
Thanks
If I understand you well this code will print you desired result:
import re
with open("File.txt") as f:
# Let's make dictionary for scores {name:scores}.
scores = {}
# Define regular expressin to parse team name and team scores from line.
patternScore = '\[([^\]]+)\]'
patternName = '(.*):'
for line in f:
# Find value for team name and its scores.
fields = re.search(patternScore, line).groups()[0].split(', ')
name = re.search(patternName, line).groups()[0]
# Update dictionary with new value.
scores[name] = fields
# Print output first goes first element of keyValue in dict then goes keyName
for key in scores:
print (scores[key][0] + ':' + key)
You will recieve following output:
60:Charlie
0:Alpha
0:Bravo
0:Foxtrot

Resources