VegaLite - Customize label axis according to values - plot

I'm trying to plot a dendrogram using VegaLite. The plot is almost complete, but there is just one thing mixing, which are the axis labels corresponding to the point id. In the figure I plotted, the axis is ordered from 1 to 40. I have a dictionary with the corresponding labels, e.g. if value is 1 then the label is "point10", if value is 30 then label is "point5" and so on. Hence, what I'd like to do is to replace the numbers for the label.
Is it possible to do this in VegaLite? I haven't found a way by reading the Docs.

You can just provide the field of your pointLabels in x-axis and then provide sort with field of value, with this you will be able to see the pointLabels as labels and the ordering will be displayed as per your values. Refer the below example of sorting on a Name field or check the editor.
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "Drag a rectangular brush to show (first 20) selected points in a table.",
"data": {"url": "data/cars.json"},
"transform": [
{"window": [{"op": "row_number", "as": "row_number"}]},
{"calculate": "'My'+datum.Name", "as": "alias"}
],
"params": [{"name": "brush", "select": "interval"}],
"mark": "bar",
"encoding": {
"x": {
"field": "Name",
"type": "nominal",
"stack": false,
"sort": {"field": "row_number", "order": "ascending"}
},
"y": {"field": "Miles_per_Gallon", "stack": false, "type": "quantitative"},
"color": {
"condition": {"param": "brush", "field": "Cylinders", "type": "ordinal"},
"value": "grey"
}
}
}

Unfortunately, I was not able to solve this by modifying the axis. Although, I did manage a solution. It's not the "correct" way, since what I do is to actually erase the axis, and add a "text" marker with the the labels, and I place it where the axis was.
If anyone is interested, here is a gist with the implementation: https://gist.github.com/davibarreira/74c4274333ac51bbeed627deeb631195

Related

OpenAI package leaving linebreak in response

I've starting using OpenAI API in R. I downloaded the openai package. I keep getting a double linebreak in the text response. Here's an example of my code:
library(openai)
vector = create_completion(
model = "text-davinci-003",
prompt = "Tell me what the weather is like in London, UK, in Celsius in 5 words.",
max_tokens = 20,
temperature = 0,
echo = FALSE
)
vector_2 = vector$choices[1]
vector_2$text
[1] "\n\nRainy, mild, cool, humid."
Is there a way to get rid of this without 'correcting' the response text using other functions?
No, it's not possible.
The OpenAI API returns the completion with starting \n\n by default. There's no parameter for the Completions endpoint to control this.
You need to remove linebreak manually.
Example response looks like this:
{
"id": "cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7",
"object": "text_completion",
"created": 1589478378,
"model": "text-davinci-003",
"choices": [
{
"text": "\n\nThis is indeed a test",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 7,
"total_tokens": 12
}
}

How can I use jq to sort by datetime field and filter based on attribute?

I am trying to sort following json response based on "startTime" and also want to filter based on "name" and fetch only "dataCenter" of matched record. Can you please help with jq function for doing it?
I tried something like this jq '.[]|= sort_by(.startTime)' but it doesnt return correct result.
[
{
"name": "JPCSKELT",
"dataCenter": "mvsADM",
"orderId": "G9HC8",
"scheduleTable": "FD33515",
"nodeGroup": null,
"controlmApp": "P/C-DEVELOPMENT-LRSP",
"groupName": "SCMTEST",
"assignmentGroup": "HOST_CONFIG_MGMT",
"owner": "PC00000",
"description": null,
"startTime": "2021-11-11 17:45:48.0",
"endTime": "2021-11-11 17:45:51.0",
"successCount": 1,
"failureCount": 0,
"dailyRunCount": 0,
"scriptName": "JPCSKELT"
},
{
"name": "JPCSKELT",
"dataCenter": "mvsADM",
"orderId": "FWX98",
"scheduleTable": "JPCS1005",
"nodeGroup": null,
"controlmApp": "P/C-DEVELOPMENT-LRSP",
"groupName": "SCMTEST",
"assignmentGroup": "HOST_CONFIG_MGMT",
"owner": "PC00000",
"description": null,
"startTime": "2021-07-13 10:49:47.0",
"endTime": "2021-07-13 10:49:49.0",
"successCount": 1,
"failureCount": 0,
"dailyRunCount": 0,
"scriptName": "JPCSKELT"
},
{
"name": "JPCSKELT",
"dataCenter": "mvsADM",
"orderId": "FWX98",
"scheduleTable": "JPCS1005",
"nodeGroup": null,
"controlmApp": "P/C-DEVELOPMENT-LRSP",
"groupName": "SCMTEST",
"assignmentGroup": "HOST_CONFIG_MGMT",
"owner": "PC00000",
"description": null,
"startTime": "2021-10-13 10:49:47.0",
"endTime": "2021-10-13 10:49:49.0",
"successCount": 1,
"failureCount": 0,
"dailyRunCount": 0,
"scriptName": "JPCSKELT"
}
]
You can use the following expression to sort the input -
sort_by(.startTime | sub("(?<time>.*)\\..*"; "\(.time)") | strptime("%Y-%m-%d %H:%M:%S") | mktime)
The sub("(?<time>.*)\\..*"; "\(.time)") expression removes the trailing decimal fraction.
I assume you can use the result from the above query to perform desired filtering.
Welcome. From what I'm guessing you're asking, you want to supply a value to filter the records on using the name property, sort the results by the startTime property and then just output the value of the dataCenter property for those records. How about this:
jq --arg name JPCSKELT '
map(select(.name==$name))|sort_by(.startTime)[].dataCenter
' data.json
Based on your sample data, this produces:
"mvsADM"
"mvsADM"
"mvsADM"
So I'm wondering if this is what you're really asking?

Group nested array objects to parent key in JQ

I have JSON coming from an external application, formatted like so:
{
"ticket_fields": [
{
"url": "https://example.com/1122334455.json",
"id": 1122334455,
"type": "tagger",
"custom_field_options": [
{
"id": 123456789,
"name": "I have a problem",
"raw_name": "I have a problem",
"value": "help_i_have_problem",
"default": false
},
{
"id": 456789123,
"name": "I have feedback",
"raw_name": "I have feedback",
"value": "help_i_have_feedback",
"default": false
},
]
}
{
"url": "https://example.com/6677889900.json",
"id": 6677889900,
"type": "tagger",
"custom_field_options": [
{
"id": 321654987,
"name": "United States,
"raw_name": "United States",
"value": "location_123_united_states",
"default": false
},
{
"id": 987456321,
"name": "Germany",
"raw_name": "Germany",
"value": "location_456_germany",
"default": false
}
]
}
]
}
The end goal is to be able to get the data into a TSV in the sense that each object in the custom_field_options array is grouped by the parent ID (ticket_fields.id), and then transposed such that each object would be represented on a single line, like so:
Ticket Field ID
Name
Value
1122334455
I have a problem
help_i_have_problem
1122334455
I have feedback
help_i_have_feedback
6677889900
United States
location_123_united_states
6677889900
Germany
location_456_germany
I have been able to export the data successfully to TSV already, but it reads per-line, and without preserving order, like so:
Using jq -r '.ticket_fields[] | select(.type=="tagger") | [.id, .custom_field_options[].name, .custom_field_options[].value] | #tsv'
Ticket Field ID
Name
Name
Value
Value
1122334455
I have a problem
I have feedback
help_i_have_problem
help_i_have_feedback
6677889900
United States
Germany
location_123_united_states
location_456_germany
Each of the custom_field_options arrays in production may consist of any number of objects (not limited to 2 each). But I seem to be stuck on how to appropriately group or map these objects to their parent ticket_fields.id and to transpose the data in a clean manner. The select(.type=="tagger") is mentioned in the query as there are multiple values for ticket_fields.type which need to be filtered out.
Based on another answer on here, I did try variants of jq -r '.ticket_fields[] | select(.type=="tagger") | map(.custom_field_options |= from_entries) | group_by(.custom_field_options.ticket_fields) | map(map( .custom_field_options |= to_entries))' without success. Any assistance would be greatly appreciated!
You need two nested iterations, one in each array. Save the value of .id in a variable to access it later.
jq -r '
.ticket_fields[] | select(.type=="tagger") | .id as $id
| .custom_field_options[] | [$id, .name, .value]
| #tsv
'

Is my partition transform in Vega written correctly because the graph that is visualized is not accurate

I am creating a hierarchical representation of data in Vega. To do this I am using the stratify and partition transformations. The issue that is occurring lies with the x coordinates that are generated with the partition transformation. In the link, navigate to data viewer and select tree-map. The x0 and x1 for the initial id, the top most element, "completed stories" within the hierarchy ranges from 0 - 650. The next two elements, "testable" & "not testable", should have a combined x range of 0 - 650. But instead, they range from 0 - 455. The width should be based on their quantities, located in the "amount" field. Any suggestions as to why the rectangle that is generated is not commensurate with the quantities.
Link to Vega Editor with code shown
For your dataset "rawNumbers", values should only be provided for the "leave" nodes when using stratify transform.
{
"name": "rawNumbers",
"values": [
{"id": "completed stories", "parent": null},
{"id": "testable", "parent": "completed stories"},
{"id": "not testable", "parent": "completed stories", "amount": 1435},
{"id": "sufficiently tested", "parent": "testable"},
{"id": "insufficiently tested", "parent": "testable"},
{"id": "integration tested", "parent": "sufficiently tested", "amount": 1758},
{"id": "unit tested", "parent": "sufficiently tested", "amount": 36},
{"id": "partial coverage", "parent": "insufficiently tested", "amount": 298},
{"id": "no coverage", "parent": "insufficiently tested", "amount": 341}
]
},
Open in Vega Editor

Understanding fold() and its impact on gremlin query cost in Azure Cosmos DB

I am trying to understand query costs in Azure Cosmos DB
I cannot figure out what is the difference in the following examples and why using fold() lowers the cost:
g.V().hasLabel('item').project('itemId', 'id').by('itemId').by('id')
which produces the following output:
[
{
"itemId": 14,
"id": "186de1fb-eaaf-4cc2-b32b-de8d7be289bb"
},
{
"itemId": 5,
"id": "361753f5-7d18-4a43-bb1d-cea21c489f2e"
},
{
"itemId": 6,
"id": "1c0840ee-07eb-4a1e-86f3-abba28998cd1"
},
....
{
"itemId": 5088,
"id": "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc"
}
]
The cost is 15642 RUs x 0.00008 $/RU = 1.25$
g.V().hasLabel('item').project('itemId', 'id').by('itemId').by('id').fold()
which produces the following output:
[
[
{
"itemId": 14,
"id": "186de1fb-eaaf-4cc2-b32b-de8d7be289bb"
},
{
"itemId": 5,
"id": "361753f5-7d18-4a43-bb1d-cea21c489f2e"
},
{
"itemId": 6,
"id": "1c0840ee-07eb-4a1e-86f3-abba28998cd1"
},
...
{
"itemId": 5088,
"id": "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc"
}
]
]
The cost is 787 RUs x 0.00008$/RU = 0.06$
g.V().hasLabel('item').values('id', 'itemId')
with the following output:
[
"186de1fb-eaaf-4cc2-b32b-de8d7be289bb",
14,
"361753f5-7d18-4a43-bb1d-cea21c489f2e",
5,
"1c0840ee-07eb-4a1e-86f3-abba28998cd1",
6,
...
"2ed1871d-c0e1-4b38-b5e0-78087a5a75fc",
5088
]
cost: 10639 RUs x 0.00008 $/RU = 0.85$
g.V().hasLabel('item').values('id', 'itemId').fold()
with the following output:
[
[
"186de1fb-eaaf-4cc2-b32b-de8d7be289bb",
14,
"361753f5-7d18-4a43-bb1d-cea21c489f2e",
5,
"1c0840ee-07eb-4a1e-86f3-abba28998cd1",
6,
...
"2ed1871d-c0e1-4b38-b5e0-78087a5a75fc",
5088
]
]
The cost is 724.27 RUs x 0.00008 $/RU = 0.057$
As you see, the impact on the cost is tremendous.
This is just for approx. 3200 nodes with few properties.
I would like to understand why adding fold changes so much.
I was trying to reproduce your example, but unfortunately have opposite results (500 vertices in Cosmos):
g.V().hasLabel('test').values('id')
or
g.V().hasLabel('test').project('id').by('id')
gave respectively
86.08 and 91.44 RU, while same queries followed by fold() step resulted in 585.06 and
590.43 RU.
This result I got seems fine, as according to TinkerPop documentation:
There are situations when the traversal stream needs a "barrier" to
aggregate all the objects and emit a computation that is a function of
the aggregate. The fold()-step (map) is one particular instance of
this.
Knowing that Cosmos charge RUs for both number of accessed objects and computations that are done on those obtained objects (fold in this particular case), higher costs for fold is as expected.
You can try to run executionProfile() step for your traversal, which can help you to investigate your case. When I tried:
g.V().hasLabel('test').values('id').executionProfile()
I got 2 additional steps for fold() (same parts of output omitted for brevity), and this ProjectAggregation is where the result set was mapped from 500 to 1:
...
{
"name": "ProjectAggregation",
"time": 165,
"annotations": {
"percentTime": 8.2
},
"counts": {
"resultCount": 1
}
},
{
"name": "QueryDerivedTableOperator",
"time": 1,
"annotations": {
"percentTime": 0.05
},
"counts": {
"resultCount": 1
}
}
...

Resources