How to effectively chain groupby queries from flat api data in Kafka Streams? - bigdata

I have some random data coming from an API into a Kafka topic that looks like this:
{"vin": "1N6AA0CA7CN040747", "make": "Nissan", "model": "Pathfinder", "year": 1993, "color": "Blue", "salePrice": "$58312.28", "city": "New York City", "state": "New York", "zipCode": "10014"}
{"vin": "1FTEX1C88AF678435", "make": "Audi", "model": "200", "year": 1991, "color": "Aquamarine", "salePrice": "$65651.53", "city": "Newport Beach", "state": "California", "zipCode": "92662"}
{"vin": "JN8AS1MU1BM237985", "make": "Subaru", "model": "Legacy", "year": 1990, "color": "Violet", "salePrice": "$21325.27", "city": "Joliet", "state": "Illinois", "zipCode": "60435"}
{"vin": "SCBGR3ZA1CC504502", "make": "Mercedes-Benz", "model": "E-Class", "year": 1986, "color": "Fuscia", "salePrice": "$81822.04", "city": "Pasadena", "state": "California", "zipCode": "91117"}
I am able to create KStream objects and observe them, like this:
KStream<byte[], UsedCars> usedCarsInputStream =
builder.stream("used-car-colors", Consumed.with(Serdes.ByteArray(), new UsedCarsSerdes()));
//k, v => year, countof cars in year
KTable<String,Long> yearCount = usedCarsInputStream
.filter((k,v)->v.getYear() > 2010)
.selectKey((k,v) -> v.getVin())
.groupBy((key, value) -> Integer.toString(value.getYear()))
.count().toStream().print(Printed.<String, Long>toSysOut().withLabel("blah"));
This of course gives us a count of the records grouped by each year greater than 2010. However, what I would like to do in the next step, but have been unable to accomplish, is to simply take each of those years, as in a foreach, and count the number of cars in each color per year. I attempted writing a foreach on yearCount.toStream() to further process the data, but got no results.
I am looking for output that might look like this:
{
"2011": [
{
"blue": "99",
"green": "243,",
"red": "33"
}
],
"2012": [
{
"blue": "74,",
"green": "432,",
"red": "2"
}
]
}

I believe I may have answered my own question. I would welcome any others to comment on my own solution.
What I did not realize is that you can do GroupBy an object that is essentially a compound object. In this case, I needed the equivalent of this following SQL statement
SELECT year, color, count(*) FROM use_car_colors AS years
GROUP BY year, color
In Kafka Streams, you can accomplish this by creating an object -- in this situation, I created a POJO class called 'YearColor' with members year and color -- and then select that as a key in Kafka Streams:
usedCarsInputStream
.selectKey((k,v) -> new YearColor(v.getYear(), v.getColor()))
.groupByKey(Grouped.with(new YearColorSerdes(), new UsedCarsSerdes()))
.count()
.toStream()
.peek((yc, ct) -> System.out.println("year: " + yc.getYear() + " color: " + yc.getColor()
+ " count: " + ct));
You of course have to implement the Serializer and Deserializer for this object (and I did with YearColorSerdes()). My output when running the Kafka Streams application gives me updates on the modified counts, a la:
year: 2012 color: Maroon count: 2
year: 2013 color: Khaki count: 1
year: 2012 color: Crimson count: 5
year: 2011 color: Pink count: 4
year: 2011 color: Green count: 2
which is what I was looking for.

Related

Risk of a single Doc with a dozen arrays containing thousands of small objects

Can the write operations on a single doc break when many users are online?
There will be ~12 such arrays, each with tens of thousands of such objects. Write operations would be:
increment(+1) for the count field of existing objects based on currency and activity (stay, sports...),
add an entire new object when it's a new currency and country,
update an existing object, eg: increment(-1) or price change.
Note: all data is displayed at once on a single page.
stayCount [
{price: "15", count: 2, country: "USA", currency: "USD"},
{price: "15, count: 3, country: "UAE", currency: AED},
{price: "25", count: 5, country: "USA", currency: USD}
]
sportsCount [
{price: "15", count: 1, country: "Germany", currency: EUR},
{price: "49, count: 6, country: "UAE", currency: AED},
{price: "49", count: 8, country: "France", currency: EUR}
]
Asking because of the one write max per Doc per second rule.

Group nested array objects to parent key in JQ

I have JSON coming from an external application, formatted like so:
{
"ticket_fields": [
{
"url": "https://example.com/1122334455.json",
"id": 1122334455,
"type": "tagger",
"custom_field_options": [
{
"id": 123456789,
"name": "I have a problem",
"raw_name": "I have a problem",
"value": "help_i_have_problem",
"default": false
},
{
"id": 456789123,
"name": "I have feedback",
"raw_name": "I have feedback",
"value": "help_i_have_feedback",
"default": false
},
]
}
{
"url": "https://example.com/6677889900.json",
"id": 6677889900,
"type": "tagger",
"custom_field_options": [
{
"id": 321654987,
"name": "United States,
"raw_name": "United States",
"value": "location_123_united_states",
"default": false
},
{
"id": 987456321,
"name": "Germany",
"raw_name": "Germany",
"value": "location_456_germany",
"default": false
}
]
}
]
}
The end goal is to be able to get the data into a TSV in the sense that each object in the custom_field_options array is grouped by the parent ID (ticket_fields.id), and then transposed such that each object would be represented on a single line, like so:
Ticket Field ID
Name
Value
1122334455
I have a problem
help_i_have_problem
1122334455
I have feedback
help_i_have_feedback
6677889900
United States
location_123_united_states
6677889900
Germany
location_456_germany
I have been able to export the data successfully to TSV already, but it reads per-line, and without preserving order, like so:
Using jq -r '.ticket_fields[] | select(.type=="tagger") | [.id, .custom_field_options[].name, .custom_field_options[].value] | #tsv'
Ticket Field ID
Name
Name
Value
Value
1122334455
I have a problem
I have feedback
help_i_have_problem
help_i_have_feedback
6677889900
United States
Germany
location_123_united_states
location_456_germany
Each of the custom_field_options arrays in production may consist of any number of objects (not limited to 2 each). But I seem to be stuck on how to appropriately group or map these objects to their parent ticket_fields.id and to transpose the data in a clean manner. The select(.type=="tagger") is mentioned in the query as there are multiple values for ticket_fields.type which need to be filtered out.
Based on another answer on here, I did try variants of jq -r '.ticket_fields[] | select(.type=="tagger") | map(.custom_field_options |= from_entries) | group_by(.custom_field_options.ticket_fields) | map(map( .custom_field_options |= to_entries))' without success. Any assistance would be greatly appreciated!
You need two nested iterations, one in each array. Save the value of .id in a variable to access it later.
jq -r '
.ticket_fields[] | select(.type=="tagger") | .id as $id
| .custom_field_options[] | [$id, .name, .value]
| #tsv
'

Is my partition transform in Vega written correctly because the graph that is visualized is not accurate

I am creating a hierarchical representation of data in Vega. To do this I am using the stratify and partition transformations. The issue that is occurring lies with the x coordinates that are generated with the partition transformation. In the link, navigate to data viewer and select tree-map. The x0 and x1 for the initial id, the top most element, "completed stories" within the hierarchy ranges from 0 - 650. The next two elements, "testable" & "not testable", should have a combined x range of 0 - 650. But instead, they range from 0 - 455. The width should be based on their quantities, located in the "amount" field. Any suggestions as to why the rectangle that is generated is not commensurate with the quantities.
Link to Vega Editor with code shown
For your dataset "rawNumbers", values should only be provided for the "leave" nodes when using stratify transform.
{
"name": "rawNumbers",
"values": [
{"id": "completed stories", "parent": null},
{"id": "testable", "parent": "completed stories"},
{"id": "not testable", "parent": "completed stories", "amount": 1435},
{"id": "sufficiently tested", "parent": "testable"},
{"id": "insufficiently tested", "parent": "testable"},
{"id": "integration tested", "parent": "sufficiently tested", "amount": 1758},
{"id": "unit tested", "parent": "sufficiently tested", "amount": 36},
{"id": "partial coverage", "parent": "insufficiently tested", "amount": 298},
{"id": "no coverage", "parent": "insufficiently tested", "amount": 341}
]
},
Open in Vega Editor

How can I create a GTM data layer variable that pull the last item in a product array

I have a eCommerce data layer that looks like this:
products: [
{
name: "T10 (Fri - Sun) Adult",
id: "1123",
price: "260",
brand: "f1",
category: "Austrian Formula 1 Grand Prix 2022",
quantity: 1
},
{
name: "Red Bull: CDE (Fri - Sun) Adult",
id: "1123",
price: "251",
brand: "f1",
category: "Austrian Formula 1 Grand Prix 2022",
quantity: 1
},
{
name: "Steiermark (South-West) (Fri - Sun) Adult",
id: "1123",
price: "420",
brand: "f1",
category: "Austrian Formula 1 Grand Prix 2022",
quantity: 1
}
]
When creating a data layer variable in GTM, I know I can create products.0.name and the result will be 'T10 (Fri - Sun) Adult.' Or products.2.name would result in 'Steiermark (South-West) (Fri - Sun) Adult.' But, if we assume I don't always know how many items will be in the product array, how can I create a variable that always pulls the final product (in the above example Steiermark (South-West) (Fri - Sun) Adult)?
UPDATE
I have tried products.slice(-1).name
view here
But that is giving a 'undefined' response
view here
Just use .slice(-1) on your array. Here:
You can also use .pop(), but pop mutates, therefore, no pops.

Is there an R library or function for formatting international currency strings?

Here's a snippet of the JSON data I'm working with:
{
"item" = "Mexican Thing",
...
"raised": "19",
"currency": "MXN"
},
{
"item" = "Canadian Thing",
...
"raised": "42",
"currency": "CDN"
},
{
"item" = "American Thing",
...
"raised": "1",
"currency": "USD"
}
You get the idea.
I'm hoping there's a function out there that can take in a standard currency abbreviation and a number and spit out the appropriate string. I could theoretically write this myself except I can't pretend like I know all the ins and outs of this stuff and I'm bound to spend days and weeks being surprised by bugs or edge cases I didn't think of. I'm hoping there's a library (or at least a web api) already written that can handle this but my Googling has yielded nothing useful so far.
Here's an example of the result I want (let's pretend "currency" is the function I'm looking for)
currency("USD", "32") --> "$32"
currency("GBP", "45") --> "£45"
currency("EUR", "19") --> "€19"
currency("MXN", "40") --> "MX$40"
Assuming your real json is valid, then it should be relatively simple. I'll provide a valid json string, fixing the three invalid portions here: = should be :; ... is obviously a placeholder; and it should be a list wrapped in [ and ]:
js <- '[{
"item": "Mexican Thing",
"raised": "19",
"currency": "MXN"
},
{
"item": "Canadian Thing",
"raised": "42",
"currency": "CDN"
},
{
"item": "American Thing",
"raised": "1",
"currency": "USD"
}]'
with(jsonlite::parse_json(js, simplifyVector = TRUE),
paste(raised, currency))
# [1] "19 MXN" "42 CDN" "1 USD"
Edit: in order to change to specific currency characters, don't make this too difficult: just instantiate a lookup vector where "USD" (for example) prepends "$" and appends "" (nothing) to the raised string. (I say both prepend/append because I believe some currencies are always post-digits ... I could be wrong.)
pre_currency <- Vectorize(function(curr) switch(curr, USD="$", GDP="£", EUR="€", CDN="$", "?"))
post_currency <- Vectorize(function(curr) switch(curr, USD="", GDP="", EUR="", CDN="", "?"))
with(jsonlite::parse_json(js, simplifyVector = TRUE),
paste0(pre_currency(currency), raised, post_currency(currency)))
# [1] "?19?" "$42" "$1"
I intentionally left "MXN" out of the vector here to demonstrate that you need a default setting, "?" (pre/post) here. You may choose a different default/unknown currency value.
An alternative:
currency <- function(val, currency) {
pre <- sapply(currency, switch, USD="$", GDP="£", EUR="€", CDN="$", "?")
post <- sapply(currency, switch, USD="", GDP="", EUR="", CDN="", "?")
paste0(pre, val, post)
}
with(jsonlite::parse_json(js, simplifyVector = TRUE),
currency(raised, currency))
# [1] "?19?" "$42" "$1"

Resources