I am trying to create an order book data structure where a top level dictionary holds 3 basic order types, each of those types has a bid and ask side and each of the sides has a list of tables, one for each ticker. For example, if I want to retrieve all the ask orders of type1 for Google stock, I'd call book[`orderType1][`ask][`GOOG]. I implemented that using the following:
bookTemplate: ([]orderID:`int$();date:"d"$();time:`time$();sym:`$();side:`$();
orderType:`$();price:`float$();quantity:`int$());
bookDict:(1#`)!enlist`orderID xkey bookTemplate;
book: `orderType1`orderType2`orderType3 ! (3# enlist(`ask`bid!(2# enlist bookDict)));
Data retrieval using book[`orderType1][`ask][`ticker] seems to be working fine. The problem appears when I try to add new order to a specific order book e.g:
testorder:`orderID`date`time`sym`side`orderType`price`quantity!(111111111;.z.D;.z.T;
`GOOG;`ask;`orderType1;100.0f;123);
book[`orderType1][`ask][`GOOG],:testorder;
Executing the last query gives 'assign error. What's the reason? How to solve it?
A couple of issues here. First one being that while you can lookup into dictionaries using a series of in-line repeated keys, i.e.
q)book[`orderType1][`ask][`GOOG]
orderID| date time sym side orderType price quantity
-------| -------------------------------------------
you can't assign values like this (can only assign at one level deep). The better approach is to use dot-indexing (and dot-amend to reassign values). However, the problem is that the value of your book dictionary is getting flattened to a table due to the list of dictionaries being uniform. So this fails:
q)book . `orderType1`ask`GOOG
'rank
You can see how it got flattened by inspecting the terminal
q)book
| ask
----------| -----------------------------------------------------------------
orderType1| (,`)!,(+(,`orderID)!,`int$())!+`date`time`sym`side`orderType`pric
orderType2| (,`)!,(+(,`orderID)!,`int$())!+`date`time`sym`side`orderType`pric
orderType3| (,`)!,(+(,`orderID)!,`int$())!+`date`time`sym`side`orderType`pric
To prevent this flattening you can force the value to be a mixed list by adding a generic null
q)book: ``orderType1`orderType2`orderType3 !(::),(3# enlist(`ask`bid!(2# enlist bookDict)));
Then it looks like this:
q)book
| ::
orderType1| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
orderType2| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
orderType3| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
Dot-indexing now works:
q)book . `orderType1`ask`GOOG
orderID| date time sym side orderType price quantity
-------| -------------------------------------------
which means that dot-amend will now work too
q).[`book;`orderType1`ask`GOOG;,;testorder]
`book
q)book
| ::
orderType1| `ask`bid!+``GOOG!(((+(,`orderID)!,`int$())!+`date`time`sym`side`o
orderType2| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
orderType3| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
Finally, I would recommend reading this FD whitepaper on how to best store book data: http://www.firstderivatives.com/downloads/q_for_Gods_Nov_2012.pdf
Related
I'm trying to write a KQL query that will, among other things, display the contents of a serialized dictionary called Tags which has been added to the Application Insights traces table customDimensions column by application logging.
An example of the serialized Tags dictionary is:
{
"Source": "SAP",
"Destination": "TC",
"SAPDeliveryNo": "0012345678",
"PalletID": "(00)312340123456789012(02)21234987654(05)123456(06)1234567890"
}
I'd like to use evaluate bag_unpack(...) to evaluate the JSON and turn the keys into columns. We're likely to add more keys to the dictionary as the project develops and it would be handy not to have to explicitly list every column name in the query.
However, I'm already using project to reduce the number of other columns I display. How can I use both a project statement, to only display some of the other columns, and evaluate bag_unpack(...) to automatically unpack the Tags dictionary into columns?
Or is that not possible?
This is what I have so far, which doesn't work:
traces
| where datetime_part("dayOfYear", timestamp) == datetime_part("dayOfYear", now())
and message has "SendPalletData"
| extend TagsRaw = parse_json(customDimensions.["Tags"])
| evaluate bag_unpack(TagsRaw)
| project timestamp, message, ActionName = customDimensions.["ActionName"], TagsRaw
| order by timestamp desc
When it runs it displays only the columns listed in the project statement (including TagsRaw, so I know the Tags exist in customDimensions).
evaluate bag_unpack(TagsRaw) doesn't automatically add extra columns to the result set unpacked from the Tags in customDimensions.
EDIT: To clarify what I want to achieve, these are the columns I want to output:
timestamp
message
ActionName
TagsRaw
Source
Destination
SAPDeliveryNo
PalletID
EDIT 2: It turned out a major part of my problem was that double quotes within the Tags data are being escaped. While the Tags as viewed in the Azure portal looked like normal JSON, and copied out as normal JSON, when I copied out the whole of a customDimensions record the Tags looked like "Tags": "{\"Source\":\"SAP\",\"Destination\":\"TC\", ... with the double quotes escaped with backslashes.
The accepted answer from David Markovitz handles this situation in the line:
TagsRaw = todynamic(tostring(customDimensions["Tags"]))
A few comments:
When filtering on timestamp, better use the timestamp column As Is, and do the manipulations on the other side of the equation.
When using the has[...] operators, prefer the case-sensitive one (if feasable)
Everything extracted from dynamic value is also dynamic, and when given a dynamic value parse_json() (or its equivalent, todynamic()), simply returns it, As Is.
Therefore, we need to treet customDimensions.["Tags"] in 2 steps:
1st, convert it to string. 2nd, convert the result to dynamic.
To reference a field within a dynamic type you can use X.Y, X["Y"], or "X['Y'].
No need to combine them as you did with customDimensions.["Tags"].
As the bag_unpack plugin doc states:
"The specified input column (Column) is removed."
In other words, TagsRaw does not exist following the bag_unpack operation.
Please note that you can add prefix to the columns generated by bag_unpack. Might make it easier to differentiate them from the rest of the columns.
While you can use project, using project-away is sometimes easier.
// Data sample generation. Not part of the solution.
let traces =
print c1 = "some columns"
,c2 = "we"
,c3 = "don't need"
,timestamp = ago(now()%1d * rand())
,message = "abc SendPalletData xyz"
,customDimensions = dynamic
(
{
"Tags":"{\"Source\":\"SAP\",\"Destination\":\"TC\",\"SAPDeliveryNo\":\"0012345678\",\"PalletID\":\"(00)312340123456789012(02)21234987654(05)123456(06)1234567890\"}"
,"ActionName":"Action1"
}
)
;
// Solution starts here
traces
| where timestamp >= startofday(now())
and message has_cs "SendPalletData"
| extend TagsRaw = todynamic(tostring(customDimensions["Tags"]))
,ActionName = customDimensions.["ActionName"]
| project-away c*
| evaluate bag_unpack(TagsRaw, "TR_")
| order by timestamp desc
timestamp
message
ActionName
TR_Destination
TR_PalletID
TR_SAPDeliveryNo
TR_Source
2022-08-27T04:15:07.9337681Z
abc SendPalletData xyz
Action1
TC
(00)312340123456789012(02)21234987654(05)123456(06)1234567890
0012345678
SAP
Fiddle
If I understand correctly, you want to use project to limit the number of columns that are displayed, but you also want to include all of the unpacked columns from TagsRaw, without naming all of the tags explicitly.
The easiest way to achieve this is to switch the order of your steps, so that you first do the project (including the TagsRaw column) and then you unpack the tags. If desired, you can then use project-away to specifically remove the TagsRaw column after you've unpacked it.
I'm fairly new to BigQuery (3rd day of using it with no training), I'm just trying to get my head around nested fields etc.
I've looked at the following resources and used the personsdata example on the google bigquery docs link
https://cloud.google.com/bigquery/docs/data
https://chartio.com/resources/tutorials/how-to-flatten-data-using-google-bigquerys-legacy-vs-standard-sql/
I'd like to run the below query:
select *
from [dataset.tableid]
where fullname = 'John Doe'
If I run this, I get the following error:
Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place
From reading the above articles this isn't possible because you need to flatten the results, which from what I can understand just duplicates all the none repeated variables i.e.
Fullname | age | gender | Children.name | children.age
John Doe | 22 | Male | John | 5
John Doe | 22 | Male | Jane | 7
One of the above articles suggests that you can still use the where statements by using the flatten function in bigquery:
select fullname,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place
If I change this to:
select *
FROM (FLATTEN([dataset.tableId], children))
WHERE fullname = 'John Doe'
Then this works fine and gives me what I need however if I change to this:
select *
FROM (FLATTEN([dataset.tableId], citieslived))
WHERE fullname = 'John Doe'
Then I get the following error:
Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_yearsLived
Can someone explain why this will work flattening based on "Children" but not "CitiesLived" and how to know what variables to use within flatten with more complex datasets with multiple nested variables?
Thank you in advance
Can someone explain why this will work flattening based on "Children" but not "CitiesLived"
Check schema of this table again
Schema
-----------------------------------
|- kind: STRING
|- fullName: STRING (required)
|- age: INTEGER
|- gender: STRING
+- phoneNumber: RECORD
| |- areaCode: INTEGER
| |- number: INTEGER
+- children: RECORD (repeated)
| |- name: STRING
| |- gender: STRING
| |- age: INTEGER
+- citiesLived: RECORD (repeated)
| |- place: STRING
| +- yearsLived: INTEGER (repeated)
As you can see - when you flatten children repeated record – the only repeated record that is left for output is citiesLived and even though it has inside it yet another repeated field – yearsLived – they are not independent – thus BigQuery Legacy SQL can output result
Now, when you flatten by citiesLived – what you get in result are two repeated fileds - children and yearsLived. Those two are independent - thus BigQuery Legacy SQL cannot output such result.
how to know what variables to use within flatten with more complex datasets with multiple nested variables?
To make it work - you should add yet another flattening with (for example) yearsLived filed. Something like below
FROM (FLATTEN(FLATTEN([dataset.tableId], citieslived), yearsLived))
Adding all those multiple FLATTENs can become cumbersome so using BigQuery Standard SQL is really the way to go!
See Migrating from Legacy SQL to BigQuery Standard SQL
If you run this query:
SELECT
*
FROM
(FLATTEN((FLATTEN(([project_id:dataset_id.table]), citiesLived.yearsLived)), citiesLived))
It will flatten as expected.
When using the Legacy SQL, BQ tries to flatten automatically the results for you.
What I have noticed though is that if you try to flatten repeated fields that have other repeated fields inside then sometimes you might run into these errors (notice that the fields citiesLived and citiesLived.yearsLived are both repeated).
So one way to solve that is by forcing the flatten operation on all repeated fields you want to work with (in the example I showed you I first flattened the yearsLived and then citiesLived) and not relying on the automatic flattening operation that the Legacy SQL offers.
But what I strongly recommend and encourage you to do is to learn the Standard SQL version for BQ as Elliot suggested in his comment. It might have a steeper learning curve at first but it will totally pay off in the long run (and you won't have the risk of eventually having to migrate all your legacy queries to standard as we had to do in our company)
I am currently working on a small Talend job, which imports CSV data, gets the address field and sends the address to Google Maps API for geocoding. Afterwards, I need to combine both the input and geocoding data.
My problem is, that the combination of initial data row and geocoding result seems not possible; After passing the TRestClient, all reference to the input data seems gone.
Here's my non-final data flow:
Subjob 1: CSVInput --> THashMapOutput
|
|
Subjob 2: THashInput --> tRestClient --> tExtractJSONFields --> tMap --> tBufferOutput
| (Lookup)
|
tHashInput
|
|
Subjob 3: tBufferInput --> tFileOutputDelimited
Herein, the last tMap does not have a foreign key aka reference to the input row. Therefore the join creates the cross product of all different combinations of input and geocoded raw.
Is there a way to combine both input and geocoding results? Can we configure tRestClient to forward inputs as well?
(a combination of two resulting csv files seems to fail for the same missing identifier)
Ok, answer was quite easy:
Assume you have the first link in subjob 2 called row2.
Then you can open the second tMap component.
Remove the lookup shown above.
Add the references to row 2 within tMap: e.g. row2.URL, row2.Name
Et voila: Now you get each row combined of geocoded result and original data.
I've stored numeric tabular data as relationship properties in a Neo4j database. I would like to recover the data in tabular form.
For instance, one node was stored as follows:
MATCH (g:GNE),(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
CREATE UNIQUE (p)-[r:Was_norm
{Method:'NULL', time_t_35: '6.04',time_t9: '6.587',time_t14: '5.708',time_t31: '6.89',time_t224: '4.842'}
]->(g)
I tried a query like this:
MATCH (g:GNE)-[r1:Was_sel]-(e:EXP)-[r2:Was_norm]-(g)
WHERE e.NExp = 'Bos_SM'
RETURN g.etr,r2
but I'd like to recover the data in tabular form, and in the correct order.
Does anyone have any suggestions?
It may not be possible to do what you want with your current data model, given Cypher's current capabilities. Part of the problem is that there is no way to get a property value without hardcoding (in your query) the name of the property. Another part of the problem is that property keys are not necessarily returned in the original order (or in any predictable order).
Instead, you can get around these problems by changing the way you store your tabular data.
For example, suppose you stored a node this way (notice that the collections are stored in the desired order):
MATCH (g:GNE),(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
CREATE UNIQUE
(p)-[r:Was_norm {
Method:'NULL',
times: [ 9, 14, 31, 224],
values:[6.587, 5.708, 6.89, 4.842]
}]->(g)
Given the above data model, you can easily get the tabular data back as 2 separate arrays:
MATCH (g:GNE)-[r:Was_norm]->(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
RETURN g.etr, r.times, r.values;
Or, if you wanted to get the data back in a single array:
MATCH (g:GNE)-[r:Was_norm]->(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
RETURN g.etr,
REDUCE(s =[], i IN RANGE(0,LENGTH(r.times)-1) | s + { time: r.times[i], value: r.values[i]}) AS table;
The result of the above query (see this console) would look like this:
+-------------------------------------------------------------------------------------------------------+
| g.etr | table |
+-------------------------------------------------------------------------------------------------------+
| "5313" | [{time=9, value=6.587},{time=14, value=5.708},{time=31, value=6.89},{time=224, value=4.842}] |
+-------------------------------------------------------------------------------------------------------+
For example I have the following table named "example":
name | age | address
'abc' | 12 | {'street':'1', 'city':'kl', 'country':'malaysia'}
'cab' | 15 | {'street':'5', 'city':'jakarta', 'country':'indonesia'}
In Spark I can do this:
scala> val test = sc.cassandraTable ("test","example")
and this:
scala> test.first.getString
and this:
scala> test.first.getMapString, String
which gives me all the fields of the address in the form of a map
Question 1: But how do I use the "get" to access "city" information?
Question 2: Is there a way to falatten the entire table?
Question 3: how do I go about counting number of rows where "city" = "kl"?
Thanks
Question 3 : How do we count the number of rows where city == something
I'll answer 3 first because this may provide you an easier way to work with the data. Something like
sc.cassandraTable[(String,Map[String,String],Int)]("test","example")
.filter( _._2.getOrElse("city","NoCity") == "kl" )
.count
First, I use the type parameter [(String,Map[String,String],Int)] on my cassandraTable call to transform the rows into tuples. This gives me easy access to the Map without any casting. (The order is just how it appears when I made the table in my test environment you may have to change the ordering)
Second I say I would like to filter based on the _._2 which is shorthand for the second element of the incoming tuple. getOrElse returns the value for the key "city" if the key exists and "NoCity" otherwise. The final equivalency checks what city it is.
Finally, I call count to find out the number of entries in the city.
1 How do we access the map?
So the answer to 2 is that once you have a Map, you can call get("key") or getOrElse("key") or any of the standard Scala operations to get a value out of the map.
2 How to flatten the entire table.
Depending on what you mean by "flatten" this can be a variety of things. For example if you want to return the entire table as an array to the driver (Not recommended since your RDD should be very big in production.) You can call collect
If you want to flatten the elements of your map into a tuple you can always do something like calling toSeq and you will end up with a list of (key,value) tuples. Feel free to ask another question if I haven't answered what you want with "flattening."