Talend TRestClient: geocoding and combination of both flows (rows) afterwards

Talend TRestClient: geocoding and combination of both flows (rows) afterwards - google-maps-api-3

I am currently working on a small Talend job, which imports CSV data, gets the address field and sends the address to Google Maps API for geocoding. Afterwards, I need to combine both the input and geocoding data.
My problem is, that the combination of initial data row and geocoding result seems not possible; After passing the TRestClient, all reference to the input data seems gone.
Here's my non-final data flow:
Subjob 1: CSVInput --> THashMapOutput
|
|
Subjob 2: THashInput --> tRestClient --> tExtractJSONFields --> tMap --> tBufferOutput
| (Lookup)
|
tHashInput
|
|
Subjob 3: tBufferInput --> tFileOutputDelimited
Herein, the last tMap does not have a foreign key aka reference to the input row. Therefore the join creates the cross product of all different combinations of input and geocoded raw.
Is there a way to combine both input and geocoding results? Can we configure tRestClient to forward inputs as well?
(a combination of two resulting csv files seems to fail for the same missing identifier)

Ok, answer was quite easy:
Assume you have the first link in subjob 2 called row2.
Then you can open the second tMap component.
Remove the lookup shown above.
Add the references to row 2 within tMap: e.g. row2.URL, row2.Name
Et voila: Now you get each row combined of geocoded result and original data.

Related

In KQL how can I use bag_unpack to turn a serialized dictionary object in customDimensions into columns?

I'm trying to write a KQL query that will, among other things, display the contents of a serialized dictionary called Tags which has been added to the Application Insights traces table customDimensions column by application logging.
An example of the serialized Tags dictionary is:
{
"Source": "SAP",
"Destination": "TC",
"SAPDeliveryNo": "0012345678",
"PalletID": "(00)312340123456789012(02)21234987654(05)123456(06)1234567890"
}
I'd like to use evaluate bag_unpack(...) to evaluate the JSON and turn the keys into columns. We're likely to add more keys to the dictionary as the project develops and it would be handy not to have to explicitly list every column name in the query.
However, I'm already using project to reduce the number of other columns I display. How can I use both a project statement, to only display some of the other columns, and evaluate bag_unpack(...) to automatically unpack the Tags dictionary into columns?
Or is that not possible?
This is what I have so far, which doesn't work:
traces
| where datetime_part("dayOfYear", timestamp) == datetime_part("dayOfYear", now())
and message has "SendPalletData"
| extend TagsRaw = parse_json(customDimensions.["Tags"])
| evaluate bag_unpack(TagsRaw)
| project timestamp, message, ActionName = customDimensions.["ActionName"], TagsRaw
| order by timestamp desc
When it runs it displays only the columns listed in the project statement (including TagsRaw, so I know the Tags exist in customDimensions).
evaluate bag_unpack(TagsRaw) doesn't automatically add extra columns to the result set unpacked from the Tags in customDimensions.
EDIT: To clarify what I want to achieve, these are the columns I want to output:
timestamp
message
ActionName
TagsRaw
Source
Destination
SAPDeliveryNo
PalletID
EDIT 2: It turned out a major part of my problem was that double quotes within the Tags data are being escaped. While the Tags as viewed in the Azure portal looked like normal JSON, and copied out as normal JSON, when I copied out the whole of a customDimensions record the Tags looked like "Tags": "{\"Source\":\"SAP\",\"Destination\":\"TC\", ... with the double quotes escaped with backslashes.
The accepted answer from David Markovitz handles this situation in the line:
TagsRaw = todynamic(tostring(customDimensions["Tags"]))

A few comments:
When filtering on timestamp, better use the timestamp column As Is, and do the manipulations on the other side of the equation.
When using the has[...] operators, prefer the case-sensitive one (if feasable)
Everything extracted from dynamic value is also dynamic, and when given a dynamic value parse_json() (or its equivalent, todynamic()), simply returns it, As Is.
Therefore, we need to treet customDimensions.["Tags"] in 2 steps:
1st, convert it to string. 2nd, convert the result to dynamic.
To reference a field within a dynamic type you can use X.Y, X["Y"], or "X['Y'].
No need to combine them as you did with customDimensions.["Tags"].
As the bag_unpack plugin doc states:
"The specified input column (Column) is removed."
In other words, TagsRaw does not exist following the bag_unpack operation.
Please note that you can add prefix to the columns generated by bag_unpack. Might make it easier to differentiate them from the rest of the columns.
While you can use project, using project-away is sometimes easier.
// Data sample generation. Not part of the solution.
let traces =
print c1 = "some columns"
,c2 = "we"
,c3 = "don't need"
,timestamp = ago(now()%1d * rand())
,message = "abc SendPalletData xyz"
,customDimensions = dynamic
(
{
"Tags":"{\"Source\":\"SAP\",\"Destination\":\"TC\",\"SAPDeliveryNo\":\"0012345678\",\"PalletID\":\"(00)312340123456789012(02)21234987654(05)123456(06)1234567890\"}"
,"ActionName":"Action1"
}
)
;
// Solution starts here
traces
| where timestamp >= startofday(now())
and message has_cs "SendPalletData"
| extend TagsRaw = todynamic(tostring(customDimensions["Tags"]))
,ActionName = customDimensions.["ActionName"]
| project-away c*
| evaluate bag_unpack(TagsRaw, "TR_")
| order by timestamp desc
timestamp
message
ActionName
TR_Destination
TR_PalletID
TR_SAPDeliveryNo
TR_Source
2022-08-27T04:15:07.9337681Z
abc SendPalletData xyz
Action1
TC
(00)312340123456789012(02)21234987654(05)123456(06)1234567890
0012345678
SAP
Fiddle

If I understand correctly, you want to use project to limit the number of columns that are displayed, but you also want to include all of the unpacked columns from TagsRaw, without naming all of the tags explicitly.
The easiest way to achieve this is to switch the order of your steps, so that you first do the project (including the TagsRaw column) and then you unpack the tags. If desired, you can then use project-away to specifically remove the TagsRaw column after you've unpacked it.

Telegraf drop a tag from a specific measurement

Is it possible in telegraf using a processor to drop a tag from a measurement?
Using the cisco_telemetry plugin that takes in series and within one of the measurements not the whole plugin I want to only keep one tag.
I tried using the tag_limit processor but it didn't work. The current measurement "Cisco-IOS-XR-procfind-oper:proc-distribution/nodes/node/process/pid/filter-type" has two tags "pid" and "proc_name" each contain around 10000 values. I only want to keep "proc_name" and drop "pid" from this measurement. Should the tag_limit processor work for this? Version 1.23
[[processors.tag_limit]]
namepass = ["Cisco-IOS-XR-procfind-oper:proc-distribution/nodes/node/process/pid/filter-type"]
## Maximum number of tags to preserve
limit = 1
## List of tags to preferentially preserve
keep = ["proc_name"]

within one of the measurements
I would probably use a starlark processor then. Use namepass as you have done, and then remove the specific tag.
[[processors.starlark]]
namepass = ["Cisco-IOS-XR-procfind-oper:proc-distribution/nodes/node/process/pid/filter-type"]
source = '''
def apply(metric):
metric.tags.pop("pid")
return metric
'''
For users looking to do this to an entire measurement, they can drop tags from a measurement with metric modifiers. Specifically, you are looking for tagexclude, which will remove tags from a measurement matching those patterns. This way, you do not even need to use a processor and can add this directly to the end of your input:
[[inputs.cisco_telemetry]]
<connection details>
tagexclude = ["pid"]

How do I de-duplicate while merging this data in R?

Goal: Merge two excel files that will have significant overlap, but overwrite ONLY the phone numbers and record ID of one data set.
What I have been doing: Just brute force de-dup in excel where I copy over the sheet with phone numbers, sort the ID column, identify/highlight duplicates, and drag "up" the phone numbers to fill in the empty space for the matching ID. The process isn't hard, but with more records it starts to get absurdly tedious. These data look like this in the merged but not de-duplicated Excel file. Or just in plain text:
555555 | Joe | Copy | DOB |AGE | 555 Data Road | DataVille | LA |ZIP|County|(**PHONE GOES HERE**) |Male|White|Doc Name|More info
555555| Joe| Copy |DOB| AGE| 555 Data Road|DataVille|LA| ZIP|County| 555555555 (Phone)
And the phone should be added to the space between County and Gender for every record that matches the two ID's (first number in the record).
Attempts in R:
df_final <- merge(df_noPhone, df_Phone, by = c("Record_ID"), all.x = T)
But this just duplicates the columns ("PatientAddress.x" etc.) and I need those to be synced up for the records to be complete.
The real tricky part is, though that it isn't consistent this way throughout the data. Sometimes we simply don't have phone numbers for certain records and we still want to retain them in the data.
Suggestions? I've tried merging with almost every package I can imagine but sometimes it ends up creating more work in the direct, raw data file afterwards than it's worth.
Thanks!

You'd mentioned :
.. identify/highlight duplicates, and drag "up" the phone numbers to
fill in the empty space for the matching ID.
I suggest : replace the "drag up", with a formula.. then swap the column.
Assuming your data is filled in A2:S3, put :
=IF(M2="",1,0) in U2
=IF(U2=1,INDEX(M:M,MATCH(1,INDEX((0=U:U)*(A2=A:A),0,1),0)),"No data") in V2
and drag both downwards.
ref link : https://exceljet.net/formula/index-and-match-with-multiple-criteria
You'll noticed that in I use "no data" to 'fill' in column that already have numbers. you may use data > filter to remove/unselect those lines manually. (That's what I'll do.. but still it's = up to you.. )
Hope it helps..

How to update entries in a table within a nested dictionary?

I am trying to create an order book data structure where a top level dictionary holds 3 basic order types, each of those types has a bid and ask side and each of the sides has a list of tables, one for each ticker. For example, if I want to retrieve all the ask orders of type1 for Google stock, I'd call book[`orderType1][`ask][`GOOG]. I implemented that using the following:
bookTemplate: ([]orderID:`int$();date:"d"$();time:`time$();sym:`$();side:`$();
orderType:`$();price:`float$();quantity:`int$());
bookDict:(1#`)!enlist`orderID xkey bookTemplate;
book: `orderType1`orderType2`orderType3 ! (3# enlist(`ask`bid!(2# enlist bookDict)));
Data retrieval using book[`orderType1][`ask][`ticker] seems to be working fine. The problem appears when I try to add new order to a specific order book e.g:
testorder:`orderID`date`time`sym`side`orderType`price`quantity!(111111111;.z.D;.z.T;
`GOOG;`ask;`orderType1;100.0f;123);
book[`orderType1][`ask][`GOOG],:testorder;
Executing the last query gives 'assign error. What's the reason? How to solve it?

A couple of issues here. First one being that while you can lookup into dictionaries using a series of in-line repeated keys, i.e.
q)book[`orderType1][`ask][`GOOG]
orderID| date time sym side orderType price quantity
-------| -------------------------------------------
you can't assign values like this (can only assign at one level deep). The better approach is to use dot-indexing (and dot-amend to reassign values). However, the problem is that the value of your book dictionary is getting flattened to a table due to the list of dictionaries being uniform. So this fails:
q)book . `orderType1`ask`GOOG
'rank
You can see how it got flattened by inspecting the terminal
q)book
| ask
----------| -----------------------------------------------------------------
orderType1| (,`)!,(+(,`orderID)!,`int$())!+`date`time`sym`side`orderType`pric
orderType2| (,`)!,(+(,`orderID)!,`int$())!+`date`time`sym`side`orderType`pric
orderType3| (,`)!,(+(,`orderID)!,`int$())!+`date`time`sym`side`orderType`pric
To prevent this flattening you can force the value to be a mixed list by adding a generic null
q)book: ``orderType1`orderType2`orderType3 !(::),(3# enlist(`ask`bid!(2# enlist bookDict)));
Then it looks like this:
q)book
| ::
orderType1| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
orderType2| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
orderType3| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
Dot-indexing now works:
q)book . `orderType1`ask`GOOG
orderID| date time sym side orderType price quantity
-------| -------------------------------------------
which means that dot-amend will now work too
q).[`book;`orderType1`ask`GOOG;,;testorder]
`book
q)book
| ::
orderType1| `ask`bid!+``GOOG!(((+(,`orderID)!,`int$())!+`date`time`sym`side`o
orderType2| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
orderType3| `ask`bid!+(,`)!,((+(,`orderID)!,`int$())!+`date`time`sym`side`ord
Finally, I would recommend reading this FD whitepaper on how to best store book data: http://www.firstderivatives.com/downloads/q_for_Gods_Nov_2012.pdf

Using Marklogic Xquery data population

I have the data as below manner.
<Status>Active Leave Terminated</Status>
<date>05/06/2014 09/10/2014 01/10/2015</date>
I want to get the data as in the below manner.
<status>Active</Status>
<date>05/06/2014</date>
<status>Leave</Status>
<date>09/10/2014</date>
<status>Terminated</Status>
<date>01/10/2015</date>
please help me on the query, to retrieve the data as specified above.

Well, you have a string and want to split it at the whitestapces. That's what tokenize() is for and \s is a whitespace. To get the corresponding date you can get the current position in the for loop using at. Together it looks something like this (note that I assume that the input data is the current context item):
let $dates := tokenize(date, "\s+")
for $status at $pos in tokenize(Status, "\s+")
return (
<status>{$status}</status>,
<date>{$dates[$pos]}</date>
)

You did not indicate whether your data is on the file system or already loaded into MarkLogic. It's also not clear if this is something you need to do once on a small set of data or on an on-going basis with a lot of data.
If it's on the file system, you can transform it as it is being loaded. For instance, MarkLogic Content Pump can apply a transformation during load.
If you have already loaded the content and you want to transform it in place, you can use Corb2.
If you have a small amount of data, then you can just loop across it using Query Console.
Regardless of how you apply the transformation code, dirkk's answer shows how you need to change it. If you are updating content already in your database, you'll xdmp:node-delete() the original Status and date elements and xdmp:node-insert-child() the new ones.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Talend TRestClient: geocoding and combination of both flows (rows) afterwards - google-maps-api-3

Related

In KQL how can I use bag_unpack to turn a serialized dictionary object in customDimensions into columns?

Telegraf drop a tag from a specific measurement

How do I de-duplicate while merging this data in R?

How to update entries in a table within a nested dictionary?

Using Marklogic Xquery data population

Categories

Resources