Add column of totals pr. field value - azure-data-explorer

I start with a list of failures that take place in locations
failureName, failureLocation
failure a, location 1
failure b, location 1
failure a, location 2
failure a, location 1
<etc>
I can transform that into this table by using summarize count() by location
failureName, failureLocation, count
failure a, location 1, 100
failure a, location 2, 50
failure b, location 1, 10
<etc>
I'd like to transform the counts into percent on a per. failure basis, so I need to add a sum per failure name. My goal is to end up with this table:
failureName, failureLocation, count, sumPerFailureName
failure a, location 1, 100, 150
failure a, location 2, 50, 150
failure b, location 1, 10, 10
<etc>
Suggestions?

Try this, to take you from your 2nd table to the 3rd (and extend a calculated column of the percentage):
let T =
datatable(failureName:string, failureLocation:string, ['count']:long)
[
'failure a', 'location 1', 100,
'failure a', 'location 2', 50,
'failure b', 'location 1', 10,
]
;
T
| summarize sumPerFailureName = sum(['count']) by failureName
| join
(
T
) on failureName
| project failureName, failureLocation, ['count'], sumPerFailureName, percentage = round(100.0 * ['count'] / sumPerFailureName, 2)

Related

Limiting Azure data explorer update policy input

We have a use case where we are saving telemetry and statistic data from the machines but the update policy, which is supposed to process the raw data, is giving us trouble and running out of memory.
Aggregation over string column exceeded the memory budget of 8GB during evaluation
We have two tables, the 'ingest-table' where the data is initially being ingested to and the 'main-table' where it should end up.
We are in a process of migrating from another solution to ADX and have to ingest a high volume of data.
The raw data is in a matrix format, which means that one message from a machine will end up as multiple rows/records in the ADX database. We use mv-expand for the breakdown and the query is pretty much doing that, among with some data formatting.
So, our update policy looks like the following:
['ingest-table']
| mv-expand Counter = Data.Counters
| mv-expand with_itemindex = r Row = Rows
| mv-expand Column = Rows[r].Data
| project ...
I don't see any way how could I improve the processing query itself and I'm looking for a way to somehow limit the number of the record which the update policy function would receive.
I've tried playing around with the ingestion batching (MaximumNumberOfItems = 1000) and also sharding policy (MaxRowCount = 1000) for the 'ingest-table' but it does not have any effect on the number of records the update policy is pulling it at once.
My idea is to let only 1000 items at once to be processed by the update policy function because I've manually tested and it works fine to up to 5k record but fails closly above that.
Any suggestion what we could do in this case and how I can achieve that?
EDIT:
An example raw message which has to be processed by the update policy.
The number of rows the policy has to generate is the number of COUNTERS * ROWS * COLUMNS. In this case it would mean that we end up with ~1200 rows after this single message is processed.
I do not see any other way that doing a mv-expand here.
{
"Name": "StatisicName",
"TimeInterval": {
"StartUtc": 1654221156.285,
"EndUtc": 1654221216.286
},
"Legend": {
"RowLabels": [
"0",
"0.04",
"0.08",
"0.12",
"0.16",
"0.2",
"0.24",
"0.28",
"0.32",
"0.36",
"0.4",
"0.44",
"0.48",
"0.52",
"0.56",
"0.6",
"0.64",
"0.68",
"0.72",
"0.76",
"0.8",
"0.84",
"0.88",
"0.92",
"0.96",
"1",
"1.04",
"1.08",
"1.12",
"1.16",
"1.2",
"1.24",
"1.28",
"1.32",
"1.36",
"1.4",
"1.44",
"1.48",
"1.52",
"1.56",
"1.6",
"1.64",
"1.68",
"1.72",
"1.76",
"1.8",
"1.84",
"1.88",
"1.92",
"1.96"
],
"ColumnLabels": [
"Material1",
"Material2",
"Material3",
"Material4",
"Material5",
"Material6",
"Material7",
"Material8",
"Material9",
"Material10",
"Material11",
"Material12"
]
},
"Counters": [
{
"Type": "Cumulative",
"Matrix": {
"Rows": [
{
"Data": [
6.69771873292923,
0,
0,
0,
0.01994649920463562,
0.017650499296188355,
0.007246749711036683,
0.003443999862670899,
0.1422802443265915,
0,
0,
0.0008609999656677247
]
}
//,{...} ... for each row of the matrix
]
}
},
{
"Type": "Count",
"Matrix": {
"Rows": [
{
"Data": [
0.0001434999942779541,
0,
0,
0,
0.0001434999942779541,
0.0001434999942779541,
0.0001317590856552124,
0.0001434999942779541,
0.00014285165093031273,
0,
0,
0.0001434999942779541
]
}
//,{...} ... for each row of the matrix
]
}
}
]
}
The main issue I see in your code is this:
| mv-expand with_itemindex = r Row = Rows
| mv-expand Column = Rows[r].Data
You explode Rows and get the exploded values in a new column called Row, but then instead of working with Row.Data, you keep using the original unexploded Rows, traversing through the elements using r.
This leads to unnecessary duplication of Rows and it is probably what creates the memory pressure.
Check out the following code.
You can use the whole code and get the data formatted as a table with columns Material1, Material2 etc., or exclude the last 2 rows and simply get the exploded values, each in a separate row.
// Data sample generation. Not part of the solution
let p_matrixes = 3;
let p_columns = 12;
let p_rows = 50;
let ['ingest-table'] =
range i from 1 to p_matrixes step 1
| extend StartUtc = floor((ago(28d + rand()*7d) - datetime(1970))/1ms/1000,0.001)
| extend EndUtc = floor((ago(rand()*7d) - datetime(1970))/1ms/1000,0.001)
| extend RowLabels = toscalar(range x from todecimal(0) to todecimal(0.04 * (p_rows - 1)) step todecimal(0.04) | summarize make_list(tostring(x)))
| extend ColumnLabels = toscalar(range x from 1 to p_columns step 1 | summarize make_list(strcat("Material",tostring(x))))
| extend Counters_Cumulative = toscalar(range x from 1 to p_rows step 1 | mv-apply range(1, p_columns) on (summarize Data = pack_dictionary("Data", make_list(rand()))) | summarize make_list(Data))
| extend Counters_Count = toscalar(range x from 1 to p_rows step 1 | mv-apply range(1, p_columns) on (summarize Data = pack_dictionary("Data", make_list(rand()))) | summarize make_list(Data))
| project i, Data = pack_dictionary("Name", "StatisicName", "TimeInterval", pack_dictionary("StartUtc", StartUtc, "EndUtc",EndUtc), "Legend", pack_dictionary("RowLabels", RowLabels, "ColumnLabels", ColumnLabels), "Counters", pack_array(pack_dictionary("Type", "Cumulative", "Matrix", pack_dictionary("Rows", Counters_Cumulative)), pack_dictionary("Type", "Count", "Matrix", pack_dictionary("Rows", Counters_Count))))
;
// Solution starts here
// Explode values
['ingest-table']
| project Name = tostring(Data.Name), StartUtc = todecimal(Data.TimeInterval.StartUtc), EndUtc = todecimal(Data.TimeInterval.EndUtc), RowLabels = Data.Legend.RowLabels, ColumnLabels = Data.Legend.ColumnLabels, Counters = Data.Counters
| mv-apply Counters on (project Type = tostring(Counters.Type), Rows = Counters.Matrix.Rows)
| mv-apply RowLabels to typeof(decimal), Rows on (project RowLabels, Data = Rows.Data)
| mv-expand ColumnLabels to typeof(string), Data to typeof(real)
// Format as table
| evaluate pivot(ColumnLabels, take_any(Data))
| project-reorder Name, StartUtc, EndUtc, RowLabels, Type, * granny-asc
"Explode values" sample
Name
StartUtc
EndUtc
ColumnLabels
RowLabels
Type
Data
StatisicName
1658601891.654
1660953273.898
Material4
0.88
Count
0.33479977032253788
StatisicName
1658601891.654
1660953273.898
Material7
0.6
Cumulative
0.58620965468565811
StatisicName
1658801257.201
1660941025.56
Material1
0.72
Count
0.23164306814350025
StatisicName
1658601891.654
1660953273.898
Material4
1.68
Cumulative
0.47149864409592157
StatisicName
1658601891.654
1660953273.898
Material12
1.08
Cumulative
0.777589612330022
"Format as table" Sample
Name
StartUtc
EndUtc
RowLabels
Type
Material1
Material2
Material3
Material4
Material5
Material6
Material7
Material8
Material9
Material10
Material11
Material12
StatisicName
1658581605.446
1660891617.665
0.52
Cumulative
0.80568785763966921
0.69112398516227513
0.45844947991605256
0.87975011678339887
0.19607303271777138
0.76728212781319993
0.27520162657976527
0.48612400400362971
0.23810927904958085
0.53986865017468966
0.31225384042818344
0.99380179164514848
StatisicName
1658581605.446
1660891617.665
0.72
Count
0.77601864161716061
0.351768361021601
0.59345888695494731
0.92329751241805491
0.80811999338933449
0.49117503870065837
0.97871902062153937
0.94241064167069055
0.52950523227349289
0.39281849330041424
0.080759530370922858
0.8995622227351241
StatisicName
1658345203.482
1660893443.968
1.92
Count
0.78327575542772387
0.16795871437570925
0.01201541525964204
0.96029371013283549
0.60248327254185241
0.019315208353334352
0.4828009899119266
0.75923221663483853
0.29630236707606555
0.23977292819044668
0.94531978804572625
0.54626985282267437
StatisicName
1658345203.482
1660893443.968
1
Count
0.65268575186841382
0.61471913013853441
0.80536656853846211
0.380104887115314
0.84979344481966745
0.68790819414895632
0.80862491082567767
0.083687871352600765
0.16707928827946666
0.4071460045501768
0.94115460659910444
0.25011225557898314
StatisicName
1658581605.446
1660891617.665
1.6
Count
0.75532393959433786
0.71081551001527776
0.9757484452705758
0.55510969429009
0.055800808878012885
0.74924458240427783
0.78706505608871058
0.18745675452118818
0.70192553697345517
0.39429935579653647
0.4048784200404818
0.14888395753558561
Fiddle

JQ: How to split array by values and find out length of each piece?

I need to find out lengths of user sessions given timestamps of individual visits.
New session starts every time a delay between adjacent timestamps is longer than limit.
For example, for this set of timestamps (consider it sort of seconds from epoch):
[
101,
102,
105,
116,
128,
129,
140,
145,
146,
152
]
...and for value of limit=10, I need the following output:
[
3,
1,
2,
4
]
Assuming the values will be in ascending order, loop through the values accumulating the groups based on your condition. reduce works well in this case.
10 as $limit # remove this so you can feed in your value as an argument
| reduce .[] as $i (
{prev:.[0], group:[], result:[]};
if ($i - .prev > $limit)
then {prev:$i, group:[$i], result:(.result + [.group])}
else {prev:$i, group:(.group + [$i]), result}
end
)
| [(.result[], .group) | length]
If the difference from the previous value exceeds the limit, take the current group of values and move it to the result. Otherwise, the current value belongs to the current group so add it. At the end, you could count the sizes of the groups to get your result.
Here's a slightly modified version that just counts the values up.
10 as $limit
| reduce .[] as $i (
{prev:.[0], count:0, result:[]};
if ($i - .prev > $limit)
then {prev:$i, count:1, result:(.result + [.count])}
else {prev:$i, count:(.count + 1), result}
end
)
| [.result[], .count]
Here's another approach using indices to calculate the breakpoint positions:
Producing the lengths of the segments:
10 as $limit
| [
[0, indices(while(. != []; .[1:]) | select(.[0] + $limit <= .[1]))[] + 1, length]
| .[range(length-1):] | .[1] - .[0]
]
[
3,
1,
2,
4
]
Demo
Producing the segments themselves:
10 as $limit
| [
(
[indices(while(. != []; .[1:]) | select(.[0] + $limit <= .[1]))[] + 1]
| [null, .[0]], .[range(length):]
)
as [$a,$b] | .[$a:$b]
]
[
[
101,
102,
105
],
[
116
],
[
128,
129
],
[
140,
145,
146,
152
]
]
Demo

Cassandra collection tombstones

I have created a table with a collection. Inserted a record and took sstabledump of it and seeing there is range tombstone for it in the sstable. Does this tombstone ever get removed? Also when I run sstablemetadata on the only sstable, it shows "Estimated droppable tombstones" as 0.5", Similarly it shows one record with epoch time as insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it mean that when I do sstablemetadata on a table having collections, the estimated droppable tombstone ratio and drop times values are not true and dependable values due to collection/list range tombstones?
CREATE TABLE ks.nmtest (
reservation_id text,
order_id text,
c1 int,
order_details map<text, text>,
PRIMARY KEY (reservation_id, order_id)
) WITH CLUSTERING ORDER BY (order_id ASC)
user#cqlsh:ks> insert into nmtest (reservation_id , order_id , c1, order_details ) values('3','3',3,{'key':'value'});
user#cqlsh:ks> select * from nmtest ;
reservation_id | order_id | c1 | order_details
----------------+----------+----+------------------
3 | 3 | 3 | {'key': 'value'}
(1 rows)
[root#localhost nmtest-e1302500201d11e983bb693c02c04c62]# sstabledump mc-5-big-Data.db
WARN 02:52:19,596 memtable_cleanup_threshold has been deprecated and should be removed from cassandra.yaml
[
{
"partition" : {
"key" : [ "3" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 41,
"clustering" : [ "3" ],
"liveness_info" : { "tstamp" : "2019-01-25T02:51:13.574409Z" },
"cells" : [
{ "name" : "c1", "value" : 3 },
{ "name" : "order_details", "deletion_info" : { "marked_deleted" : "2019-01-25T02:51:13.574408Z", "local_delete_time" : "2019-01-25T02:51:13Z" } },
{ "name" : "order_details", "path" : [ "key" ], "value" : "value" }
]
}
]
}
SSTable: /data/data/ks/nmtest-e1302500201d11e983bb693c02c04c62/mc-5-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.010000
Minimum timestamp: 1548384673574408
Maximum timestamp: 1548384673574409
SSTable min local deletion time: 1548384673
SSTable max local deletion time: 2147483647
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: 1.0714285714285714
TTL min: 0
TTL max: 0
First token: -155496620801056360 (key=3)
Last token: -155496620801056360 (key=3)
minClustringValues: [3]
maxClustringValues: [3]
Estimated droppable tombstones: 0.5
SSTable Level: 0
Repaired at: 0
Replay positions covered: {CommitLogPosition(segmentId=1548382769966, position=6243201)=CommitLogPosition(segmentId=1548382769966, position=6433666)}
totalColumnsSet: 2
totalRows: 1
Estimated tombstone drop times:
1548384720: 1
Another quuestion was on the nodetool tablestats output - what does slice refer to in cassandra?
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
sstablemetadata does not have the information about your table that is not held within the sstable as it is not guaranteed to be run on system that has Cassandra running, and even if it was its very complex to be able to know how to pull the schema information from it.
Since the gc_grace_seconds is a table parameter and not in the metadata it defaults to assuming a 0 gc grace so the droppable times listed in that histogram will be more a histogram of the tombstone creation times by default. If you know your gc grace you can add it as a -g parameter to your sstablemetadata call. like:
sstablemetadata -g 864000 mc-5-big-Data.db
see http://cassandra.apache.org/doc/latest/tools/sstable/sstablemetadata.html for information on the tools output.
With collections it's just normal range tombstone with all that it entails. They are used to prevent the requirement of a read-before-write when overwriting the value of a multicell collection.

Update element on map depending on other update

I want to do update a map doing the following:
def updateInfo() do
person = %{ person | name : "new name", age : person.age +1, observation : changeObs(person.age)}
end
def changeObs(age), when age >= 18, do: "Adult"
def changeObs(age), do: "kid"
If I updateInfo() and the age is 17, I would expect the person to change the observation to "Adult". But it is not working. I thought the updates of the map were done sequentially, but apparently not, so I cannot rely on the fact that the age now is 18. I can do this if I split the update like so:
person = %{ person | name : "new name", age : person.age +1}
person = %{ person | observation : changeObs(person.age)}
Is there a way to keep it all the update in one line, relying on previous updates of the attributes on the map?
Well, as far as I can tell, you can just cache the new value in a variable:
new_age = person.age + 1
person = %{ person | name: "new name", age: new_age, observation: changeObs(new_age)}
Or you could piple the changes like this:
person
|> Map.put(:name, "new_name")
|> Map.put(:age, new_age)
|> Map.put(:observation, changeObs(new_age))
First of all, the syntax of updating maps uses colons, not equal signs. Your code raises SyntaxError exception.
There are plenty of ways to accomplish the task as oneliner:
person = with age <- person.age + 1,
do: %{person | age: age,
observation: (if age >= 18, do: "Adult", else: "Kid")}
Or:
person = (age = person.age + 1;
%{person | age: age,
observation: (if age >= 18, do: "Adult", else: "Kid")})
Or a pipe chain:
{_, person} = person
|> Map.put(:age, person.age + 1)
|> Map.get_and_update(:age, fn age ->
{age, age + 1}
end)
The idiomatic would be the last one. My fave would be the first one.

PIG - Scalar has more than one row in the output. 1s

I have data set in the following format:
100000853384|RETAIL|OTHER|4.625|280000|360|02/2012|04/2012|31|31|1|23|801|NO|CASH-OUT REFINANCE|SF|1|INVESTOR|CA|945||FRM
100003735682|RETAIL|SUNTRUST MORTGAGE INC.|3.99|466000|360|01/2012|03/2012|80|80|2|30|788|NO|PURCHASE|SF|1|PRINCIPAL|MD|208||FRM
100006367485|CORRESPONDENT|PHH MORTGAGE CORPORATION|4|229000|360|02/2012|04/2012|67|67|2|36|794|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|CA|959||FRM
4th record is the ORIGINAL_INTEREST_RATE.
Now My Question is
What is the interest rate for which most number of people have taken a loan.
I write following codes
LOAD DATA SET
loanAqiData = LOAD 'hdfs://masterNode:8020/home/hadoop/hadoop_data/LOAN_Acquisition_DATA/Acquisition_2012Q1.txt'
USING PigStorage('|')
AS
(
LOAN_IDENTIFIER:chararray
, CHANNEL:chararray
, SELLER_NAME:chararray
, ORIGINAL_INTEREST_RATE:float
, ORIGINAL_UNPAID_PRINCIPAL_BALANCE :float
, ORIGINAL_LOAN_TERM :float
, ORIGINATION_DATE:chararray
, FIRST_PAYMENT_DATE:chararray
, ORIGINAL_LOAN_TO_VALUE:float
, ORIGINAL_COMBINED_LOAN_TO_VALUE :float
, NUMBER_OF_BORROWERS:float
, DEBT_TO_INCOME_RATIO:float
, CREDIT_SCORE:float
, FIRST_TIME_HOME_BUYER_INDICATOR:chararray
, LOAN_PURPOSE:chararray
, PROPERTY_TYPE:chararray
, NUMBER_OF_UNITS:chararray
, OCCUPANCY_STATUS:chararray
, PROPERTY_STATE:chararray
, ZIP:chararray
, MORTGAGE_INSURANCE_PERCENTAGE:float
, PRODUCT_TYPE:chararray
);
//- Group By Interest Rate
grouped_by_interest_rate = group loanAqiData by ORIGINAL_INTEREST_RATE;
No of Counts for individual Interest Rate
count_for_specific_interest = FOREACH grouped_by_interest_rate GENERATE group as INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
Dump
dump count_for_specific_interest
Output
(3.625,1)
(3.75,2)
(3.875,26)
(3.99,8)
(4.0,21)
(4.1,1)
(4.125,15)
(4.25,16)
(4.375,15)
(4.376,26)
(4.5,10)
(4.625,3)
But I want to get
(3.875,26) and (4.376,26)
How Can I get ?
Also If I want to get the Loan Interest for which minimum No of people has taken Loan ..
I'd suggest you use the MAX() function (http://pig.apache.org/docs/r0.11.0/func.html#max) to determine the highest number of people and then filter by this number.
Here is an example of code that should work (not tested) :
FOREACH count_for_specific_interest {
max_value= MAX($1.NO_OF_PEOPLE);
GENERATE INTEREST_RATE, NO_OF_PEOPLE, max_value;
}
RESULT = FILTER count_for_specific_interest BY NO_OF_PEOPLE==max_value;
For the min you would be able to use exactly the same script replacing MAX() by MIN()
Finally this is resolved.
let me write down the steps
1) Load
2) Group by Interest
grp = group loanAqiData by ORIGINAL_INTEREST_RATE;
3) Count No of people against each Interest
cntForEachGrp = FOREACH grp GENERATE group as
INTEREST_RATE, COUNT(loanAqiData) as NO_OF_PEOPLE;
Output
(3.625,1) (3.75,2) (3.875,26) (3.99,8) (4.0,21) (4.1,1) (4.125,15) (4.25,16) (4.375,15) (4.376,26) (4.5,10) (4.625,3)
4) Group them all to put in the same BAG
grpALL = GROUP cntForEachGrp ALL;
(all,{(3.625,1),(3.75,2),(3.875,26),(3.99,8),(4.0,21),(4.1,1),(4.125,15),(4.25,16),(4.375,15),(4.376,1),(4.5,10),(4.625,3),(4.75,5),(4.875,4),(5.0,2),(5.25,1)})
5) Calculate Max No of people from the BAG
maxVal = FOREACH grpALL {
max_value= MAX(cntForEachGrp.NO_OF_PEOPLE);
GENERATE cntForEachGrp.INTEREST_RATE, cntForEachGrp.NO_OF_PEOPLE, max_value as
max_no;
}
grunt> describe maxVal;
maxVal: {{(INTEREST_RATE: float)},{(NO_OF_PEOPLE: long)},max_no: long}
dump maxVal;
({(3.625),(3.75),(3.875),(3.99),(4.0),(4.1),(4.125),(4.25),(4.375),(4.376),(4.5),(4.625),(4.75),(4.875),(5.0),(5.25)},{(1),(2),(26),(8),(21),(1),(15),(16),(15),(1),(10),(3),(5),(4),(2),(1)},26)
6)Filter out Loan interest having Max no of people
RESULT=FILTER cntForEachGrp BY NO_OF_PEOPLE == maxVal.max_no ;
After dump we get interest Rate -3.875 has max no of people 26.
Why we have to do
grpALL = GROUP cntForEachGrp ALL;
and
what is the inner meaning of the nested foreach in (5)

Resources