KQL - Remove Duplicates From A List And Sum Values - azure-data-explorer

I have a simple string in KQL that I want to re-format as follows:
Given something in this format like this for example: "ABC-123 (8), ABC-123 (12), ABC-123 (5), DEF (3), DEF (1), GHI (3)",
I want to transform it to: "ABC-123 (25), DEF (4), GHI (3)"
Inside the parentheses will always be an integer, and the values preceding the enclosed numbers are just any string.
Basically summing up the numbers in the parentheses for each unique comma separated string that precedes it.
I have tried looking into things like split(), and leveraging array_indexof() to find out the positions of the unique values, but I cannot get it to work exactly.
Could anyone point me in the right direction here?

While possible,
datatable(id:int, col:string)
[
1, "ABC-123 (8), ABC-123 (12), ABC-123 (5), DEF (3), DEF (1), GHI (3)"
,2, "DEF (3), DEF (4), DEF (5), GHI (4), GHI (5), JKL (7)"
]
| mv-apply kv = extract_all(#"(\S+)\s*\((\d+)\)", col) on
(
summarize v = tostring(sum(toint(kv[1]))) by k = tostring(kv[0])
| summarize col = array_strcat(make_list(strcat(k, " (", v, ")")), ", ")
)
id
col
1
ABC-123 (25), DEF (4), GHI (3)
2
DEF (12), GHI (9), JKL (7)
Fiddle
storing the result as dictionary seems to be more useful than concatenated it all back to a string.
datatable(id:int, col:string)
[
1, "ABC-123 (8), ABC-123 (12), ABC-123 (5), DEF (3), DEF (1), GHI (3)"
,2, "DEF (3), DEF (4), DEF (5), GHI (4), GHI (5), JKL (7)"
]
| mv-apply kv = extract_all(#"(\S+)\s*\((\d+)\)", col) on
(
summarize v = sum(toint(kv[1])) by k = tostring(kv[0])
| summarize col = make_bag(pack_dictionary(k, v))
)
id
col
1
{"ABC-123":25,"DEF":4,"GHI":3}
2
{"DEF":12,"GHI":9,"JKL":7}
Fiddle
and we can also take an additional step and transform the dictionaries to columns
datatable(id:int, col:string)
[
1, "ABC-123 (8), ABC-123 (12), ABC-123 (5), DEF (3), DEF (1), GHI (3)"
,2, "DEF (3), DEF (4), DEF (5), GHI (4), GHI (5), JKL (7)"
]
| mv-apply kv = extract_all(#"(\S+)\s*\((\d+)\)", col) on
(
summarize v = sum(toint(kv[1])) by k = tostring(kv[0])
| summarize col = make_bag(pack_dictionary(k, v))
)
| evaluate bag_unpack(col)
id
ABC-123
DEF
GHI
JKL
1
25
4
3
2
12
9
7
Fiddle

Related

DB2 Case when need to count the digits of a value

I need to write a case when statement in db2. I am new, so I do not have much experience,sorry for that.
I have a column with different call numbers, each call number should contain 7 digits. (eg.AR78HJ8)
I need when the value is blank or "_______", (7 times _ ), the result to be 0,
and when I have a seven digits call number, (but not 7 times _ ) the result to be 1.
Also , there could be cases when the call number is 8, 6 or any other different then 7 digits. In this case I want to show the call number itself.
What I have written so far is
case when ab.call_number = '' then '0'
when ab.call_number = '_______' then '0'
else '1'
end as "Call number",
but in this case I assume that all other call numbers are always 7 digits.
What should I do?
Thanks a lot for your help!
Try this as is:
WITH TAB (CALL) AS
(
VALUES
'1234567'
, ''
, ' '
, 'AR78HJ8'
, '12345678'
)
SELECT CALL
,
CASE
WHEN CALL = '' THEN '0'
WHEN LENGTH(TRANSLATE(CALL, '', '0123456789', '')) = 0 AND LENGTH(CALL) = 7 THEN '1'
ELSE CALL
END AS "Call number"
FROM TAB;
The result is:
|CALL |Call number|
|--------|-----------|
|1234567 |1 |
| |0 |
| |0 |
|AR78HJ8 |AR78HJ8 |
|12345678|12345678 |

Can Powerbi count multiple occurrences of a specific text within a cell?

What i am trying to find out is, for example let's take as an example the following table:
| Col 1 | Col 2 |
|-------|---------|
| ab | 1 |
| ab ab | 2 |
| ac | 1 |
| ae | 1 |
| ae ae | 2 |
| af | 1 |
So basically if there are two occurrences of the same item in the cell, I want to display 2 in the next column. If there are 3, then 3 and so on. The thing is that I am looking for specific strings most of the time. Its a text and number string.
Is this doable in Power BI?
Assuming you want to count the number of occurrences of the first non-space characters that occur before the first separating space, you can do the following:
Col 2 =
VAR Trimmed = TRIM(Table2[Col 1])
VAR FirstSpace = SEARCH(" ", Trimmed, 1, LEN(Trimmed) + 1)
VAR FirstString = LEFT(Trimmed, FirstSpace - 1)
RETURN DIVIDE(
LEN(Trimmed) - LEN(SUBSTITUTE(Trimmed, FirstString, "")),
FirstSpace - 1
)
Let's go through an example to see how this works. Suppose we have a string " abc abc abc ".
The TRIM function removes any extraneous spaces at the beginning an end, so Trimmed = "abc abc abc".
The FirstSpace searches for the first space in Trimmed. In this case, FirstSpace = 4. (If there is no first space, then we define FirstSpace to be the length of Trimmed + 1 so the next part works correctly.)
The FirstString uses FirstSpace to find the first chunk. In this case, FirstString = "abc".
Finally, we use SUBSTITUTE to replace each FirstString with an empty string (leaving only the middle spaces) and look at how that changes the length of Trimmed. We know LEN(Trimmed) = 11 and LEN(" ") = 2, so the difference is the 9 characters we removed by substitution. We know that the 9 characters are n copies of FirstString, "abc" and we know the length of FirstString is FirstSpace - 1 = 3.
Thus we can solve 3n = 9 for n to get n = 9/3 = 3, the count of the "abc" substrings.

Regular Expression in teradata

I need to search few patterns from a column using regular expression in Teradata.
One of the example is mentioned below:
SELECT
REGEXP_SUBSTR(
REGEXP_SUBSTR('1-2-3','([0-9] *- *[0-9] *- *[0-9])',1, 1, 'i'),
'([0-9] *- *[0-9] *- *[0-9])',
1, 1, 'i'
) AS Tmp,
REGEXP_SUBSTR(
tmp,
'(^[0-9])',1,1,'i') || '-' || REGEXP_SUBSTR(tmp,'([0-9]$)',
1, 1, 'i'
) AS final_exp
;
In the above expression, I am extracting "1-3" out of a pattern like "1-2-3". Now the patterns can be anything like: 1-2-3-4-5 or 1-2,3 or 1&2-3 or 1-2,3 &4.
Is there any way that I can generalize the search pattern in regular expression like [-,&]* will only search for occurrence of this characters in order, but the characters can be present in any order in the data.
Few examples mentioned below,need is to fetch all the desired result set using a single pattern serch in expression.
Column name ==> Result
abc 1-2+3- 4 ==> 1-4
def 10,12 & 13 ==> 10-13
ijk 1,2,3, and 4 lmn ==> 1-4
abc1-2 & 3 def ==> 1-3
ikl 11 &12 -13 ==> 11-13
oAy$ 7-8 and 9 ==> 7-9
RegExp_Substr(col, '(\d+)',1, 1, 'c') || '-' ||
RegExp_Substr(col, '(\d+)(?!.*\d)',1, 1, 'c')
(\d+) = first number
(\d+)(?!.*\d) = last number (a number not followed by another number)
There's also no need for those optional parameters, because it's using the defaults anyway:
RegExp_Substr(col, '(\d+)') || '-' ||
RegExp_Substr(col, '(\d+)(?!.*\d)')

Unix awk - count of occurrences for each unique value

In Unix, I am printing the unique value for the first character in a field. I am also printing a count of the unique field lengths. Now I would like to do both together. Easy to do in SQL, but I'm not sure how to do this in Unix with awk (or grep, sed, ...).
PRINT FIRST UNIQ LEADING CHAR
awk -F'|' '{print substr($61,1,1)}' file_name.sqf | sort | uniq
PRINT COUNT OF FIELDS WITH LENGTHS 8, 10, 15
awk -F'|' 'NR>1 {count[length($61)]++} END {print count[8] ", " count[10] ", " count[15]}' file_name.sqf | sort | uniq
DESIRED OUTPUT
first char, length 8, length 10, length 15
a, 10, , 150
b, 50, 43, 31
A, 20, , 44
B, 60, 83, 22
The fields that start with an upper or lower 'a' are never length 10.
The input file is a | delimited .sqf with no header. The field is varChar 15.
sample input
56789 | someValue | aValue | otherValue | 712345
46789 | someValue | bValue | otherValue | 812345
36789 | someValue | AValue | otherValue | 912345
26789 | someValue | BValue | otherValue | 012345
56722 | someValue | aValue | otherValue | 712345
46722 | someValue | bValue | otherValue | 812345
desired output
a: , , 2
b: 1, , 1
A: , , 1
B: , 1,
'a' has two instances that are length 15
'b' has one instance each of length 8 and 15
'A' has one instance that is length 15
'B' has one instance that is length 10
Thank you.
I think you need a better sample input file, but I guess that's what you're looking for
$ awk -F' \\| ' -v OFS=, '{k=substr($3,1,1); ks[k]; c[k,length($3)]++}
END {for(k in ks) print k": "c[k,6],c[k,10],c[k,15]}' file
A: 1,,
B: 1,,
a: 2,,
b: 2,,
note that since all lengths are 6, I printed that count instead of 8. With the right data you should be able to get the output you expect. Note however that the order is not preserved.

Coordinates to Grid Box Number

Let's say I have some grid that looks like this
_ _ _ _ _ _ _ _ _
| | | |
| 0 | 1 | 2 |
|_ _ _|_ _ _|_ _ _|
| | | |
| 3 | 4 | 5 |
|_ _ _|_ _ _|_ _ _|
| | | |
| 6 | 7 | 8 |
|_ _ _|_ _ _|_ _ _|
How do I find which cell I am in if I only know the coordinates? For example, how do I get 0 from (0,0), or how do I get 7 from (1,2)?
Also, I found this question, which does what I want to do in reverse, but I can't reverse it for my needs because as far as I know there is not a mathematical inverse to modulus.
In this case, given cell index A in the range [0, 9), the row is given by R = floor(A/3) and the column is given by C = A mod 3.
In the general case, where MN cells are arranged into a grid with M rows and N columns (an M x N grid), given a whole number B in [0, MN), the row is found by R = floor(B/N) and the column is found by C = B mod N.
Going the other way, if you are given a grid element (R, C) where R is in [0, M) and C is in [0, N), finding the element in the scheme you show is given by A = RN + C.
cell = x + y*width
Programmers use this often to treat a 1D-array like a 2D-array.
For future programmers
May this be useful:
let wide = 4;
let tall = 3;
let reso = ( wide * tall);
for (let indx=0; indx<reso; indx++)
{
let y = Math.floor(indx/wide);
let x = (indx % wide);
console.log(indx, `x:${x}`, `y:${y}`);
};

Resources