Can Powerbi count multiple occurrences of a specific text within a cell? - count

What i am trying to find out is, for example let's take as an example the following table:
| Col 1 | Col 2 |
|-------|---------|
| ab | 1 |
| ab ab | 2 |
| ac | 1 |
| ae | 1 |
| ae ae | 2 |
| af | 1 |
So basically if there are two occurrences of the same item in the cell, I want to display 2 in the next column. If there are 3, then 3 and so on. The thing is that I am looking for specific strings most of the time. Its a text and number string.
Is this doable in Power BI?

Assuming you want to count the number of occurrences of the first non-space characters that occur before the first separating space, you can do the following:
Col 2 =
VAR Trimmed = TRIM(Table2[Col 1])
VAR FirstSpace = SEARCH(" ", Trimmed, 1, LEN(Trimmed) + 1)
VAR FirstString = LEFT(Trimmed, FirstSpace - 1)
RETURN DIVIDE(
LEN(Trimmed) - LEN(SUBSTITUTE(Trimmed, FirstString, "")),
FirstSpace - 1
)
Let's go through an example to see how this works. Suppose we have a string " abc abc abc ".
The TRIM function removes any extraneous spaces at the beginning an end, so Trimmed = "abc abc abc".
The FirstSpace searches for the first space in Trimmed. In this case, FirstSpace = 4. (If there is no first space, then we define FirstSpace to be the length of Trimmed + 1 so the next part works correctly.)
The FirstString uses FirstSpace to find the first chunk. In this case, FirstString = "abc".
Finally, we use SUBSTITUTE to replace each FirstString with an empty string (leaving only the middle spaces) and look at how that changes the length of Trimmed. We know LEN(Trimmed) = 11 and LEN(" ") = 2, so the difference is the 9 characters we removed by substitution. We know that the 9 characters are n copies of FirstString, "abc" and we know the length of FirstString is FirstSpace - 1 = 3.
Thus we can solve 3n = 9 for n to get n = 9/3 = 3, the count of the "abc" substrings.

Related

Kusto: Apply function on multiple column values during bag_unpack

Given a dynamic field, say, milestones, it has value like: {"ta": 1655859586546, "tb": 1655859586646},
How do I print a table with columns like "ta", "tb" etc, with the single row as unixtime_milliseconds_todatetime(tolong(taValue)), unixtime_milliseconds_todatetime(tolong(tbValue)) etc.
I figured that I'll need to write a function that I can call, so I created this:-
let f = view(a:string ){
unixtime_milliseconds_todatetime(tolong(a))
};
I can use this function with a normal column as:- project f(columnName).
However, in this case, its a dynamic field, and the number of items in the list is large, so I do not want to enter the fields manually. This is what I have so far.
log_table
| take 1
| evaluate bag_unpack(milestones, "m_") // This gives me fields as columns
// | project-keep m_* // This would work, if I just wanted the value, however, I want `view(columnValue)
| project-keep f(m_*) // This of course doesn't work, but explains the idea.
Based on the mv-apply operator
// Generate data sample. Not part of the solution.
let log_table = materialize(range record_id from 1 to 10 step 1 | mv-apply range(1, 1 + rand(5), 1) on (summarize milestones = make_bag(pack_dictionary(strcat("t", make_string(to_utf8("a")[0] + toint(rand(26)))), 1600000000000 + rand(60000000000)))));
// Solution Starts here.
log_table
| mv-apply kv = milestones on
(
extend k = tostring(bag_keys(kv)[0])
| extend v = unixtime_milliseconds_todatetime(tolong(kv[k]))
| summarize milestones = make_bag(pack_dictionary(k, v))
)
| evaluate bag_unpack(milestones)
record_id
ta
tb
tc
td
te
tf
tg
th
ti
tk
tl
tm
to
tp
tr
tt
tu
tw
tx
tz
1
2021-07-06T20:24:47.767Z
2
2021-05-09T07:21:08.551Z
2022-07-28T20:57:16.025Z
2022-07-28T14:21:33.656Z
2020-11-09T00:54:39.71Z
2020-12-22T00:30:13.463Z
3
2021-12-07T11:07:39.204Z
2022-05-16T04:33:50.002Z
2021-10-20T12:19:27.222Z
4
2022-01-31T23:24:07.305Z
2021-01-20T17:38:53.21Z
5
2022-04-27T22:41:15.643Z
7
2022-01-22T08:30:08.995Z
2021-09-30T08:58:46.47Z
8
2022-03-14T13:41:10.968Z
2022-03-26T10:45:19.56Z
2022-08-06T16:50:37.003Z
10
2021-03-03T11:02:02.217Z
2021-02-28T09:52:24.327Z
2021-04-09T07:08:06.985Z
2020-12-28T20:18:04.973Z
9
2022-02-17T04:55:35.468Z
6
2022-08-02T14:44:15.414Z
2021-03-24T10:22:36.138Z
2020-12-17T01:14:40.652Z
2022-01-30T12:45:54.28Z
2022-03-31T02:29:43.114Z
Fiddle

Extract two consecutive lines that have non-consecutive strings

I have a very large text file with 2 columns and more than 10 mio of lines.
Most lines have in column 2 a number that is the number of column 2 of the previous line +1. However, few thousands of lines behave differently (see example below).
Input file:
A 1
A 2
A 3
A 10
A 11
A 12
A 40
A 41
I would like to extract the pair of two lines that do not respect the +1 increment in column 2.
Desired output file:
A 3
A 10
A 12
A 40
Is there (preferentially) an awk command that allows to do that?
I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).
awk 'FNR==1 {print; next} $2==p2+1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt
Thanks for your help. Best,
Would you please try the following:
awk 'NR>1 {if ($2!=p2+1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt
Output:
A 3
A 10
A 12
A 40
The variables names are similar to yours: p holds the previous line and
p2 holds the second column of the previous line.
The condition NR>1 suppresses to print on the 1st line.
if ($2!=p2+1) print p ORS $0 prints the pairs of two lines which
meet the condition.
The block {p=$0; p2=$2} preserves values of current line for the next iteration.
I like perl for the text processing that needs arithmetic.
$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp+1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
This is using -a to autosplit into #F.
Prints first 2 lines: print and next if $.<3
On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field: print $p and print if $F[3]!=$fp+1
Saves the 4th field as $fp and the entire line as $p: $fp=$F[3]; $p=$_
Assumptions:
columns are tab-delimited
the 1st column may contain white space (this isn't demonstrated in the sample provided by OP but it also hasn't been ruled out)
lines of interest must have the same value in the 1st column (ie, if the values in the 1st column differ then we don't bother with comparing the values in the 2nd column and instead proceed to the next input line)
if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once
Setup:
$ cat input.txt
A 1
A 2
A 3 # match
A 10 # match
A 11
A 12 # match
A 23 # match
A 40 # match
A 41
X to Z 101
X to Z 102 # match
X to Z 104 # match
X to Z 105
NOTE: comments only added here to highlight the lines that match the search criteria
One awk idea:
awk -F'\t' '
FNR==1 { prevline=$0 }
FNR>1 { if ($1 == prev1 && $2+0 != prev2+1) {
if (prevline) print prevline
print
prevline="" # make sure this line is not printed again if next line also meets criteria
}
else
prevline=$0
}
{ prev1=$1; prev2=$2 }
' input.txt
This generates:
A 3
A 10
A 12
A 23
A 40
X to Z 102
X to Z 104
This might work for you (GNU sed):
sed -nE 'N;h
s/.*\s+(.*)\n.*(\s.*)/echo "$((\1+1))\2"/e;/^(.*)\s\1$/!{x;p;x};x;D' file
Open a two line window throughout the length of the file.
Make a copy of the window and increment the 2nd column of the first line by one. If this amended value is equal to the 2nd column of the second line then print both unadulterated lines.
Delete the first line and repeat.
N.B. This may print the second of these lines twice if the following line meets the same criteria.

Unix awk - count of occurrences for each unique value

In Unix, I am printing the unique value for the first character in a field. I am also printing a count of the unique field lengths. Now I would like to do both together. Easy to do in SQL, but I'm not sure how to do this in Unix with awk (or grep, sed, ...).
PRINT FIRST UNIQ LEADING CHAR
awk -F'|' '{print substr($61,1,1)}' file_name.sqf | sort | uniq
PRINT COUNT OF FIELDS WITH LENGTHS 8, 10, 15
awk -F'|' 'NR>1 {count[length($61)]++} END {print count[8] ", " count[10] ", " count[15]}' file_name.sqf | sort | uniq
DESIRED OUTPUT
first char, length 8, length 10, length 15
a, 10, , 150
b, 50, 43, 31
A, 20, , 44
B, 60, 83, 22
The fields that start with an upper or lower 'a' are never length 10.
The input file is a | delimited .sqf with no header. The field is varChar 15.
sample input
56789 | someValue | aValue | otherValue | 712345
46789 | someValue | bValue | otherValue | 812345
36789 | someValue | AValue | otherValue | 912345
26789 | someValue | BValue | otherValue | 012345
56722 | someValue | aValue | otherValue | 712345
46722 | someValue | bValue | otherValue | 812345
desired output
a: , , 2
b: 1, , 1
A: , , 1
B: , 1,
'a' has two instances that are length 15
'b' has one instance each of length 8 and 15
'A' has one instance that is length 15
'B' has one instance that is length 10
Thank you.
I think you need a better sample input file, but I guess that's what you're looking for
$ awk -F' \\| ' -v OFS=, '{k=substr($3,1,1); ks[k]; c[k,length($3)]++}
END {for(k in ks) print k": "c[k,6],c[k,10],c[k,15]}' file
A: 1,,
B: 1,,
a: 2,,
b: 2,,
note that since all lengths are 6, I printed that count instead of 8. With the right data you should be able to get the output you expect. Note however that the order is not preserved.

SQLite find table row where a subset of columns satisfies a specified constraint

I have the following SQLite table
CREATE TABLE visits(urid INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT,dX INTEGER,dY INTEGER,dZ INTEGER);
Typical content would be
# select * from visits;
urid | hash | dx | dY | dZ
------+-----------+-------+--------+------
1 | 'abcd' | 10 | 10 | 10
2 | 'abcd' | 11 | 11 | 11
3 | 'bcde' | 7 | 7 | 7
4 | 'abcd' | 13 | 13 | 13
5 | 'defg' | 20 | 21 | 17
What I need to do here is identify the urid for the table row which satisfies the constraint
hash = 'abcd' AND (nearby >= (abs(dX - tX) + abs(dY - tY) + abs(dZ - tZ))
with the smallest deviation - in the sense of smallest sum of absolute distances
In the present instance with
nearby = 7
tX = tY = tZ = 12
there are three rows that meet the above constraint but with different deviations
urid | hash | dx | dY | dZ | deviation
------+-----------+-------+--------+--------+---------------
1 | 'abcd' | 10 | 10 | 10 | 6
2 | 'abcd' | 11 | 11 | 11 | 3
4 | 'abcd' | 12 | 12 | 12 | 3
in which case I would like to have reported urid = 2 or urid = 3 - I don't actually care which one gets reported.
Left to my own devices I would fetch the full set of matching rows and then dril down to the one that matches my secondary constraint - smallest deviation - in my own Java code. However, I suspect that is not necessary and it can be done in SQL alone. My knowledge of SQL is sadly too limited here. I hope that someone here can put me on the right path.
I now have managed to do the following
CREATE TEMP TABLE h1(v1 INTEGER,v2 INTEGER);
SELECT urid,(SELECT (abs(dX - 12) + abs(dY - 12) + abs(dZ - 12))) devi FROM visits WHERE hash = 'abcd';
which gives
--SELECT * FROM h1
urid | devi |
-------+-----------+
1 | 6 |
2 | 3 |
4 | 3 |
following which I issue
select urid from h1 order by v2 asc limit 1;
which yields urid = 2, the result I am after. Whilst this works, I would like to know if there is a better/simpler way of doing this.
You're so close! You have all of the components you need, you just have to put them together into a single query.
Consider:
SELECT urid
, (abs(dx - :tx) + abs(dy - :tx) + abs(dz - :tx)) AS devi
FROM visits
WHERE hash=:hashval AND devi < :nearby
ORDER BY devi
LIMIT 1
Line by line, first you list the rows and computed values you want (:tx is a placeholder; in your code you want to prepare a statement and then bind values to the placeholders before executing the statement) from the visit table.
Then in the WHERE clause you restrict what rows get returned to those matching the particular hash (That column should have an index for best results... CREATE INDEX visits_idx_hash ON visits(hash) for example), and that have a devi that is less than the value of the :nearby placeholder. (I think devi < :nearby is clearer than :nearby >= devi).
Then you say that you want those results sorted in increasing order according to devi, and LIMIT the returned results to a single row because you don't care about any others (If there are no rows that meet the WHERE constraints, nothing is returned).

R count character column

I want to add a column containing the amount of letters a-z in a different column from the same row.
dataset$count <-length((gregexpr('[a-z]', as.character(dataset$text))[[1]]))
does not work.
The result I would like to acheive:
text | count
a | 1
ao | 2
ao2 | 2
as2e | 3
as2eA | 3
Tricky one:
nchar(gsub("[^a-z]","",x))
This should do the trick:
numchars<-function(txt){
#basically your code, but to be applied to 1 item
tmpres<-gregexpr('[a-z]', as.character(txt))[[1]]
ifelse(tmpres[1]==-1, 0, length(tmpres))
}
#now apply it to all items:
dataset$count <-sapply(dataset$text, numchars)
Another option is more of a two-step approach:
charmatches<-gregexpr('[a-z]', as.character(dataset$text))[[1]]
dataset$count<-sapply(charmatches, length)

Resources