Unix awk - count of occurrences for each unique value - unix

In Unix, I am printing the unique value for the first character in a field. I am also printing a count of the unique field lengths. Now I would like to do both together. Easy to do in SQL, but I'm not sure how to do this in Unix with awk (or grep, sed, ...).
PRINT FIRST UNIQ LEADING CHAR
awk -F'|' '{print substr($61,1,1)}' file_name.sqf | sort | uniq
PRINT COUNT OF FIELDS WITH LENGTHS 8, 10, 15
awk -F'|' 'NR>1 {count[length($61)]++} END {print count[8] ", " count[10] ", " count[15]}' file_name.sqf | sort | uniq
DESIRED OUTPUT
first char, length 8, length 10, length 15
a, 10, , 150
b, 50, 43, 31
A, 20, , 44
B, 60, 83, 22
The fields that start with an upper or lower 'a' are never length 10.
The input file is a | delimited .sqf with no header. The field is varChar 15.
sample input
56789 | someValue | aValue | otherValue | 712345
46789 | someValue | bValue | otherValue | 812345
36789 | someValue | AValue | otherValue | 912345
26789 | someValue | BValue | otherValue | 012345
56722 | someValue | aValue | otherValue | 712345
46722 | someValue | bValue | otherValue | 812345
desired output
a: , , 2
b: 1, , 1
A: , , 1
B: , 1,
'a' has two instances that are length 15
'b' has one instance each of length 8 and 15
'A' has one instance that is length 15
'B' has one instance that is length 10
Thank you.

I think you need a better sample input file, but I guess that's what you're looking for
$ awk -F' \\| ' -v OFS=, '{k=substr($3,1,1); ks[k]; c[k,length($3)]++}
END {for(k in ks) print k": "c[k,6],c[k,10],c[k,15]}' file
A: 1,,
B: 1,,
a: 2,,
b: 2,,
note that since all lengths are 6, I printed that count instead of 8. With the right data you should be able to get the output you expect. Note however that the order is not preserved.

Related

List all strings appearing more than once in a file

I have a very large file (around 70GB), and I want to list all strings that appear more than once in the whole file.
I can list all the matches when I specify which string to search in a file, but I want to list all strings that have more than one occurrence.
For example, assuming my file looks like this:
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 7AB | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 7AB | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 4D3 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
And I want to list only records which have HHID more than once, i.e, 7AB and 348.
Any idea how can I implement this?
awk to the rescue:
awk -F'[ |]+' '
$2 ~ /^[[:alnum:]]+$/ { count[$2]++ }
END {
for (hhid in count) {
if (count[hhid] >= 2) {
print hhid
}
}
}
' file
-F'[ |]+' sets the field separator.
$2 ~ /^[[:alnum:]]+$/ filters out the header and horizontal lines.
count[$2]++ increases the value at $2, the string we’re counting. On the first occurrence this initialises the value to 1. On the second occurrence it increases it to 2, and so on.
END is run after all lines have been processed.
for (hhid in count) iterates over the strings in count.
if (count[hhid] >= 2) skips any <2 counts.
print hhid prints the string.

How to duplicate input into outputs with jq?

I'm trying to adapt the following snippet:
echo '{"a":{"value":"b"}, "c":{"value":"d"}}' \
| jq -r '. as $in | keys[] | [$in[.].value | tostring + " 1"] | #tsv'
b 1
d 1
to output:
b 1
b 2
d 1
d 2
The following adaptation produces the desired output:
echo '{"a":{"value":"b"}, "c":{"value":"d"}}' |
jq -r '
def addindex(start;lessthan):
range(start;lessthan) as $i | "\(.) \($i)";
. as $in
| keys[]
| $in[.].value
| addindex(1;3)'
Note that keys emits the key names after they have been sorted, whereas keys_unsorted retains the ordering.

Split a single row into multiple rows keeping the delimiter intact

I am trying to split a single row in my data set into multiple rows by keeping the delimiter intact.
This is a sample of my input data set
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1. Teams must be split into two |
| | 2. Teams must have ten players in each team |
| | 3. Each player must bring their own gear |
|---------------------|----------------------------------------------- |
When I use Strsplit function, I get the following output:
df = data.frame(rules =unlist(strsplit(as.character(df$Rules),"?=[[digits]]", perl = T)))
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1 |
|--------------------------------------------------------------------- |
1 | .Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2 |
|--------------------------------------------------------------------- |
1 | .Teams must have ten players in each team |
|--------------------------------------------------------------------- |
My desired Output
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1.Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2.Teams must have ten players in each team |
|--------------------------------------------------------------------- |
Here is a way to collapse each number with the following character string in column Rules. It throws warnings, not errors.
grp <- cumsum(!is.na(as.numeric(df$Rules)))
res <- lapply(split(df, grp), function(X){
data.frame(Group = X[[1]][1],
Rules = paste(X[[2]], collapse = ""))
})
res <- do.call(rbind, res)
res
# Group Rules
#1 1 1.Teams must be split into two
#2 1 2.Teams must have ten players in each team
Data.
df <- data.frame(Group = rep(1, 4),
Rules = c(1, ".Teams must be split into two",
2, ".Teams must have ten players in each team"),
stringsAsFactors = FALSE)

Can Powerbi count multiple occurrences of a specific text within a cell?

What i am trying to find out is, for example let's take as an example the following table:
| Col 1 | Col 2 |
|-------|---------|
| ab | 1 |
| ab ab | 2 |
| ac | 1 |
| ae | 1 |
| ae ae | 2 |
| af | 1 |
So basically if there are two occurrences of the same item in the cell, I want to display 2 in the next column. If there are 3, then 3 and so on. The thing is that I am looking for specific strings most of the time. Its a text and number string.
Is this doable in Power BI?
Assuming you want to count the number of occurrences of the first non-space characters that occur before the first separating space, you can do the following:
Col 2 =
VAR Trimmed = TRIM(Table2[Col 1])
VAR FirstSpace = SEARCH(" ", Trimmed, 1, LEN(Trimmed) + 1)
VAR FirstString = LEFT(Trimmed, FirstSpace - 1)
RETURN DIVIDE(
LEN(Trimmed) - LEN(SUBSTITUTE(Trimmed, FirstString, "")),
FirstSpace - 1
)
Let's go through an example to see how this works. Suppose we have a string " abc abc abc ".
The TRIM function removes any extraneous spaces at the beginning an end, so Trimmed = "abc abc abc".
The FirstSpace searches for the first space in Trimmed. In this case, FirstSpace = 4. (If there is no first space, then we define FirstSpace to be the length of Trimmed + 1 so the next part works correctly.)
The FirstString uses FirstSpace to find the first chunk. In this case, FirstString = "abc".
Finally, we use SUBSTITUTE to replace each FirstString with an empty string (leaving only the middle spaces) and look at how that changes the length of Trimmed. We know LEN(Trimmed) = 11 and LEN(" ") = 2, so the difference is the 9 characters we removed by substitution. We know that the 9 characters are n copies of FirstString, "abc" and we know the length of FirstString is FirstSpace - 1 = 3.
Thus we can solve 3n = 9 for n to get n = 9/3 = 3, the count of the "abc" substrings.

Z80 DAA instruction

Apologies for this seemingly minor question, but I can't seem to find the answer anywhere - I'm just coming up to implementing the DAA instruction in my Z80 emulator, and I noticed in the Zilog manual that it is for the purposes of adjusting the accumulator for binary coded decimal arithmetic. It says the instruction is intended to be run right after an addition or subtraction instruction.
My questions are:
what happens if it is run after another instruction?
how does it know what instruction preceeded it?
I realise there is the N flag - but this surely wouldnt definitively indicate that the previous instruction was an addition or subtraction instruction?
Does it just modify the accumulator anyway, based on the conditions set out in the DAA table, regardless of the previous instruction?
Does it just modify the accumulator anyway, based on the conditions set out in the DAA table, regardless of the previous instruction?
Yes. The documentation is only telling you what DAA is intended to be used for. Perhaps you are referring to the table at this link:
--------------------------------------------------------------------------------
| | C Flag | HEX value in | H Flag | HEX value in | Number | C flag|
| Operation | Before | upper digit | Before | lower digit | added | After |
| | DAA | (bit 7-4) | DAA | (bit 3-0) | to byte | DAA |
|------------------------------------------------------------------------------|
| | 0 | 0-9 | 0 | 0-9 | 00 | 0 |
| ADD | 0 | 0-8 | 0 | A-F | 06 | 0 |
| | 0 | 0-9 | 1 | 0-3 | 06 | 0 |
| ADC | 0 | A-F | 0 | 0-9 | 60 | 1 |
| | 0 | 9-F | 0 | A-F | 66 | 1 |
| INC | 0 | A-F | 1 | 0-3 | 66 | 1 |
| | 1 | 0-2 | 0 | 0-9 | 60 | 1 |
| | 1 | 0-2 | 0 | A-F | 66 | 1 |
| | 1 | 0-3 | 1 | 0-3 | 66 | 1 |
|------------------------------------------------------------------------------|
| SUB | 0 | 0-9 | 0 | 0-9 | 00 | 0 |
| SBC | 0 | 0-8 | 1 | 6-F | FA | 0 |
| DEC | 1 | 7-F | 0 | 0-9 | A0 | 1 |
| NEG | 1 | 6-F | 1 | 6-F | 9A | 1 |
|------------------------------------------------------------------------------|
I must say, I've never seen a dafter instruction spec. If you examine the table carefully, you will see that the effect of the instruction depends only on the C and H flags and the value in the accumulator -- it doesn't depend on the previous instruction at all. Also, it doesn't divulge what happens if, for example, C=0, H=1, and the lower digit in the accumulator is 4 or 5. So you will have to execute a NOP in such cases, or generate an error message, or something.
Just wanted to add that the N flag is what they mean when they talk about the previous operation. Additions set N = 0, subtractions set N = 1. Thus the contents of the A register and the C, H and N flags determine the result.
The instruction is intended to support BCD arithmetic but has other uses. Consider this code:
and 15
add a,90h
daa
adc a,40h
daa
It ends converting the lower 4 bits of A register into the ASCII values '0', '1', ... '9', 'A', 'B', ..., 'F'. In other words, a binary to hexadecimal converter.
I found this instruction rather confusing as well, but I found this description of its behavior from z80-heaven to be most helpful.
When this instruction is executed, the A register is BCD corrected using the contents of the flags. The exact process is the following: if the least significant four bits of A contain a non-BCD digit (i. e. it is greater than 9) or the H flag is set, then $06 is added to the register. Then the four most significant bits are checked. If this more significant digit also happens to be greater than 9 or the C flag is set, then $60 is added.
This provides a simple pattern for the instruction:
if the lower 4 bits form a number greater than 9 or H is set, add $06 to the accumulator
if the upper 4 bits form a number greater than 9 or C is set, add $60 to the accumulator
Also, while DAA is intended to be run after an addition or subtraction, it can be run at any time.
This is code in production, implementing DAA correctly and passes the zexall/zexdoc/z80test Z80 opcode test suits.
Based on The Undocumented Z80 Documented, pag 17-18.
void daa()
{
int t;
t=0;
// 4 T states
T(4);
if(flags.H || ((A & 0xF) > 9) )
t++;
if(flags.C || (A > 0x99) )
{
t += 2;
flags.C = 1;
}
// builds final H flag
if (flags.N && !flags.H)
flags.H=0;
else
{
if (flags.N && flags.H)
flags.H = (((A & 0x0F)) < 6);
else
flags.H = ((A & 0x0F) >= 0x0A);
}
switch(t)
{
case 1:
A += (flags.N)?0xFA:0x06; // -6:6
break;
case 2:
A += (flags.N)?0xA0:0x60; // -0x60:0x60
break;
case 3:
A += (flags.N)?0x9A:0x66; // -0x66:0x66
break;
}
flags.S = (A & BIT_7);
flags.Z = !A;
flags.P = parity(A);
flags.X = A & BIT_5;
flags.Y = A & BIT_3;
}
For visualising the DAA interactions, for debugging purposes, I have written a small Z80 assembly program, that can be run in an actual ZX Spectrum or in an emulation that emulates accurately DAA: https://github.com/ruyrybeyro/daatable
As how it behaves, got a table of flags N,C,H and register A before and after DAA produced with the aforementioned assembly program: https://github.com/ruyrybeyro/daatable/blob/master/daaoutput.txt

Resources