Unix count lines starting with same number - unix

i have a text corpus and already sorted it by frequency:
tr ' ' '\n' < corpus.txt | sort | uniq -c | sort -nr
Now i want to count up all lines that start with the same number.
For example:
100 the
50 in
50 and
10 cat
10 dog
should return:
100 1
50 2
10 2
Is there a way to do it?
Thanks!

Easy with awk:
$ awk '{count[$1]++} END {for (i in count) print i, count[i]}' file
100 1
10 2
50 2

Just tweak your already written command:-
cut -d' ' -f1 corpus.txt| sort -rn | uniq -c
Required output is:-
1 100
2 50
2 10

Related

compare string from file and get group by of results using shell/bash

I have a file like below :
h1 a 1
h2 a 1
h1 b 2
h2 b 2
h1 c 3
h2 c 3
h1 c1 3
h2 c1 3
h1 c2 3
h2 c2 3
I need output like :
2 a 1
2 b 2
6 c 3
I have tried with bash , somehow its not giving me the expected results.
cat sample.log | awk '{print $2 , $3}' | sort | uniq -c
2
2 a 1
2 b 2
2 c 3
2 c1 3
2 c2 3
With below i am able to get the c* results, but a and b are missing .
cat sample.log | awk '$2="c" {print $2 , $3}' | sort -n | uniq -c | sort -n | tail -1
6 c 3
You may use this gnu-awk:
awk '{ ch=substr($2, 1, 1); ++freq[ch OFS $3] } END {
PROCINFO["sorted_in"] = "#ind_str_asc"; for (i in freq) print freq[i], i }' file
2 a 1
2 b 2
6 c 3
1st solution: Could you please try following.
awk '{sub(/[0-9]+/,"",$2);a[$2 OFS $3]++} END{for(i in a){print a[i],i}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
sub(/[0-9]+/,"",$2) ##Substitute digits from 2nd field with NULL.
a[$2 OFS $3]++ ##Creating array with 2nd and 3rd field and increasing its occurence.
}
END{
for(i in a){ ##Starting for loop here.
print a[i],i ##Printing array a element with index i and index i here.
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: In case OP needs output in same sequence as Input_file then try following,
awk '
{
sub(/[0-9]+/,"",$2)
}
!a[$2 OFS $3]++{
count++
}
{
b[count]=$2 OFS $3
++c[$2 OFS $3]
}
END{
for(i=1;i<=count;i++){
print c[b[i]],b[i]
}
}
' Input_file
without awk
$ sed -E 's/[^ ]+ (.).* /\1 /' file | sort | uniq -c
2 a 1
2 b 2
6 c 3

Count number of not null rows column wise in a txt file in UNIX

I am trying to count the number of not null rows of all column in a txt file. I am able to read not null rows in each column individually but I am trying to loop them all together. awk - F "|" '$1!=""{N++} print N'
Here is a look at my data
A | B | C | D | E
1 | 2 | 0 | 8 |
5 | 3 | 6 | | 4
| | 8 | |
| 7 | 8 | |
8 | 9 | 2 | | 4
I want the result to be like :
Column A: 3
Column B: 4
Column C: 5
Column D: 1
Column E: 2
Your attempt is not working. Please remove the space between - F and call print N at the end using END :
awk -F "|" '$1!=""{N++} END {print N}' input.txt
This command will also count lines with some text missing a |.
An alternative would be
grep -cE "[^|]+\|" input.txt
If you want to check all columns of all lines, instead of a particular column:
awk -F'|' '{ for (i = 1; i <= NF; i++) if ($i != "") n++ } END { print n }' input.txt
For each line, loop over every |-delimited field in that line, incrementing a counter if it's not empty. Finally print the count at the end.

rotating trailing minus to front of the data in of excel in UNIX [duplicate]

This question already has an answer here:
Issue with negative sign after executing the tr command in UNIX
(1 answer)
Closed 6 years ago.
How to bring the trailing minus sign to beginning of the data in a file in UNIX?
Input
ABC 12-
XYZ 10
Expected output
ABC -12
XYZ 10
you can try somethink like this;
awk '{for (i=1; i<=NF; i++)
if (substr($i,length($i),length($i)) == "-")
$i = "-"substr($i,1,length($i)-1);
print }' yourFile
Example;
user#host:/tmp$ cat t1 | column -t
ABC 12- XYZ -10 XYZ 22-
XYZ -10 XYZ -1-0 XYZ -1-0 XYZ -1-0
XYZ -1-0 10- ABC -ABC ABC-
user#host:/tmp$ awk '{for (i=1; i<=NF; i++) if (substr($i,length($i),length($i)) == "-") $i = "-"substr($i,1,length($i)-1); print }' t1 | column -t
ABC -12 XYZ -10 XYZ -22
XYZ -10 XYZ -1-0 XYZ -1-0 XYZ -1-0
XYZ -1-0 -10 ABC -ABC -ABC

AWK command for sum 2 files

i am new at awk and i need awk command to summing 2 files if found the same column
file 1
a | 16:00 | 24
b | 16:00 | 12
c | 16:00 | 32
file 2
b | 16:00 | 10
c | 16:00 | 5
d | 16:00 | 14
and the output should be
a | 16:00 | 24
b | 16:00 | 22
c | 16:00 | 37
d | 16:00 | 14
i have read some of the question here and still found the correct way to do it, i already tried with this command
awk 'BEGIN { FS = "," } ; FNR=NR{a[$1]=$2 FS $3;next}{print $0,a[$1]}'
please help me, thank you
This script also uses sort but it will work,
awk -F'|' ' { f[$1] += $3 ; g[$1] = $2 } END { for (a in f) { print a , "|", g[a] , "|", f[a] } } ' a.txt b.txt | sort
The results are
a | 16:00 | 24
b | 16:00 | 22
c | 16:00 | 37
d | 16:00 | 14
without |sort
awk -F'|' '{O[$1FS$2]+=$3}END{asorti(O,T,"#ind_str_asc");for(t in T)print T[t] FS O[T[t]]}' file[1,2]
Just store all the data in two arrays a[] and b[] and then print them back:
awk 'BEGIN{FS=OFS="|"}
{a[$1]+=$3; b[$1]=$2}
END{for (i in a) print i,b[i],a[i]}' f1 f2
Test
$ awk 'BEGIN{FS=OFS="|"} {a[$1]+=$3; b[$1]=$2} END{for (i in a) print i,b[i],a[i]}' f1 f2
b | 16:00 |22
c | 16:00 |37
d | 16:00 |14
a | 16:00 |24

How to read stdin from pipes in an R script?

I tried to plot a beeswarm plot (http://www.cbs.dtu.dk/~eklund/beeswarm/) and I got it to work and now I want to write a small R script to automate things. The input I want to give to this R script is from STDIN and I'm having trouble to get data read from STDIN.
Here is my R script:
args <- commandArgs(TRUE)
f1 <- args[1]
plotbeeswarm <- function(output){
library(beeswarm)
f <- read.table(stdin(), header=TRUE)
png(output, width=800, height=800)
beeswarm(Data ~ Category, data=f, pch=16, pwcol=1+as.numeric(sample),
xlab="")
}
plotbeeswarm(f1)
The problem I think is just how the input file was read and processed into f. Can anyone help me fix my code? Thanks very much!
Here is an example I use on the webpage for littler and which you should be able to adapt for Rscript too:
The code of the script is just:
#!/usr/bin/r -i
fsizes <- as.integer(readLines(file("stdin")))
print(summary(fsizes))
stem(fsizes)
and I feed the result from ls -l into, filtered by awk to get just one column of file sizes:
edd#max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3 6240 30820 61810 60170 2000000 1
The decimal point is 5 digit(s) to the right of the |
0 | 00000000000000000000000000000000000111111111111111111111111122222222+57
1 | 111112222345679
2 | 7
3 | 1
4 | 1
5 |
6 |
7 |
8 |
9 | 6
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 | 0
edd#max:~/svn/littler/examples$

Resources