I'm trying to figure out a way to take in a file where there is one word per line and output a log of the most frequently used words in the file and how often they occurred.
Namly, if I were given a file like this (far shorter than what I am looking at, but for clarity's sake...):
dog
dog
cat
bird
cat
horse
dog
I would get an output like:
dog - 3
cat - 2
bird - 1
horse - 1
How about this:
[cnicutar#fresh ~]$ sort < file | uniq -c | sort -rn
3 dog
2 cat
1 horse
1 bird
You can then tweak it to get dog-3 and so on.
using awk & sort :
$ awk '{arr[$1]++}END{for(a in arr){print a" - "arr[a]}}' file.txt | sort -nrk3
A full awk version :
awk '{
arr[$1]++
}
END{
for (i in arr) tmpidx[sprintf("%12s", arr[i]),i] = i
num = asorti(tmpidx)
j = 0
for (i=num; i>=1; i--) {
split(tmpidx[i], tmp, SUBSEP)
indices[++j] = tmp[2]
}
for (i=1; i<=num; i++) print indices[i], arr[indices[i]]
}' file.txt
OUTPUTs
dog - 3
cat - 2
horse - 1
bird - 1
Another way using perl (exact output like you asked):
perl -lne '
END{
print "$_ - $h{$_}" for reverse sort {$h{$a} cmp $h{$b}} keys %h
}
$h{$_}++
' file.txt
OUTPUT
dog - 3
cat - 2
bird - 1
horse - 1
Related
I have a file that is sorted like the following:
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
I want to search for the highest value in the file and display what is one the line?
3 Goodbye
3 begin
3 Yes
3 No
How would I do this?
awk to the rescue!
$ awk 'FNR==NR{if(max<$1) max=$1; next} $1==max' file{,}
3 Goodbye
3 Begin
3 Yes
3 No
double-pass, find the maximum and filter out the rest.
cat file.txt | sort -r | awk '{if ($1>=prev) {print $0; prev=$1}}'
3 Yes
3 No
3 Goodbye
3 Begin
Assuming file.txt contains
2 Good
2 Hello
3 Goodbye
3 Begin
3 Yes
3 No
First get the highest value in the file into a variable. Considering the file is already sorted, pickup the last line in the file. Then parse out the number using awk.
highest=`tail -1 file.list|awk '{print $1}'`
Then grep the file using that value.
grep "^${highest} " file.list
This should do the job. I am only using awk as required in the question:
awk 'BEGIN {v=0} {l = l "\n" $0} {if ($1>v) {l = $0; v = $1}} END {print l}' file.txt
The variable v is initialized (before parsing the file) to 0. Then each line is read and kept in memory; if the first field ($1) is greater than v, then update v and empty what is in l. At the end, just print the content of l.
It's easier than you think.
awk '/^3/' file
3 Goodbye
3 Begin
3 Yes
3 No
I'm working on a problem in KSH, and need to take two comma separated lists and compare them, then output the differences.
Example input 1:
apple, banana
Example input 2:
apple, banana, kiwi
Output:
kiwi
I assume I will need to put the lists into arrays and compare each string in list 1 to list 2, through a loop.
for fruit in $fruits
do
if [[ fruit[1] == fruit1[1] ]]
then
echo "fruit is the same"
else
echo "fruit is not in the list. difference found."
echo $fruit
fi
Does anyone know how I could do this?
Thanks
looking for complement of two lists:
$ a="1,2,3,4,5"
$ b="2,3,4,5,6"
$ echo $a,$b | tr , "\n" | sort | uniq -u
1
6
or, the same, but passing lists separetely (e.g. if you need different preprocessing):
$ sort <(echo $a | tr , "\n") <(echo $b | tr , "\n") | uniq -u
1
6
One possible solution could be:
localhost > cat file1
apple,banana,kiwi,apple
localhost > cat file2
apple,banana,jewel,potato
The eaisest solution is :
cat file1 | tr , '\n' | sort > file3
cat file2 | tr , '\n' | sort > file4
comm -3 file3 file4
Output:
apple
jewel
kiwi
potato
To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.
Similar question to many previous ones (including mine) but I can't find the solution. This is purely a syntax error and I cannot figure out how to make it work.
I have two files in Unix. In file1 I have 5 columns and about 6000 rows. I am trying to match rows in file2 to rows in file1 IF column 1 matches exactly AND if the value in row 5 of file1 is less than 0.00000005 for said row.
file1:
SNPs Context Intergenic Risk Allele Frequency p-Value
rs9747992 Intergenic 1 0.086 2.00E-07
rs2059865 Intron 0 0.235 3.00E-07
rs117020818 Intergenic 1 0.046 7.00E-07
rs1074145 Intergenic 1 0.162 4.00E-09
file2:
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
I can do the first part BUT it keeps outputting a file that is larger than the first two. I am unsure if this is a real result file or just full of duplicates? I also cannot do the 'less than' command. I have tried putting it into the command as a second pattern and also piping it, as below:
awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a) {print $0}}' file1 file2 > output | awk '{if (a[$5] < 0.00000005)}'
and
awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a && $5 < 0.00000005)} {print $0}}' file1 file2 > output
Both times it's giving me the same size file which is much larger than either file1 or file2. If you want examples of the tables please just say.
Tentative solution:
A tentative solution I am using is to just make a new file containing only lines from file1 which have that <0.00000005 value. This works though I would like to know my original answer for posterity.
awk '$5<=0.00000005' file1 > file11
Per my comments above, if you're using file2 as a filter list, you need to load it into the a[] array.
I've made up a small sample of how that works, the test for $28 < .000005 should be easy to add as you have it in your code.
With file data1
1 2 3 4 5 6 7
2 3 4 5 6 7 8
4 5 8 7 8 9 10
and file searchList
3
Then
awk 'FNR==NR{a[$0]=$0;next}
FNR!=NR{ if ($2 in a) print $0}
#dbg END{for (x in a) print "x="x " a[x]=" a[x]
}' searchList data1
gives output
2 3 4 5 6 7 8
edit Per our conversation in comments, my best guess without seeing your required output would be
I've added an extra record in file1 so there can be match
rs3131972 Intergenic 1 0.086 2.00E-07
awk '( FNR==NR && (sprintf("%.07f",$5) < .000000005) ) {
a[$1]=$0
#dbg print "a["$1"]="a[$1]
next
}
FNR!=NR{
#dbg print "$1="$1
if ($1 in a)print "Matched:" $0
}' file1 file2
The output is now
Matched:rs3131972 1 742584 A G 0.289 0.7726 .
IHTH
Shellter's answer is good. Mine is more about what you did wrong. Your first attempt
> awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a) {print $0}}
' file1 file2 > output | awk '{if (a[$5] < 0.00000005)}'
fails because your pipeline is wrong. You need to pipe awk | awk > output not awk >output | awk. The latter will receive no input and produce no output from the last step of the pipeline. Also, the second Awk instance has no knowledge of the variables you used in the first.
Furthermore, you seem to have a recurring problem with spurious braces in Awk. The general syntax is awk "condition1 { action1 } condition2 { action2 }..." where you can omit a condition to do an action unconditionally, or omit the action part (with the braces) to perform the default action { print $0 }. But here, you have only an action, which is however actually a condition, with no side effects such as printing anything. You want to remove the braces and the if wrapper.
So you need
awk 'FNR==NR{a[$1]=$0;next}{if ($1 in a) {print $0}}' file1 file2 |
awk '$5 < 0.00000005' >output
which (in accordance with the rules for omitting a condition or an action, and with some refactoring) can be much simplified to
awk 'FNR==NR{a[$1]=$0;next}
$1 in a' file1 file2 |
awk '$5 < 0.00000005' >output
Your second attempt is closer;
> awk 'FNR==NR{a[$1]=$0;next}
{if ($1 in a && $5 < 0.00000005)} {print $0}}' file1 file2 > output
but again, you have too many brackets. The closing brace after the if ruins it all! So you have effectively "if (condition)" then nothing (maybe this should be a syntax error!), followed by a new block with an unconditional print. But overall, this is much better.
awk 'FNR==NR{a[$1]=$0;next}
{if ($1 in a && $5 < 0.00000005) print $0}' file1 file2 > output
which of course can be simplified to
awk 'FNR==NR{a[$1]=$0;next}
($1 in a) && $5 < 0.00000005' file1 file2 > output
Answer that worked based on Shellters assistance.
awk -F $'\t' 'NR==FNR{if ($5 < 0.00000005){a[$1]=$0}} NR!=FNR{if ($1 in a) print $0}' file1 file2 > output
Thanks
have this text file:
name, age
joe,42
jim,20
bob,15
mike,24
mike,15
mike,54
bob,21
Trying to get this (count):
joe 1
jim 1
bob 2
mike 3
Thanks,
$ awk -F, 'NR>1{arr[$1]++}END{for (a in arr) print a, arr[a]}' file.txt
joe 1
jim 1
mike 3
bob 2
EXPLANATIONS
-F, splits on ,
NR>1 treat lines after line 1
arr[$1]++ increment array arr (split with ,) with first column as key
END{} block is executed at the end of processing the file
for (a in arr) iterating over arr with a key
print a print key , arr[a] array with a key
Strip the header row, drop the age field, group the same names together (sort), count identical runs, output in desired format.
tail -n +2 txt.txt | cut -d',' -f 1 | sort | uniq -c | awk '{ print $2, $1 }'
output
bob 2
jim 1
joe 1
mike 3
It looks like you want sorted output. You could simply pipe or print into sort -nk 2:
awk -F, 'NR>1 { a[$1]++ } END { for (i in a) print i, a[i] | "sort -nk 2" }' file
Results:
jim 1
joe 1
bob 2
mike 3
However, if you have GNU awk installed, you can perform the sorting without coreutils. Here's the single process solution that will sort the array by it's values. The solution should still be quite quick. Run like:
awk -f script.awk file
Contents of script.awk:
BEGIN {
FS=","
}
NR>1 {
a[$1]++
}
END {
for (i in a) {
b[a[i],i] = i
}
n = asorti(b)
for (i=1;i<=n;i++) {
split (b[i], c, SUBSEP)
d[++x] = c[2]
}
for (j=1;j<=n;j++) {
print d[j], a[d[j]]
}
}
Results:
jim 1
joe 1
bob 2
mike 3
Alternatively, here's the one-liner:
awk -F, 'NR>1 { a[$1]++ } END { for (i in a) b[a[i],i] = i; n = asorti(b); for (i=1;i<=n;i++) { split (b[i], c, SUBSEP); d[++x] = c[2] } for (j=1;j<=n;j++) print d[j], a[d[j]] }' file
A strictly awk solution...
BEGIN { FS = "," }
{ ++x[$1] }
END { for(i in x) print i, x[i] }
If name, age is really in the file, you could adjust the awk program to ignore it...
BEGIN { FS = "," }
/[0-9]/ { ++x[$1] }
END { for(i in x) print i, x[i] }
I come up with two functions based on the answers here:
topcpu() {
top -b -n1 \
| tail -n +8 \
| awk '{ print $12, $9, $10 }' \
| awk '{ CPU[$1] += $2; MEM[$1] += $3 } END { for (k in CPU) print k, CPU[k], MEM[k] }' \
| sort -k3 -n \
| tail -n 10 \
| column -t \
| tac
}
topmem() {
top -b -n1 \
| tail -n +8 \
| awk '{ print $12, $9, $10 }' \
| awk '{ CPU[$1] += $2; MEM[$1] += $3 } END { for (k in CPU) print k, CPU[k], MEM[k] }' \
| sort -k2 -n \
| tail -n 10 \
| column -t \
| tac
}
$ topcpu
chrome 0 75.6
gnome-shell 6.2 7
mysqld 0 4.2
zsh 0 2.2
deluge-gtk 0 2.1
Xorg 0 1.6
scrcpy 0 1.6
gnome-session-b 0 0.8
systemd-journal 0 0.7
ibus-x11 6.2 0.7
$ topmem
top 12.5 0
Xorg 6.2 1.6
ibus-x11 6.2 0.7
gnome-shell 6.2 7
chrome 6.2 74.6
adb 6.2 0.1
zsh 0 2.2
xdg-permission- 0 0.2
xdg-document-po 0 0.1
xdg-desktop-por 0 0.4
enjoy!
cut -d',' -f 1 file.txt |
sort | uniq -c
2 bob
1 jim
1 joe
3 mike