I'm given a data set which contain several url like:
http://dbpedia.org/resource/Duncan_Trussell
http://dbpedia.org/resource/Ruth_Mary_Reynolds
http://dbpedia.org/resource/Abbas_Noyan
http://dbpedia.org/resource/John_E._Dowdell
http://dbpedia.org/resource/John_F._Du
Basically, i needed to make a script which shows the name of the people, the result should be like this:
First name
John 2
Abbas 1
Duncan 1
Ruth 1
Last Name
Trussels 1
Reynolds 1
Noyan 1
Dowdell 1
Du 1
grep -Po 'resource/\K[^_]*(?=_)' urls.txt | uniq -c | awk '{print $2" "$1}'
Output:
Duncan 1
Ruth 1
Abbas 1
John 2
grep -Po '_\K[^_]*$' urls.txt | uniq -c | awk '{print $2" "$1}'
Output:
Trussell 1
Reynolds 1
Noyan 1
Dowdell 1
Du 1
If you want you can replace | uniq -c | awk '{print $2" "$1}' by | awk '{count[$1]++}END{for(j in count) print j, count[j]}'.
All in awk
Split the last field on _
Increment the first and last fields of the split using associative array
Then at the end just print the names and their respective counts.
awk -F/ '{t=split($NF,a,"_");b[a[1]]++;c[a[t]]++}
END{print "Firstnames";for(i in b)print i,b[i]
print "\nLastnames";for(j in c) print j,c[j]
}' test
output
Firstnames
Duncan 1
Ruth 1
John 2
Abbas 1
Lastnames
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
Here is another awk
awk -F"/|_" '{a[$5]++;b[$NF]++} END {for (i in a) print i,a[i];print "";for (i in b) print i,b[i]}' file
Duncan 1
Ruth 1
John 2
Abbas 1
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
This should be much faster than the multiple command grep uniq awk
With GNU awk for true 2-D arrays:
$ gawk -F'/' '
{
nr = split($NF,names,/_/)
cnt["First"][names[1]]++
cnt["Last"][names[nr]]++
}
END {
for (pos in cnt) {
print pos, "name"
for (name in cnt[pos]) {
print name, cnt[pos][name]
}
}
}
' file
Last name
Noyan 1
Du 1
Trussell 1
Reynolds 1
Dowdell 1
First name
Duncan 1
Ruth 1
John 2
Abbas 1
Related
I have a huge .txt file A with 6 columns. The first 5 columns are separated by a tab, but columns 5 and 6 are separated by a white space. Some columns have missing elements. Using Unix, how can I generate another file B.txt that has columns 5 & 6 separated by a tab?
A.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
B.txt
1 886620 rs201016776 T A 1:886620-TG
1 879576 rs115979567 C T
1 886817 rs111748052;rs10465241 C CATTTT JHU_1.886816
1 889255 rs150003341 C T Variant37585
1 891659 rs116557519 G A JHU_1.891658
Use column -t inputFile
$ column -t A.txt > B.txt
If the tabular form using column doesn't work, one can be explicit about the tab using awk:
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' A.txt > B.txt
You can try with gnu sed
sed 's/[[:blank:]]/\t/g' A.txt > B.txt
I want to match column 1 and 2 from the file for_matching with files in different directories in the path and named /.file and print the entire row that matches those columns
/.file (example)
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
for_matching
carrot 124555
output
carrot 124555 1 2 6
Right now I can just match column 1 between the two.
for i in */*.file; do awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' $i for_matching > $i.matched; done
Using awk
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
Test Results:
$ cat file
carrot 124555 1 2 6
hair 9922 2 2 2
tree 2223 2 1 2
$ cat for_matching
carrot 124555
$ awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching file
carrot 124555 1 2 6
Same with multiple files, no need of ls */*.file
#!/usr/bin/env bash
for i in */*.file; do
awk 'FNR==NR{arr[$1,$2];next}(($1,$2) in arr)' for_matching "$i" >"$i.matched"
done
That is so simple that you could just:
$ grep -F -w -f for_matching file
carrot 124555 1 2 6
For a limitation see #karakfa's comment below.
That could of course be circumvented with (:
$ cat file
carrot 124555 1 2 6
1 carrot 124555 1 2 6
$ grep -w -f <(sed 's/^/^/g' for_matching) file
carrot 124555 1 2 6
I have two files which both have over a million lines and I want to print out the match from the first column of file 1 with the second of file 2.
file_1
SNP_A-2131660 1 0.0021
SNP_A-1967418 2 0.0005
SNP_A-1969580 2 0.0011
SNP_A-4263484 1 0.0024
SNP_A-1978185 0 0.0014
SNP_A-4264431 0 0.0038
SNP_A-1980898 2 0.0050
SNP_A-1983139 0 0.0011
file_2
"AFFX-SNP_10000979" "rs4147951"
"AFFX-SNP_10009702" "rs2022235"
"SNP_A-2131660" "rs6425720"
"AFFX-SNP_10021569" "rs12997193"
"AFFX-SNP_10026879" "rs9933410"
"AFFX-SNP_10029725" "rs7142489"
"AFFX-SNP_10034687" "rs1350088"
matches.txt
"SNP_A-2131660" "rs6425720"
Right now I am doing, but it is so slow. Is there a faster way?
awk '{print $1}' file_1 | while read -r a; do grep -H $a file_2; done >> matches.txt
Could you please try following awk.
awk 'FNR==NR{a[$1]=$0;next} {val=$1;gsub(/\"/,"",val)} (val in a)' file_1 file_2
If you want to redirect above command's output into a output file you could do > output.txt(or whatever you like the name for output file) at the end of the command too.
Using awk, and setting field separator FS='"' after reading first file,
awk 'FNR==NR{a[$1];next}$2 in a' file1 FS='"' file2
# OR
awk -v q='"' 'FNR==NR{a[q $1 q];next}$1 in a' file1 file2
It will be faster among all solution posted till now, as we are not sorting file, replacing double quotes and joining them
Test Results :
$ cat f1
SNP_A-2131660 1 0.0021
SNP_A-1967418 2 0.0005
SNP_A-1969580 2 0.0011
SNP_A-4263484 1 0.0024
SNP_A-1978185 0 0.0014
SNP_A-4264431 0 0.0038
SNP_A-1980898 2 0.0050
SNP_A-1983139 0 0.0011
$ cat f2
"AFFX-SNP_10000979" "rs4147951"
"AFFX-SNP_10009702" "rs2022235"
"SNP_A-2131660" "rs6425720"
"AFFX-SNP_10021569" "rs12997193"
"AFFX-SNP_10026879" "rs9933410"
"AFFX-SNP_10029725" "rs7142489"
"AFFX-SNP_10034687" "rs1350088"
$ awk 'FNR==NR{a[$1];next}$2 in a' f1 FS='"' f2
"SNP_A-2131660" "rs6425720"
# OR
$ awk -v q='"' 'FNR==NR{a[q $1 q];next}$1 in a' f1 f2
"SNP_A-2131660" "rs6425720"
One possible tool is join, but you need some preprocessing and some post-processing:
sort < file1 > file1_sorted
sort < file2 | sed 's/^"\([^"]*\)"/\1/' > file2_sorted
join file1_sorted file2_sorted -1 1 -2 1
Gives:
SNP_A-2131660 1 0.0021 "rs6425720"
Let's say I have some tab-separated data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
And I want to sort it by the number of times a name occurs in the first column (max to min)
So we'd have Peter (3 occurrences) Joe (2 occurrences) and Laura (1 occurrence).
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
It only needs sorted by the first column, not the second. I've been reading sort's documentation, and I don't think it has the functionality. Anyone have an easy method?
not sexy but works for your example:
awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' file file|sort -nr|sed -r 's/[0-9]* //'
test with your data:
kent$ cat n.txt
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
kent$ awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' n.txt n.txt|sort -nr|sed -r 's/[0-9]* //'
Peter 8
Peter 7
Peter 5
Joe 8
Joe 4
Laura 3
This works:
for person in $(awk '{print $1}' file.txt | sort | uniq -c | sort -dnr | awk '{print $2}');
do grep -e "^$person[[:space:]]" file.txt;
done
Here's one way using GNU awk. Run like:
awk -f script.awk file
Contents of script.awk:
BEGIN {
FS="\t"
}
{
c[$1]++
r[$1] = (r[$1] ? r[$1] ORS : "") $0
}
END {
for (i in c) {
a[c[i],i] = i
}
n = asorti(a)
for (i=1;i<=n;i++) {
split(a[i], b, SUBSEP)
x[++j] = b[2]
}
for (i=n;i>=1;i--) {
print r[x[i]]
}
}
Results:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
That's a surprisingly hard sort criterion. This code works, but it is pretty ugly:
data=${1:-data}
awk '{ print $1 }' $data |
sort |
uniq -c |
sort -k2 |
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
sort -k1,1nr -k3,3 -k2n |
awk 'BEGIN{OFS="\t"} { print $3, $4 }'
It assumes bash 4.x for 'process substitution' but doesn't use any sorting built into awk (that's a GNU extension compared with POSIX awk). With an explicit temporary file, it can be made to work in shells without process substitution.
data=${1:-data} # File named on command line, or uses name 'data'
awk '{ print $1 }' $data | # List of names
sort | # Sorted list of names
uniq -c | # Count occurrences of each name
sort -k2 | # Sort in name order
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
# The process substitution numbers each record in sequence and sorts in name order
# The join matches the names (column 2) and outputs the frequency, record number, name, value
sort -k1,1nr -k3,3 -k2n | # Sort on frequency reversed, name, original line number
awk 'BEGIN{OFS="\t"} { print $3, $4 }' # Print name and value
Using GNU awk with a built-in sort, or Perl or Python, is probably better than this.
For the original data, the output is:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
Given this extended version of the data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
Peter 50
Joe 80
Peter 70
Peter 80
Joe 40
Laura 30
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Pater 50
Jae 80
Pater 70
Pater 80
Jae 40
Laura 30
The output is:
Peter 5
Peter 7
Peter 8
Peter 50
Peter 70
Peter 80
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Joe 8
Joe 4
Joe 80
Joe 40
Laura 3
Laura 30
Laura 30
Pater 50
Pater 70
Pater 80
Jae 80
Jae 40
The -k3,3 sort term is necessary for this data set; it sorts Laura's entries before Pater's entries (when omitted, you get those two lists interleaved).
Here's another one using awk:
awk '{ a[ $1, ++b[$1] ]=$0 ; if(b[$1]>max) max=b[$1] }
END{ for(x=max;x>=1;x--)
for( k in b )
if( a[k,x] )
for(y=1;y<=x;y++) {
print a[k,y]
delete a[k,y]
}
}' filename
It works fine with gawk and POSIX awk. The presence of three loops in the END statement might affect performance with big files.
Suppose to feed the filter standard input with these line:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
It would be nicer if someone tell me how to write a script that prints only every 4 lines, in the case of the example input above:
line 1
line 5
line 9
$ yes | cat -n | head -10 | awk 'NR % 4 == 1'
1 y
5 y
9 y
That is, your answer is "awk 'NR % 4 == 1'".
awk '{ if ((NR-1) %4 ==0) print}'
awk 'NR%4 == 1 {print}'</etc/hosts
Replace 4 by whatever value you want of course.
sed -ne '1~4p’
(GNU sed. Not tested on OSX, etc.)