How to retrieve unique IDs from the txt file? - unix

I have a text file containing the sequence IDs. These Ids file contain some duplicate IDs. Few IDs are also present more then 2 times in this file. I want to find unique IDs in one file and repeated IDs in another file. Furthermore I am also interested to find the number, how many times the repeated IDs present in the file.
I found duplicated sequence using the following command
$ cat id.txt | grep '^>' | sort | uniq -d > dupid.txt
This gives me the duplicated sequences in "dupid.txt" file. But how do I get those that are present more then 2 times and how many times they are present? Secondly, how do I find unique sequences?

A bit of searching might have found this answer, with many suggestions on traditional uses of uniq.
Also, note that:
$ cat id.txt | grep '^>'
...is basically the same as:
$ grep '^>' id.txt
The so-called "Useless Use Of Cat"
But to your question - find uniq ids, dupes, and dupes with counts - here's a try using awk that processes its stdin, and writes to three output files the user must name, trying to avoid clobbering output files that already exist. One pass, but holds all input in memory before starting output.
#!/bin/bash
[ $# -eq 3 ] || { echo "Usage: $(basename $0) <uniqs> <dupes> <dupes_counts>" 1>&2; exit 1; }
chk() {
[ -e "$1" ] && { echo "$1: already exists" 1>&2; return 1; }
return $2
}
chk "$1" 0; chk "$2" $?; chk "$3" $? || exit 1
awk -v u="$1" -v d="$2" -v dc="$3" '
{
idc[$0]++
}
END {
for (id in idc) {
if (idc[id] == 1) {
print id >> u
} else {
print id >> d
printf "%d:%s\n", idc[id], id >> dc
}
}
}
'
Save as (for example) "doit.sh", and then invoke via:
$ grep '^>' id.txt | doit.sh uniques.txt dupes.txt dupes_counts.txt

Related

Break the nested while loops in unix scripting

Have two files:
file1 is having the key words - INFO ERROR
file2 is having the list of log files path - path1 path2
I need to exit out of the script if any of the condition in any of the loops failed.
Here is the Code:
#!/bin/bash
RC=0
while read line
do
echo "grepping from the file $line
if [ -f $line ]; then
while read key
do
echo "searching $key from the file $line
if [ condition ]; then
RC=0;
else
RC=1;
break;
fi
done < /apps/file1
else
RC=1;
break;
fi
done < apps/file2
exit $RC
Thank you!
The ansewer to your question is using break 2:
while true; do
sleep 1
echo "outer loop"
while true; do
echo "inner loop"
break 2
done
done
I never use this, it is terrible when you want to understand or modify the code.
Already better is using a boolean
found_master=
while [ -n "${found_master}" ]; do
sleep 1
echo "outer loop"
while true; do
echo "inner loop"
found_master=true
break
done
done
When you do not need the variable found_master it is an ugly additional variable.
You can use a function
inner_loop() {
local i=0;
while ((i++ < 5)); do
((random=$RANDOM%5))
echo "Inner $i: ${random}"
if [ ${random} -eq 0 ]; then
echo "Returning 0"
return 0
fi
done;
return 1;
}
j=0
while ((j++ < 5 )); do
echo "Out loop $j"
inner_loop
if [ $? -eq 0 ]; then
echo "inner look broken"
break
fi
done
But your original problem can be handles without two while loops.
You can use grep -E "INFO|ERROR" file2 or combining the keywords. When the keywords are on different lines in file1, you can use grep -f file1 file2.
Replace condition with $(grep -c ${key} ${line}) -gt 0 like this:
echo "searching $key from the file $line
if [ $(grep -c ${key} ${line}) -eq 0 ]; then
It will count the each key-word in your log-file. If count=0 (pattern didn't found), running then. If found at least 1 key, running else, RC=1 and exit from loop.
And be sure, that your key-words can't be substrings of the longest words, or you will get an error.
Example:
[sahaquiel#sahaquiel-PC Stackoverflow]$ cat file
correctstringERROR and more useless text
ERROR thats really error string
[sahaquiel#sahaquiel-PC Stackoverflow]$ grep -c ERROR file
2
If you wish to avoid count 2 (because counting first string, obliviously, bad way), you should also add two keys for grep:
[sahaquiel#sahaquiel-PC Stackoverflow]$ grep -cow ERROR file
1
Now you have counted only the words equal to your key, not substrings in any useful strings.

Loop over environment variables in POSIX sh

I need to loop over environment variables and get their names and values in POSIX sh (not bash). This is what I have so far.
#!/usr/bin/env sh
# Loop over each line from the env command
while read -r line; do
# Get the string before = (the var name)
name="${line%=*}"
eval value="\$$name"
echo "name: ${name}, value: ${value}"
done <<EOF
$(env)
EOF
It works most of the time, except when an environment variable contains a newline. I need it to work in that case.
I am aware of the -0 flag for env that separates variables with nul instead of newlines, but if I use that flag, how do I loop over each variable? Edit: #chepner pointed out that POSIX env doesn't support -0, so that's out.
Any solution that uses portable linux utilities is good as long as it works in POSIX sh.
There is no way to parse the output of env with complete confidence; consider this output:
bar=3
baz=9
I can produce that with two different environments:
$ env -i "bar=3" "baz=9"
bar=3
baz=9
$ env -i "bar=3
> baz=9"
bar=3
baz=9
Is that two environment variables, bar and baz, with simple numeric values, or is it one variable bar with the value $'3\nbaz=9' (to use bash's ANSI quoting style)?
You can safely access the environment with POSIX awk, however, using the ENVIRON array. For example:
awk 'END { for (name in ENVIRON) {
print "Name is "name;
print "Value is "ENVIRON[name];
}
}' < /dev/null
With this command, you can distinguish between the two environments mentioned above.
$ env -i "bar=3" "baz=9" awk 'END { for (name in ENVIRON) { print "Name is "name; print "Value is "ENVIRON[name]; }}' < /dev/null
Name is baz
Value is 9
Name is bar
Value is 3
$ env -i "bar=3
> baz=9" awk 'END { for (name in ENVIRON) { print "Name is "name; print "Value is "ENVIRON[name]; }}' < /dev/null
Name is bar
Value is 3
baz=9
Maybe this would work?
#!/usr/bin/env sh
env | while IFS= read -r line
do
name="${line%%=*}"
indirect_presence="$(eval echo "\${$name+x}")"
[ -z "$name" ] || [ -z "$indirect_presence" ] || echo "name:$name, value:$(eval echo "\$$name")"
done
It is not bullet-proof, as if the value of a variable with a newline happens to have a line beginning that looks like an assignment, it could be somewhat confused.
The expansion uses %% to remove the longest match, so if a line contains several = signs, they should all be removed to leave only the variable name from the beginning of the line.
Here an example based on the awk approach:
#!/bin/sh
for NAME in $(awk "END { for (name in ENVIRON) { print name; }}" < /dev/null)
do
VAL="$(awk "END { printf ENVIRON[\"$NAME\"]; }" < /dev/null)"
echo "$NAME=$VAL"
done

How to use awk for multiple file search in two directories, print records only from files with matching string in second directory

Remade a previous question so that it is more clear. I'm trying to search files in two directories and print matching character strings (+ line immediately following) into a new file from the second directory only if they match a record in the first directory. I have found similar examples but nothing quite the same. I don't know how to use awk for multiple files from different directories and I've tortured myself trying to figure it out.
Directory 1, 28,000 files, formatted viz.:
>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG
Directory 2, 15 files, formatted viz.:
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
Desired output:
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
Directories 1 and 2 are located in my home directory: (./Test1 & ./Test2)
If anyone could advise command to specific the different directories, I'd be immensely grateful! Currently when I include file path (e.g., /Test1/*.fa) I get the following error:
awk: can't open file /Test1/*.fa
You'll want something like this (untested):
awk '
FNR==1 {
dirname = FILENAME
sub("/.*","",dirname)
if (NR==1) {
dirname1 = dirname
}
}
dirname == dirname1 {
if (FNR % 2) {
key = $0
}
else {
map[key] = $0
}
next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
print $0 ORS map[$0]
}
' Test1/* Test2/*
Given you're getting the error message /usr/bin/awk: Argument list too long which means you're exceeding your shells maximum argument length for a command and that 28,000 of your files are in the Test1 directory, try this:
find Test1 -type f -exec cat {} \; |
awk '
NR == FNR {
if (FNR % 2) {
key = $0
}
else {
map[key] = $0
}
next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
print $0 ORS map[$0]
}
' - Test2/*
Solution in TXR:
Data:
$ ls dir*
dir1:
file1 file2
dir2:
file1 file2
$ cat dir1/file1
>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG
$ cat dir1/file2
>XYZ
SDOIWEUROIUOIWUEROIWUEROIWUEROIWUEROUIEIDIDIIDFIFI
>MNO
OOIWEPOIUWERHJSDHSDFJSHDF
$ cat dir2/file1
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
$ cat dir2/file2
>STP
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>MNO
123412341234123412341234123412341234123412341234123412341234123412341234
$
Run:
$ txr filter.txr dir1/* dir2/*
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>MNO
123412341234123412341234123412341234123412341234123412341234123412341234
Code in filter.txr:
#(bind want #(hash :equal-based))
#(next :args)
#(all)
#dir/#(skip)
#(and)
# (repeat :gap 0)
#dir/#file
# (next `#dir/#file`)
# (repeat)
>#key
# (do (set [want key] t))
# (end)
# (end)
#(end)
#(repeat)
#path
# (next path)
# (repeat)
>#key
#datum
# (require [want key])
# (output)
>#key
#datum
# (end)
# (end)
#(end)
To separate the dir1 paths from the rest, we use an #(all) match (try multiple pattern branches, which must all match) with two branches. The first branch matches one #dir/#(skip) pattern, binding the variable dir to text that is preceded by a slash, and ignore the rest. The second branch matches a whole consecutive sequence of #dir/#file patterns via #(repeat :gap 0). Because the same dir variable appears that already has a binding from the first branch of the all, this constrains the matches to the same directory name. Inside this repeat we recurse into each file via next and gather the >-delimited keys into the keep hash. After that, we process the remaining arguments as path names of files to process; they don't all have to be in the same directory. We scan through each one for the >#key pattern followed by a line of #datum. The #(require ...) directive will fail the match if key is not in the wanted hash, otherwise we fall through to the #(output).

How to exclude parent Unix processes from grepped output from ps

I have got a file of pids and am using ps -f to get information about the pids.
Here is an example..
ps -eaf | grep -f myfilename
myuser 14216 14215 0 10:00 ? 00:00:00 /usr/bin/ksh /home/myScript.ksh
myuser 14286 14216 0 10:00 ? 00:00:00 /usr/bin/ksh /home/myScript.ksh
where myfilename contains only 14216.
I've got a tiny problem where the output is giving me parent process id's as well as the child. I want to exclude the line for the parent process id.
Does anyone know how I could modify my command to exclude parent process keeping in mind that I could have many process id's in my input file?
Hard to do with just grep but easy to do with awk.
Invoke the awk script below from the following command:
ps -eaf | awk -f script.awk myfilename -
Here's the script:
# process the first file on the command line (aka myfilename)
# this is the list of pids
ARGIND == 1 {
pids[$0] = 1
}
# second and subsequent files ("-"/stdin in the example)
ARGIND > 1 {
# is column 2 of the ps -eaf output [i.e.] the pid in the list of desired
# pids? -- if so, print the entire line
if ($2 in pids)
printf("%s\n",$0)
}
UPDATE:
When using GNU awk (gawk), the following may be ignored. For other [obsolete] versions, insert the following code at the top:
# work around old, obsolete versions
ARGIND == 0 {
defective_awk_flag = 1
}
defective_awk_flag != 0 {
if (FILENAME != defective_awk_file) {
defective_awk_file = FILENAME
ARGIND += 1
}
}
UPDATE #2:
The above is all fine. Just for fun, here's an alternate way to do the same thing with perl. One of the advantages is that everything can be contained in the script and no pipeline is necessary.
Invoke the script via:
./script.pl myfilename
And, here's script.pl. Note: I don't write idiomatic perl. My style is more akin to what one would expect to see in other languages like C, javascript, etc.:
#!/usr/bin/perl
master(#ARGV);
exit(0);
# master -- master control
sub master
{
my(#argv) = #_;
my($xfsrc);
my($pidfile);
my($buf);
# NOTE: "chomp" is a perl function that strips newlines
# get filename with list of pids (e.g. myfilename)
$pidfile = shift(#argv);
open($xfsrc,"<$pidfile") ||
die("master: unable to open '$pidfile' -- $!\n");
# create an associative array (a 'hash" in perl parlance) of the desired
# pid numbers
while ($pid = <$xfsrc>) {
chomp($pid);
$pid_desired{$pid} = 1;
}
close($xfsrc);
# run the 'ps' command and capture its output into an array
#pslist = (`ps -eaf`);
# process the command output, line-by-line
foreach $buf (#pslist) {
chomp($buf);
# the pid number we want is in the second column
(undef,$pid) = split(" ",$buf);
# print the line if the pid is one of the ones we want
print($buf,"\n")
if ($pid_desired{$pid});
}
}
Use this command:
ps -eaf | grep -f myfilename | grep -v grep | grep -f myfilename

grep for a string in a line if the previous line doesn't contain a specific string

I have the following lines in a file:
abcdef ghi jkl
uvw xyz
I want to grep for the string "xyz" if the previous line is not contains the string "jkl".
I know how to grep for a string if the line doesn't contains a specific string using -v option. But i don't know how to do this with different lines.
grep is really a line-oriented tool. It might be possible to achieve what you want with it, but it's easier to use Awk:
awk '
/xyz/ && !skip { print }
{ skip = /jkl/ }
' file
Read as: for every line, do
if the current line matches xyz and we haven't just seen jkl, print it;
set the variable skip to indicate whether we've just seen jkl.
sed '/jkl/{N;d}; /xyz/!d'
If find jkl, remove that line and next
print only remaining lines with xyz
I think you're better off using an actual programming language, even a simple one like Bash or AWK or sed. For example, using Bash:
(
previous_line_matched=
while IFS= read -r line ; do
if [[ ! "$previous_line_matched" && "$line" == *xyz* ]] ; then
echo "$line"
fi
if [[ "$line" == *jkl* ]] ; then
previous_line_matched=1
else
previous_line_matched=
fi
done < input_file
)
Or, more tersely, using Perl:
perl -ne 'print if m/xyz/ && ! $skip; $skip = m/jkl/' < input_file

Resources