parameterize UNIX statements - unix

I have a bunch of statements in UNIX that I want to loop to use parameterized value for their calculations.
more /var/xacct_data/xxxx/log_flattener/xxxx/logfile_current | grep " F " //E,I,D
**xxxx = mpay,mmg,tvr**
/var/xacct_data/faff/faff1/log_flattener/faffsnp1/ logfile_current | grep " F " //E , I, D also
/var/xacct_data/faff/faff1/log_flattener/faffdbt1 /logfile_current | grep " F " //E , I, D also
/var/xacct_data/faff/faff1/log_flattener/fafftxn1 /logfile_current | grep " F " //E , I, D also
/var/xacct_data/faff/faff2/log_flattener/faffdbt2/ logfile_current | grep " F " //E , I, D also
I want to store these paths in a file. Read from the file in a unix shell script. and run the unix commands on the above paths, while manipulating the above paths by substituting some values in the path..
For example, in the above code block, in the top most path. I want to replace the xxxx with the three values given. mpay, mmg and tvr. how do i go about it??
For every grep " F " I want to use E, I and D as parameters for the current path. how do i do it??

The left part of the pipe seem truncated but for the grep side, I think you are looking for
... | grep " [FEID] "

This should get you started. I won't write the entire script for you.
In bash, zsh, etc...
for directory in mpay mmg tvr; do
for char in F E I D; do
echo "Looking for lines containing ${char} in ${directory} directory..."
grep "${char}" /var/xacct_data/${directory}/log_flattener/${directory}/logfile_current
done
done
No need for more here. grep takes a filename as input.

Related

Grep to count occurrences of file A in file B

I have two files, file A may be in file B and I would like to count for each line in file A, how many times it occurs in file B. For example:
File A:
GAGGACAGACTACTAAAGCC
CTTGCCGCAGATTATCAGAG
CCAGCTTGATGTGTCCTGTG
TGATAGGCAGTGGAACACTG
File B:
NTCTTGAGGAAAGGACGAATCTGCGGAGGACAGACTACTAAAGCCGTTTGAGAGCTAGAACGAGCAAGTTAAGAGA
TCTTGAGGAAAGGACGAAACTCCGGAGGACAGACTACTAAAGCCGTTTTAGAGCTAGAAAGCGCAAGTTAAACGAC
NTCTTGAGGAAAGGACGAATCTGCGCTTGCCGCAGATTATCAGAGGTATGAGAGCTAGAACGAGCAAGTTAAGAGC
TCTTGAGGAAAGGACGAAAGTGCGCTTGCCGCAGATTATCAGAGGTTTTAGAGCTAGAAAGAGCAAGTTAAAATAA
GATCTAGTGGAAAGGACGATTCTCCGCTTGCCGCAGATTATCAGAGGTTGTAGAGCTAGAACTAGCAAGTGACAAG
ATCTTGAGGAAAGGACGAATCTGCGCTTGCCGCAGATTATCAGAGGTTTGAGAGCTAGAACTAGCAAGTTAATAGA
CGATCAAGTGGAAGGACGATTCTCCGTGATAGGCAGTGGAACACTGGATGTAGAGCTAGAAATAGCAAGTGAGCAG
ATCTAGAGGAAAGGACGAATCTCCGTGATAGGCAGTGGAACACTGGTATGAGAGCTAGAACTAGCAAGTTAATAGA
TCTTGAGGAAAGGACGAAACTCCGTGATAGGCAGTGGAACACTGGTTTTAGAGCTAGAAAGCGCAAGTTAAAAGAC
And the output should be File C:
2 GAGGACAGACTACTAAAGCC
4 CTTGCCGCAGATTATCAGAG
0 CCAGCTTGATGTGTCCTGTG
3 TGATAGGCAGTGGAACACTG
I would like to do this using grep and I've tried a few variations of -c,o,f but I can't seem to get the right output.
How can I achieve this?
Try this
for i in `cat a`; do echo "$i `grep $i -c b`"; done
In this case if line from file A occurred several times in one line of file B then this will be count as one occurrence. If you want to count such occurrences but without its overlapping use this
for i in `cat a`; do printf $i; grep $i -o b | wc -l; done
And maybe this variant would be quicker
cat b | grep "`cat a`" -o | sort | uniq -c
#!/usr/bin/perl
open A, "A"; # open file "A" to handle A
open B, "B"; # open file "B" to handle B
chomp(#keys = <A>); # read keys to array, strip line-feeds
#counts{#keys} = (0) x #keys; # initialize hash counts for keys
while(<B>){ # iterate file handle B line by line
foreach $k (#keys){ # iterate keys array
if (/$k/) { # if key matches line
$counts{$k}++; # increase count for key by one
}
}
}
print "$counts{$_} $_\n" for (keys %counts);
Linux command to compare files:
comm FileA FileB
comm produces three-column output. Column one contains lines unique to FileA, column two contains lines unique to FileB, and column three contains lines common to both files.

Use awk to replace word in file

I have a file with some lines:
a
b
c
d
I would like to cat this file into a awk command to produce something like this:
letter is a
letter is b
letter is c
letter is d
using something like this:
cat file.txt | awk 'letter is $1'
But it's not printing out as expected:
$ cat raw.txt | awk 'this is $1'
a
b
c
d
At the moment, you have no { action } block, so your condition evaluates the two empty variables this and is, concatenating them with the first field $1, and checks whether the result is true (a non-empty string). It is, so the default action prints each line.
It sounds like you want to do this instead:
awk '{ print "letter is", $1 }' raw.txt
Although in this case, you might as well just use sed:
sed 's/^/letter is /' raw.txt
This command matches the start of each line and adds the string.
Note that I'm passing the file as an argument, rather than using cat with a pipe.
Not sure if you wanted sed or awk but this is in awk:
$ awk '{print "letter is " $1}' file
letter is a
letter is b
letter is c
letter is d

unix split FASTA using a loop, awk and split

I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display

How to separate unique characters from several words in a "indic" text file?

I've a plain text file.
> Input: इंजेक्शन इंटरनॅशनल इंटिग्रेटेड इंटिरिअर इंडस्ट्री
All words are separated by one or more spaces. I want to collect all unique chars from the text file. I'm looking for a unix command; the order of the result chars is not important.
> Expected result: इं जे क्श न ट र नॅ श ल इ्रे टे ड टि रिअ र ड स्ट्री
With the command Klaus has provided
cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
Result comes as:
ं अ इ क ग ज ट ड न र ल श सिीॅे्
I don't want to separate horizontal or vertical conjuncts or dependent vowels from its base character.
I just want to separate complete characters in a word from each other.
Can we achieve this with UNIX commands?
"base character" + "dependent vowel" = "complete character"
- क ा का
- क ि कि
Klaus's command works for English text only. But, It doesn't work with indic languages such as Hindi.
Input: hi1 hello-2 how!3 "are4 ?you5
result: h i e l o w a r y u 1 2 3 4 5 - ! "
Note:- You have to install Indic support in your OS.
Also, download Mangal font from http://hindi-fonts.com/fonts/Mangal
Try this:
cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
or simplified ( stolen from fedorqui comment, thanks! Never seen & before in the replacement part. Good to learn something new! )
sed 's/./&\n/g' <file> | sort -u | tr -d '\n'

SED command to change the header

Well, I have about 114 files that I want to join side-by-side based on the 1st column that each file shares, which's the ID number. Each file consists of 2 columns and over 400000 lines. I used write.table to join those tables together in one table and I got X's in my header. For example, my header should be like:
ID 1_sample1 2_sample2 3_sample3
But I get it like this:
ID X1_sample1 X2_sample2 X3_sample3
I read about this problem and found out the check.names get rid of this problem, but in my case when I use check.names I get the following error:
"unused argument (check.name = F)"
Thus, I decided to use sed to fix the problem, it actually works great, BUT it joins the 2nd line and the 1st line. For instance, my 1st column and second column should be something like this:
ID 1_sample1 2_sample2 3_sample
cg123 .0235 2.156 -5.546
But I get the following instead:
ID 1_sample1 2_sample2 3_sample cg123 .0235 2.156 -5.546
Can any one check this code for me, please. I might've done something wrong to not get each line separated from the other.
head -n 1 inFILE | tr "\t" "\n" | sed -e 's/^X//g' | sed -e 's/\./-/' | sed -e 's/\./(/' |sed -e 's/\./)/' | tr "\n" "\t" > outFILE
tail -n +2 beta.norm.txt >> outFILE
If your data is tab delimited, the simple fix would be
sed '1,1s/\tX/\t/g' < inputfile > outputfile
1,1 only operate on the range "line 1 to line 1"
\tX find tab followed by X
/\t/ replace with tab
g all occurrences
It does seem as though your original attempt does more than just strip the X - it also changes successive dots to (-) but you don't show in your example why you need that. The reason your code joins the first two lines is that you only replace \n with \t in your last tr command - which leaves you with no \n at the end of the line.
You need to attach a \n at the end of your first line before concatenating lines 2 and beyond with your second command. Experiment with
head -n 1 inFILE | tr "\t" "\n" | sed -e 's/^X//g' | sed -e 's/\./-/' | sed -e 's/\./(/' |sed -e 's/\./)/' | tr "\n" "\t" > outFILE
echo "\n" >> outFile
tail -n +2 beta.norm.txt >> outFILE
whether that works depends on your OS. There are other ways to add a newline...
edit using awk is probably much cleaner - for example
awk '(NR==1){gsub(" X"," ", $0);}{print;}' inputFile > outputFile
Explanation:
(NR==1) for the first line only (record number == 1) do:
{gsub(" X","", $0);} do a global substitution of "space followed by X", with "space"
for all lines (including the one that was just modified) do:
{print;}' print the whole line

Resources