modifying line names to avoid redundancies when files are merged in terminal - unix

I have two files containing biological DNA sequence data. Each of these files are the output of a python script which assigns each DNA sequence to a sample ID based on a DNA barcode at the beginning of the sequence. The output of one of these .txt files looks like this:
>S066_1 IGJRWKL02G0QZG orig_bc=ACACGTGTCGC new_bc=ACACGTGTCGC bc_diffs=0
TTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCGTGAGAAGTTGAGGTTATGGCAAGCATCCATAAGAACCCTATAGCGAGAATAATTACTACGCTTAGAGCCAGATGGCACCGCCACTGATTTTAGGGGCCGCTGAATAGCGAGCTCCAAGACCCCTTGCGGGATTGGTCAAAATAGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTCCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCAGCGTTCTTCATCGATGACGAGTCTAG
>S045_2 IGJRWKL02H5XHD orig_bc=ATCTGACGTCA new_bc=ATCTGACGTCA bc_diffs=0
CTAAGTTCAGCGGGTAGTCTTGTCTGATATCAGGTCCAATTGAGATACCACCGACAATCATTCGATCATCAACGATACAGAATTTCCCAAATAAATCTCTCTACGCAACTAAATGCAGCGTCTCCGTACATCGCGAAATACCCTACTAAACAACGATCCACAGCTCAAACCGACAACCTCCAGTACACCTCAAGGCACACAGGGGATAGG
The first line is the sequence ID, and the second line in the DNA sequence. S_066 in the first part of the ID indicates that the sequence is from sample 066, and the _1 indicates that its the first sequence in the file (not the first sequence from S_066 per se). Because of the nuances of the DNA sequencing technology being used, I need to generate two files like this from the raw sequencing files, and the result is an output where I have two of these files, which I then use cat to merge together. So far so good.
The next downstream step in my workflow does not allow identical sample names. Right now it gets half way through, errors, and closes because it encounters some identical sequence IDs. So, it must be that the 400th sequence in both files belongs to the same sample, or something, generating identical sample IDs (i.e. both files might have S066_400).
What I would like to do is use some code to insert a number (1000,, 4971, whatever) immediately after the _ on every other line in the second file, starting with the first line. This way the IDs would no longer be confounded and I could proceed. So, it would cover S066_2 to S066_24971 or S066_49712. Part of the trouble is that the ID may be variable in length such that it could begin as S066_ or as 49BBT1_.

Try:
awk '/^\>/ {$1=$1 "_13"} {print $0}' filename > tmp.tmp
mv tmp.tmp filename

Related

looping through paired end fastq reads

How can you loop through a paired-end fastq file? For single end reads you can do the following
library(ShortRead)
strm <- FastqStreamer("./my.fastq.gz")
repeat {
fq <- yield(strm)
if (length(fq) == 0)
break
#do things
writeFasta(fq, 'output.fq', mode="a")
}
However, if I edit one paired-end file, I somehow need to keep track of the second file so that the two files continue to correspond well with each other
Paired-end fastq files are typically ordered,
So you could keep track of the lines that are removed, and remove them from the paired file. But this isn't a great method, and if your data is line-wrapped you will be in pain.
A better way would be to use the header information.
The headers for the paired reads in the two files are identical, except for the field that specifies whether the read is reverse or forward (1 or 2)...
first read from file 1:
#M02621:7:000000000-ARATH:1:1101:15643:1043 1:N:0:12
first read from file 2
#M02621:7:000000000-ARATH:1:1101:15643:1043 2:N:0:12
The numbers 1101:15643:1043 refers to the tile and x, y coordinates on that tile, respectively.
These numbers uniquely identify each read pair, for the given run.
Using this information, you can removed reads from the second file if they are not in the first file.
Alternatively, if you are doing quality trimming... Trimmomatic can perform quality/length filtering on paired-end data, and it's fast...

R - Exctract multiple tables from text file

I have a .txt file containing text (which I don't want) and 65 tables, as shown below (just the top of the .txt file)
Does anyone know how I can extract only the tables from this text file, such that I can open the resulting .txt file as a data.frame with my 65 tables in R? Above each table is a fixed number of lines (starting with "The result of abcpred on seq..." and ending with "Predicted B cell epitopes") and below each of them is a variable number of lines, depending on how many rows each tables has. Then it comes the next table, and it goes like that until I reach the 65th table.
Given that the tables are the only elements that start with numbers, to grep for integers at the beginning of the line is indeed the best solution. Using the shell (and not R) the command:
grep '^[0-9]' input > output
did exactly what I wanted.

.ksh paste user input value into dataset

Good morning.
First things first: I know next to nothing about shell scripting in Unix, so please pardon my naivety.
Here's what I'd like to do, and I think it's relatively simple: I would like to create a .ksh file to do two things: 1) take a user-provided numerical value (argument) and paste it into a new column at the end of a dataset (a separate .txt file), and 2) execute a different .ksh script.
I envision calling this script at the Unix prompt, with the input value added thereafter. Something like, "paste_and_run.ksh 58", where 58 would populate a new, final (un-headered) column in an existing dataset (specifically, it'd populate the 77th column).
To be perfectly honest, I'm not even sure where to start with this, so any input would be very appreciated. Apologies for the lack of code within the question. Please let me know if I can offer any more detail, and thank you for taking a look.
I have found the answer: the "nawk" command.
TheNumber=$3
PE_Infile=$1
Where the above variables correspond to the third and first arguments from the command line, respectively. "PE_Infile" represents the file (with full path) to be manipulated, and "TheNumber" represents the number to populate the final column. Then:
nawk -F"|" -v TheNewNumber=$TheNumber '{print $0 "|" TheNewNumber/10000}' $PE_Infile > $BinFolder/Temp_Input.txt
Here, the -F"|" dictates the delimiter, and the -v dictates what is to be added. For reasons unknown to myself, the declaration of a new varible (TheNewNumber) was necessary to perform the arithmetic manipulation within the print statement. print $0 means that the whole line would be printed, while tacking the "|" symbol and the value of the command line input divided by 10000 to the end. Finally, we have the input file and an output file (Temp_PE_Input.txt, within a path represented by the $Binfolder variable).
Running the desired script afterward was as simple as typing out the script name (with path), and adding corresponding arguments ($2 $3) afterward as needed, each separated by a space.

Find lines matching a pattern, provided their value in a specified column occurs exactly twice in the input file

Say the input is (.csv file):
a,b_b,3,c
d,k_k,3,f
g,h_h,3,i
j,k_k,4,l
m,n_n,4,o
p,k_k,5,q
r,s_s,5,t
I want this output:
All lines containing the pattern "k_k" whose number in the third column is found in exactly two lines (ex.: numbers 4 and 5):
j,k_k,4,l
p,k_k,5,q
It might be a simple one but I can't find I way to achieve this. Could anyone help me using Unix command lines (awk)?
awk '/k_k/' && ?? file.csv
I think you want something like this:
awk -F, 'FNR==NR{a[$3]++;next} /k_k/ {if(a[$3]==2)print $0}' file file
I am assuming you mean that the number in column 3 appears exactly twice in the file, not that it is the number 4 or 5. This solution makes 2 passes over your file to count the number of times each number occurs in column 3 the first time and to print matching lines the second time. Therefore the input file is specified twice on the command line.
As a note of explanation, it counts the number of times 1 occurs in column 3 in a[1], and it counts the number of times 2 occurs in column 3 in a[2] etc...
Reading your question title, it says "2 lines maximum", so if occurring in one single line is also ok, you should change the "==" in my code to "<=". I cannot tell what you mean.

Sequence number inside a txt file in UNIX

I want to generate a unique sequence number for each row in the file in unix. I can not make identity column in database as it has some other sources which also inserts data in it. I tried using NR number in awk but since i have filters in my script it may skip rows in the file so i may not get sequential numbers.
my requirements are - This sequence number needs to be persistent since everday i would receive this file and should start from where i left of. also the number needs to be preceded by "EMP_" for each line in the file.
Please suggest.
Thanks in advance.
To obtain unique id in UNIX you may use file to store and read the value. however this method is so tedious and require mechanism on file IO locking. the easiest way is to use date time to obtain unique id example :
#!/bin/sh
uniqueVal = `date '+%Y%m%d%H%M%S'`

Resources