Creating dictionary from a '.fasta' file containing several genes from an organism - dictionary

I have a '.txt' file in which a list of genes are given and their sequence. I need to create a dictionary in which the keys are the names of the genes and the values are the sequences.
I want the output of the dictionary to be this:
dict = ('sequence1' : 'AATTGGCC', 'sequence2' : 'AAGGCCTT', ...)
So this is what I tried, but I ran into some problems:
dictionary = {}
accesion_number = ""
sequentie = ""
with open("6EP.fasta", "r") as proteoom:
for line in proteoom:
if line.startswith(">"):
line.strip()
dictionary[accesion_number] = sequentie
sequentie = ""
else:
sequentie = sequentie + line.rstrip().strip("\n").strip("\r")
dictionary[accesion_number] = sequentie
Does anyone know what went wrong here, and how I can fix it?
Thanks in advance!

I can think of two ways to do this:
High memory usage
If the file is not too large, you can use readlines() and then use the indexes like so:
IDs = []
sequences = []
with open('Proteome.fasta', 'r') as f:
raw_data = f.readlines()
for i, l in enumerate(raw_data):
if l[0] == '>':
IDs.append(l)
sequences.append(raw_data[i + 1])
Low memory usage
Now, if you don't want to load the contents of the file into memory, then I think you can read the file twice by saving the indexes of every ID line plus one, like so:
Get the '>' lines and their indexes, which will be the ID index plus one
Compare if the line number is in the indexes list and, if so, then append the content to your variable
In here, I'm taking advantage of the fact that the lists are, by definition, sorted.
IDs = []
indexes = []
sequences = []
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
IDs.append(l) # Get your IDs
indexes.append(i + 1) # Get the index of the ID + 1
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
if i == indexes[0]: # Check whether line matches with the index
sequences.append(l) # Get your sequence
indexes.pop(0) # Remove the first element of the indexes
I hope this helps! ;)

Code
ids = []
seq = []
char = ['_', ':', '*', '#'] #invalid in sequence
seqs = ''
with open('fasta.txt', 'r') as f: #open sample fasta
for line in f:
if line.startswith('>'):
ids.append(line.strip('\n'))
if seqs != '': #if there's previous seq
seq.append(seqs) #append the seq
seqs = '' #then start a new seq
elif line not in char:
seqs = seqs + line.strip('\n') #build seq with each line until '>'
seq.append(seqs) #append any remaining seq
print(ids)
print(seq)
Result
['>SeqABCD [organism=Mus musculus]', '>SeqABCDE [organism=Plasmodium]']
['ACGTCAGTCACGTACGTCAGTTCAGTC...', 'GGTACTGCAAAGTTCTTCCGCCTGATTA...']
Sample File
>SeqABCD [organism=Mus musculus]
ACGTCAGTCACGTACGTCAGTTCAGTCARYSTYSATCASMBMBDH
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAA
AAATATCATTTGAGGTCAATACAAATCCTATTTCTATCGTTTTT
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAAT
>SeqABCDE [organism=Plasmodium falciparum]
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTT
TTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTT
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCGGTACTGCAA
AGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAA

Related

how would I delete item's in a dictionary within specific parameters?

for my code I want all numbers from a dictionary under 70 to be deleted, I'm unsure of how to specify this and I need it to also delete the associated name with that number as well, either that or only diplay numbers that are 70 or above.
Below is the code that I have in it's entirety:
name = []
number =[]
name_grade = {}
counter = 0
counter_bool= True
num_loop = True
while counter_bool:
stu = int(input("please enter the number of students: "))
if stu < 2:
print("value is too low, try again")
continue
else:
break
while counter != stu:
name_inp = str(input("Enter your name: "))
while num_loop:
number_inp = int(input("Enter your number: "))
if number_inp < 0 or number_inp > 100:
print("The value is too high or too low, please enter a number between 0 and 100.")
continue
else:
break
name_grade[name_inp] = number_inp
name.append(name_inp)
number.append(number_inp)
counter += 1
print(name_grade)
sorted_numbers = sorted(name_grade.items(), key= lambda x:x[1])
print(sorted_numbers)
if number > 70:
resorted_numbers = number < 70
print(resorted numbers)
how would I go about this?
Also if it's also not too much trouble could someone explain in detail about dictionary keys and how the lambda function I've used works? I got help but I would prefer to know the small details on how it's applied and formatted but don't worry if it's a pain to explain.
You can just iterate over the dictionary and filter for values less than 70:
resorted_numbers = {k:v for k,v in name_grade.items() if v<70}
dict.items method returns a list of key-value tuple pairs of a dictionary, so the lambda function is telling the sorted function to sort by the second element in each tuple.

How to read pair by pair from a file in SML?

I want to read N pairs from a file and store them as a tuples in a list.For example if i have these 3 pairs : 1-2 , 7-3, 2-9 i want my list to look like this -> [(1,2),(7,3),(2-9)]
I tried something like this:
fun ex filename =
let
fun readInt input = Option.valOf (TextIO.scanStream (Int.scan StringCvt.DEC) input)
val instream = TextIO.openIn filename
val T = readInt instream (*number of pairs*)
val _ = TextIO.inputLine instream
fun read_ints2 (x,acc) =
if x = 0 then acc
else read_ints2(x-1,(readInt instream,readInt instream)::acc)
in
...
end
When i run it i get an exeption error :/ What's wrong??
I came up with this solution. I reads a single line from the given file. In processing the text it strips away anything not a digit creating a single flat list of chars. Then it splits the flat list of chars into a list of pairs and in the process converts the chars to ints. I'm sure it could be improved.
fun readIntPairs file =
let val is = TextIO.openIn file
in
case (TextIO.inputLine is)
of NONE => ""
| SOME line => line
end
fun parseIntPairs data =
let val cs = (List.filter Char.isDigit) (explode data)
fun toInt c =
case Int.fromString (str c)
of NONE => 0
| SOME i => i
fun part [] = []
| part [x] = []
| part (x::y::zs) = (toInt x,toInt y)::part(zs)
in
part cs
end
parseIntPairs (readIntPairs "pairs.txt");

Delete all rows in a file that fit between certain headers?

I would like to delete all of the rows that sit between certain headers in this example text file.
fileConn <- file("sample.txt")
one <- "*Keyword"
two <- "*Node"
three <- "$ Node,X,Y,Z"
four <- "1,639982.78040607,4733827.5104821,0"
five <- "2,639757.59709573,4733830.43494066,0"
six <- "3,639738.81268144,4733834.3619618,0"
seven <- "*End"
writeLines (c(one, two, three, four, five, six, seven), fileConn)
close(fileConn)
sample <- readLines("sample.txt")
What I am looking to do is delete all of the rows/lines between "*Node" and "*End". Since I am dealing with files with different lengths of rows between these headers, the deletion method needs to be based on headers only. I have no idea how to do this since I've only deleted rows in dataframes referenced by row numbers previously. Any clues?
Expected output is:
*Keyword
*Node
*End
readLines returns a vector, not a data frame, so we can create the sample input more simply:
sample = c("*Keyword",
"*Node",
"$ Node,X,Y,Z",
"1,639982.78040607,4733827.5104821,0",
"2,639757.59709573,4733830.43494066,0",
"3,639738.81268144,4733834.3619618,0",
"*End")
Find the starting and ending headers, and remove the elements in between with negative indexing:
node = which(sample == "*Node")
end = which(sample == "*End")
result = sample[-seq(from = node + 1, to = end - 1)]
result
# [1] "*Keyword" "*Node" "*End"
This assumes there is a single *Node and a single *End line. It also assumes that there is at least one line to delete. You may want to create a more robust solution with some handling for those special cases, e.g.,
delete_between = function(input, start, end) {
start_index = which(sample == start)
end_index = which(sample == end)
if (length(start_index) == 0 | length(end_index) == 0) {
warning("No start or end found, returning input as-is")
return(input)
}
if (length(start_index) > 1 | length(end_index) > 1) {
stop("Multiple starts or ends found.")
}
if (start_index == end_index - 1) {
return(input)
}
return(input[-seq(from = start_index + 1, to = end_index - 1)])
}

how to find word vertically in a crossword

I'm trying to write a function that accepts a 2-dimensional (2D) list of characters (like a crossword puzzle) and a string as input arguments, the function must then search the columns of the 2D list to find a match of the word. If a match is found, the function should then return a list containing the row index and column index of the start of the match, otherwise it should return the value None.
For example if the function is called as shown below:
crosswords = [['s','d','o','g'],['c','u','c','m'],['a','c','a','t'],['t','e','t','k']]
word = 'cat'
find_word_vertical(crosswords,word)
then the function should return:
[1,0]
def find_word_vertical(crosswords,word):
columns = []
finished = []
for col in range(len(crosswords[0])):
columns.append( [crosswords[row][col] for row in
range(len(crosswords))])
for a in range(0, len(crosswords)):
column = [crosswords[x][a] for x in range(len(crosswords))]
finished.append(column)
for row in finished:
r=finished.index(row)
whole_row = ''.join(row)
found_at = whole_row.find(word)
if found_at >=0:
return([found_at, r])
This one is for finding horizontal... could switching this around help?
def find_word_horizontal(crosswords, word):
list1=[]
row_index = -1
column_index = -1
refind=''
for row in crosswords:
index=''
for column in row:
index= index+column
list1.append(index)
for find_word in list1:
if word in find_word:
row_index = list1.index(find_word)
refind = find_word
column_index = find_word.index(word)
ret = [row_index,column_index]
if row_index!= -1 and column_index != -1:
return ret
The simple version is:
def find_word_vertical(crosswords,word):
z=[list(i) for i in zip(*crosswords)]
for rows in z:
row_index = z.index(rows)
single_row = ''.join(rows)
column_index = single_row.find(word)
if column_index >= 0:
return([column_index, row_index])
This gives correct output [1,0]
To find a word vertically:
def find_word_vertical(crosswords,word):
if not crosswords or not word:
return None
for col_index in range(len(crosswords[0])):
str = ''
for row_index in range(len(crosswords)):
str = str + crosswords[row_index][col_index]
if temp_str.find(word) >= 0:
return [str.find(word),col_index]
To find a word Horizontaly:
def find_word_horizontal(crosswords, word):
if not crosswords or not word:
return None
for index, row in enumerate(crosswords):
str = ''.join(row)
if str.find(word) >= 0:
return [index,str.find(word)]
#find vertical word in 2d
def find_it(li,wo):
out_list=[]
for row in range(len(li)):
print(row)
chek_word=""
for item in range(len(li)):
chek_word=chek_word + li[item][row]
print(chek_word)
if wo in chek_word:
print(chek_word.find(wo))
out_list=[ chek_word.find(wo) , row]
print(out_list)
break
this is mine and yes it work

alignment of sequences

I want to do pairwise alignment with uniprot and pdb sequences. I have an input file containing uniprot and pdb IDs like this.
pdb id uniprot id
1dbh Q07889
1e43 P00692
1f1s Q53591
first, I need to read each line in an input file
2) retrieve the pdb and uniprot sequences from pdb.fasta and uniprot.fasta files
3) Do alignment and calculate sequence identity.
Usually, I use the following program for pairwise alignment and seq.identity calculation.
library("seqinr")
seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR"
seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM"
library(Biostrings)
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")
I need to print the output like this
pdbid uniprotid seq.identity
1dbh Q07889 99
1e43 P00692 80
1f1s Q53591 56
How can I change the above code ? your help would be appreciated!
'
This code is hopefully what your looking for:
class test():
def get_seq(self, pdb,fasta_file): # Get sequences
from Bio.PDB.PDBParser import PDBParser
from Bio import SeqIO
aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
p=PDBParser(PERMISSIVE=1)
structure_id="%s" % pdb[:-4]
structure=p.get_structure(structure_id, pdb)
residues = structure.get_residues()
seq_pdb = ''
for res in residues:
res = res.get_resname()
if res in aa:
seq_pdb = seq_pdb+aa[res]
handle = open(fasta_file, "rU")
for record in SeqIO.parse(handle, "fasta") :
seq_fasta = record.seq
handle.close()
self.seq_aln(seq_pdb,seq_fasta)
def seq_aln(self,seq1,seq2): # Align the sequences
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist
matrix = matlist.blosum62
gap_open = -10
gap_extend = -0.5
alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
top_aln = alns[0]
aln_seq1, aln_seq2, score, begin, end = top_aln
with open('aln.fasta', 'w') as outfile:
outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
print aln_seq1+'\n'+aln_seq2
self.seq_id('aln.fasta')
def seq_id(self,aln_fasta): # Get sequence ID
import string
from Bio import AlignIO
input_handle = open("aln.fasta", "rU")
alignment = AlignIO.read(input_handle, "fasta")
j=0 # counts positions in first sequence
i=0 # counts identity hits
for record in alignment:
#print record
for amino_acid in record.seq:
if amino_acid == '-':
pass
else:
if amino_acid == alignment[0].seq[j]:
i += 1
j += 1
j = 0
seq = str(record.seq)
gap_strip = seq.replace('-', '')
percent = 100*i/len(gap_strip)
print record.id+' '+str(percent)
i=0
a = test()
a.get_seq('1DBH.pdb','Q07889.fasta')
This outputs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence
For this to work on you input file, you need to put my a.get_seq() in a for loop with the inputs from your text file.
EDIT:
Replace the seq_id function with this one:
def seq_id(self,aln_fasta):
import string
from Bio import AlignIO
from Bio import SeqIO
record_iterator = SeqIO.parse(aln_fasta, "fasta")
first_record = record_iterator.next()
print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-','')))
second_record = record_iterator.next()
print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-','')))
lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))]
if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest
print 'PDB sequence has the shortest length'
else:
print 'Uniport sequence has the shortes length'
idenities = 0
for i,v in enumerate(first_record.seq):
if v == '-':
pass
#print i,v, second_record.seq[i]
if v == second_record.seq[i]:
idenities +=1
#print i,v, second_record.seq[i], idenities
print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))
to pass the arguments to the class use:
with open('input_file.txt', 'r') as infile:
next(infile)
next(infile) # Going by your input file
for line in infile:
line = line.split()
a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')
It might be something like this; a repeatable example (e.g., with short files posted on-line) would help...
library(Biostrings)
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")
to input all sequences into two objects. pairwiseAlignment accepts a vector as first (query) argument, so if you were wanting to align all pdb against all uniprot pre-allocate a result matrix
pids = matrix(numeric(), length(uniprot), length(pdb),
dimnames=list(names(uniprot), names(pdb)))
and then do the calculations
for (i in seq_along(uniprot)) {
globalAlignment = pairwiseAlignment(pdb, uniprot[i])
pids[i,] = pid(globalAlignment)
}

Resources