Encode a list of ids in an elegant id - math

I am trying to find the best way to encode a list of 10 integers (approximately between 1 and 10000) into a single id (I need a one-to-one function between a list of integers to a single integer or string).
I tried base64 (which is good because it is reversible), but the result is much longer than the input and that's quite bad.
For example, if I want to encode 12-54-235-1223-21-765-43-763-9522-908,
base64 gives me MTItNTQtMjM1LTEyMjMtMjEtNzY1LTQzLTc2My05NTIyLTkwOA==
Hashing functions are bad because I can't recover the input easily.
Maybe I could use the fact that I have only numbers as input and use number theory facts, someone has an idea?

If the integers are guaranteed to be smaller than 10^9, you could encode them as:
[number of digits in 1st number][1st number][number of digits in 2nd number][2nd number][...]
So 12,54,235,1223,21,765,43,763,9522,908 yields 21225432354122322137652433763495223908.
Sample Python implementation:
def numDigits(x):
if x < 10:
return 1
return 1 + numDigits(x/10)
def encode(nums):
ret = ""
for number in nums:
ret = ret + str(numDigits(number)) + str(number)
return ret
def decode(id):
nums = []
while id != "":
numDigits = int(id[0])
id = id[1:] #remove first char from id
number = int(id[:numDigits])
nums.append(number)
id = id[numDigits:] #remove first number from id
return nums
nums = [12,54,235,1223,21,765,43,763,9522,908]
id = encode(nums)
decodedNums = decode(id)
print id
print decodedNums
Result:
21225432354122322137652433763495223908
[12, 54, 235, 1223, 21, 765, 43, 763, 9522, 908]

Related

Creating dictionary from a '.fasta' file containing several genes from an organism

I have a '.txt' file in which a list of genes are given and their sequence. I need to create a dictionary in which the keys are the names of the genes and the values are the sequences.
I want the output of the dictionary to be this:
dict = ('sequence1' : 'AATTGGCC', 'sequence2' : 'AAGGCCTT', ...)
So this is what I tried, but I ran into some problems:
dictionary = {}
accesion_number = ""
sequentie = ""
with open("6EP.fasta", "r") as proteoom:
for line in proteoom:
if line.startswith(">"):
line.strip()
dictionary[accesion_number] = sequentie
sequentie = ""
else:
sequentie = sequentie + line.rstrip().strip("\n").strip("\r")
dictionary[accesion_number] = sequentie
Does anyone know what went wrong here, and how I can fix it?
Thanks in advance!
I can think of two ways to do this:
High memory usage
If the file is not too large, you can use readlines() and then use the indexes like so:
IDs = []
sequences = []
with open('Proteome.fasta', 'r') as f:
raw_data = f.readlines()
for i, l in enumerate(raw_data):
if l[0] == '>':
IDs.append(l)
sequences.append(raw_data[i + 1])
Low memory usage
Now, if you don't want to load the contents of the file into memory, then I think you can read the file twice by saving the indexes of every ID line plus one, like so:
Get the '>' lines and their indexes, which will be the ID index plus one
Compare if the line number is in the indexes list and, if so, then append the content to your variable
In here, I'm taking advantage of the fact that the lists are, by definition, sorted.
IDs = []
indexes = []
sequences = []
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
IDs.append(l) # Get your IDs
indexes.append(i + 1) # Get the index of the ID + 1
with open('Proteome.fasta', 'r') as f:
for i, l in enumerate(f):
if i == indexes[0]: # Check whether line matches with the index
sequences.append(l) # Get your sequence
indexes.pop(0) # Remove the first element of the indexes
I hope this helps! ;)
Code
ids = []
seq = []
char = ['_', ':', '*', '#'] #invalid in sequence
seqs = ''
with open('fasta.txt', 'r') as f: #open sample fasta
for line in f:
if line.startswith('>'):
ids.append(line.strip('\n'))
if seqs != '': #if there's previous seq
seq.append(seqs) #append the seq
seqs = '' #then start a new seq
elif line not in char:
seqs = seqs + line.strip('\n') #build seq with each line until '>'
seq.append(seqs) #append any remaining seq
print(ids)
print(seq)
Result
['>SeqABCD [organism=Mus musculus]', '>SeqABCDE [organism=Plasmodium]']
['ACGTCAGTCACGTACGTCAGTTCAGTC...', 'GGTACTGCAAAGTTCTTCCGCCTGATTA...']
Sample File
>SeqABCD [organism=Mus musculus]
ACGTCAGTCACGTACGTCAGTTCAGTCARYSTYSATCASMBMBDH
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAA
AAATATCATTTGAGGTCAATACAAATCCTATTTCTATCGTTTTT
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAAT
>SeqABCDE [organism=Plasmodium falciparum]
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTT
TTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTT
GGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCGGTACTGCAA
AGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAA

Need help understanding how gsub and tonumber are used to encode lua source code?

I'm new to LUA but figured out that gsub is a global substitution function and tonumber is a converter function. What I don't understand is how the two functions are used together to produce an encoded string.
I've already tried reading parts of PIL (Programming in Lua) and the reference manual but still, am a bit confused.
local L0_0, L1_1
function L0_0(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 256 - 13 + 255999744) % 256)
end))
end
encodes = L0_0
L0_0 = gg
L0_0 = L0_0.toast
L1_1 = "__loading__\226\128\166"
L0_0(L1_1)
L0_0 = encodes
L1_1 = --"The Encoded String"
L0_0 = L0_0(L1_1)
L1_1 = load
L1_1 = L1_1(L0_0)
pcall(L1_1)
I removed the encoded string where I put the comment because of how long it was. If needed I can upload the encoded string as well.
gsub is being used to get 2 digit sections of A0_2. This means the string A0_3 is a 2 digit hexadecimal number but it is not in a number format so we cannot preform math on the value. A0_3 being a hex number can be inferred based on how tonubmer is used.
tonumber from Lua 5.1 Reference Manual:
Tries to convert its argument to a number. If the argument is already a number or a string convertible to a number, then tonumber returns this number; otherwise, it returns nil.
An optional argument specifies the base to interpret the numeral. The base may be any integer between 2 and 36, inclusive. In bases above 10, the letter 'A' (in either upper or lower case) represents 10, 'B' represents 11, and so forth, with 'Z' representing 35. In base 10 (the default), the number can have a decimal part, as well as an optional exponent part (see ยง2.1). In other bases, only unsigned integers are accepted.
So tonumber(A0_3, 16) means we are expecting for A0_3 to be a base 16 number (hexadecimal).
Once we have the number value of A0_3 we do some math and finally convert it to a character.
function L0_0(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 256 - 13 + 255999744) % 256)
end))
end
This block of code takes a string of hex digits and converts them into chars. tonumber is being used to allow for the manipulation of the values.
Here is an example of how this works with Hello World:
local str = "Hello World"
local hex_str = ''
for i = 1, #str do
hex_string = hex_string .. string.format("%x", str:byte(i,i))
end
function L0_0(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 256 - 13 + 255999744) % 256)
end))
end
local encoded = L0_0(hex_str)
print(encoded)
Output
;X__bJbe_W
And taking it back to the orginal string:
function decode(A0_2)
return (A0_2:gsub("..", function(A0_3)
return string.char((tonumber(A0_3, 16) + 13) % 256)
end))
end
hex_string = ''
for i = 1, #encoded do
hex_string = hex_string .. string.format("%x", encoded:byte(i,i))
end
print(decode(hex_string))

How can I change ascii string to hex and vice versa in python 3.7?

I look some solution in this site but those not works in python 3.7.
So, I asked a new question.
Hex string of "the" is "746865"
I want to a solution to convert "the" to "746865" and "746865" to "the"
Given that your string contains ascii only (each char is in range 0-0xff), you can use the following snippet:
In [28]: s = '746865'
In [29]: import math
In [30]: int(s, base=16).to_bytes(math.ceil(len(s) / 2), byteorder='big').decode('ascii')
Out[30]: 'the'
Firstly you need to convert a string into integer with base of 16, then convert it to bytes (assuming 2 chars per byte) and then convert bytes back to string using decode
#!/usr/bin/python3
"""
Program name: txt_to_ASC.py
The program transfers
a string of letters -> the corresponding
string of hexadecimal ASCII-codes,
eg. the -> 746865
Only letters in [abc...xyzABC...XYZ] should be input.
"""
print("Transfer letters to hex ASCII-codes")
print("Input range is [abc...xyzABC...XYZ].")
print()
string = input("Input set of letters, eg. the: ")
print("hex ASCII-code: " + " "*15, end = "")
def str_to_hasc(x):
global glo
byt = bytes(x, 'utf-8')
bythex = byt.hex()
for b1 in bythex:
y = print(b1, end = "")
glo = str(y)
return glo
str_to_hasc(string)
If you have a byte string, then:
>>> import binascii
>>> binascii.hexlify(b'the')
b'746865'
If you have a Unicode string, you can encode it:
>>> s = 'the'
>>> binascii.hexlify(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
>>> binascii.hexlify(s.encode())
b'746865'
The result is a byte string, you can decode it to get a Unicode string:
>>> binascii.hexlify(s.encode()).decode()
'746865'
The reverse, of course, is:
>>> binascii.unhexlify(b'746865')
b'the'
#!/usr/bin/python3
"""
Program name: ASC_to_txt.py
The program's input is a string of hexadecimal digits.
The string is a bytes object, and each byte is supposed to be
the hex ASCII-code of a (capital or small) letter.
The program's output is the string of the corresponding letters.
Example
Input: 746865
First subresult: ['7','4','6','8','6','5']
Second subresult: ['0x74', '0x68', '0x65']
Third subresult: [116, 104, 101]
Final result: the
References
Contribution by alhelal to stackoverflow.com (20180901)
Contribution by QintenG to stackoverflow.com (20170104)
Mark Pilgrim, Dive into Python 3, section 4.6
"""
import string
print("The program converts a string of hex ASCII-codes")
print("into the corresponding string of letters.")
print("Input range is [41, 42, ..., 5a] U [61, 62, ..., 7a]. \n")
x = input("Input the hex ASCII-codes, eg. 746865: ")
result_1 = []
for i in range(0,len(x)//2):
for j in range(0,2):
result_1.extend(x[2*i+j])
# First subresult
lenres_1 = len(result_1)
result_2 = []
for i in range(0,len(result_1) - 1,2):
temp = ""
temp = temp + "0x" + result_1[i] #0, 2, 4
temp = temp + result_1[i + 1] #1, 3, 5
result_2.append(temp)
# Second subresult
result_3 = []
for i in range(0,len(result_2)):
result_3.append(int(result_2[i],16))
# Third subresult
by = bytes(result_3)
result_4 = by.decode('utf-8')
# Final result
print("Corresponding string of letters:" + " "*6, result_4, end = "\n")

Longest substring in alphabetical order [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Write a program that prints the longest substring of s in which the letters occur in alphabetical order. For example, if s = 'azcbobobegghakl', then your program should print
Longest substring in alphabetical order is: beggh
In the case of ties, print the first substring. For example, if s = 'abcbcd', then your program should print
Longest substring in alphabetical order is: abc
Here you go edx student i've been helped to finish the code :
from itertools import count
def long_sub(input_string):
maxsubstr = input_string[0:0] # empty slice (to accept subclasses of str)
for start in range(len(input_string)): # O(n)
for end in count(start + len(maxsubstr) + 1): # O(m)
substr = input_string[start:end] # O(m)
if len(substr) != (end - start): # found duplicates or EOS
break
if sorted(substr) == list(substr):
maxsubstr = substr
return maxsubstr
sub = (long_sub(s))
print "Longest substring in alphabetical order is: %s" %sub
These are all assuming you have a string (s) and are needing to find the longest substring in alphabetical order.
Option A
test = s[0] # seed with first letter in string s
best = '' # empty var for keeping track of longest sequence
for n in range(1, len(s)): # have s[0] so compare to s[1]
if len(test) > len(best):
best = test
if s[n] >= s[n-1]:
test = test + s[n] # add s[1] to s[0] if greater or equal
else: # if not, do one of these options
test = s[n]
print "Longest substring in alphabetical order is:", best
Option B
maxSub, currentSub, previousChar = '', '', ''
for char in s:
if char >= previousChar:
currentSub = currentSub + char
if len(currentSub) > len(maxSub):
maxSub = currentSub
else: currentSub = char
previousChar = char
print maxSub
Option C
matches = []
current = [s[0]]
for index, character in enumerate(s[1:]):
if character >= s[index]: current.append(character)
else:
matches.append(current)
current = [character]
print "".join(max(matches, key=len))
Option D
def longest_ascending(s):
matches = []
current = [s[0]]
for index, character in enumerate(s[1:]):
if character >= s[index]:
current.append(character)
else:
matches.append(current)
current = [character]
matches.append(current)
return "".join(max(matches, key=len))
print(longest_ascending(s))
The following code solves the problem using the reduce method:
solution = ''
def check(substr, char):
global solution
last_char = substr[-1]
substr = (substr + char) if char >= last_char else char
if len(substr) > len(solution):
solution = substr
return substr
def get_largest(s):
global solution
solution = ''
reduce(check, list(s))
return solution

alignment of sequences

I want to do pairwise alignment with uniprot and pdb sequences. I have an input file containing uniprot and pdb IDs like this.
pdb id uniprot id
1dbh Q07889
1e43 P00692
1f1s Q53591
first, I need to read each line in an input file
2) retrieve the pdb and uniprot sequences from pdb.fasta and uniprot.fasta files
3) Do alignment and calculate sequence identity.
Usually, I use the following program for pairwise alignment and seq.identity calculation.
library("seqinr")
seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR"
seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM"
library(Biostrings)
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")
I need to print the output like this
pdbid uniprotid seq.identity
1dbh Q07889 99
1e43 P00692 80
1f1s Q53591 56
How can I change the above code ? your help would be appreciated!
'
This code is hopefully what your looking for:
class test():
def get_seq(self, pdb,fasta_file): # Get sequences
from Bio.PDB.PDBParser import PDBParser
from Bio import SeqIO
aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
p=PDBParser(PERMISSIVE=1)
structure_id="%s" % pdb[:-4]
structure=p.get_structure(structure_id, pdb)
residues = structure.get_residues()
seq_pdb = ''
for res in residues:
res = res.get_resname()
if res in aa:
seq_pdb = seq_pdb+aa[res]
handle = open(fasta_file, "rU")
for record in SeqIO.parse(handle, "fasta") :
seq_fasta = record.seq
handle.close()
self.seq_aln(seq_pdb,seq_fasta)
def seq_aln(self,seq1,seq2): # Align the sequences
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist
matrix = matlist.blosum62
gap_open = -10
gap_extend = -0.5
alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
top_aln = alns[0]
aln_seq1, aln_seq2, score, begin, end = top_aln
with open('aln.fasta', 'w') as outfile:
outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
print aln_seq1+'\n'+aln_seq2
self.seq_id('aln.fasta')
def seq_id(self,aln_fasta): # Get sequence ID
import string
from Bio import AlignIO
input_handle = open("aln.fasta", "rU")
alignment = AlignIO.read(input_handle, "fasta")
j=0 # counts positions in first sequence
i=0 # counts identity hits
for record in alignment:
#print record
for amino_acid in record.seq:
if amino_acid == '-':
pass
else:
if amino_acid == alignment[0].seq[j]:
i += 1
j += 1
j = 0
seq = str(record.seq)
gap_strip = seq.replace('-', '')
percent = 100*i/len(gap_strip)
print record.id+' '+str(percent)
i=0
a = test()
a.get_seq('1DBH.pdb','Q07889.fasta')
This outputs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence
For this to work on you input file, you need to put my a.get_seq() in a for loop with the inputs from your text file.
EDIT:
Replace the seq_id function with this one:
def seq_id(self,aln_fasta):
import string
from Bio import AlignIO
from Bio import SeqIO
record_iterator = SeqIO.parse(aln_fasta, "fasta")
first_record = record_iterator.next()
print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-','')))
second_record = record_iterator.next()
print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-','')))
lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))]
if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest
print 'PDB sequence has the shortest length'
else:
print 'Uniport sequence has the shortes length'
idenities = 0
for i,v in enumerate(first_record.seq):
if v == '-':
pass
#print i,v, second_record.seq[i]
if v == second_record.seq[i]:
idenities +=1
#print i,v, second_record.seq[i], idenities
print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))
to pass the arguments to the class use:
with open('input_file.txt', 'r') as infile:
next(infile)
next(infile) # Going by your input file
for line in infile:
line = line.split()
a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')
It might be something like this; a repeatable example (e.g., with short files posted on-line) would help...
library(Biostrings)
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")
to input all sequences into two objects. pairwiseAlignment accepts a vector as first (query) argument, so if you were wanting to align all pdb against all uniprot pre-allocate a result matrix
pids = matrix(numeric(), length(uniprot), length(pdb),
dimnames=list(names(uniprot), names(pdb)))
and then do the calculations
for (i in seq_along(uniprot)) {
globalAlignment = pairwiseAlignment(pdb, uniprot[i])
pids[i,] = pid(globalAlignment)
}

Resources