I'm currently working on a script that analyzes skew differences. Unfortunately, my problem is that when the length of the string increases, the runtime becomes too long and I can't seem to calculate my answer.
def SkewGC(file):
countG = 0
countC = 0
diffGtoC = ""
# first, we need to find number of G's.
# the idea is, if G appears, we add it to the count.
# We'll just do the same to each one.
for pos in range(0,len(file)):
if file[pos] == "G":
countG = countG+1
if file[pos] == "C":
countC = countC+1
diffGtoC = diffGtoC + str(countG-countC) + ","
return diffGtoC.split(",")
SkewGCArray = SkewGC(data)
# This because I included extra "," at the end...
SkewGCArray = [int(i) for i in SkewGCArray[:len(SkewGCArray)-1]]
def min_locator(file):
min_indices = ""
for pos in range(0,len(file)):
if file[pos] == min(file):
min_indices = min_indices + str(pos) + " "
return min_indices
print min_locator(SkewGCArray)
Essentially, this script calculates the number of G and C (corresponds to nucleotides in DNA), obtains differences at each position, and then I'm trying to find the indices of minimum. It works fine for low length of file (that's the input string) but when the length becomes large - even like 90000+, then my script runs but cannot resolve to an answer in reasonable time (~4-5 min).
Can anyone point to me what I could do to make it quicker? I've thought about whether it's better to say, obtain the difference (diffGtoC), set that as the minimum, and then re-calculate each difference until it sees something different during which I also replace the minimum value too.
But the concern I had that with this approach is on finding and retaining the indices of minimum. If I say, had an array with values:
[-4,-2,-5,-6,-5,-6]
I can see how changing the minimum value (-4 to -5 and then to -6) will be quicker in terms of algorithm runtime but how will I be able to maintain both -6's position? Not sure if this makes completely sense.
Several suggestions to improve the performance of your code:
diffGtoC = diffGtoC + str(countG-countC) + ","
return diffGtoC.split(",")
is actually equivalent to:
diffGtoC = list()
diffGtoC.append(countG - countC)
Strings are immutable in Python, so you are generating a new string for every position which is not very efficient. Using a list will also save you the str and int conversions you are performing and the truncation of your list. You could also use pop() to remove the last item of your list instead of generating a new one.
A really simple alternative would be to search for the minimum and only store the minimum value and its position. Then start iterating from the minimum position and see if you can find the minimum again and if yes append it to the first minimum position. Less data manipulation which saves time and memory.
Related
I am currently having an issue. Basically, I have 2 similar functions in terms of concept but the results do not align. These are the codes I learned from Bioinformatics I on Coursera.
The first code is simply creating a dictionary of occurrences of each k-mer pattern from a text (which is a long stretch of nucleotides). In this case, k is 5.
def FrequencyMap(text,k):
freq ={}
for i in range (0, len(text)-k+1):
freq[text[i:i+k]]=0
for j in range (0, len(text)-k+1):
if text[j:j+k] == text[i:i+k]:
freq[text[i:i+k]] +=1
return freq, max(freq)
The text and the result dictionary are kinda long, but the main point is when I call max(freq), it returns the key 'TTTTC', which has a value of 1.
Meanwhile, I wrote another code that is simply based on the previous code to generate the 5-mer patterns that have the max values (number of occurrences in the text).
def FrequentWords(text, k):
a = FrequencyMap(text, k)
m = max(a.values())
words = []
for i in a:
if a[i]==m:
words.append(i)
return words,m
And this code returns 'ACCTA', which has the value of 99, meaning it appears 99 times in the text. This makes total sense.
I used the same text and k (k=5) for both codes. I ran the codes on Jupyter Notebook. Why does the first one not return 'ACCTA'?
Thank you so much,
Here is the text, if anyone wants to try:
"ACCATCCCTAGGGCATACCTAAGTCTACCTAAAAGGCTACCTAATACCATACCTAATTACCTAACTACCTAAAATAAGTCTACCTAATACCTAATACCTAAAGTTACCTAACGTACCTAATACCTAATACCTAACCACTACCTAATCCGATTTACCTAACAACCGATCGAGTACCTAATCGATACCTAAATAACGGACAATATACCTAATTACCTAATACCTAATACCTAAGTGTACCTAAGACGTCTACCTAATTGTACCTAACTACCTAATTACCTAAGATTAATACCTAATACCTAATTTACCTAATACCTAACGTGGACTACCTAATACCTAACTTTTCCCCTACCTAATACCTAACTGTACCTAAATACCTAATACCTAAGCTACCTAAAGAACAACATTGTACGTGCGCCGTACCTAAATACCTAACAACTACCTAACTGATACCTAATAGTGATTACCTAACGCTTCTACCTAACTACCTAAGTACCTAACGCTACCTAACTACCTAATGTCCACAAAATACCTAATACCTAATAGCTACCTAATTGTGTACCTAAGTACCTAACCTACCTAATAATACCTAAAAATACCTAAGTACCTAACGTACCTAAATTTTACCTAATCTACCTAACGTACCTAATACCTAATTATACCTAATTACCTAATGGTTACCTAAGTTACCTAATATGCCACTACCTAACCTTACCTAAGACCTACCTAATAGGTACCTAACTGGGTACCTAAGGCAGTTTACCTAATTCAGGGCTACCTAATGTACCTAATACCTAAGTACCTAATACCTAATCCCATACCTAATATTTACCTAAGGGCACCGGTACCTAATACCTAATACCTAATACCTAAACCTTCGTACCTAAATACCTAATCTACCTAATGTACCTAAGGTACCTAATACCTAAGTCACTACCTAATACCTAATACCTAATGGGAGGAGCTTACCTAAGGTTACCTAATTACCTAAATACCTAATCGTTACCTAA"
Why does the first one not return 'ACCTA'?
Because max(freq) returns the maximum key of the dictionary. In this case the keys are strings (the k-mers), and strings are compared alphabetically. Hence the maximum one is the last string when the are sorted alphabetically.
If you want the first function to return the k-mer that occurs most often, you should change max(freq) to max(freq.items(), key=lambda key_value_pair: key_value_pair[1])[0]. Here, you are sorting the (kmer, count) pairs (that's the key_value_pair parameter of the lambda expression) based on the frequency and then selecting the kmer.
How can I convert a z3.String to a sequence of ASCII values?
For example, here is some code that I thought would check whether the ASCII values of all the characters in the string add up to 100:
import z3
def add_ascii_values(password):
return sum(ord(character) for character in password)
password = z3.String("password")
solver = z3.Solver()
ascii_sum = add_ascii_values(password)
solver.add(ascii_sum == 100)
print(solver.check())
print(solver.model())
Unfortunately, I get this error:
TypeError: ord() expected string of length 1, but SeqRef found
It's apparent that ord doesn't work with z3.String. Is there something in Z3 that does?
The accepted answer dates back to 2018, and things have changed in the mean time which makes the proposed solution no longer work with z3. In particular:
Strings are now formalized by SMTLib. (See https://smtlib.cs.uiowa.edu/theories-UnicodeStrings.shtml)
Unlike the previous version (where strings were simply sequences of bit vectors), strings are now sequences unicode characters. So, the coding used in the previous answer no longer applies.
Based on this, the following would be how this problem would be coded, assuming a password of length 3:
from z3 import *
s = Solver()
# Ord of character at position i
def OrdAt(inp, i):
return StrToCode(SubString(inp, i, 1))
# Adding ascii values for a string of a given length
def add_ascii_values(password, len):
return Sum([OrdAt(password, i) for i in range(len)])
# We'll have to force a constant length
length = 3
password = String("password")
s.add(Length(password) == length)
ascii_sum = add_ascii_values(password, length)
s.add(ascii_sum == 100)
# Also require characters to be printable so we can view them:
for i in range(length):
v = OrdAt(password, i)
s.add(v >= 0x20)
s.add(v <= 0x7E)
print(s.check())
print(s.model()[password])
Note Due to https://github.com/Z3Prover/z3/issues/5773, to be able to run the above, you need a version of z3 that you downloaded on Jan 12, 2022 or afterwards! As of this date, none of the released versions of z3 contain the functions used in this answer.
When run, the above prints:
sat
" #!"
You can check that it satisfies the given constraint, i.e., the ord of characters add up to 100:
>>> sum(ord(c) for c in " #!")
100
Note that we no longer have to worry about modular arithmetic, since OrdAt returns an actual integer, not a bit-vector.
2022 Update
Below answer, written back in 2018, no longer applies; as strings in SMTLib received a major update and thus the code given is outdated. Keeping it here for archival purposes, and in case you happen to have a really old z3 that you cannot upgrade for some reason. See the other answer for a variant that works with the new unicode strings in SMTLib: https://stackoverflow.com/a/70689580/936310
Old Answer from 2018
You're conflating Python strings and Z3 Strings; and unfortunately the two are quite different types.
In Z3py, a String is simply a sequence of 8-bit values. And what you can do with a Z3 is actually quite limited; for instance you cannot iterate over the characters like you did in your add_ascii_values function. See this page for what the allowed functions are: https://rise4fun.com/z3/tutorialcontent/sequences (This page lists the functions in SMTLib parlance; but the equivalent ones are available from the z3py interface.)
There are a few important restrictions/things that you need to keep in mind when working with Z3 sequences and strings:
You have to be very explicit about the lengths; In particular, you cannot sum over strings of arbitrary symbolic length. There are a few things you can do without specifying the length explicitly, but these are limited. (Like regex matches, substring extraction etc.)
You cannot extract a character out of a string. This is an oversight in my opinion, but SMTLib just has no way of doing so for the time being. Instead, you get a list of length 1. This causes a lot of headaches in programming, but there are workarounds. See below.
Anytime you loop over a string/sequence, you have to go up to a fixed bound. There are ways to program so you can cover "all strings upto length N" for some constant "N", but they do get hairy.
Keeping all this in mind, I'd go about coding your example like the following; restricting password to be precisely 10 characters long:
from z3 import *
s = Solver()
# Work around the fact that z3 has no way of giving us an element at an index. Sigh.
ordHelperCounter = 0
def OrdAt(inp, i):
global ordHelperCounter
v = BitVec("OrdAtHelper_%d_%d" % (i, ordHelperCounter), 8)
ordHelperCounter += 1
s.add(Unit(v) == SubString(inp, i, 1))
return v
# Your original function, but note the addition of len parameter and use of Sum
def add_ascii_values(password, len):
return Sum([OrdAt(password, i) for i in range(len)])
# We'll have to force a constant length
length = 10
password = String("password")
s.add(Length(password) == 10)
ascii_sum = add_ascii_values(password, length)
s.add(ascii_sum == 100)
# Also require characters to be printable so we can view them:
for i in range(length):
v = OrdAt(password, i)
s.add(v >= 0x20)
s.add(v <= 0x7E)
print(s.check())
print(s.model()[password])
The OrdAt function works around the problem of not being able to extract characters. Also note how we use Sum instead of sum, and how all "loops" are of fixed iteration count. I also added constraints to make all the ascii codes printable for convenience.
When you run this, you get:
sat
":X|#`y}###"
Let's check it's indeed good:
>>> len(":X|#`y}###")
10
>>> sum(ord(character) for character in ":X|#`y}###")
868
So, we did get a length 10 string; but how come the ord's don't sum up to 100? Now, you have to remember sequences are composed of 8-bit values, and thus the arithmetic is done modulo 256. So, the sum actually is:
>>> sum(ord(character) for character in ":X|#`y}###") % 256
100
To avoid the overflows, you can either use larger bit-vectors, or more simply use Z3's unbounded Integer type Int. To do so, use the BV2Int function, by simply changing add_ascii_values to:
def add_ascii_values(password, len):
return Sum([BV2Int(OrdAt(password, i)) for i in range(len)])
Now we'd get:
unsat
That's because each of our characters has at least value 0x20 and we wanted 10 characters; so there's no way to make them all sum up to 100. And z3 is precisely telling us that. If you increase your sum goal to something more reasonable, you'd start getting proper values.
Programming with z3py is different than regular programming with Python, and z3 String objects are quite different than those of Python itself. Note that the sequence/string logic isn't even standardized yet by the SMTLib folks, so things can change. (In particular, I'm hoping they'll add functionality for extracting elements at an index!).
Having said all this, going over the https://rise4fun.com/z3/tutorialcontent/sequences would be a good start to get familiar with them, and feel free to ask further questions.
I'm learning Julia, but have relatively little programming experience outside of R. I'm taking this problem directly from rosalind.info and you can find it here if you'd like a bit more detail.
I've given two strings: a motif and a sequence where the motif is a substring of the sequence and i'm tasked with finding out the index of the beginning position of the substring however many times it is found in the sequence.
For example:
Sequence: "GATATATGCATATACTT"
Motif: "ATAT"
ATAT is found three times, once beginning at index 2, once at index 4, and once at index 10. This is assuming 1-based indexing. So the final output would be: 2 4 10
Here's what I have so far:
f = open("motifs.txt")
stream = readlines(f)
sequence = chomp(stream[1])
motif = chomp(stream[2])
println("Sequence: $sequence")
println("Motif: $motif")
result = searchindex(sequence, motif)
println("$result")
close(f)
My main problem seems to be that there isn't a searchindexall option. The current script gives me the first index of the first time the motif is encountered (index 2), i've tried a variety of for loops that haven't ended in much success so i'm hoping that someone can give me some insight on this.
Here is one solution with while loops:
sequence = "GATATATGCATATACTT"
motif = "ATAT"
function find_indices(sequence, motif)
# initalise empty array of integers
found_indices = Array{Int, 1}()
# set initial values for search helpers
start_at = 1
while true
# search string for occurrence of motif
result = searchindex(sequence, motif, start_at)
# if motif not found, terminate while loop
result == 0 && break
# add new index to results
push!(found_indices, result-1+start_at)
start_at += result + 1
end
return found_indices
end
This gives what you want:
>find_indices(sequence, motif)
2
4
10
If the performance is not so important, regular expression can be a good choice.
julia> map(x->x.offset, eachmatch(r"ATAT", "GATATATGCATATACTT", true))
3-element Array{Any,1}:
2
4
10
PS. The third arguments of eachmatch means "overlap", don't forget to set it true.
If a better performance is required, maybe you should spend some time implementing an algorithm like KMP.
I created the following simple matlab functions to convert a number from an arbitrary base to decimal and back
this is the first one
function decNum = base2decimal(vec, base)
decNum = vec(1);
for d = 1:1:length(vec)-1
decNum = decNum*base + vec(d+1);
end
and here is the other one
function baseNum = decimal2base(num, base, Vlen)
ii = 1;
if num == 0
baseNum = 0;
end
while num ~= 0
baseNum(ii) = mod(num, base);
num = floor(num./base);
ii = ii+1;
end
baseNum = fliplr(baseNum);
if Vlen>(length(baseNum))
baseNum = [zeros(1,(Vlen)-(length(baseNum))) baseNum ];
end
Due to the fact that there are limitations to how big a number can be these functions can't successfully convert vary big vectors, but while testing them I noticed the following bug
Let's use the following testing function
num = 201;
pCount = 7
x=base2decimal(repmat(num-1, 1, pCount), num)
repmat(num-1, 1, pCount)
y=decimal2base(x, num, 1)
isequal(repmat(num-1, 1, pCount),y)
A supposed vector with seven (7) digits in base201 works fine, but the same vector with base200 does not return the expected result even though it is smaller and theoretically should be converted successfully.
(One preliminary comment: calling base2decimal won't result in a decimal number but rather in a number :-D)
This is due floating-point limited precision (in our case, double). To test it, just type at the MATLAB Command Window:
>> 200^7 - 1 == 200^7
ans =
1
>> mod(200^7 - 1, 200)
ans =
0
which means that the value of your number in base 200 (which is precisely 2007−1) is represented exactly as 2007, and the "true" value of representation is 2007.
On the other hand:
>> 201^7 - 1 == 201^7
ans =
1
so still the two numbers are represented the same, but
>> mod(201^7 - 1, 201)
ans =
200
which means that the two values share the "true" representation of 2017−1, which, by accident, is the value that you expected.
TL;DR
When stored in a double, 2007−1 is inaccurately represented as 2007, while 2017−1 is accurately represented.
"Bigger numbers are less accurately represented than smaller numbers" is a misconception: if it was true, there would be no big numbers that could be exactly represented.
Judging from your own observations:
The code works fine in most cases
The code can give small errors for large numbers
The suspect is apparent:
Rounding issues seem to give you headaces here. This is also illustrated by #RTL in the comments.
The first question should now be:
1. Do you need perfect accuracy for such large numbers? Or is it ok if it is off by a relatively small amount sometimes?
If that is answered with a yes, I would recommend you to try a different storage format.
The simple solution would be to use big integers:
uint64
The alternative would be to make your own storage format. This is required if you need even bigger numbers. I think you can cover a huge range with a cell array and some tricks, but of course it is going to be hard to combine those numbers afterwards without losing the accuracy that you worked so hard for.
So, I am working on a program in Scilab which solves a binary puzzle. I have come across a problem however. Can anyone explain to me the logic behind solving a binary sequence with gaps (like [1 0 -1 0 -1 1 -1] where -1 means an empty cell. I want all possible solutions of a given sequence. So far I have:
function P = mogelijkeCombos(V)
for i=1:size(V,1)
if(V(i) == -1)
aantalleeg = aantalleeg +1
end
end
for i=1:2^aantalleeg
//creating combos here
end
endfunction
sorry that some words are in dutch
aantalleeg means amountempty by which I mean the amount of empty cells
I hope I gave you guys enough info. I don't need any code written, I'd just like ideas of how I can make every possible rendition as I am completely stuck atm.
BTW this is a school assignment, but the assignment is way bigger than this and it's just a tiny part I need some ideas on
ty in advance
Short answer
You could create the combos by extending your code and create all possible binary words of the length "amountempty" and replacing them bit-for-bit in the empty cells of V.
Step-by-step description
Find all the empty cell positions
Count the number of positions you've found (which equals the number of empty cells)
Create all possible binary numbers with the length of your count
For each binary number you generate, place the bits in the empty cells
print out / store the possible sequence with the filled in bits
Example
Find all the empty cell positions
You could for example check from left-to-right starting at 1 and if a cell is empty add the position to your position list.
V = [1 0 -1 0 -1 1 -1]
^ ^ ^
| | |
1 2 3 4 5 6 7
// result
positions = [3 5 7]
Count the number of positions you've found
//result
amountempty = 3;
Create all possible binary numbers with the length amountempty
You could create all possible numbers or words with the dec2bin function in SciLab. The number of possible words is easy to determine because you know how much separate values can be represented by a word of amountempty bits long.
// Create the binary word of amountEmpty bits long
binaryWord = dec2bin( i, amountEmpty );
The binaryWord generated will be a string, you will have to split it into separate bits and convert it to numbers.
For each binaryWord you generate
Now create a possible solution by starting with the original V and fill in every empty cell at the position from your position list with a bit from binaryWordPerBit
possibleSequence = V;
for j=1:amountEmpty
possibleSequence( positions(j) ) = binaryWordPerBit(j);
end
I wish you "veel succes met je opdracht"