Counting frequency of amino acids at each position in multiple-sequence alignments - r

I'm wondering if anyone knows any tools which allow me to count the frequency of amino acids at any specific position in a multiple-sequence alignment.
For example if I had three sequences:
Species 1 - MMRSA
Species 2 - MMLSA
Species 3 - MMRTA
I'd like for a way to search by position for the following output:
Position 1 - M = 3;
Position 2 - M = 3;
Position 3 - R = 2, L = 1;
Position 4 - S = 2, T = 1;
Position 5 - A = 3.
Thanks! I'm familiar with R and Linux, but if there's any other software that can do this I'm sure I can learn.

Using R:
x <- read.table(text = "Species 1 - MMRSA
Species 2 - MMLSA
Species 3 - MMRTA")
ixCol = 1
table(sapply(strsplit(x$V4, ""), "[", ixCol))
# M
# 3
ixCol = 4
table(sapply(strsplit(x$V4, ""), "[", ixCol))
# S T
# 2 1
Depending input file format, there are likely a purpose built bioconductor packages/functions.

That is really easy to parse, you can use any language of choice.
Here is an example in Python using a dict and Counter to assemble the data in a simple object.
from collections import defaultdict, Counter
msa = '''
Species 1 - MMRSA
Species 2 - MMLSA
Species 3 - MMRTA
'''
r = defaultdict(list) #dictionary having the sequences index as key and the list of aa found at that index as value
for line in msa.split('\n'):
line = line.strip()
if line:
sequence = line.split(' ')[-1]
for i, aa in enumerate(list(sequence)):
r[i].append(aa)
count = {k:Counter(v) for k,v in r.items()}
print(count)
#{0: Counter({'M': 3}), 1: Counter({'M': 3}), 2: Counter({'R': 2, 'L': 1}), 3: Counter({'S': 2, 'T': 1}), 4: Counter({'A': 3})}
To print the output as you specified:
for k, v in count.items():
print(f'Position {k+1} :', end=' ') #add 1 to start counting from 1 instead of 0
for aa, c in v.items():
print(f'{aa} = {c};', end=' ')
print()
It prints:
Position 1 : M = 3;
Position 2 : M = 3;
Position 3 : R = 2; L = 1;
Position 4 : S = 2; T = 1;
Position 5 : A = 3;

Related

Dijkstra's algorithm with adjacency matrix

I'm trying to implement the following code from here but it won't work correctly.
What I want is the shortest path distances from a source to all nodes and also the predecessors. Also, I want the input of the graph to be an adjacency matrix which contains all of the edge weights.
I'm trying to make it work in just one function so I have to rewrite it. If I'm right the original code calls other functions (from graph.jl for example).
I don't quite understand how to rewrite the for loop which calls the adj() function.
Also, I'm not sure if the input is correct in the way the code is for now.
function dijkstra(graph, source)
node_size = size(graph, 1)
dist = ones(Float64, node_size) * Inf
dist[source] = 0.0
Q = Set{Int64}() # visited nodes
T = Set{Int64}(1:node_size) # unvisited nodes
pred = ones(Int64, node_size) * -1
while condition(T)
# node selection
untraversed_nodes = [(d, k) for (k, d) in enumerate(dist) if k in T]
if minimum(untraversed_nodes)[1] == Inf
break # Break if remaining nodes are disconnected
end
node_ind = untraversed_nodes[argmin(untraversed_nodes)][2]
push!(Q, node_ind)
delete!(T, node_ind)
# distance update
curr_node = graph.nodes[node_ind]
for (neigh, edge) in adj(graph, curr_node)
t_ind = neigh.index
weight = edge.cost
if dist[t_ind] > dist[node_ind] + weight
dist[t_ind] = dist[node_ind] + weight
pred[t_ind] = node_ind
end
end
end
return dist, pred
end
So if I'm trying it with the following matrix
A = [0 2 1 4 5 1; 1 0 4 2 3 4; 2 1 0 1 2 4; 3 5 2 0 3 3; 2 4 3 4 0 1; 3 4 7 3 1 0]
and source 2 i would like to get the distances in a vector dist and the predeccessors in anothe vectore pred.
Right now I'm getting
ERROR: type Array has no field nodes
Stacktrace: [1] getproperty(::Any, ::Symbol) at .\sysimg.jl:18
I guess I have to rewrite it a bit more.
I m thankful for any help.
Assuming that graph[i,j] is a length of path from i to j (your graph is directed looking at your data), and it is a Matrix with non-negative entries, where 0 indicates no edge from i to j, a minimal rewrite of your code should be something like:
function dijkstra(graph, source)
#assert size(graph, 1) == size(graph, 2)
node_size = size(graph, 1)
dist = fill(Inf, node_size)
dist[source] = 0.0
T = Set{Int}(1:node_size) # unvisited nodes
pred = fill(-1, node_size)
while !isempty(T)
min_val, min_idx = minimum((dist[v], v) for v in T)
if isinf(min_val)
break # Break if remaining nodes are disconnected
end
delete!(T, min_idx)
# distance update
for nei in 1:node_size
if graph[min_idx, nei] > 0 && nei in T
possible_dist = dist[min_idx] + graph[min_idx, nei]
if possible_dist < dist[nei]
dist[nei] = possible_dist
pred[nei] = min_idx
end
end
end
end
return dist, pred
end
(I have not tested it extensively, so please report if you find any bugs)

Expressing Natural Number by sum of Triangular numbers

Triangular numbers are numbers which is number of things when things can be arranged in triangular shape.
For Example, 1, 3, 6, 10, 15... are triangular numbers.
o o o o o o o o o o is shape of n=4 triangular number
what I have to do is A natural number N is given and I have to print
N expressed by sum of triangular numbers.
if N = 4
output should be
1 1 1 1
1 3
3 1
else if N = 6
output should be
1 1 1 1 1 1
1 1 1 3
1 1 3 1
1 3 1 1
3 1 1 1
3 3
6
I have searched few hours and couldn't find answers...
please help.
(I am not sure this might help, but I found that
If i say T(k) is Triangular number when n is k, then
T(k) = T(k-1) + T(k-3) + T(k-6) + .... + T(k-p) while (k-p) > 0
and p is triangular number )
Here's Code for k=-1(Read comments below)
#include <iostream>
#include <vector>
using namespace std;
long TriangleNumber(int index);
void PrintTriangles(int index);
vector<long> triangleNumList(450); //(450 power raised by 2 is about 200,000)
vector<long> storage(100001);
int main() {
int n, p;
for (int i = 0; i < 450; i++) {
triangleNumList[i] = i * (i + 1) / 2;
}
cin >> n >> p;
cout << TriangleNumber(n);
if (p == 1) {
//PrintTriangles();
}
return 0;
}
long TriangleNumber(int index) {
int iter = 1, out = 0;
if (index == 1 || index == 0) {
return 1;
}
else {
if (storage[index] != 0) {
return storage[index];
}
else {
while (triangleNumList[iter] <= index) {
storage[index] = ( storage[index] + TriangleNumber(index - triangleNumList[iter]) ) % 1000000;
iter++;
}
}
}
return storage[index];
}
void PrintTriangles(int index) {
// What Algorithm?
}
Here is some recursive Python 3.6 code that prints the sums of triangular numbers that total the inputted target. I prioritized simplicity of code in this version. You may want to add error-checking on the input value, counting the sums, storing the lists rather than just printing them, and wrapping the entire routine into a function. Setting up the list of triangular numbers could also be done in fewer lines of code.
Your code saved time but worsened memory usage by "memoizing" the triangular numbers (storing and reusing them rather than always calculating them when needed). You could do the same to the sum lists, if you like. It is also possible to make this more in the dynamic programming style: find the sum lists for n=1 then for n=2 etc. I'll leave all that to you.
""" Given a positive integer n, print all the ways n can be expressed as
the sum of triangular numbers.
"""
def print_sums_of_triangular_numbers(prefix, target):
"""Print sums totalling to target, each after printing the prefix."""
if target == 0:
print(*prefix)
return
for tri in triangle_num_list:
if tri > target:
return
print_sums_of_triangular_numbers(prefix + [tri], target - tri)
n = int(input('Value of n ? '))
# Set up list of triangular numbers not greater than n
triangle_num_list = []
index = 1
tri_sum = 1
while tri_sum <= n:
triangle_num_list.append(tri_sum)
index += 1
tri_sum += index
# Print the sums totalling to n
print_sums_of_triangular_numbers([], n)
Here are the printouts of two runs of this code:
Value of n ? 4
1 1 1 1
1 3
3 1
Value of n ? 6
1 1 1 1 1 1
1 1 1 3
1 1 3 1
1 3 1 1
3 1 1 1
3 3
6

Scilab code giving submatrix incorrectly defined error

I am trying to plot a 3D graph between 2 scalars and one matrix for each of its entries. On compiling it is giving me "Submatrix incorrectly defined" error on line 11. The code:
i_max= 3;
u = zeros(4,5);
a1 = 1;
a2 = 1;
a3 = 1;
b1 = 1;
hx = linspace(1D-6,1D6,13);
ht = linspace(1D-6,1D6,13);
for i = 1:i_max
for j = 2:4
u(i+1,j)=u(i,j)+(ht*(a1*u(i,j))+b1+(((a2*u(i,j+1))-(2*a2*u(i,j))+(a2*u(i,j-1)))*(hx^-2))+(((a3*u(i,j+1))-(a3*u(i,j-1)))*(0.5*hx^-1)));
plot(ht,hx,u(i+1,j));
end
end
Full error message:
-->exec('C:\Users\deba123\Documents\assignments and lecture notes\Seventh Semester\UGP\Scilab\Simulation1_Plot.sce', -1)
+(((a3*u(i,j+1))-(a3*u(i,j-1)))*(0.5*hx^-1)))
!--error 15
Submatrix incorrectly defined.
at line 11 of exec file called by :
emester\UGP\Scilab\Simulation1_Plot.sce', -1
Please help.
For a 3-dimensional figure, you need 2 argument vectors and a matrix for the function values. So I expanded u to a tensor.
At every operation in your code, I added the current dimension of the term. Now, a transparent handling of you calculation is given. For plotting you have to use the plot3d (single values) or surf (surface) command.
In a 3-dim plot, you want two map 2 vectors (hx,ht) with dim n and m to an scalar z. Therefore you reach a (nxm)-matrix with your results. Is this, what you want to do? Currently, you have 13 values for each u(i,j,:) - entry, but you want (13x13) for every figure. Maybe the eval3d-function can help you.
i_max= 3;
u = zeros(4,5,13);
a1 = 1;
a2 = 1;
a3 = 1;
b1 = 1;
hx = linspace(1D-6,1D6,13); // 1 x 13
ht = linspace(1D-6,1D6,13); // 1 x 13
for i = 1:i_max
for j = 2:4
u(i+1,j,:)= u(i,j)...
+ ht*(a1*u(i,j))*b1... // 1 x 13
+(((a2*u(i,j+1)) -(2*a2*u(i,j)) +(a2*u(i,j-1)))*(hx.^-2))... // 1 x 13
+(((a3*u(i,j+1))-(a3*u(i,j-1)))*(0.5*hx.^-1)) ... // 1 x 13
+ hx*ones(13,1)*ht; // added to get non-zero values
z = squeeze( u(i+1,j, : ))'; // 1x13
// for a 3d-plot: (1x13, 1x13, 13x13)
figure()
plot3d(ht,hx, z'* z ,'*' ); //
end
end

Non Decreasing Number Combinations (Interval)

So my problem is the following:
Given a number X of size and an A (1st number), B(Last number) interval, I have to find the number of all different kind of non decreasing combinations (increasing or null combinations) that I can build.
Example:
Input: "2 9 11"
X = 2 | A = 9 | B = 11
Output: 8
Possible Combinations ->
[9],[9,9],[9,10],[9,11],[10,10],[10,11],[11,11],[10],[11].
Now, If it was the same input, but with a different X, line X = 4, this would change a lot...
[9],[9,9],[9,9,9],[9,9,9,9],[9,9,9,10],[9,9,9,11],[9,9,10,10]...
Your problem can be reformulated to simplify to just two parameters
X and N = B - A + 1 to give you sequences starting with 0 instead of A.
If you wanted exactly X numbers in each item, it is simple combination with repetition and the equation for that would be
x_of_n = (N + X - 1)! / ((N - 1)! * X!)
so for your first example it would be
X = 2
N = 11 - 9 + 1 = 3
x_of_n = 4! / (2! * 2!) = 4*3*2 / 2*2 = 6
to this you need to add the same with X = 1 to get x_of_n = 3, so you get the required total 9.
I am not aware of simple equation for the required output, but when you expand all the equations to one sum, there is a nice recursive sequence, where you compute next (N,X) from (N,X-1) and sum all the elements:
S[0] = N
S[1] = S[0] * (N + 1) / 2
S[2] = S[1] * (N + 2) / 3
...
S[X-1] = S[X-2] * (N + X - 1) / X
so for the second example you give we have
X = 4, N = 3
S[0] = 3
S[1] = 3 * 4 / 2 = 6
S[2] = 6 * 5 / 3 = 10
S[3] = 10 * 6 / 4 = 15
output = sum(S) = 3 + 6 + 10 + 15 = 34
so you can try the code here:
function count(x, a, b) {
var i,
n = b - a + 1,
s = 1,
total = 0;
for (i = 0; i < x; i += 1) {
s *= (n + i) / (i + 1); // beware rounding!
total += s;
}
return total;
}
console.log(count(2, 9, 11)); // 9
console.log(count(4, 9, 11)); // 34
Update: If you use a language with int types (JS has only double),
you need to use s = s * (n + i) / (i + 1) instead of *= operator to avoid temporary fractional number and subsequent rounding problems.
Update 2: For a more functional version, you can use a recursive definition
function count(x, n) {
return n < 1 || x < 1 ? 0 : 1 + count(n - 1, x) + count(n, x - 1);
}
where n = b - a + 1

R convert vector of numbers to skipping indexes

I have a vector of widths,
ws = c(1,1,2,1,3,1)
From this vector I'd like to have another vector of this form:
indexes = c(1,2,3,5,6,7,9,11,12)
In order to create such vector I did the following for loop in R:
ws = c(1,1,2,1,3,1)
indexes = rep(0, sum(ws))
counter = 1
counter2 = 1
last = 0
for(i in 1:length(ws))
{
if (ws[i] == 1)
{
indexes[counter] = counter2
counter = counter + 1
} else {
for(j in 1:ws[i])
{
indexes[counter] = counter2
counter = counter + 1
counter2 = counter2+2
}
counter2 = counter2 - 2
}
counter2 = counter2+1
}
The logic is as follows, each element in ws specifies the respective number of elements in index. For example if ws is 1, the respective number of elements in indexes is 1, but if ws is > 1, let us say 3, the respective number of elements in index is 3, and the elements are skipped 1-by-1, corresponding to 3,5,7.
However, I'd like to avoid for loops since they tend to be very slow in R. Do you have any suggestions on how to achieve such results only with vector operations? or some more crantastic solution?
Thanks!
Here's a vectorized one-liner for you:
ws <- c(1,1,2,1,3,1)
cumsum((unlist(sapply(ws, seq_len)) > 1) + 1)
# [1] 1 2 3 5 6 7 9 11 12
You can pick it apart piece by piece, working from the inside out, to see how it works.

Resources