Standard Deviation for SQLite - sqlite

I've searched the SQLite docs and couldn't find anything, but I've also searched on Google and a few results appeared.
Does SQLite have any built-in Standard Deviation function?

You can calculate the variance in SQL:
create table t (row int);
insert into t values (1),(2),(3);
SELECT AVG((t.row - sub.a) * (t.row - sub.a)) as var from t,
(SELECT AVG(row) AS a FROM t) AS sub;
However, you still have to calculate the square root to get the standard deviation.

The aggregate functions supported by SQLite are here:
STDEV is not in the list.
However, the module extension-functions.c in this page contains a STDEV function.

There is still no built-in stdev function in sqlite. However, you can define (as Alix has done) a user-defined aggregator function. Here is a complete example in Python:
import sqlite3
import math
class StdevFunc:
def __init__(self):
self.M = 0.0
self.S = 0.0
self.k = 1
def step(self, value):
if value is None:
tM = self.M
self.M += (value - tM) / self.k
self.S += (value - tM) * (value - self.M)
self.k += 1
def finalize(self):
if self.k < 3:
return None
return math.sqrt(self.S / (self.k-2))
with sqlite3.connect(':memory:') as con:
con.create_aggregate("stdev", 1, StdevFunc)
cur = con.cursor()
cur.execute("create table test(i)")
cur.executemany("insert into test(i) values (?)", [(1,), (2,), (3,), (4,), (5,)])
cur.execute("insert into test(i) values (null)")
cur.execute("select avg(i) from test")
print("avg: %f" % cur.fetchone()[0])
cur.execute("select stdev(i) from test")
print("stdev: %f" % cur.fetchone()[0])
This will print:
avg: 3.000000
stdev: 1.581139
Compare with MySQL:!2/ad42f3/3/0

Use variance formula V(X) = E(X^2) - E(X)^2. In SQL sqlite
SELECT AVG(col*col) - AVG(col)*AVG(col) FROM table
To get standard deviation you need to take the square root V(X)^(1/2)

I implemented the Welford's method (the same as extension-functions.c) as a SQLite UDF:
function (&$context, $row, $data) // step callback
if (isset($context) !== true) // $context is null at first
$context = array
'k' => 0,
'm' => 0,
's' => 0,
if (isset($data) === true) // the standard is non-NULL values only
$context['s'] += ($data - $context['m']) * ($data - ($context['m'] += ($data - $context['m']) / ++$context['k']));
return $context;
function (&$context, $row) // fini callback
if ($context['k'] > 0) // return NULL if no non-NULL values exist
return sqrt($context['s'] / $context['k']);
return null;
That's in PHP ($db is the PDO object) but it should be trivial to port to another language.
SQLite is soooo cool. <3

a little trick
select ((sum(value)*sum(value) - sum(value * value))/((count(*)-1)*(count(*))))
from the_table ;
then the only thing left is to calculate sqrt outside.

No, I searched this same issue, and ended having to do the calculations with my application (PHP)

added some error detection in the python functions
class StdevFunc:
For use as an aggregate function in SQLite
def __init__(self):
self.M = 0.0
self.S = 0.0
self.k = 0
def step(self, value):
# automatically convert text to float, like the rest of SQLite
val = float(value) # if fails, skips this iteration, which also ignores nulls
tM = self.M
self.k += 1
self.M += ((val - tM) / self.k)
self.S += ((val - tM) * (val - self.M))
def finalize(self):
if self.k <= 1: # avoid division by zero
return none
return math.sqrt(self.S / (self.k-1))

You don't state which version of standard deviation you wish to calculate but variances (standard deviation squared) for either version can be calculated using a combination of the sum() and count() aggregate functions.
(count(val)*sum(val*val) - (sum(val)*sum(val)))/((count(val)-1)*(count(val))) as sample_variance,
(count(val)*sum(val*val) - (sum(val)*sum(val)))/((count(val))*(count(val))) as population_variance
from ... ;
It will still be necessary to take the square root of these to obtain the standard deviation.

# -*- coding: utf-8 -*-
#Values produced by this script can be verified by follwing the steps
#found at to Verify
#by chosing a non memory based database.
import sqlite3
import math
import random
import os
import sys
import traceback
import random
class StdevFunc:
def __init__(self):
self.M = 0.0 #Mean
self.V = 0.0 #Used to Calculate Variance
self.S = 0.0 #Standard Deviation
self.k = 1 #Population or Small
def step(self, value):
if value is None:
return None
tM = self.M
self.M += (value - tM) / self.k
self.V += (value - tM) * (value - self.M)
self.k += 1
except Exception as EXStep:
return None
def finalize(self):
if ((self.k - 1) < 3):
return None
#Now with our range Calculated, and Multiplied finish the Variance Calculation
self.V = (self.V / (self.k-2))
#Standard Deviation is the Square Root of Variance
self.S = math.sqrt(self.V)
return self.S
except Exception as EXFinal:
return None
def Histogram(Population):
BinCount = 6
More = 0
#a = 1 #For testing Trapping
#b = 0 #and Trace Back
#c = (a / b) #with Detailed Info
#If you want to store the Database
#uncDatabase = os.path.join(os.getcwd(),"BellCurve.db3")
#con = sqlite3.connect(uncDatabase)
#If you want the database in Memory
con = sqlite3.connect(':memory:')
#row_factory allows accessing fields by Row and Col Name
con.row_factory = sqlite3.Row
#Add our Non Persistent, Runtime Standard Deviation Function to the Database
con.create_aggregate("Stdev", 1, StdevFunc)
#Lets Grab a Cursor
cur = con.cursor()
#Lets Initialize some tables, so each run with be clear of previous run
cur.executescript('drop table if exists MyData;') #executescript requires ; at the end of the string
cur.executescript('drop table if exists Bins;') #executescript requires ; at the end of the string
#Lets generate some random data, and insert in to the Database
for n in range(0,(Population)):
sql = "insert into MyData(Val) values ({0})".format(random.uniform(-1,1))
#If Whole Number Integer greater that value of 2, Range Greater that 1.5
#sql = "insert into MyData(Val) values ({0})".format(random.randint(-1,1))
#Now let’s calculate some built in Aggregates, that SQLite comes with
cur.execute("select Avg(Val) from MyData")
Average = cur.fetchone()[0]
cur.execute("select Max(Val) from MyData")
Max = cur.fetchone()[0]
cur.execute("select Min(Val) from MyData")
Min = cur.fetchone()[0]
cur.execute("select Count(Val) from MyData")
Records = cur.fetchone()[0]
#Now let’s get Standard Deviation using our function that we added
cur.execute("select Stdev(Val) from MyData")
Stdev = cur.fetchone()[0]
#And Calculate Range
Range = float(abs(float(Max)-float(Min)))
if (Stdev == None):
print("================================ Data Error ===============================")
print(" Insufficient Population Size, Or Bad Data.")
elif (abs(Max-Min) == 0):
print("================================ Data Error ===============================")
print(" The entire Population Contains Identical values, Distribution Incalculable.")
Bin = [] #Holds the Bin Values
Frequency = [] #Holds the Bin Frequency for each Bin
#Establish the 1st Bin, which is based on (Standard Deviation * 3) being subtracted from the Mean
Bin.append(float((Average - ((3 * Stdev)))))
#Establish the remaining Bins, which is basically adding 1 Standard Deviation
#for each interation, -3, -2, -1, 1, 2, 3
for b in range(0,(BinCount) + 1):
Bin.append((float(Bin[(b)]) + Stdev))
for b in range(0,(BinCount) + 1):
#Lets exploit the Database and have it do the hard work calculating distribution
#of all the Bins, with SQL's between operator, but making it left inclusive, right exclusive.
sqlBinFreq = "select count(*) as Frequency from MyData where val between {0} and {1} and Val < {2}". \
format(float((Bin[b])), float(Bin[(b + 1)]), float(Bin[(b + 1)]))
#If the Database Reports Values that fall between the Current Bin, Store the Frequency to a Bins Table.
for rowBinFreq in cur.execute(sqlBinFreq):
Frequency[(b + 1)] = rowBinFreq['Frequency']
sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
format(b, float(Bin[b]), Frequency[(b)])
#Allthough this Demo is not likley produce values that
#fall outside of Standard Distribution
#if this demo was to Calculate with real data, we want to know
#how many non-Standard data points we have.
More = (More + Frequency[b])
More = abs((Records - More))
#Add the More value
sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
format((BinCount + 1), float(0), More)
#Now Report the Analysis
print("================================ The Population ==============================")
print(" {0} {1} {2} {3} {4} {5}". \
format("Size".rjust(10, ' '), \
"Max".rjust(10, ' '), \
"Min".rjust(10, ' '), \
"Mean".rjust(10, ' '), \
"Range".rjust(10, ' '), \
"Stdev".rjust(10, ' ')))
print("Aggregates: {0:10d} {1:10.4f} {2:10.4f} {3:10.4f} {4:10.4f} {5:10.4f}". \
format(Population, Max, Min, Average, Range, Stdev))
print("================================= The Bell Curve =============================")
LabelString = "{0} {1} {2} {3}". \
format("Bin".ljust(8, ' '), \
"Ranges".rjust(8, ' '), \
"Frequency".rjust(8, ' '), \
"Histogram".rjust(6, ' '))
#Let's Paint a Histogram
sqlChart = "select * from Bins order by Bin asc"
for rowChart in cur.execute(sqlChart):
if (rowChart['Bin'] == 7):
#Bin 7 is not really a bin, but where we place the values that did not fit into the
#Normal Distribution. This script was tested against Excel's Bell Curve Example
#and produces the same results. Feel free to test it.
BinName = "More"
ChartString = "{0:<6} {1:<10} {2:10.0f}". \
format(BinName, \
"", \
#Theses are the actual bins where values fall within the distribution.
BinName = (rowChart['Bin'] + 1)
#Scale the Chart
fPercent = ((float(rowChart['Frequency']) / float(Records) * 100))
iPrecent = int(math.ceil(fPercent))
ChartString = "{0:<6} {1:10.4f} {2:10.0f} {3}". \
format(BinName, \
rowChart['Val'], \
rowChart['Frequency'], \
"".rjust(iPrecent, '#'))
#Commit to Database
#Clean Up
except Exception as EXBellCurve:
TraceInfo = traceback.format_exc()
raise Exception(TraceInfo)


TypeError: Caught TypeError in DataLoader worker process 0. TypeError: 'KeyError' object is not iterable

from torchvision_starter.engine import train_one_epoch, evaluate
from torchvision_starter import utils
import multiprocessing
import time
n_cpu = multiprocessing.cpu_count()
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
_ =
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(model.parameters(), lr=0.00001)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
# Let's train for 10 epochs
num_epochs = 1
start = time.time()
for epoch in range(10, 10 + num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loaders['train'], device, epoch, print_freq=10)
# update the learning rate
# evaluate on the validation dataset
evaluate(model, data_loaders['valid'], device=device)
stop = time.time()
print(f"\n\n{num_epochs} epochs in {stop - start} s ({(stop-start) / 3600:.2f} hrs)")
Before I move on to this part, everything is OK. But after I run the part, the error is like below:
I have tried to add drop_last to the's function like:
data_loaders["train"] =
But it doesn't work. By the way, the torch and torchvision are compatible and Cuda is available.
I wonder how to fix it.
The get_data_loaders function:
def get_data_loaders(
folder, batch_size: int = 2, valid_size: float = 0.2, num_workers: int = -1, limit: int = -1, thinning: int = None
Create and returns the train_one_epoch, validation and test data loaders.
:param foder: folder containing the dataset
:param batch_size: size of the mini-batches
:param valid_size: fraction of the dataset to use for validation. For example 0.2
means that 20% of the dataset will be used for validation
:param num_workers: number of workers to use in the data loaders. Use -1 to mean
"use all my cores"
:param limit: maximum number of data points to consider
:param thinning: take every n-th frame, instead of all frames
:return a dictionary with 3 keys: 'train_one_epoch', 'valid' and 'test' containing respectively the
train_one_epoch, validation and test data loaders
if num_workers == -1:
# Use all cores
num_workers = multiprocessing.cpu_count()
# We will fill this up later
data_loaders = {"train": None, "valid": None, "test": None}
# create 3 sets of data transforms: one for the training dataset,
# containing data augmentation, one for the validation dataset
# (without data augmentation) and one for the test set (again
# without augmentation)
data_transforms = {
"train": get_transform(UdacitySelfDrivingDataset.mean, UdacitySelfDrivingDataset.std, train=True),
"valid": get_transform(UdacitySelfDrivingDataset.mean, UdacitySelfDrivingDataset.std, train=False),
"test": get_transform(UdacitySelfDrivingDataset.mean, UdacitySelfDrivingDataset.std, train=False),
# Create train and validation datasets
train_data = UdacitySelfDrivingDataset(
# The validation dataset is a split from the train_one_epoch dataset, so we read
# from the same folder, but we apply the transforms for validation
valid_data = UdacitySelfDrivingDataset(
# obtain training indices that will be used for validation
n_tot = len(train_data)
indices = torch.randperm(n_tot)
# If requested, limit the number of data points to consider
if limit > 0:
indices = indices[:limit]
n_tot = limit
split = int(math.ceil(valid_size * n_tot))
train_idx, valid_idx = indices[split:], indices[:split]
# define samplers for obtaining training and validation batches
train_sampler =
valid_sampler = # =
# prepare data loaders
data_loaders["train"] =
data_loaders["valid"] =
valid_data, # -
batch_size=batch_size, # -
sampler=valid_sampler, # -
num_workers=num_workers, # -
# Now create the test data loader
test_data = UdacitySelfDrivingDataset(
if limit > 0:
indices = torch.arange(limit)
test_sampler =
test_sampler = None
data_loaders["test"] =
# -
return data_loaders
class UdacitySelfDrivingDataset(
# Mean and std of the dataset to be used in nn.Normalize
mean = torch.tensor([0.3680, 0.3788, 0.3892])
std = torch.tensor([0.2902, 0.3069, 0.3242])
def __init__(self, root, transform, train=True, thinning=None):
self.root = os.path.abspath(os.path.expandvars(os.path.expanduser(root)))
self.transform = transform
# load datasets
if train:
self.df = pd.read_csv(os.path.join(self.root, "labels_train.csv"))
self.df = pd.read_csv(os.path.join(self.root, "labels_test.csv"))
# Index by file id (i.e., a sequence of the same length as the number of images)
codes, uniques = pd.factorize(self.df['frame'])
if thinning:
# Take every n-th rows. This makes sense because the images are
# frames of videos from the car, so we are essentially reducing
# the frame rate
thinned = uniques[::thinning]
idx = self.df['frame'].isin(thinned)
print(f"Keeping {thinned.shape[0]} of {uniques.shape[0]} images")
print(f"Keeping {idx.sum()} objects out of {self.df.shape[0]}")
self.df = self.df[idx].reset_index(drop=True)
# Recompute codes
codes, uniques = pd.factorize(self.df['frame'])
self.n_images = len(uniques)
self.df['image_id'] = codes
self.df.set_index("image_id", inplace=True)
self.classes = ['car', 'truck', 'pedestrian', 'bicyclist', 'light']
self.colors = ['cyan', 'blue', 'red', 'purple', 'orange']
def n_classes(self):
return len(self.classes)
def __getitem__(self, idx):
if idx in self.df.index:
row = self.df.loc[[idx]]
return KeyError(f"Element {idx} not in dataframe")
# load images fromm file
img_path = os.path.join(self.root, "images", row['frame'].iloc[0])
img ="RGB")
# Exclude bogus boxes with 0 height or width
h = row['ymax'] - row['ymin']
w = row['xmax'] - row['xmin']
filter_idx = (h > 0) & (w > 0)
row = row[filter_idx]
# get bounding box coordinates for each mask
boxes = row[['xmin', 'ymin', 'xmax', 'ymax']].values
# convert everything into a torch.Tensor
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# get the labels
labels = torch.as_tensor(row['class_id'].values, dtype=int)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# assume no crowd for everything
iscrowd = torch.zeros((row.shape[0],), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transform is not None:
img, target = self.transform(img, target)
return img, target
def __len__(self):
return self.n_images
def plot(self, idx, renormalize=True, predictions=None, threshold=0.5, ax=None):
image, label_js = self[idx]
if renormalize:
# Invert the T.Normalize transform
unnormalize = T.Compose(
T.Normalize(mean = [ 0., 0., 0. ], std = 1 / type(self).std),
T.Normalize(mean = -type(self).mean, std = [ 1., 1., 1. ])
image, label_js = unnormalize(image, label_js)
if ax is None:
fig, ax = plt.subplots(figsize=(8, 8))
_ = ax.imshow(torch.permute(image, [1, 2, 0]))
for i, box in enumerate(label_js['boxes']):
xy = (box[0], box[1])
h, w = (box[2] - box[0]), (box[3] - box[1])
r = patches.Rectangle(xy, h, w, fill=False, color=self.colors[label_js['labels'][i]-1], lw=2, alpha=0.5)
if predictions is not None:
# Make sure the predictions are on the CPU
for k in predictions:
predictions[k] = predictions[k].detach().cpu().numpy()
for i, box in enumerate(predictions['boxes']):
if predictions['scores'][i] > threshold:
xy = (box[0], box[1])
h, w = (box[2] - box[0]), (box[3] - box[1])
r = patches.Rectangle(xy, h, w, fill=False, color=self.colors[predictions['labels'][i]-1], lw=2, linestyle=':')
_ = ax.axis("off")
return ax

"RecursionError: maximum recursion depth exceeded" while adding a new constraints to a convex problem in cvxpy

I'm new in cvxpy and working on a certain energy optimization model I'm trying to add new constraints. When I try to append the new constraints i get the following error "RecursionError: maximum recursion depth exceeded". If I exclude the last lines related (rules.appen(...)) everything works. I have tried to increase the system recursion limit but this doesn't work. Any suggestion?
class ProductionRamping(Constraint):
# it runs on both operation and planning, single- and multi-region
def __get_energy_prod_differences(self):
"""Creates variables to constrain the increase and decrease
in energy production when passing from one timestep to the next"""
# Create a dataframe containing the differences between the energy production of each tech
# in each timestep and the production in the previous timestep
energy_prod_differences = {}
# for each region
for reg in self.model_data.settings.regions:
energy_prod_differences_regional = {}
#for each tech, except demand, transmission, and storage
for tech_type in self.model_data.settings.technologies[reg].keys():
if tech_type == "Demand" or tech_type == "Transmission" or tech_type == "Storage":
# select the first row
first_row = cp.reshape(
(1, self.variables.technology_prod[reg][tech_type][0,:].shape[0])
# in the case the timesteps are just one, the shifted dataframe containing the energy production
# will only be the first row; if timesteps are > 1 instead we have a shifted dataframe
if self.variables.technology_prod[reg][tech_type].shape[0] > 1:
shifted_prod = cp.vstack([first_row, (self.variables.technology_prod[reg][tech_type][:-1,:])])
shifted_prod = first_row
# compute the difference between the energy production in each timestep and the the production in the previouos timestep
energy_prod_differences_regional[tech_type] = self.variables.technology_prod[reg][tech_type] - shifted_prod
energy_prod_differences[reg] = energy_prod_differences_regional
return energy_prod_differences ## 8760*nyears x ntechs
def transform_total_capacity(capacity, regions, timeslice_fraction, years):
totcapacity ={}
for reg in regions:
totcapacity_regional = {}
for tech_type in capacity[reg].keys():
if tech_type == "Demand" or tech_type == "Transmission" or tech_type == "Storage":
totcapacity_hourly = cp.reshape(
(1, capacity[reg][tech_type][0,:].shape[0]))
totcapacity_yearly = totcapacity_hourly
for hours in np.arange(len(timeslice_fraction)-1):
totcapacity_yearly = cp.vstack([totcapacity_yearly,totcapacity_hourly])
totcapacity_multiyear_regional = totcapacity_yearly
for indx in np.arange(len(years)-1):
totcapacity_hourly1 = cp.reshape(
(1, capacity[reg][tech_type][indx,:].shape[0]))
totcapacity_yearly1 = totcapacity_hourly1
for hours in np.arange(len(timeslice_fraction)-1):
totcapacity_yearly1 = cp.vstack([totcapacity_yearly1,totcapacity_hourly1])
totcapacity_multiyear_regional = cp.vstack ([totcapacity_multiyear_regional,totcapacity_yearly1])
totcapacity_regional[tech_type] = totcapacity_multiyear_regional
totcapacity[reg] = totcapacity_regional
return totcapacity
def _check(self):
assert hasattr(self.variables, 'totalcapacity'), "totalcapacity must be defined"
def rules(self):
rules = []
totcapacity = transform_total_capacity(
for reg in self.model_data.settings.regions:
for tech_type, value in totcapacity[reg].items():
max_percentage_ramps = self.model_data.regional_parameters[reg]["prod_max_ramp"].loc[:, tech_type]
max_percentage_ramps = max_percentage_ramps.reindex(max_percentage_ramps.index.repeat(
min_percentage_ramps = self.model_data.regional_parameters[reg]["prod_min_ramp"].loc[:, tech_type]
min_percentage_ramps = min_percentage_ramps.reindex(min_percentage_ramps.index.repeat(
energy_prod_differences = self.__get_energy_prod_differences()
for indx, year in enumerate(self.model_data.settings.years):
max_ramp_in_timestep = cp.multiply ( value [indx * len(self.model_data.settings.timeslice_fraction) : (indx + 1)
* len(self.model_data.settings.time_steps), :] , max_percentage_ramps)
min_ramp_in_timestep = cp.multiply ( value [indx * len(self.model_data.settings.timeslice_fraction) : (indx + 1)
* len(self.model_data.settings.time_steps), :] , min_percentage_ramps)
energy_prod = energy_prod_differences[reg][tech_type][indx * len(self.model_data.settings.timeslice_fraction) : (indx + 1)
* len(self.model_data.settings.timeslice_fraction), :]
diff = energy_prod - max_ramp_in_timestep
diff1 = energy_prod * -1 - min_ramp_in_timestep
diff <= 0
diff1 <= 0
return rules

Right Shift and Left shift operator in Classic ASP

How to use Right Shift operator in classic ASP. As suggested in In ASP, Bit Operator Left shift and Right shift , I used "\" for right shift operator. it gives me wrong result.
For example
in javascript, 33555758 >> 24 gives result 2.
But in Classic ASP 33555758 \ 24 gives division result.
Please help me on this.
Bitwise right shift >> is not equal to simple division \ by the given number, but by the given number of times by integer 2, which is binary 10. A bit shift moves each digit in a set of bits right. I.e. dividing by binary 10 removes a binary digit from the number and shifts digits right.
Example: 5 >> 1 = 2
5 00000000000000000000000000000101
5 >> 1 00000000000000000000000000000010 (2)
which is same as 5 / 2,
i.e. in your case it will be not 33555758 \ 24 but 24 times dividing 2. As there is no direct method in vbscript, it can be done as
Function SignedRightShift(pValue, pShift)
Dim NewValue, PrevValue, i
PrevValue = pValue
For i = 1 to pShift
Select Case VarType(pValue)
Case vbLong
NewValue = Int((PrevValue And "&H7FFFFFFF") / 2)
If PrevValue And "&H80000000" Then NewValue = NewValue Or "&HC0000000"
NewValue = CLng(NewValue)
Case vbInteger
NewValue = Int((PrevValue And "&H7FFF") / 2)
If PrevValue And "&H8000" Then NewValue = NewValue Or "&HC000"
NewValue = CInt("&H"+ Hex(NewValue))
Case vbByte
NewValue = Int(PrevValue / 2)
If PrevValue And "&H80" Then NewValue = NewValue Or "&HC0"
NewValue = CByte(NewValue)
Case Else: Err.Raise 13 ' Not a supported type
End Select
PrevValue = NewValue
SignedRightShift = PrevValue
End Function
and used as
x = SignedRightShift(33555758, 24)
For more, see

Checksum Python

I tried to make a checksum function but my tests returned None instead of the expected output. Can you point out where I went wrong and how to correct it? I am trying to get a numerical sum out of a bunch of strings in the test code.
def string_checksum(data):
partialchecksum = 0
for i in data:
if i is int:
partialchecksum += i
def tobits(i):
result = []
strsum = 0
for c in i:
bits = bin(ord(c))[2:]
bits = '00000000'[len(bits):] + bits
result.extend([int(b) for b in bits])
strsum = sum(result)
checksum = partialchecksum + strsum
return checksum
## Heading ##

alignment of sequences

I want to do pairwise alignment with uniprot and pdb sequences. I have an input file containing uniprot and pdb IDs like this.
pdb id uniprot id
1dbh Q07889
1e43 P00692
1f1s Q53591
first, I need to read each line in an input file
2) retrieve the pdb and uniprot sequences from pdb.fasta and uniprot.fasta files
3) Do alignment and calculate sequence identity.
Usually, I use the following program for pairwise alignment and seq.identity calculation.
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")
I need to print the output like this
pdbid uniprotid seq.identity
1dbh Q07889 99
1e43 P00692 80
1f1s Q53591 56
How can I change the above code ? your help would be appreciated!
This code is hopefully what your looking for:
class test():
def get_seq(self, pdb,fasta_file): # Get sequences
from Bio.PDB.PDBParser import PDBParser
from Bio import SeqIO
aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
structure_id="%s" % pdb[:-4]
structure=p.get_structure(structure_id, pdb)
residues = structure.get_residues()
seq_pdb = ''
for res in residues:
res = res.get_resname()
if res in aa:
seq_pdb = seq_pdb+aa[res]
handle = open(fasta_file, "rU")
for record in SeqIO.parse(handle, "fasta") :
seq_fasta = record.seq
def seq_aln(self,seq1,seq2): # Align the sequences
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist
matrix = matlist.blosum62
gap_open = -10
gap_extend = -0.5
alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
top_aln = alns[0]
aln_seq1, aln_seq2, score, begin, end = top_aln
with open('aln.fasta', 'w') as outfile:
outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
print aln_seq1+'\n'+aln_seq2
def seq_id(self,aln_fasta): # Get sequence ID
import string
from Bio import AlignIO
input_handle = open("aln.fasta", "rU")
alignment =, "fasta")
j=0 # counts positions in first sequence
i=0 # counts identity hits
for record in alignment:
#print record
for amino_acid in record.seq:
if amino_acid == '-':
if amino_acid == alignment[0].seq[j]:
i += 1
j += 1
j = 0
seq = str(record.seq)
gap_strip = seq.replace('-', '')
percent = 100*i/len(gap_strip)
print' '+str(percent)
a = test()
This outputs:
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence
For this to work on you input file, you need to put my a.get_seq() in a for loop with the inputs from your text file.
Replace the seq_id function with this one:
def seq_id(self,aln_fasta):
import string
from Bio import AlignIO
from Bio import SeqIO
record_iterator = SeqIO.parse(aln_fasta, "fasta")
first_record =
print '%s has a length of %d' % (, len(str(first_record.seq).replace('-','')))
second_record =
print '%s has a length of %d' % (, len(str(second_record.seq).replace('-','')))
lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))]
if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest
print 'PDB sequence has the shortest length'
print 'Uniport sequence has the shortes length'
idenities = 0
for i,v in enumerate(first_record.seq):
if v == '-':
#print i,v, second_record.seq[i]
if v == second_record.seq[i]:
idenities +=1
#print i,v, second_record.seq[i], idenities
print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))
to pass the arguments to the class use:
with open('input_file.txt', 'r') as infile:
next(infile) # Going by your input file
for line in infile:
line = line.split()
It might be something like this; a repeatable example (e.g., with short files posted on-line) would help...
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")
to input all sequences into two objects. pairwiseAlignment accepts a vector as first (query) argument, so if you were wanting to align all pdb against all uniprot pre-allocate a result matrix
pids = matrix(numeric(), length(uniprot), length(pdb),
dimnames=list(names(uniprot), names(pdb)))
and then do the calculations
for (i in seq_along(uniprot)) {
globalAlignment = pairwiseAlignment(pdb, uniprot[i])
pids[i,] = pid(globalAlignment)
