Standard Deviation for SQLite - sqlite
I've searched the SQLite docs and couldn't find anything, but I've also searched on Google and a few results appeared.
Does SQLite have any built-in Standard Deviation function?
You can calculate the variance in SQL:
create table t (row int);
insert into t values (1),(2),(3);
SELECT AVG((t.row - sub.a) * (t.row - sub.a)) as var from t,
(SELECT AVG(row) AS a FROM t) AS sub;
0.666666666666667
However, you still have to calculate the square root to get the standard deviation.
The aggregate functions supported by SQLite are here:
http://www.sqlite.org/lang_aggfunc.html
STDEV is not in the list.
However, the module extension-functions.c in this page contains a STDEV function.
There is still no built-in stdev function in sqlite. However, you can define (as Alix has done) a user-defined aggregator function. Here is a complete example in Python:
import sqlite3
import math
class StdevFunc:
def __init__(self):
self.M = 0.0
self.S = 0.0
self.k = 1
def step(self, value):
if value is None:
return
tM = self.M
self.M += (value - tM) / self.k
self.S += (value - tM) * (value - self.M)
self.k += 1
def finalize(self):
if self.k < 3:
return None
return math.sqrt(self.S / (self.k-2))
with sqlite3.connect(':memory:') as con:
con.create_aggregate("stdev", 1, StdevFunc)
cur = con.cursor()
cur.execute("create table test(i)")
cur.executemany("insert into test(i) values (?)", [(1,), (2,), (3,), (4,), (5,)])
cur.execute("insert into test(i) values (null)")
cur.execute("select avg(i) from test")
print("avg: %f" % cur.fetchone()[0])
cur.execute("select stdev(i) from test")
print("stdev: %f" % cur.fetchone()[0])
This will print:
avg: 3.000000
stdev: 1.581139
Compare with MySQL: http://sqlfiddle.com/#!2/ad42f3/3/0
Use variance formula V(X) = E(X^2) - E(X)^2. In SQL sqlite
SELECT AVG(col*col) - AVG(col)*AVG(col) FROM table
To get standard deviation you need to take the square root V(X)^(1/2)
I implemented the Welford's method (the same as extension-functions.c) as a SQLite UDF:
$db->sqliteCreateAggregate('stdev',
function (&$context, $row, $data) // step callback
{
if (isset($context) !== true) // $context is null at first
{
$context = array
(
'k' => 0,
'm' => 0,
's' => 0,
);
}
if (isset($data) === true) // the standard is non-NULL values only
{
$context['s'] += ($data - $context['m']) * ($data - ($context['m'] += ($data - $context['m']) / ++$context['k']));
}
return $context;
},
function (&$context, $row) // fini callback
{
if ($context['k'] > 0) // return NULL if no non-NULL values exist
{
return sqrt($context['s'] / $context['k']);
}
return null;
},
1);
That's in PHP ($db is the PDO object) but it should be trivial to port to another language.
SQLite is soooo cool. <3
a little trick
select ((sum(value)*sum(value) - sum(value * value))/((count(*)-1)*(count(*))))
from the_table ;
then the only thing left is to calculate sqrt outside.
No, I searched this same issue, and ended having to do the calculations with my application (PHP)
added some error detection in the python functions
class StdevFunc:
"""
For use as an aggregate function in SQLite
"""
def __init__(self):
self.M = 0.0
self.S = 0.0
self.k = 0
def step(self, value):
try:
# automatically convert text to float, like the rest of SQLite
val = float(value) # if fails, skips this iteration, which also ignores nulls
tM = self.M
self.k += 1
self.M += ((val - tM) / self.k)
self.S += ((val - tM) * (val - self.M))
except:
pass
def finalize(self):
if self.k <= 1: # avoid division by zero
return none
else:
return math.sqrt(self.S / (self.k-1))
You don't state which version of standard deviation you wish to calculate but variances (standard deviation squared) for either version can be calculated using a combination of the sum() and count() aggregate functions.
select
(count(val)*sum(val*val) - (sum(val)*sum(val)))/((count(val)-1)*(count(val))) as sample_variance,
(count(val)*sum(val*val) - (sum(val)*sum(val)))/((count(val))*(count(val))) as population_variance
from ... ;
It will still be necessary to take the square root of these to obtain the standard deviation.
#!/usr/bin/python
# -*- coding: utf-8 -*-
#Values produced by this script can be verified by follwing the steps
#found at https://support.microsoft.com/en-us/kb/213930 to Verify
#by chosing a non memory based database.
import sqlite3
import math
import random
import os
import sys
import traceback
import random
class StdevFunc:
def __init__(self):
self.M = 0.0 #Mean
self.V = 0.0 #Used to Calculate Variance
self.S = 0.0 #Standard Deviation
self.k = 1 #Population or Small
def step(self, value):
try:
if value is None:
return None
tM = self.M
self.M += (value - tM) / self.k
self.V += (value - tM) * (value - self.M)
self.k += 1
except Exception as EXStep:
pass
return None
def finalize(self):
try:
if ((self.k - 1) < 3):
return None
#Now with our range Calculated, and Multiplied finish the Variance Calculation
self.V = (self.V / (self.k-2))
#Standard Deviation is the Square Root of Variance
self.S = math.sqrt(self.V)
return self.S
except Exception as EXFinal:
pass
return None
def Histogram(Population):
try:
BinCount = 6
More = 0
#a = 1 #For testing Trapping
#b = 0 #and Trace Back
#c = (a / b) #with Detailed Info
#If you want to store the Database
#uncDatabase = os.path.join(os.getcwd(),"BellCurve.db3")
#con = sqlite3.connect(uncDatabase)
#If you want the database in Memory
con = sqlite3.connect(':memory:')
#row_factory allows accessing fields by Row and Col Name
con.row_factory = sqlite3.Row
#Add our Non Persistent, Runtime Standard Deviation Function to the Database
con.create_aggregate("Stdev", 1, StdevFunc)
#Lets Grab a Cursor
cur = con.cursor()
#Lets Initialize some tables, so each run with be clear of previous run
cur.executescript('drop table if exists MyData;') #executescript requires ; at the end of the string
cur.execute("create table IF NOT EXISTS MyData('ID' INTEGER PRIMARY KEY AUTOINCREMENT, 'Val' FLOAT)")
cur.executescript('drop table if exists Bins;') #executescript requires ; at the end of the string
cur.execute("create table IF NOT EXISTS Bins('ID' INTEGER PRIMARY KEY AUTOINCREMENT, 'Bin' UNSIGNED INTEGER, 'Val' FLOAT, 'Frequency' UNSIGNED BIG INT)")
#Lets generate some random data, and insert in to the Database
for n in range(0,(Population)):
sql = "insert into MyData(Val) values ({0})".format(random.uniform(-1,1))
#If Whole Number Integer greater that value of 2, Range Greater that 1.5
#sql = "insert into MyData(Val) values ({0})".format(random.randint(-1,1))
cur.execute(sql)
pass
#Now let’s calculate some built in Aggregates, that SQLite comes with
cur.execute("select Avg(Val) from MyData")
Average = cur.fetchone()[0]
cur.execute("select Max(Val) from MyData")
Max = cur.fetchone()[0]
cur.execute("select Min(Val) from MyData")
Min = cur.fetchone()[0]
cur.execute("select Count(Val) from MyData")
Records = cur.fetchone()[0]
#Now let’s get Standard Deviation using our function that we added
cur.execute("select Stdev(Val) from MyData")
Stdev = cur.fetchone()[0]
#And Calculate Range
Range = float(abs(float(Max)-float(Min)))
if (Stdev == None):
print("================================ Data Error ===============================")
print(" Insufficient Population Size, Or Bad Data.")
print("*****************************************************************************")
elif (abs(Max-Min) == 0):
print("================================ Data Error ===============================")
print(" The entire Population Contains Identical values, Distribution Incalculable.")
print("******************************************************************************")
else:
Bin = [] #Holds the Bin Values
Frequency = [] #Holds the Bin Frequency for each Bin
#Establish the 1st Bin, which is based on (Standard Deviation * 3) being subtracted from the Mean
Bin.append(float((Average - ((3 * Stdev)))))
Frequency.append(0)
#Establish the remaining Bins, which is basically adding 1 Standard Deviation
#for each interation, -3, -2, -1, 1, 2, 3
for b in range(0,(BinCount) + 1):
Bin.append((float(Bin[(b)]) + Stdev))
Frequency.append(0)
for b in range(0,(BinCount) + 1):
#Lets exploit the Database and have it do the hard work calculating distribution
#of all the Bins, with SQL's between operator, but making it left inclusive, right exclusive.
sqlBinFreq = "select count(*) as Frequency from MyData where val between {0} and {1} and Val < {2}". \
format(float((Bin[b])), float(Bin[(b + 1)]), float(Bin[(b + 1)]))
#If the Database Reports Values that fall between the Current Bin, Store the Frequency to a Bins Table.
for rowBinFreq in cur.execute(sqlBinFreq):
Frequency[(b + 1)] = rowBinFreq['Frequency']
sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
format(b, float(Bin[b]), Frequency[(b)])
cur.execute(sqlBinFreqInsert)
#Allthough this Demo is not likley produce values that
#fall outside of Standard Distribution
#if this demo was to Calculate with real data, we want to know
#how many non-Standard data points we have.
More = (More + Frequency[b])
More = abs((Records - More))
#Add the More value
sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
format((BinCount + 1), float(0), More)
cur.execute(sqlBinFreqInsert)
#Now Report the Analysis
print("================================ The Population ==============================")
print(" {0} {1} {2} {3} {4} {5}". \
format("Size".rjust(10, ' '), \
"Max".rjust(10, ' '), \
"Min".rjust(10, ' '), \
"Mean".rjust(10, ' '), \
"Range".rjust(10, ' '), \
"Stdev".rjust(10, ' ')))
print("Aggregates: {0:10d} {1:10.4f} {2:10.4f} {3:10.4f} {4:10.4f} {5:10.4f}". \
format(Population, Max, Min, Average, Range, Stdev))
print("================================= The Bell Curve =============================")
LabelString = "{0} {1} {2} {3}". \
format("Bin".ljust(8, ' '), \
"Ranges".rjust(8, ' '), \
"Frequency".rjust(8, ' '), \
"Histogram".rjust(6, ' '))
print(LabelString)
print("------------------------------------------------------------------------------")
#Let's Paint a Histogram
sqlChart = "select * from Bins order by Bin asc"
for rowChart in cur.execute(sqlChart):
if (rowChart['Bin'] == 7):
#Bin 7 is not really a bin, but where we place the values that did not fit into the
#Normal Distribution. This script was tested against Excel's Bell Curve Example
#https://support.microsoft.com/en-us/kb/213930
#and produces the same results. Feel free to test it.
BinName = "More"
ChartString = "{0:<6} {1:<10} {2:10.0f}". \
format(BinName, \
"", \
More)
else:
#Theses are the actual bins where values fall within the distribution.
BinName = (rowChart['Bin'] + 1)
#Scale the Chart
fPercent = ((float(rowChart['Frequency']) / float(Records) * 100))
iPrecent = int(math.ceil(fPercent))
ChartString = "{0:<6} {1:10.4f} {2:10.0f} {3}". \
format(BinName, \
rowChart['Val'], \
rowChart['Frequency'], \
"".rjust(iPrecent, '#'))
print(ChartString)
print("******************************************************************************")
#Commit to Database
con.commit()
#Clean Up
cur.close()
con.close()
except Exception as EXBellCurve:
pass
TraceInfo = traceback.format_exc()
raise Exception(TraceInfo)
Related
TypeError: Caught TypeError in DataLoader worker process 0. TypeError: 'KeyError' object is not iterable
from torchvision_starter.engine import train_one_epoch, evaluate from torchvision_starter import utils import multiprocessing import time n_cpu = multiprocessing.cpu_count() device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') _ = model.to(device) params = [p for p in model.parameters() if p.requires_grad] optimizer = torch.optim.Adam(model.parameters(), lr=0.00001) lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.2, verbose=True ) # Let's train for 10 epochs num_epochs = 1 start = time.time() for epoch in range(10, 10 + num_epochs): # train for one epoch, printing every 10 iterations train_one_epoch(model, optimizer, data_loaders['train'], device, epoch, print_freq=10) # update the learning rate lr_scheduler.step() # evaluate on the validation dataset evaluate(model, data_loaders['valid'], device=device) stop = time.time() print(f"\n\n{num_epochs} epochs in {stop - start} s ({(stop-start) / 3600:.2f} hrs)") Before I move on to this part, everything is OK. But after I run the part, the error is like below: I have tried to add drop_last to the helper.py's function like: data_loaders["train"] = torch.utils.data.DataLoader( train_data, batch_size=batch_size, sampler=train_sampler, num_workers=num_workers, collate_fn=utils.collate_fn, drop_last=True ) But it doesn't work. By the way, the torch and torchvision are compatible and Cuda is available. I wonder how to fix it. The get_data_loaders function: def get_data_loaders( folder, batch_size: int = 2, valid_size: float = 0.2, num_workers: int = -1, limit: int = -1, thinning: int = None ): """ Create and returns the train_one_epoch, validation and test data loaders. :param foder: folder containing the dataset :param batch_size: size of the mini-batches :param valid_size: fraction of the dataset to use for validation. For example 0.2 means that 20% of the dataset will be used for validation :param num_workers: number of workers to use in the data loaders. Use -1 to mean "use all my cores" :param limit: maximum number of data points to consider :param thinning: take every n-th frame, instead of all frames :return a dictionary with 3 keys: 'train_one_epoch', 'valid' and 'test' containing respectively the train_one_epoch, validation and test data loaders """ if num_workers == -1: # Use all cores num_workers = multiprocessing.cpu_count() # We will fill this up later data_loaders = {"train": None, "valid": None, "test": None} # create 3 sets of data transforms: one for the training dataset, # containing data augmentation, one for the validation dataset # (without data augmentation) and one for the test set (again # without augmentation) data_transforms = { "train": get_transform(UdacitySelfDrivingDataset.mean, UdacitySelfDrivingDataset.std, train=True), "valid": get_transform(UdacitySelfDrivingDataset.mean, UdacitySelfDrivingDataset.std, train=False), "test": get_transform(UdacitySelfDrivingDataset.mean, UdacitySelfDrivingDataset.std, train=False), } # Create train and validation datasets train_data = UdacitySelfDrivingDataset( folder, transform=data_transforms["train"], train=True, thinning=thinning ) # The validation dataset is a split from the train_one_epoch dataset, so we read # from the same folder, but we apply the transforms for validation valid_data = UdacitySelfDrivingDataset( folder, transform=data_transforms["valid"], train=True, thinning=thinning ) # obtain training indices that will be used for validation n_tot = len(train_data) indices = torch.randperm(n_tot) # If requested, limit the number of data points to consider if limit > 0: indices = indices[:limit] n_tot = limit split = int(math.ceil(valid_size * n_tot)) train_idx, valid_idx = indices[split:], indices[:split] # define samplers for obtaining training and validation batches train_sampler = torch.utils.data.SubsetRandomSampler(train_idx) valid_sampler = torch.utils.data.SubsetRandomSampler(valid_idx) # = # prepare data loaders data_loaders["train"] = torch.utils.data.DataLoader( train_data, batch_size=batch_size, sampler=train_sampler, num_workers=num_workers, collate_fn=utils.collate_fn, drop_last=True ) data_loaders["valid"] = torch.utils.data.DataLoader( valid_data, # - batch_size=batch_size, # - sampler=valid_sampler, # - num_workers=num_workers, # - collate_fn=utils.collate_fn, drop_last=True ) # Now create the test data loader test_data = UdacitySelfDrivingDataset( folder, transform=data_transforms["test"], train=False, thinning=thinning ) if limit > 0: indices = torch.arange(limit) test_sampler = torch.utils.data.SubsetRandomSampler(indices) else: test_sampler = None data_loaders["test"] = torch.utils.data.DataLoader( test_data, batch_size=batch_size, shuffle=False, num_workers=num_workers, sampler=test_sampler, collate_fn=utils.collate_fn, drop_last=True # - ) return data_loaders class UdacitySelfDrivingDataset(torch.utils.data.Dataset): # Mean and std of the dataset to be used in nn.Normalize mean = torch.tensor([0.3680, 0.3788, 0.3892]) std = torch.tensor([0.2902, 0.3069, 0.3242]) def __init__(self, root, transform, train=True, thinning=None): super().__init__() self.root = os.path.abspath(os.path.expandvars(os.path.expanduser(root))) self.transform = transform # load datasets if train: self.df = pd.read_csv(os.path.join(self.root, "labels_train.csv")) else: self.df = pd.read_csv(os.path.join(self.root, "labels_test.csv")) # Index by file id (i.e., a sequence of the same length as the number of images) codes, uniques = pd.factorize(self.df['frame']) if thinning: # Take every n-th rows. This makes sense because the images are # frames of videos from the car, so we are essentially reducing # the frame rate thinned = uniques[::thinning] idx = self.df['frame'].isin(thinned) print(f"Keeping {thinned.shape[0]} of {uniques.shape[0]} images") print(f"Keeping {idx.sum()} objects out of {self.df.shape[0]}") self.df = self.df[idx].reset_index(drop=True) # Recompute codes codes, uniques = pd.factorize(self.df['frame']) self.n_images = len(uniques) self.df['image_id'] = codes self.df.set_index("image_id", inplace=True) self.classes = ['car', 'truck', 'pedestrian', 'bicyclist', 'light'] self.colors = ['cyan', 'blue', 'red', 'purple', 'orange'] #property def n_classes(self): return len(self.classes) def __getitem__(self, idx): if idx in self.df.index: row = self.df.loc[[idx]] else: return KeyError(f"Element {idx} not in dataframe") # load images fromm file img_path = os.path.join(self.root, "images", row['frame'].iloc[0]) img = Image.open(img_path).convert("RGB") # Exclude bogus boxes with 0 height or width h = row['ymax'] - row['ymin'] w = row['xmax'] - row['xmin'] filter_idx = (h > 0) & (w > 0) row = row[filter_idx] # get bounding box coordinates for each mask boxes = row[['xmin', 'ymin', 'xmax', 'ymax']].values # convert everything into a torch.Tensor boxes = torch.as_tensor(boxes, dtype=torch.float32) # get the labels labels = torch.as_tensor(row['class_id'].values, dtype=int) image_id = torch.tensor([idx]) area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) # assume no crowd for everything iscrowd = torch.zeros((row.shape[0],), dtype=torch.int64) target = {} target["boxes"] = boxes target["labels"] = labels target["image_id"] = image_id target["area"] = area target["iscrowd"] = iscrowd if self.transform is not None: img, target = self.transform(img, target) return img, target def __len__(self): return self.n_images def plot(self, idx, renormalize=True, predictions=None, threshold=0.5, ax=None): image, label_js = self[idx] if renormalize: # Invert the T.Normalize transform unnormalize = T.Compose( [ T.Normalize(mean = [ 0., 0., 0. ], std = 1 / type(self).std), T.Normalize(mean = -type(self).mean, std = [ 1., 1., 1. ]) ] ) image, label_js = unnormalize(image, label_js) if ax is None: fig, ax = plt.subplots(figsize=(8, 8)) _ = ax.imshow(torch.permute(image, [1, 2, 0])) for i, box in enumerate(label_js['boxes']): xy = (box[0], box[1]) h, w = (box[2] - box[0]), (box[3] - box[1]) r = patches.Rectangle(xy, h, w, fill=False, color=self.colors[label_js['labels'][i]-1], lw=2, alpha=0.5) ax.add_patch(r) if predictions is not None: # Make sure the predictions are on the CPU for k in predictions: predictions[k] = predictions[k].detach().cpu().numpy() for i, box in enumerate(predictions['boxes']): if predictions['scores'][i] > threshold: xy = (box[0], box[1]) h, w = (box[2] - box[0]), (box[3] - box[1]) r = patches.Rectangle(xy, h, w, fill=False, color=self.colors[predictions['labels'][i]-1], lw=2, linestyle=':') ax.add_patch(r) _ = ax.axis("off") return ax
"RecursionError: maximum recursion depth exceeded" while adding a new constraints to a convex problem in cvxpy
I'm new in cvxpy and working on a certain energy optimization model I'm trying to add new constraints. When I try to append the new constraints i get the following error "RecursionError: maximum recursion depth exceeded". If I exclude the last lines related (rules.appen(...)) everything works. I have tried to increase the system recursion limit but this doesn't work. Any suggestion? class ProductionRamping(Constraint): # it runs on both operation and planning, single- and multi-region def __get_energy_prod_differences(self): """Creates variables to constrain the increase and decrease in energy production when passing from one timestep to the next""" # Create a dataframe containing the differences between the energy production of each tech # in each timestep and the production in the previous timestep energy_prod_differences = {} # for each region for reg in self.model_data.settings.regions: energy_prod_differences_regional = {} #for each tech, except demand, transmission, and storage for tech_type in self.model_data.settings.technologies[reg].keys(): if tech_type == "Demand" or tech_type == "Transmission" or tech_type == "Storage": continue # select the first row first_row = cp.reshape( self.variables.technology_prod[reg][tech_type][0,:], (1, self.variables.technology_prod[reg][tech_type][0,:].shape[0]) ) # in the case the timesteps are just one, the shifted dataframe containing the energy production # will only be the first row; if timesteps are > 1 instead we have a shifted dataframe if self.variables.technology_prod[reg][tech_type].shape[0] > 1: shifted_prod = cp.vstack([first_row, (self.variables.technology_prod[reg][tech_type][:-1,:])]) else: shifted_prod = first_row # compute the difference between the energy production in each timestep and the the production in the previouos timestep energy_prod_differences_regional[tech_type] = self.variables.technology_prod[reg][tech_type] - shifted_prod energy_prod_differences[reg] = energy_prod_differences_regional return energy_prod_differences ## 8760*nyears x ntechs def transform_total_capacity(capacity, regions, timeslice_fraction, years): totcapacity ={} for reg in regions: totcapacity_regional = {} for tech_type in capacity[reg].keys(): if tech_type == "Demand" or tech_type == "Transmission" or tech_type == "Storage": continue totcapacity_hourly = cp.reshape( capacity[reg][tech_type][0,:], (1, capacity[reg][tech_type][0,:].shape[0])) totcapacity_yearly = totcapacity_hourly for hours in np.arange(len(timeslice_fraction)-1): totcapacity_yearly = cp.vstack([totcapacity_yearly,totcapacity_hourly]) totcapacity_multiyear_regional = totcapacity_yearly for indx in np.arange(len(years)-1): totcapacity_hourly1 = cp.reshape( capacity[reg][tech_type][indx,:], (1, capacity[reg][tech_type][indx,:].shape[0])) totcapacity_yearly1 = totcapacity_hourly1 for hours in np.arange(len(timeslice_fraction)-1): totcapacity_yearly1 = cp.vstack([totcapacity_yearly1,totcapacity_hourly1]) totcapacity_multiyear_regional = cp.vstack ([totcapacity_multiyear_regional,totcapacity_yearly1]) totcapacity_regional[tech_type] = totcapacity_multiyear_regional totcapacity[reg] = totcapacity_regional return totcapacity def _check(self): assert hasattr(self.variables, 'totalcapacity'), "totalcapacity must be defined" def rules(self): rules = [] totcapacity = transform_total_capacity( self.variables.totalcapacity, self.model_data.settings.regions, self.model_data.settings.timeslice_fraction, self.model_data.settings.years) for reg in self.model_data.settings.regions: for tech_type, value in totcapacity[reg].items(): max_percentage_ramps = self.model_data.regional_parameters[reg]["prod_max_ramp"].loc[:, tech_type] max_percentage_ramps = max_percentage_ramps.reindex(max_percentage_ramps.index.repeat( len(self.model_data.settings.timeslice_fraction)) ).values min_percentage_ramps = self.model_data.regional_parameters[reg]["prod_min_ramp"].loc[:, tech_type] min_percentage_ramps = min_percentage_ramps.reindex(min_percentage_ramps.index.repeat( len(self.model_data.settings.timeslice_fraction)) ).values energy_prod_differences = self.__get_energy_prod_differences() for indx, year in enumerate(self.model_data.settings.years): max_ramp_in_timestep = cp.multiply ( value [indx * len(self.model_data.settings.timeslice_fraction) : (indx + 1) * len(self.model_data.settings.time_steps), :] , max_percentage_ramps) min_ramp_in_timestep = cp.multiply ( value [indx * len(self.model_data.settings.timeslice_fraction) : (indx + 1) * len(self.model_data.settings.time_steps), :] , min_percentage_ramps) energy_prod = energy_prod_differences[reg][tech_type][indx * len(self.model_data.settings.timeslice_fraction) : (indx + 1) * len(self.model_data.settings.timeslice_fraction), :] diff = energy_prod - max_ramp_in_timestep diff1 = energy_prod * -1 - min_ramp_in_timestep rules.append( diff <= 0 ) rules.append( diff1 <= 0 ) return rules
Right Shift and Left shift operator in Classic ASP
How to use Right Shift operator in classic ASP. As suggested in In ASP, Bit Operator Left shift and Right shift , I used "\" for right shift operator. it gives me wrong result. For example in javascript, 33555758 >> 24 gives result 2. But in Classic ASP 33555758 \ 24 gives division result. Please help me on this.
Bitwise right shift >> is not equal to simple division \ by the given number, but by the given number of times by integer 2, which is binary 10. A bit shift moves each digit in a set of bits right. I.e. dividing by binary 10 removes a binary digit from the number and shifts digits right. Example: 5 >> 1 = 2 5 00000000000000000000000000000101 5 >> 1 00000000000000000000000000000010 (2) which is same as 5 / 2, i.e. in your case it will be not 33555758 \ 24 but 24 times dividing 2. As there is no direct method in vbscript, it can be done as Function SignedRightShift(pValue, pShift) Dim NewValue, PrevValue, i PrevValue = pValue For i = 1 to pShift Select Case VarType(pValue) Case vbLong NewValue = Int((PrevValue And "&H7FFFFFFF") / 2) If PrevValue And "&H80000000" Then NewValue = NewValue Or "&HC0000000" NewValue = CLng(NewValue) Case vbInteger NewValue = Int((PrevValue And "&H7FFF") / 2) If PrevValue And "&H8000" Then NewValue = NewValue Or "&HC000" NewValue = CInt("&H"+ Hex(NewValue)) Case vbByte NewValue = Int(PrevValue / 2) If PrevValue And "&H80" Then NewValue = NewValue Or "&HC0" NewValue = CByte(NewValue) Case Else: Err.Raise 13 ' Not a supported type End Select PrevValue = NewValue Next SignedRightShift = PrevValue End Function and used as x = SignedRightShift(33555758, 24) For more, see http://chris.wastedhalo.com/2014/05/more-binarybitwise-functions-for-vbscript/
Checksum Python
I tried to make a checksum function but my tests returned None instead of the expected output. Can you point out where I went wrong and how to correct it? I am trying to get a numerical sum out of a bunch of strings in the test code. def string_checksum(data): partialchecksum = 0 for i in data: if i is int: partialchecksum += i else: def tobits(i): result = [] strsum = 0 for c in i: bits = bin(ord(c))[2:] bits = '00000000'[len(bits):] + bits result.extend([int(b) for b in bits]) strsum = sum(result) checksum = partialchecksum + strsum return checksum ## Heading ##
alignment of sequences
I want to do pairwise alignment with uniprot and pdb sequences. I have an input file containing uniprot and pdb IDs like this. pdb id uniprot id 1dbh Q07889 1e43 P00692 1f1s Q53591 first, I need to read each line in an input file 2) retrieve the pdb and uniprot sequences from pdb.fasta and uniprot.fasta files 3) Do alignment and calculate sequence identity. Usually, I use the following program for pairwise alignment and seq.identity calculation. library("seqinr") seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR" seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM" library(Biostrings) globalAlign<- pairwiseAlignment(seq1, seq2) pid(globalAlign, type = "PID3") I need to print the output like this pdbid uniprotid seq.identity 1dbh Q07889 99 1e43 P00692 80 1f1s Q53591 56 How can I change the above code ? your help would be appreciated! '
This code is hopefully what your looking for: class test(): def get_seq(self, pdb,fasta_file): # Get sequences from Bio.PDB.PDBParser import PDBParser from Bio import SeqIO aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'} p=PDBParser(PERMISSIVE=1) structure_id="%s" % pdb[:-4] structure=p.get_structure(structure_id, pdb) residues = structure.get_residues() seq_pdb = '' for res in residues: res = res.get_resname() if res in aa: seq_pdb = seq_pdb+aa[res] handle = open(fasta_file, "rU") for record in SeqIO.parse(handle, "fasta") : seq_fasta = record.seq handle.close() self.seq_aln(seq_pdb,seq_fasta) def seq_aln(self,seq1,seq2): # Align the sequences from Bio import pairwise2 from Bio.SubsMat import MatrixInfo as matlist matrix = matlist.blosum62 gap_open = -10 gap_extend = -0.5 alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend) top_aln = alns[0] aln_seq1, aln_seq2, score, begin, end = top_aln with open('aln.fasta', 'w') as outfile: outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2)) print aln_seq1+'\n'+aln_seq2 self.seq_id('aln.fasta') def seq_id(self,aln_fasta): # Get sequence ID import string from Bio import AlignIO input_handle = open("aln.fasta", "rU") alignment = AlignIO.read(input_handle, "fasta") j=0 # counts positions in first sequence i=0 # counts identity hits for record in alignment: #print record for amino_acid in record.seq: if amino_acid == '-': pass else: if amino_acid == alignment[0].seq[j]: i += 1 j += 1 j = 0 seq = str(record.seq) gap_strip = seq.replace('-', '') percent = 100*i/len(gap_strip) print record.id+' '+str(percent) i=0 a = test() a.get_seq('1DBH.pdb','Q07889.fasta') This outputs: -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS PDB_seq 100 # pdb to itself would obviously have 100% identity Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence For this to work on you input file, you need to put my a.get_seq() in a for loop with the inputs from your text file. EDIT: Replace the seq_id function with this one: def seq_id(self,aln_fasta): import string from Bio import AlignIO from Bio import SeqIO record_iterator = SeqIO.parse(aln_fasta, "fasta") first_record = record_iterator.next() print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-',''))) second_record = record_iterator.next() print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-',''))) lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))] if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest print 'PDB sequence has the shortest length' else: print 'Uniport sequence has the shortes length' idenities = 0 for i,v in enumerate(first_record.seq): if v == '-': pass #print i,v, second_record.seq[i] if v == second_record.seq[i]: idenities +=1 #print i,v, second_record.seq[i], idenities print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths))) to pass the arguments to the class use: with open('input_file.txt', 'r') as infile: next(infile) next(infile) # Going by your input file for line in infile: line = line.split() a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')
It might be something like this; a repeatable example (e.g., with short files posted on-line) would help... library(Biostrings) pdb = readAAStringSet("pdb.fasta") uniprot = readAAStringSet("uniprot.fasta") to input all sequences into two objects. pairwiseAlignment accepts a vector as first (query) argument, so if you were wanting to align all pdb against all uniprot pre-allocate a result matrix pids = matrix(numeric(), length(uniprot), length(pdb), dimnames=list(names(uniprot), names(pdb))) and then do the calculations for (i in seq_along(uniprot)) { globalAlignment = pairwiseAlignment(pdb, uniprot[i]) pids[i,] = pid(globalAlignment) }