Find sequencing reads with insertions longer than number - sequencing

I'm trying to isolate, from a bam file, those sequencing reads that have insertions longer than number (let's say 50bp). I guess I can do that using the cigar but I don't know any easy way to parse it and keep only the reads that I want. This is what I need:
Read1 -> 2M1I89M53I2M
Read2 -> 2M1I144M
I should keep only Read1.
Thanks!

Most likely I'm late, but ...
Probably you want MC tag, not CIGAR. I use BWA, and information on insertions is stored in the MC tag. But I may mistake.
Use pysam module to parse BAM, and regular expressions to parse MC tags.
Example code:
import pysam
import re
input_file = pysam.AlignmentFile('input.bam', 'rb')
output_file = pysam.AlignmentFile('found.bam', 'wb', template = input_file)
for Read in input_file:
try:
TagMC = Read.get_tag('MC')
except KeyError:
continue
InsertionsTags = re.findall('\d+I', TagMC)
if not InsertionsTags: continue
InsertionLengths = [int(Item[:-1]) for Item in InsertionsTags]
MinLength = min(InsertionLengths)
if MinLength > 50: output_file.write(Read)
input_file.close()
output_file.close()
Hope that helps.

Related

How to convert div tags to a table?

I want to extract the table from this website https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214
Checking the source of that website, I noticed that somehow the table tag is missing. I assume that this table is a summary of multiple div classes. Is there any easy approach to convert this table to excel/csv? I badly have coding skills/experience...
Appreciate any help
There are a few way to do that. One of which (in python) is (pretty self-explanatory, I believe):
import lxml.html as lh
import csv
import requests
url = 'https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214'
req = requests.get(url)
doc = lh.fromstring(req.text)
headers = ['Position', 'Name', 'Brand Value', 'Last']
with open('brands.csv', 'a', newline='') as fp:
#note the 'a' in there - for 'append`
file = csv.writer(fp)
file.writerow(headers)
#with the headers out of the way, the heavier xpath lifting begins:
for row in doc.xpath('//div[#class="top100row"]'):
pos = row.xpath('./div[#class="pos"]//text()')[0]
name = row.xpath('.//div[#class="name"]//text()')[0]
brand_val = row.xpath('.//div[#class="weighted"]//text()')[0]
last = row.xpath('.//div[#class="lastyear"]//text()')[0]
file.writerow([pos,name,brand_val,last])
The resulting file should be at least close to what you're looking for.

Migrating to Qt6/PyQt6: what are all the deprecated short-form names in Qt5?

I'm trying to migrate a codebase from PyQt5 to PyQt6. I read in this article (see https://www.pythonguis.com/faq/pyqt5-vs-pyqt6/) that all enum members must be named using their fully qualified names. The article gives this example:
# PyQt5
widget = QCheckBox("This is a checkbox")
widget.setCheckState(Qt.Checked)
# PyQt6
widget = QCheckBox("This is a checkbox")
widget.setCheckState(Qt.CheckState.Checked)
Then the article continues:
"There are too many updated values to mention them all here. But if you're converting a codebase you can usually just search online for the short-form and the longer form will be in the results."
I get the point. This quote basically says something along the lines:
"If the Python interpreter runs into an error, and the error turns out to be a short-form enum, you'll likely find the solution online."
I get that. But this is not how I want to migrate the codebase. I want a full list of all the short-form enums and then perform a global search-and-replace for each.
Where can I find such a list?
I wrote a script to extract all the short-form and corresponding fully qualified enum names from the PyQt6 installation. It then does the conversions automatically:
# -*- coding: utf-8 -*-
# ================================================================================================ #
# ENUM CONVERTER TOOL #
# ================================================================================================ #
from typing import *
import os, argparse, inspect, re
q = "'"
help_text = '''
Copyright (c) 2022 Kristof Mulier
MIT licensed, see bottom
ENUM CONVERTER TOOL
===================
The script starts from the toplevel directory (assuming that you put this file in that directory)
and crawls through all the files and folders. In each file, it searches for old-style enums to
convert them into fully qualified names.
HOW TO USE
==========
Fill in the path to your PyQt6 installation folder. See line 57:
pyqt6_folderpath = 'C:/Python39/Lib/site-packages/PyQt6'
Place this script in the toplevel directory of your project. Open a terminal, navigate to the
directory and invoke this script:
$ python enum_converter_tool.py
WARNING
=======
This script modifies the files in your project! Make sure to backup your project before you put this
file inside. Also, you might first want to do a dry run:
$ python enum_converter_tool.py --dry_run
FEATURES
========
You can invoke this script in the following ways:
$ python enum_converter_tool.py No parameters. The script simply goes through
all the files and makes the replacements.
$ python enum_converter_tool.py --dry_run Dry run mode. The script won't do any replace-
ments, but prints out what it could replace.
$ python enum_converter_tool.py --show Print the dictionary this script creates to
convert the old-style enums into new-style.
$ python enum_converter_tool.py --help Show this help info
'''
# IMPORTANT: Point at the folder where PyQt6 stub files are located. This folder will be examined to
# fill the 'enum_dict'.
pyqt6_folderpath = 'C:/Python39/Lib/site-packages/PyQt6'
# Figure out where the toplevel directory is located. We assume that this converter tool is located
# in that directory. An os.walk() operation starts from this toplevel directory to find and process
# all files.
toplevel_directory = os.path.realpath(
os.path.dirname(
os.path.realpath(
inspect.getfile(
inspect.currentframe()
)
)
)
).replace('\\', '/')
# Figure out the name of this script. It will be used later on to exclude oneself from the replace-
# ments.
script_name = os.path.realpath(
inspect.getfile(inspect.currentframe())
).replace('\\', '/').split('/')[-1]
# Create the dictionary that will be filled with enums
enum_dict:Dict[str, str] = {}
def fill_enum_dict(filepath:str) -> None:
'''
Parse the given stub file to extract the enums and flags. Each one is inside a class, possibly a
nested one. For example:
---------------------------------------------------------------------
| class Qt(PyQt6.sip.simplewrapper): |
| class HighDpiScaleFactorRoundingPolicy(enum.Enum): |
| Round = ... # type: Qt.HighDpiScaleFactorRoundingPolicy |
---------------------------------------------------------------------
The enum 'Round' is from class 'HighDpiScaleFactorRoundingPolicy' which is in turn from class
'Qt'. The old reference style would then be:
> Qt.Round
The new style (fully qualified name) would be:
> Qt.HighDpiScaleFactorRoundingPolicy.Round
The aim of this function is to fill the 'enum_dict' with an entry like:
enum_dict = {
'Qt.Round' : 'Qt.HighDpiScaleFactorRoundingPolicy.Round'
}
'''
content:str = ''
with open(filepath, 'r', encoding='utf-8', newline='\n', errors='replace') as f:
content = f.read()
p = re.compile(r'(\w+)\s+=\s+\.\.\.\s+#\s*type:\s*([\w.]+)')
for m in p.finditer(content):
# Observe the enum's name, eg. 'Round'
enum_name = m.group(1)
# Figure out in which classes it is
class_list = m.group(2).split('.')
# If it belongs to just one class (no nesting), there is no point in continuing
if len(class_list) == 1:
continue
# Extract the old and new enum's name
old_enum = f'{class_list[0]}.{enum_name}'
new_enum = ''
for class_name in class_list:
new_enum += f'{class_name}.'
continue
new_enum += enum_name
# Add them to the 'enum_dict'
enum_dict[old_enum] = new_enum
continue
return
def show_help() -> None:
'''
Print help info and quit.
'''
print(help_text)
return
def convert_enums_in_file(filepath:str, dry_run:bool) -> None:
'''
Convert the enums in the given file.
'''
filename:str = filepath.split('/')[-1]
# Ignore the file in some cases
if any(filename == fname for fname in (script_name, )):
return
# Read the content
content:str = ''
with open(filepath, 'r', encoding='utf-8', newline='\n', errors='replace') as f:
content = f.read()
# Loop over all the keys in the 'enum_dict'. Perform a replacement in the 'content' for each of
# them.
for k, v in enum_dict.items():
if k not in content:
continue
# Compile a regex pattern that only looks for the old enum (represented by the key of the
# 'enum_dict') if it is surrounded by bounds. What we want to avoid is a situation like
# this:
# k = 'Qt.Window'
# k found in 'qt.Qt.WindowType.Window'
# In the situation above, k is found in 'qt.Qt.WindowType.Window' such that a replacement
# will take place there, messing up the code! By surrounding k with bounds in the regex pat-
# tern, this won't happen.
p = re.compile(fr'\b{k}\b')
# Substitute all occurences of k (key) in 'content' with v (value). The 'subn()' method re-
# turns a tuple (new_string, number_of_subs_made).
new_content, n = p.subn(v, content)
if n == 0:
assert new_content == content
continue
assert new_content != content
print(f'{q}{filename}{q}: Replace {q}{k}{q} => {q}{v}{q} ({n})')
content = new_content
continue
if dry_run:
return
with open(filepath, 'w', encoding='utf-8', newline='\n', errors='replace') as f:
f.write(content)
return
def convert_all(dry_run:bool) -> None:
'''
Search and replace all enums.
'''
for root, dirs, files in os.walk(toplevel_directory):
for f in files:
if not f.endswith('.py'):
continue
filepath = os.path.join(root, f).replace('\\', '/')
convert_enums_in_file(filepath, dry_run)
continue
continue
return
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description = 'Convert enums to fully-qualified names',
add_help = False,
)
parser.add_argument('-h', '--help' , action='store_true')
parser.add_argument('-d', '--dry_run' , action='store_true')
parser.add_argument('-s', '--show' , action='store_true')
args = parser.parse_args()
if args.help:
show_help()
else:
#& Check if 'pyqt6_folderpath' exists
if not os.path.exists(pyqt6_folderpath):
print(
f'\nERROR:\n'
f'Folder {q}{pyqt6_folderpath}{q} could not be found. Make sure that variable '
f'{q}pyqt6_folderpath{q} from line 57 points to the PyQt6 installation folder.\n'
)
else:
#& Fill the 'enum_dict'
type_hint_files = [
os.path.join(pyqt6_folderpath, _filename)
for _filename in os.listdir(pyqt6_folderpath)
if _filename.endswith('.pyi')
]
for _filepath in type_hint_files:
fill_enum_dict(_filepath)
continue
#& Perform requested action
if args.show:
import pprint
pprint.pprint(enum_dict)
elif args.dry_run:
print('\nDRY RUN\n')
convert_all(dry_run=True)
else:
convert_all(dry_run=False)
print('\nQuit enum converter tool\n')
# MIT LICENSE
# ===========
# Copyright (c) 2022 Kristof Mulier
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
# associated documentation files (the "Software"), to deal in the Software without restriction, in-
# cluding without limitation the rights to use, copy, modify, merge, publish, distribute, sublicen-
# se, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to
# do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or substan-
# tial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
# NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRIN-
# GEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Make sure you backup your Python project. Then place this file in the toplevel directory of the project. Modify line 57 (!) such that it points to your PyQt6 installation folder.
First run the script with the --dry_run flag to make sure you agree with the replacements. Then run it without any flags.

R/exams d2l multiple choice question doesn't select correct answer

I use the following to create a D2L exam from the "capital.Rmd" example (I converted the question to schoice)
exams2blackboard("capitals.Rmd", n =3, name = "testquiz" )
After I upload the testquiz.zip file, I notice that the correct answer must be manually chosen on the D2L platform.
I was wondering if there is a workaround.
Many Thanks,
Umut
If you want the correct solution to be selected, do not use the Import option from the Question Library or from the Quiz itself. Use the Import/Export/Copy Components under the Course Admin tab.
If you import the questions through the following steps, BrightSpace correctly picks the right solution. It’s a bit longer but seems to correctly choose the solution.
Under the Course Admin tab of your course, go to
'Import/Export/Copy Components' -> ‘Import Components’ -> Start -> (drag and drop the ZIP file)
Click ‘Advanced Options…’
This step will take a few minutes for large files; if you do not click
Advanced Options, then the import will automatically import the
questions into the 'Question Library' and will generate a Quiz with the
imported questions; you do not want this.
-> Continue -> Continue -> at this point choose 'Question Library' from the section 'Select Components to Import'
I would not choose ‘Quizzes’ because it automatically creates a quiz
and makes it available to students. It has the unfortunate side-effect
of making ALL the questions available, which means all the versions of
various dynamic questions; this is not something we want.
-> Continue -> Continue. This stage takes a few minutes for large
imports.
Now the Questions are available in the Question Library and can be used to generate new quizzes. Each question has the correct answer selected already. This works for ‘schoice’ and ‘mchoice’ versions of questions. Currently, plots are not imported, though, still trying to figure out why.
This problem is new to me. In earlier versions of Brightspace/D2L the import of single-choice and multiple-choice exercises via exams2blackboard() worked well. Possibly, D2L changed in the meantime given that neither the current release version from CRAN nor the development version from R-Forge work for you.
D2L also supports other import formats and we did play around with some of these. See the following discussions in the R/exams forum on R-Forge:
https://R-Forge.R-project.org/forum/forum.php?thread_id=33404&forum_id=4377&group_id=1337
https://R-Forge.R-project.org/forum/forum.php?thread_id=33657&forum_id=4377&group_id=1337
Notably we tried to use the XML-based QTI 2.1 format that seems to be employed by D2L internally. However, D2L apparently uses a particular custom flavor of QTI 2.1. It should be possible to reverse engineer that and improve exams2qti21() correspondingly but so far (to the best of my knowledge) no one put the time and effort into this that would be needed.
For simple single/multiple choice questions a CSV-based exchange format can also be used. I have put together a very basic exams2d2l() function that was posted in the threads above and that I'm also including below. It can set up the CSV file for a single exercise like the capitals.Rmd exercise that you use above. For plain text exercises like that it seems to work well but not for more complex elements (graphics, code, math, etc.).
exams2d2l <- function(file, dir = ".", ## n = 1L, nsamp = NULL disabled for now
name = NULL, quiet = TRUE, edir = NULL, tdir = NULL, sdir = NULL, verbose = FALSE,
resolution = 100, width = 4, height = 4, svg = FALSE,
encoding = "", converter = NULL, ...)
{
## for Rnw exercises use "ttm" converter otherwise "pandoc" converter
if(any(tolower(tools::file_ext(unlist(file))) == "rmd")) {
if(is.null(converter)) converter <- "pandoc"
} else {
if(is.null(converter)) converter <- "ttm"
}
## output directory or display on the fly
## output name processing
if(is.null(name)) name <- tools::file_path_sans_ext(basename(file))
## set up .html transformer and writer function
htmltransform <- make_exercise_transform_html(converter = converter, ...)
## create exam with HTML text
rval <- xexams(file,
driver = list(sweave = list(quiet = quiet, pdf = FALSE, png = !svg, svg = svg,
resolution = resolution, width = width, height = height, encoding = encoding),
read = NULL, transform = htmltransform, write = NULL),
dir = dir, edir = edir, tdir = tdir, sdir = sdir, verbose = verbose)
## currently: only a single exercise
rval <- rval[[1L]][[1L]]
## put together CSV
cleanup <- function(x) gsub('"', '""', paste(x, collapse = "\n"), fixed = TRUE)
rval <- c(
'NewQuestion,MC,,,',
sprintf('ID,"%s",,,', cleanup(rval$metainfo$file)),
sprintf('Title,"%s",,,', cleanup(rval$metainfo$name)),
sprintf('QuestionText,"%s",,,', cleanup(rval$question)),
sprintf('Points,%s,,,', if(is.null(rval$metainfo$points)) 1 else rval$metainfo$points),
'Difficulty,1,,,',
'Image,,,,',
paste0('Option,', ifelse(rval$metainfo$solution, 100, 0), ',"', cleanup(rval$questionlist), '",,"', cleanup(rval$solutionlist), '"'),
'Hint,,,,',
sprintf('Feedback,"%s",,,', cleanup(rval$solution))
)
writeLines(rval, file.path(dir, paste0(name, ".csv")))
invisible(rval)
}

BertModel transformers outputs string instead of tensor

I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a string instead of the hidden state. This is the code I'm using:
import transformers
from transformers import BertModel, BertTokenizer
print(transformers.__version__)
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
PATH_OF_CACHE = "/home/mwon/data-mwon/paperChega/src_classificador/data/hugingface"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,cache_dir = PATH_OF_CACHE)
sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.'
encoding_sample = tokenizer.encode_plus(
sample_txt,
max_length=32,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=False,
padding=True,
truncation = True,
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
)
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME,cache_dir = PATH_OF_CACHE)
last_hidden_state, pooled_output = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask']
)
print([last_hidden_state,pooled_output])
that outputs:
4.0.0
['last_hidden_state', 'pooler_output']
While the answer from Aakash provides a solution to the problem, it does not explain the issue. Since one of the 3.X releases of the transformers library, the models do not return tuples anymore but specific output objects:
o = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask']
)
print(type(o))
print(o.keys())
Output:
transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
odict_keys(['last_hidden_state', 'pooler_output'])
You can return to the previous behavior by adding return_dict=False to get a tuple:
o = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask'],
return_dict=False
)
print(type(o))
Output:
<class 'tuple'>
I do not recommend that, because it is now unambiguous to select a specific part of the output without turning to the documentation as shown in the example below:
o = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'], return_dict=False, output_attentions=True, output_hidden_states=True)
print('I am a tuple with {} elements. You do not know what each element presents without checking the documentation'.format(len(o)))
o = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'], output_attentions=True, output_hidden_states=True)
print('I am a cool object and you can acces my elements with o.last_hidden_state, o["last_hidden_state"] or even o[0]. My keys are; {} '.format(o.keys()))
Output:
I am a tuple with 4 elements. You do not know what each element presents without checking the documentation
I am a cool object and you can acces my elements with o.last_hidden_state, o["last_hidden_state"] or even o[0]. My keys are; odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])
I faced the same issue while learning how to implement Bert. I noticed that using
last_hidden_state, pooled_output = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'])
is the issue. Use:
outputs = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'])
and extract the last_hidden state using
output[0]
You can refer to the documentation here which tells you what is returned by the BertModel

Django Haystack with elasticsearch returning empty queryset while data exists

I am doing a project in Python, django rest framework. I am using haystack SearchQuerySet. My code is here.
from haystack import indexes
from Medications.models import Salt
class Salt_Index(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name = indexes.CharField(model_attr='name',null=True)
slug = indexes.CharField(model_attr='slug',null=True)
if_i_forget = indexes.CharField(model_attr='if_i_forget',null=True)
other_information = indexes.CharField(model_attr='other_information',null=True)
precautions = indexes.CharField(model_attr='precautions',null=True)
special_dietary = indexes.CharField(model_attr='special_dietary',null=True)
brand = indexes.CharField(model_attr='brand',null=True)
why = indexes.CharField(model_attr='why',null=True)
storage_conditions = indexes.CharField(model_attr='storage_conditions',null=True)
side_effects = indexes.CharField(model_attr='side_effects',null=True)
def get_model(self):
return Salt
def index_queryset(self, using=None):
return self.get_model().objects.all()
and my views.py file is -
from django.views.generic import View
from haystack.query import SearchQuerySet
from django.core import serializers
class Medication_Search_View(View):
def get(self,request,format=None):
try:
get_data = SearchQuerySet().all()
print get_data
serialized = ss.serialize("json", [data.object for data in get_data])
return HttpResponse(serialized)
except Exception,e:
print e
my python manage.py rebuild_index is working fine (showing 'Indexing 2959 salts') but in my 'views.py' file , SearchQuerySet() is returning an empty query set...
I am very much worried for this. Please help me friends if you know the reason behind getting empty query set while I have data in my Salt model.
you should check app name it is case sensitive.try to write app name in small letters
My problem is solved now. The problem was that i had wriiten apps name with capital letters and the database tables were made in small letters(myapp_Student). so it was creating problem on database lookup.

Resources