scrapy InitSpider: set Rules in __init__? - rules

I am building a recursive webspider with an optional login. I want to make most settings dynamic via a json config file.
In my __init__ function, I am reading this file and try to populate all variables, however, this does not work with Rules.
class CrawlpySpider(InitSpider):
...
#----------------------------------------------------------------------
def __init__(self, *args, **kwargs):
"""Constructor: overwrite parent __init__ function"""
# Call parent init
super(CrawlpySpider, self).__init__(*args, **kwargs)
# Get command line arg provided configuration param
config_file = kwargs.get('config')
# Validate configuration file parameter
if not config_file:
logging.error('Missing argument "-a config"')
logging.error('Usage: scrapy crawl crawlpy -a config=/path/to/config.json')
self.abort = True
# Check if it is actually a file
elif not os.path.isfile(config_file):
logging.error('Specified config file does not exist')
logging.error('Not found in: "' + config_file + '"')
self.abort = True
# All good, read config
else:
# Load json config
fpointer = open(config_file)
data = fpointer.read()
fpointer.close()
# convert JSON to dict
config = json.loads(data)
# config['rules'] is simply a string array which looks like this:
# config['rules'] = [
# 'password',
# 'reset',
# 'delete',
# 'disable',
# 'drop',
# 'logout',
# ]
CrawlpySpider.rules = (
Rule(
LinkExtractor(
allow_domains=(self.allowed_domains),
unique=True,
deny=tuple(config['rules'])
),
callback='parse',
follow=False
),
)
Scrapy still crawls the pages that are present in config['rules'] and therefore also hits the logout page. So the specified pages are not being denied. What am I missing here?
Update:
I have already tried by setting CrawlpySpider.rules = ... as well as self.rules = ... inside __init__. Both variants do not work.
Spider: InitSpider
Rules: LinkExtractor
Before crawl: Doing login prior crawling
I even try to deny that in my parse function
# Dive deeper?
# The nesting depth is now handled via a custom middle-ware (middlewares.py)
#if curr_depth < self.max_depth or self.max_depth == 0:
links = LinkExtractor().extract_links(response)
for link in links:
for ignore in self.ignores:
if (ignore not in link.url) and (ignore.lower() not in link.url.lower()) and link.url.find(ignore) == -1:
yield Request(link.url, meta={'depth': curr_depth+1, 'referer': response.url})

You are setting a class attribute where you want to set an instance attribute:
# this:
CrawlpySpider.rules = (
# should be this:
self.rules = (
<...>

Related

sending x photos after i click a button in a bot telethon telegram

#bot.on(events.CallbackQuery)
async def handler(event):
global fotomandate
global i
if event.data == b"1":
await event.respond("how many photos do i send?")
numerofoto = int(input("how many photos do i send?")) ##ignore this line i'll fix later
print (numerofoto)
while i < numerofoto:
path = (r"C:\Users\x\Desktop\Nuova cartella (2)")
fotorandom = random.choice([
x for x in os.listdir(r"C:\Users\x\Desktop\Nuova cartella (2)")
if os.path.isfile(os.path.join(path, x))
])
i += 1
await event.reply(file=fotorandom)
i need to send n(in input on telegram) random photos from a directory but it says
ValueError: Failed to convert bonni media-jpg to media. Not an existing file, an HTTP URL or a valid bot-API-like file ID
As the error states:
ValueError: Failed to convert bonni media-jpg to media. Not an existing file, an HTTP URL or a valid bot-API-like file ID
Inside the loop guard, you're doing os.path.join(path, x). However, your x does not contain the full path. Then the file is searched in the working directory, but it is not found. You need to specify the correct path there too:
fotorandom = random.choice([
os.path.join(path, x) # <- new
for x in os.listdir(path) # <- better to avoid repeating dir
if os.path.isfile(os.path.join(path, x))
])

H20 Driverless AI, Not able to load custom recipe

I am using H2O DAI 1.9.0.6. I am tring to load custom recipe (BERT pretained model using custom recipe) on Expert settings. I am using local file to upload. However upload is not happning. No error, no progress nothing. After that activity I am not able to see this model under RECIPE tab.
Took Sample Recipe from below URL and Modified for my need. Thanks for the person who created this Recipe.
https://github.com/h2oai/driverlessai-recipes/blob/master/models/nlp/portuguese_bert.py
Custom Recipe
import os
import shutil
from urllib.parse import urlparse
import requests
from h2oaicore.models import TextBERTModel, CustomModel
from h2oaicore.systemutils import make_experiment_logger, temporary_files_path, atomic_move, loggerinfo
def is_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc, result.path])
except:
return False
def maybe_download_language_model(logger,
save_directory,
model_link,
config_link,
vocab_link):
model_name = "pytorch_model.bin"
if isinstance(model_link, str):
model_name = model_link.split('/')[-1]
if '.bin' not in model_name:
model_name = "pytorch_model.bin"
maybe_download(url=config_link,
dest=os.path.join(save_directory, "config.json"),
logger=logger)
maybe_download(url=vocab_link,
dest=os.path.join(save_directory, "vocab.txt"),
logger=logger)
maybe_download(url=model_link,
dest=os.path.join(save_directory, model_name),
logger=logger)
def maybe_download(url, dest, logger=None):
if not is_url(url):
loggerinfo(logger, f"{url} is not a valid URL.")
return
dest_tmp = dest + ".tmp"
if os.path.exists(dest):
loggerinfo(logger, f"already downloaded {url} -> {dest}")
return
if os.path.exists(dest_tmp):
loggerinfo(logger, f"Download has already started {url} -> {dest_tmp}. "
f"Delete {dest_tmp} to download the file once more.")
return
loggerinfo(logger, f"Downloading {url} -> {dest}")
url_data = requests.get(url, stream=True)
if url_data.status_code != requests.codes.ok:
msg = "Cannot get url %s, code: %s, reason: %s" % (
str(url), str(url_data.status_code), str(url_data.reason))
raise requests.exceptions.RequestException(msg)
url_data.raw.decode_content = True
if not os.path.isdir(os.path.dirname(dest)):
os.makedirs(os.path.dirname(dest), exist_ok=True)
with open(dest_tmp, 'wb') as f:
shutil.copyfileobj(url_data.raw, f)
atomic_move(dest_tmp, dest)
def check_correct_name(custom_name):
allowed_pretrained_models = ['bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm-roberta',
'xlm', 'roberta', 'distilbert', 'camembert', 'ctrl', 'albert']
assert len([model_name for model_name in allowed_pretrained_models
if model_name in custom_name]), f"{custom_name} needs to contain the name" \
" of the pretrained model architecture (e.g. bert or xlnet) " \
"to be able to process the model correctly."
class CustomBertModel(TextBERTModel, CustomModel):
"""
Custom model class for using pretrained transformer models.
The class inherits :
- CustomModel that really is just a tag. It's there to make sure DAI knows it's a custom model.
- TextBERTModel so that the custom model inherits all the properties and methods.
Supported model architecture:
'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm-roberta',
'xlm', 'roberta', 'distilbert', 'camembert', 'ctrl', 'albert'
How to use:
- You have already downloaded the weights, the vocab and the config file:
- Set _model_path as the folder where the weights, the vocab and the config file are stored.
- Set _model_name according to the pretrained architecture (e.g. bert-base-uncased).
- You want to to download the weights, the vocab and the config file:
- Set _model_link, _config_link and _vocab_link accordingly.
- _model_path is the folder where the weights, the vocab and the config file will be saved.
- Set _model_name according to the pretrained architecture (e.g. bert-base-uncased).
- Important:
_model_path needs to contain the name of the pretrained model architecture (e.g. bert or xlnet)
to be able to load the model correctly.
- Disable genetic algorithm in the expert setting.
"""
# _model_path is the full path to the directory where the weights, vocab and the config will be saved.
_model_name = NotImplemented # Will be used to create the MOJO
_model_path = NotImplemented
_model_link = NotImplemented
_config_link = NotImplemented
_vocab_link = NotImplemented
_booster_str = "pytorch-custom"
# Requirements for MOJO creation:
# _model_name needs to be one of
# bert-base-uncased, bert-base-multilingual-cased, xlnet-base-cased, roberta-base, distilbert-base-uncased
# vocab.txt needs to be the same as vocab.txt used in _model_name (no custom vocabulary yet).
_mojo = False
#staticmethod
def is_enabled():
return False # Abstract Base model should not show up in models.
def _set_model_name(self, language_detected):
self.model_path = self.__class__._model_path
self.model_name = self.__class__._model_name
check_correct_name(self.model_path)
check_correct_name(self.model_name)
def fit(self, X, y, sample_weight=None, eval_set=None, sample_weight_eval_set=None, **kwargs):
logger = None
if self.context and self.context.experiment_id:
logger = make_experiment_logger(experiment_id=self.context.experiment_id, tmp_dir=self.context.tmp_dir,
experiment_tmp_dir=self.context.experiment_tmp_dir)
maybe_download_language_model(logger,
save_directory=self.__class__._model_path,
model_link=self.__class__._model_link,
config_link=self.__class__._config_link,
vocab_link=self.__class__._vocab_link)
super().fit(X, y, sample_weight, eval_set, sample_weight_eval_set, **kwargs)
class GermanBertModel(CustomBertModel):
_model_name = "bert-base-german-dbmdz-uncased"
_model_path = os.path.join(temporary_files_path, "german_bert_language_model/")
_model_link = "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/pytorch_model.bin"
_config_link = "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/config.json"
_vocab_link = "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt"
_mojo = True
#staticmethod
def is_enabled():
return True
Check that your custom recipe has is_enabled() returning True.
def is_enabled():
return True

How to download JIRA attachment files with Python

I want to download attachment files of an issue in JIRA Python.
I use jira python lib ,you can use pip install JIRA
# -- coding: UTF-8 --
from jira import JIRA
import requests
url = 'https://jira.1234.com'
jira = JIRA(server=url, basic_auth=('admin', 'password'))
attachment=jira.attachment(12345) #12345 is attachment_key
image = attachment.get()
with open("Image.png", 'wb') as f:
f.write(image)
JIRA exposes its REST services and through that and some python you can download any attachment.
It worked for me like this (you'll need to adjust the variables):
#!/usr/bin/python
# miguel ortiz
# Requests module: http://docs.python-requests.org/en/latest/
# Documentation: <url>
#----------------------------------------------------------------Modules
import sys
import csv, json
import requests
#----------------------------------------------------------------Variables
myTicket= sys.argv[1] # Your ticket: ABC-123
user = 'miguel' # JIRA user
pasw = 'password' # JIRA password
jiraURL = 'https://yourinstance.jira.com/rest/api/latest/issue/'
fileName = 'my_attached_file' # In this case we'll be looking for a specific file in the attachments
attachment_final_url="" # To validate if there are or not attachments
def main() :
print '\n\n [ You are checking ticket: ' + myTicket+ ' ]\n'
# Request Json from JIRA API
r = requests.get(jiraURL+myTicket, auth=(user, pasw),timeout=5)
# status of the request
rstatus = r.status_code
# If the status isn't 200 we leave
if not rstatus == 200 :
print 'Error accesing JIRA:' + str(rstatus)
exit()
else:
data = r.json()
if not data['fields']['attachment'] :
status_attachment = 'ERROR: Nothing attached, attach a file named: ' + fileName
attachment_final_url=""
else:
for i in data['fields']['attachment'] :
if i['filename'] == fileName :
attachment_final_url = i['content']
status_attachment_name = 'OK: The desired attachment exists: ' + fileName
attachment_name = False
attachment_amount = False
attachment_files = False
break
else :
attachment_files = False
status_attachment_name = + 'ERROR: None of the files has the desired name '
attachment_final_url=""
attachment_name = True
attachment_amount = True
continue
if attachment_final_url != "" :
r = requests.get(attachment_final_url, auth=(user, pasw), stream=True)
with open(fileName, "wb") as f:
f.write(r.content.decode('iso-8859-1').encode('utf8'))
f.close()
else:
print status_attachment
if __name__ == "__main__" :
main()
If you do not understand the code I've detailed it better in my blog.
EDIT: Be careful, in JIRA you can add many files with the same name.

OpenStack SDK - How to create image with Kernel id and Ramdisk parameters?

I've been trying to create an OpenStack image informing the Kernel Id and Ramdisk Id, using the OpenStack Unified SDK (https://github.com/openstack/python-openstacksdk), but without success. I know this is possible, because the OpenStack CLI have this parameters, as shown on this page (http://docs.openstack.org/cli-reference/glance.html#glance-image-create), where the CLI have the "--kernel-id" and "--ramdisk-id" parameters. I've used this parameter in the terminal and confirmed they work, but I need to use them in python.
I'm trying to use the upload_method, as described here http://developer.openstack.org/sdks/python/openstacksdk/users/proxies/image.html#image-api-v2 but I can't get the attrs parameter right. Documentation only say it is suposed to be a dictionary. Here is the code I'm using
...
atrib = {
'properties': {
'kernel_id': 'd84e1f2b-8d8c-4a4a-8858-77a8d5a93cb1',
'ramdisk_id': 'cfef18e0-006e-477a-a098-593d43435a1e'
}
}
with open(file) as fimage:
image = image_service.upload_image(
name=name,
data=fimage,
disk_format='qcow2',
container_format='bare',
**atrib)
....
And here is the error I'm getting:
File "builder.py", line 121, in main
**atrib
File "/usr/lib/python2.7/site-packages/openstack/image/v2/_proxy.py", line 51, in upload_image
**attrs)
File "/usr/lib/python2.7/site-packages/openstack/proxy2.py", line 193, in _create
return res.create(self.session)
File "/usr/lib/python2.7/site-packages/openstack/resource2.py", line 570, in create
json=request.body, headers=request.headers)
File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 675, in post
return self.request(url, 'POST', **kwargs)
File "/usr/lib/python2.7/site-packages/openstack/session.py", line 52, in map_exceptions_wrapper
http_status=e.http_status, cause=e)
openstack.exceptions.HttpException: HttpException: Bad Request, 400 Bad Request
Provided object does not match schema 'image': {u'kernel_id': u'd84e1f2b-8d8c-4a4a-8858-77a8d5a93cb1', u'ramdisk_id': u'cfef18e0-006e-477a-a098-593d43435a1e'} is not of type 'string' Failed validating 'type' in schema['additionalProperties']: {'type': 'string'} On instance[u'properties']: {u'kernel_id': u'd84e1f2b-8d8c-4a4a-8858-77a8d5a93cb1', u'ramdisk_id': u'cfef18e0-006e-477a-a098-593d43435a1e'}
Already tried to use the update_image method, but without success, passing kernel id and ramdisk id as a strings creates the instance, but it does not boot.
Does anyone knows how to solve this?
what the version of the glance api you use?
I have read the code in openstackclient/image/v1/images.py, openstackclient/v1/shell.py
## in shell.py
def do_image_create(gc, args):
...
fields = dict(filter(lambda x: x[1] is not None, vars(args).items()))
raw_properties = fields.pop('property')
fields['properties'] = {}
for datum in raw_properties:
key, value = datum.split('=', 1)
fields['properties'][key] = value
...
image = gc.images.create(**fields)
## in images.py
def create(self, **kwargs):
...
for field in kwargs:
if field in CREATE_PARAMS:
fields[field] = kwargs[field]
elif field == 'return_req_id':
continue
else:
msg = 'create() got an unexpected keyword argument \'%s\''
raise TypeError(msg % field)
hdrs = self._image_meta_to_headers(fields)
...
resp, body = self.client.post('/v1/images',
headers=hdrs,
data=image_data)
...
and openstackclient/v2/shell.py,openstackclient/image/v2.images.py(and i have debuged this too)
## in shell.py
def do_image_create(gc, args):
...
raw_properties = fields.pop('property', [])
for datum in raw_properties:
key, value = datum.split('=', 1)
fields[key] = value
...
image = gc.images.create(**fields)
##in images.py
def create(self, **kwargs):
"""Create an image.""
url = '/v2/images'
image = self.model()
for (key, value) in kwargs.items():
try:
setattr(image, key, value)
except warlock.InvalidOperation as e:
raise TypeError(utils.exception_to_str(e))
resp, body = self.http_client.post(url, data=image)
...
it seems that, you can create a image use your way in version 1.0, but in version 2.0, you should use the kernel_id and ramdisk_id as below
atrib = {
'kernel_id': 'd84e1f2b-8d8c-4a4a-8858-77a8d5a93cb1',
'ramdisk_id': 'cfef18e0-006e-477a-a098-593d43435a1e'
}
but the OpenStack SDK seems it can't trans those two argments to the url (because there is no Body define in openstack/image/v2/image.py. so you should modify the OpenStack SDK to support this.
BTW, the code of OpenStack is a little different from it's version, but many things are same.

Create a portal_user_catalog and have it used (Plone)

I'm creating a fork of my Plone site (which has not been forked for a long time). This site has a special catalog object for user profiles (a special Archetypes-based object type) which is called portal_user_catalog:
$ bin/instance debug
>>> portal = app.Plone
>>> print [d for d in portal.objectMap() if d['meta_type'] == 'Plone Catalog Tool']
[{'meta_type': 'Plone Catalog Tool', 'id': 'portal_catalog'},
{'meta_type': 'Plone Catalog Tool', 'id': 'portal_user_catalog'}]
This looks reasonable because the user profiles don't have most of the indexes of the "normal" objects, but have a small set of own indexes.
Since I found no way how to create this object from scratch, I exported it from the old site (as portal_user_catalog.zexp) and imported it in the new site. This seemed to work, but I can't add objects to the imported catalog, not even by explicitly calling the catalog_object method. Instead, the user profiles are added to the standard portal_catalog.
Now I found a module in my product which seems to serve the purpose (Products/myproduct/exportimport/catalog.py):
"""Catalog tool setup handlers.
$Id: catalog.py 77004 2007-06-24 08:57:54Z yuppie $
"""
from Products.GenericSetup.utils import exportObjects
from Products.GenericSetup.utils import importObjects
from Products.CMFCore.utils import getToolByName
from zope.component import queryMultiAdapter
from Products.GenericSetup.interfaces import IBody
def importCatalogTool(context):
"""Import catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog')
parent_path=''
if obj and not obj():
importer = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
__traceback_info__ = path
print [importer]
if importer:
print importer.name
if importer.name:
path = '%s%s' % (parent_path, 'usercatalog')
print path
filename = '%s%s' % (path, importer.suffix)
print filename
body = context.readDataFile(filename)
if body is not None:
importer.filename = filename # for error reporting
importer.body = body
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
importObjects(sub, path+'/', context)
def exportCatalogTool(context):
"""Export catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog', None)
if tool is None:
logger = context.getLogger('catalog')
logger.info('Nothing to export.')
return
parent_path=''
exporter = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
if exporter:
if exporter.name:
path = '%s%s' % (parent_path, 'usercatalog')
filename = '%s%s' % (path, exporter.suffix)
body = exporter.body
if body is not None:
context.writeDataFile(filename, body, exporter.mime_type)
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
exportObjects(sub, path+'/', context)
I tried to use it, but I have no idea how it is supposed to be done;
I can't call it TTW (should I try to publish the methods?!).
I tried it in a debug session:
$ bin/instance debug
>>> portal = app.Plone
>>> from Products.myproduct.exportimport.catalog import exportCatalogTool
>>> exportCatalogTool(portal)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File ".../Products/myproduct/exportimport/catalog.py", line 58, in exportCatalogTool
site = context.getSite()
AttributeError: getSite
So, if this is the way to go, it looks like I need a "real" context.
Update: To get this context, I tried an External Method:
# -*- coding: utf-8 -*-
from Products.myproduct.exportimport.catalog import exportCatalogTool
from pdb import set_trace
def p(dt, dd):
print '%-16s%s' % (dt+':', dd)
def main(self):
"""
Export the portal_user_catalog
"""
g = globals()
print '#' * 79
for a in ('__package__', '__module__'):
if a in g:
p(a, g[a])
p('self', self)
set_trace()
exportCatalogTool(self)
However, wenn I called it, I got the same <PloneSite at /Plone> object as the argument to the main function, which didn't have the getSite attribute. Perhaps my site doesn't call such External Methods correctly?
Or would I need to mention this module somehow in my configure.zcml, but how? I searched my directory tree (especially below Products/myproduct/profiles) for exportimport, the module name, and several other strings, but I couldn't find anything; perhaps there has been an integration once but was broken ...
So how do I make this portal_user_catalog work?
Thank you!
Update: Another debug session suggests the source of the problem to be some transaction matter:
>>> portal = app.Plone
>>> puc = portal.portal_user_catalog
>>> puc._catalog()
[]
>>> profiles_folder = portal.some_folder_with_profiles
>>> for o in profiles_folder.objectValues():
... puc.catalog_object(o)
...
>>> puc._catalog()
[<Products.ZCatalog.Catalog.mybrains object at 0x69ff8d8>, ...]
This population of the portal_user_catalog doesn't persist; after termination of the debug session and starting fg, the brains are gone.
It looks like the problem was indeed related with transactions.
I had
import transaction
...
class Browser(BrowserView):
...
def processNewUser(self):
....
transaction.commit()
before, but apparently this was not good enough (and/or perhaps not done correctly).
Now I start the transaction explicitly with transaction.begin(), save intermediate results with transaction.savepoint(), abort the transaction explicitly with transaction.abort() in case of errors (try / except), and have exactly one transaction.commit() at the end, in the case of success. Everything seems to work.
Of course, Plone still doesn't take this non-standard catalog into account; when I "clear and rebuild" it, it is empty afterwards. But for my application it works well enough.

Resources