BS4: AttributeError: 'NoneType' object stops the parser from working - wordpress

I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)
The project: for a list of meta-data of popular wordpress-plugins and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://wordpress.org/plugins/browse/popular/{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
def parser(url):
with requests.Session() as req:
print(f"Extracting {url}")
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
head = [soup.find("h1", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("V", "Las", "Ac", "W", "T", "P"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
see the results:
Extracting https://wordpress.org/plugins/tuxedo-big-file-uploads/Extracting https://wordpress.org/plugins/cherry-sidebars/
Extracting https://wordpress.org/plugins/meks-smart-author-widget/
Extracting https://wordpress.org/plugins/wp-limit-login-attempts/
Extracting https://wordpress.org/plugins/automatic-translator-addon-for-loco-translate/
Extracting https://wordpress.org/plugins/event-organiser/
Traceback (most recent call last):
File "/home/martin/unbenannt0.py", line 45, in <module>
print(future.result())
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/martin/unbenannt0.py", line 34, in parser
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
AttributeError: 'NoneType' object has no attribute 'find_next'
well i have a severe error - the
AttributeError: 'NoneType' object has no attribute 'find_next'
It looks like soup.find("h3", class_="screen-reader-text") has not found anything.
Well we could either break this line up and only call find_next if there was a result or use a try/except that captures the AttributeError.
at the moment i do not know how to fix this whole thing - only that we can surround the offending code with:
try:
code that causes error
except AttributeError:
print(f"Attribution error on {some data here}, {whatever else would be of value}, {...}")
... whatever action is thinkable to take here.
btw.- besides this error i want to add a option that gives the results back: see complete and unaltered error traceback. It contains valuable process call stack information.
Extracting https://wordpress.org/plugins/automatic-translator-addon-for-loco-translate/
Extracting https://wordpress.org/plugins/wpforo/Extracting https://wordpress.org/plugins/accesspress-social-share/
Extracting https://wordpress.org/plugins/mailoptin/
Extracting https://wordpress.org/plugins/tuxedo-big-file-uploads/
Extracting https://wordpress.org/plugins/post-snippets/
Extracting https://wordpress.org/plugins/woocommerce-payfast-gateway/Extracting https://wordpress.org/plugins/woocommerce-grid-list-toggle/
Extracting https://wordpress.org/plugins/goodbye-captcha/
Extracting https://wordpress.org/plugins/gravity-forms-google-analytics-event-tracking/
Traceback (most recent call last):
File "/home/martin/dev/wordpress_plugin.py", line 44, in <module>
print(future.result())
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/martin/dev/wordpress_plugin.py", line 33, in parser
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
AttributeError: 'NoneType' object has no attribute 'find_next'
hope that this was not too long and complex - thank you for the help!

Related

How to write this HTTP request in python?

I need to write this HTTP request.
I was able to get the token (from another HTTP request) and I saved on in a variable name "Token"
Can you please help me write this HTTP request in python?
I am getting 404 and I am pretty sure the syntax is wrong.
Attaching a screenshot from Microsoft documentation.
the new error message:
Traceback (most recent call last):
File "/Users/talcohen/opt/anaconda3/lib/python3.9/site-packages/requests/models.py", line 972, in json
return complexjson.loads(self.text, **kwargs)
File "/Users/talcohen/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/Users/talcohen/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/talcohen/opt/anaconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/talcohen/Documents/ddd/PowerBI_REST_API/CreateProfile.py", line 15, in <module>
data = requests.post(api_url, json=payload, headers=headers).json()
File "/Users/talcohen/opt/anaconda3/lib/python3.9/site-packages/requests/models.py", line 976, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
(base) talcohen#Tals-Mac-mini PowerBI_REST_API %
Looking at the API you should probably do:
import requests
token = "<YOUR TOKEN>"
api_url = "https://api.powerbi.com/v1.0/myorg/profiles"
headers = {"Authorization": f"Bearer {token}"}
payload = {"displayName": "My First Profile"}
data = requests.post(api_url, json=payload, headers=headers).json()
print(data)

execution_date jinja resolving as a string

I have an airflow dag that uses the following jinja template: "{{ execution_date.astimezone('Etc/GMT+6').subtract(days=1).strftime('%Y-%m-%dT00:00:00') }}"
This template works in other dags, and it works when the schedule_interval for the dag is set to timedelta(hours=1). However, when we set the schedule interval to 0 8 * * *, it throws the following traceback at runtime:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 1426, in _run_raw_task
self.render_templates()
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 1790, in render_templates
rendered_content = rt(attr, content, jinja_context)
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2538, in render_template
return self.render_template_from_field(attr, content, context, jinja_env)
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2520, in render_template_from_field
for k, v in list(content.items())}
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2520, in <dictcomp>
for k, v in list(content.items())}
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2538, in render_template
return self.render_template_from_field(attr, content, context, jinja_env)
File "/usr/lib/python2.7/site-packages/airflow/models/__init__.py", line 2514, in render_template_from_field
result = jinja_env.from_string(content).render(**context)
File "/usr/lib64/python2.7/site-packages/jinja2/environment.py", line 1008, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/lib64/python2.7/site-packages/jinja2/environment.py", line 780, in handle_exception
reraise(exc_type, exc_value, tb)
File "<template>", line 1, in top-level template code
TypeError: astimezone() argument 1 must be datetime.tzinfo, not str
It appears the execution date being passed in is a string, not a datetime object; but I am only able to hit this error on this specific dag, and no others. I've tried deleting the dag entirely and recreating it with no luck.
Looks like astimezone(..) function is misbehaving, it expects a datetime.tzinfo while you are passing it an str argument ('Etc/GMT+6')
TypeError: astimezone() argument 1 must be datetime.tzinfo, not str
While I couldn't make the exact thing work, I believe following achieves pretty much the same effect as what you are trying
{{ execution_date.in_timezone("US/Eastern") - timedelta(days=1) }}
Recall that
execution_date macro is a Pendulum object
in_timezone(..) converts it into a datetime.datetime(..)
then we just add a datetime.timedelta(days=1) to it

html5lib: TypeError: __init__() got an unexpected keyword argument 'encoding'

I'm trying to install html5lib. at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it:
>>> with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())
I get an error:
Traceback (most recent call last):
File "<pyshell#11>", line 2, in <module>
document = html5lib.parse(f, encoding=f.info().get_content_charset())
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 35, in parse
return p.parse(doc, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 235, in parse
self._parse(stream, False, None, *args, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 85, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\_tokenizer.py", line 36, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\_inputstream.py", line 151, in HTMLInputStream
return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
What is wrong and what should I do?
I see something was broken in the latest versions of html5lib in regard to bs4, html5lib.treebuilders._base is no longer there, usng bs4 4.4.1 the latest compatible version seems to be the one with 7 nines, once you install it as below it works fine:
pip3 install -U html5lib=="0.9999999"
Tested using bs4 4.4.1:
In [1]: import bs4
In [2]: bs4.__version__
Out[2]: '4.4.1'
In [3]: import html5lib
In [4]: html5lib.__version__
Out[4]: '0.9999999'
In [5]: from urllib.request import urlopen
In [6]: with urlopen("http://example.com/") as f:
...: document = html5lib.parse(f, encoding=f.info().get_content_charset())
...:
In [7]:
You can see the change in this commit Rename treebuilders._base to .base to reflect public status the name was changed:
The error you see is because you are still using the newest version, in html5lib/_inputstream.py, HTMLBinaryInputStream has no encoding arg:
class HTMLBinaryInputStream(HTMLUnicodeInputStream):
"""Provides a unicode stream of characters to the HTMLTokenizer.
This class takes care of character encoding and removing or replacing
incorrect byte-sequences and also provides column and line tracking.
"""
def __init__(self, source, override_encoding=None, transport_encoding=None,
same_origin_parent_encoding=None, likely_encoding=None,
default_encoding="windows-1252", useChardet=True):
Setting override_encoding=f.info().get_content_charset() should do the trick.
Also upgrading to the latest version of bs4 works fine with the latest version of html5lib:
In [16]: bs4.__version__
Out[16]: '4.5.1'
In [17]: html5lib.__version__
Out[17]: '0.999999999'
In [18]: with urlopen("http://example.com/") as f:
document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
....:
In [19]:

pexpect python throw error

Although this is my first attempt at using pexpect, the python3 script using pexpect is pretty simple; yet it fails.
#!/usr/bin/env python3
import sys
import pexpect
SSH_NEWKEY = r'Are you sure you want to continue connecting \(yes/no\)\?'
child = pexpect.spawn("ssh -i /user/aws/key.pem ec2-user#xxx.xxx.xxx.xxx date")
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY )
if i == 1:
child.sendline('yes')
print(child.before)
The SSH_NEWKEY is the only response I'm expecting, but the example showed a list containing pexpect.TIMEOUT in it so I used it.
$ ./test.py
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 144, in read_nonblocking
s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 97, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/pty_spawn.py", line 455, in read_nonblocking
return super(spawn, self).read_nonblocking(size)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 149, in read_nonblocking
raise EOF('End Of File (EOF). Exception style platform.')
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./min.py", line 15, in <module>
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY ] )
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 315, in expect
timeout, searchwindowsize, async)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 339, in expect_list
return exp.expect_loop(timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 102, in expect_loop
return self.eof(e)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 49, in eof
raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f70ea4fbcf8>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', '-i', '/user/aws/key.pem', 'ec2-user#xxx.xxx.xxx.xxx', 'date']
searcher: None
buffer (last 100 chars): b''
before (last 100 chars): b'Fri May 6 13:50:18 EDT 2016\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: 0
flag_eof: True
pid: 31293
child_fd: 5
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
What am I missing?
CentOS 6.4
python 3.4.3
An EOF error is being raised during your expect call. This means that the response received does not match SSH_NEWKEY, and reaches end of file within the timeout period. To catch this exception, you should change your except line to read:
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY, pexpect.EOF)
You can then make your if more robust:
if i == 1:
child.sendline('yes')
elif i == 0:
print "Timeout"
elif i == 2:
print "EOF"
print(child.before)
This doesn't solve the reason behind why you are on receiving a response with the expected string - it's hard to know without looking at more code but it's likely because you have the response slightly wrong. If you manually type in the SSH string, you should be able to see the response you can expect, and enter this response into your code.
You can also print child.before after your expect call, or print child.read() instead of your expect call to see what is being sent back as a response.

Diazo, parameters and restrictedTraverse

if in my diazo controlpanel > 'Parameter expressions' I put
have_left_portlets = python:context and context.restrictedTraverse('##plone').have_portlets('plone.leftcolumn',context)
I obtain an error only when I'm on the portal homepage:
2012-06-26 16:51:42 ERROR plone.transformchain Unexpected error whilst trying to apply transform chain
Traceback (most recent call last):
File "/Users/vito/.buildout/eggs/plone.transformchain-1.0.2-py2.6.egg/plone/transformchain/transformer.py", line 48, in __call__
newResult = handler.transformIterable(result, encoding)
File "/Users/vito/.buildout/eggs/plone.app.theming-1.0-py2.6.egg/plone/app/theming/transform.py", line 257, in transformIterable
params[name] = quote_param(expression(expressionContext))
File "/Users/vito/.buildout/eggs/Zope2-2.13.13-py2.6.egg/Products/PageTemplates/ZRPythonExpr.py", line 48, in __call__
return eval(self._code, vars, {})
File "PythonExpr", line 1, in <expression>
File "/Users/vito/.buildout/eggs/AccessControl-2.13.7-py2.6-macosx-10.6-x86_64.egg/AccessControl/ImplPython.py", line 675, in guarded_getattr
v = getattr(inst, name)
AttributeError: 'FilesystemResourceDirectory' object has no attribute 'restrictedTraverse'
How I can solve this?
I suspect this is a bug in plone.app.theming: the context isn't set correctly. Strange, though.
Just confirming that the issue exits:
I get about the same traceback, the site itself looks fine, but for every click inside the site I get a the following traceback in my instance fg:
2012-08-10 15:05:05 ERROR plone.transformchain Unexpected error whilst trying to apply transform chain
Traceback (most recent call last):
File "/opt/etc/buildout/eggs/plone.transformchain-1.0.2-py2.6.egg/plone/transformchain/transformer.py", line 48, in __call__
newResult = handler.transformIterable(result, encoding)
File "/opt/etc/buildout/eggs/plone.app.theming-1.0-py2.6.egg/plone/app/theming/transform.py", line 257, in transformIterable
params[name] = quote_param(expression(expressionContext))
File "/opt/etc/buildout/eggs/Zope2-2.13.10-py2.6.egg/Products/PageTemplates/ZRPythonExpr.py", line 48, in __call__
return eval(self._code, vars, {})
File "PythonExpr", line 1, in <expression>
AttributeError: 'FilesystemResourceDirectory' object has no attribute 'Language'
This is because I have the following line in my manifest.cfg (which is about the same as the parameter line in the plone_control_panel:
lang = python: context.Language()
In a way in my case this is sort of logical, since not all content objects have an index called Language().
But the 'context' in this case is apparently refering to the 'FileSystemResourceDirectory' and not to the piece of content you are on?
I'll try with pdb if I can find some more info...

Resources