NoneType' object has no attribute 'text' while web scrapping - web-scraping

I try to web-scrap the news data using beautiful soup getting a Nonetype attribute error.
Error:
AttributeError Traceback (most recent call last)
<ipython-input-5-d288f07b930a> in <module>
29
30 for link in data_array_tcs:
---> 31 title_all = link.find("strong").text
32 # news_all = link.find(class_="FL").text
33 date_all = link.find(class_="PT3 a_10dgry").text
AttributeError: 'NoneType' object has no attribute 'text'

The expression link.find("strong") returned None,
as the HTML contained no matches.
You need to test for that, perhaps with an if statement,
so you only ask for .text in situations where in fact it exists.

Related

I get this error while tying to join a file using the os path

filepath=os.path.join('SalesData')
filepath
disFile=os.listdir(filepath)
disFile
joinFile = os.path.join(disFile, "Sales_Data_2020")
I keep getting this error:
TypeError Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 joinFile=os.path.join(disFile,"Sales_Data_2020")
File ~\anaconda3\lib\ntpath.py:78, in join(path, *paths)
77 def join(path, *paths):
---> 78 path = os.fspath(path)
79 if isinstance(path, bytes):
80 sep = b'\'
TypeError: expected str, bytes or os.PathLike object, not list
Depending on what disFile represents in your case, this should do the trick:
joinFile = os.path.join("disFile", "Sales_Data_2020")
According to the error I guess that disFile on your end represents a list which won't work. Thus, you would have to select a string of that list to join it.
See also the official docu

BS4: AttributeError: 'NoneType' object stops the parser from working

I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP.
I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)
The project: for a list of meta-data of popular wordpress-plugins and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://wordpress.org/plugins/browse/popular/{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
def parser(url):
with requests.Session() as req:
print(f"Extracting {url}")
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
head = [soup.find("h1", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("V", "Las", "Ac", "W", "T", "P"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
see the results:
Extracting https://wordpress.org/plugins/tuxedo-big-file-uploads/Extracting https://wordpress.org/plugins/cherry-sidebars/
Extracting https://wordpress.org/plugins/meks-smart-author-widget/
Extracting https://wordpress.org/plugins/wp-limit-login-attempts/
Extracting https://wordpress.org/plugins/automatic-translator-addon-for-loco-translate/
Extracting https://wordpress.org/plugins/event-organiser/
Traceback (most recent call last):
File "/home/martin/unbenannt0.py", line 45, in <module>
print(future.result())
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/martin/unbenannt0.py", line 34, in parser
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
AttributeError: 'NoneType' object has no attribute 'find_next'
well i have a severe error - the
AttributeError: 'NoneType' object has no attribute 'find_next'
It looks like soup.find("h3", class_="screen-reader-text") has not found anything.
Well we could either break this line up and only call find_next if there was a result or use a try/except that captures the AttributeError.
at the moment i do not know how to fix this whole thing - only that we can surround the offending code with:
try:
code that causes error
except AttributeError:
print(f"Attribution error on {some data here}, {whatever else would be of value}, {...}")
... whatever action is thinkable to take here.
btw.- besides this error i want to add a option that gives the results back: see complete and unaltered error traceback. It contains valuable process call stack information.
Extracting https://wordpress.org/plugins/automatic-translator-addon-for-loco-translate/
Extracting https://wordpress.org/plugins/wpforo/Extracting https://wordpress.org/plugins/accesspress-social-share/
Extracting https://wordpress.org/plugins/mailoptin/
Extracting https://wordpress.org/plugins/tuxedo-big-file-uploads/
Extracting https://wordpress.org/plugins/post-snippets/
Extracting https://wordpress.org/plugins/woocommerce-payfast-gateway/Extracting https://wordpress.org/plugins/woocommerce-grid-list-toggle/
Extracting https://wordpress.org/plugins/goodbye-captcha/
Extracting https://wordpress.org/plugins/gravity-forms-google-analytics-event-tracking/
Traceback (most recent call last):
File "/home/martin/dev/wordpress_plugin.py", line 44, in <module>
print(future.result())
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/martin/anaconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/martin/dev/wordpress_plugin.py", line 33, in parser
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
AttributeError: 'NoneType' object has no attribute 'find_next'
hope that this was not too long and complex - thank you for the help!

Why tf2 can't save a tf_function model as .pb file?

I tried to saved a model like the official code of transformer on official website, but when i want to save the train_step graph or trace on it with tf.summary.trace_on ,it errors.the error is as
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py in convert(x)
936 try:
--> 937 x = ops.convert_to_tensor_or_composite(x)
938 except (ValueError, TypeError):
15 frames
TypeError: Can't convert Operation 'PartitionedFunctionCall' to Tensor (target dtype=None, name=None, as_ref=False)
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py in convert(x)
941 "must return zero or more Tensors; in compilation of %s, found "
942 "return value of type %s, which is not a Tensor." %
--> 943 (str(python_func), type(x)))
944 if add_control_dependencies:
945 x = deps_ctx.mark_as_return(x)
TypeError: To be compatible with tf.contrib.eager.defun, Python functions must return zero or more Tensors; in compilation of <function canonicalize_signatures.<locals>.signature_wrapper at 0x7fcf794b47b8>, found return value of type <class 'tensorflow.python.framework.ops.Operation'>, which is not a Tensor.
I supposed it was some error on tensor operation and write an other demo to confirm my idea:
import tensorflow as tf
sig=[tf.TensorSpec(shape=(None, None), dtype=tf.int64),tf.TensorSpec(shape=(None, None), dtype=tf.int64)]
#tf.function(input_signature=sig)
def cal(a,d):
b=a[1:]
root=tf.Module()
root.func = cal
# 获取具体函数。
concrete_func = root.func.get_concrete_function(
tf.TensorSpec(shape=(None, None), dtype=tf.int64),tf.TensorSpec(shape=(None, None), dtype=tf.int64)
)
tf.saved_model.save(root, '/correct', concrete_func)
the error occurs as supposed. But how can i fixed it? the positional_encoding requires and i have no idea about how to replace this operation.
I figure it because i didn't save the transformer model in the pb file

Unit test for exception raised in custom GNU radio python block

I have created a custom python sync block for use in a gnuradio flowgraph. The block tests for invalid input and, if found, raises a ValueError exception. I would like to create a unit test to verify that the exception is raised when the block indeed receives invalid input data.
As part of the python-based qa test for this block, I created a flowgraph such that the block receives invalid data. When I run the test, the block does appear to raise the exception but then hangs.
What is the appropriate way to test for this? Here is a minimal working example:
#!/usr/bin/env python
import numpy as np
from gnuradio import gr, gr_unittest, blocks
class validate_input(gr.sync_block):
def __init__(self):
gr.sync_block.__init__(self,
name="validate_input",
in_sig=[np.float32],
out_sig=[np.float32])
self.max_input = 100
def work(self, input_items, output_items):
in0 = input_items[0]
if (np.max(in0) > self.max_input):
raise ValueError('input exceeds max.')
validated_in = output_items[0]
validated_in[:] = in0
return len(output_items[0])
class qa_validate_input (gr_unittest.TestCase):
def setUp (self):
self.tb = gr.top_block ()
def tearDown (self):
self.tb = None
def test_check_valid_data(self):
src_data = (0, 201, 92)
src = blocks.vector_source_f(src_data)
validate = validate_input()
snk = blocks.vector_sink_f()
self.tb.connect (src, validate)
self.tb.connect (validate, snk)
self.assertRaises(ValueError, self.tb.run)
if __name__ == '__main__':
gr_unittest.run(qa_validate_input, "qa_validate_input.xml")
which produces:
DEPRECATED: Using filename with gr_unittest does no longer have any effect.
handler caught exception: input exceeds max.
Traceback (most recent call last):
File "/home/xxx/devel/gnuradio3_8/lib/python3.6/dist-packages/gnuradio/gr/gateway.py", line 60, in eval
try: self._callback()
File "/home/xxx/devel/gnuradio3_8/lib/python3.6/dist-packages/gnuradio/gr/gateway.py", line 230, in __gr_block_handle
) for i in range(noutputs)],
File "qa_validate_input.py", line 21, in work
raise ValueError('input exceeds max.')
ValueError: input exceeds max.
thread[thread-per-block[1]: <block validate_input(2)>]: SWIG director method error. Error detected when calling 'feval_ll.eval'
^CF
======================================================================
FAIL: test_check_valid_data (__main__.qa_validate_input)
----------------------------------------------------------------------
Traceback (most recent call last):
File "qa_validate_input.py", line 47, in test_check_valid_data
self.assertRaises(ValueError, self.tb.run)
AssertionError: ValueError not raised by run
----------------------------------------------------------------------
Ran 1 test in 1.634s
FAILED (failures=1)
The top_block's run() function does not call the block's work() function directly but starts the internal task scheduler and its threads and waits them to finish.
One way to unit test the error handling in your block is to call the work() function directly
def test_check_valid_data(self):
src_data = [[0, 201, 92]]
output_items = [[]]
validate = validate_input()
self.assertRaises(ValueError, lambda: validate.work(src_data, output_items))

pexpect python throw error

Although this is my first attempt at using pexpect, the python3 script using pexpect is pretty simple; yet it fails.
#!/usr/bin/env python3
import sys
import pexpect
SSH_NEWKEY = r'Are you sure you want to continue connecting \(yes/no\)\?'
child = pexpect.spawn("ssh -i /user/aws/key.pem ec2-user#xxx.xxx.xxx.xxx date")
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY )
if i == 1:
child.sendline('yes')
print(child.before)
The SSH_NEWKEY is the only response I'm expecting, but the example showed a list containing pexpect.TIMEOUT in it so I used it.
$ ./test.py
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 144, in read_nonblocking
s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 97, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/pty_spawn.py", line 455, in read_nonblocking
return super(spawn, self).read_nonblocking(size)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 149, in read_nonblocking
raise EOF('End Of File (EOF). Exception style platform.')
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./min.py", line 15, in <module>
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY ] )
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 315, in expect
timeout, searchwindowsize, async)
File "/usr/local/lib/python3.4/site-packages/pexpect/spawnbase.py", line 339, in expect_list
return exp.expect_loop(timeout)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 102, in expect_loop
return self.eof(e)
File "/usr/local/lib/python3.4/site-packages/pexpect/expect.py", line 49, in eof
raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f70ea4fbcf8>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', '-i', '/user/aws/key.pem', 'ec2-user#xxx.xxx.xxx.xxx', 'date']
searcher: None
buffer (last 100 chars): b''
before (last 100 chars): b'Fri May 6 13:50:18 EDT 2016\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: 0
flag_eof: True
pid: 31293
child_fd: 5
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
What am I missing?
CentOS 6.4
python 3.4.3
An EOF error is being raised during your expect call. This means that the response received does not match SSH_NEWKEY, and reaches end of file within the timeout period. To catch this exception, you should change your except line to read:
i = child.expect( [ pexpect.TIMEOUT, SSH_NEWKEY, pexpect.EOF)
You can then make your if more robust:
if i == 1:
child.sendline('yes')
elif i == 0:
print "Timeout"
elif i == 2:
print "EOF"
print(child.before)
This doesn't solve the reason behind why you are on receiving a response with the expected string - it's hard to know without looking at more code but it's likely because you have the response slightly wrong. If you manually type in the SSH string, you should be able to see the response you can expect, and enter this response into your code.
You can also print child.before after your expect call, or print child.read() instead of your expect call to see what is being sent back as a response.

Resources