CrawlSpider not following links

CrawlSpider not following links - web-scraping

Title says it all, I'm trying to make a CrawlSpider work for some products in Amazon to no avail.
Here is the original URL page I want to get products from.
Looking at the HTML code where the next link is, looks like this:
<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s/ref=sr_pg_2?me=A1COIXT69Y8KR&rh=i%3Amerchant-items&page=2&ie=UTF8&qid=1444414650">
<span id="pagnNextString">Next Page</span>
<span class="srSprite pagnNextArrow"></span>
</a>
This is the current reg expression I'm using:
s/ref=sr_pg_[0-9]\?[^">]+
And using a service like Pythex.org, this seems to be ok, I'm getting this portion of the URL:
s/ref=sr_pg_2?me=A1COIXT69Y8KR&rh=i%3Amerchant-items&page=2&ie=UTF8&qid=1444414650
Here is the code of my crawler:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from amazon.items import AmazonProduct
class AmazonCrawlerSpider(CrawlSpider):
name = 'amazon_crawler'
allowed_domains = ['amazon.com']
#allowed_domains = ['stackoverflow.com']
start_urls = ['http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=1']
#start_urls = ['http://stackoverflow.com/questions?pagesize=50&sort=newest']
rules = [
Rule(LinkExtractor(allow=r's/ref=sr_pg_[0-9]\?[^">]+'),
callback='parse_item', follow=True)
]
'''rules = [
Rule(LinkExtractor(allow=r'questions\?page=[0-9]&sort=newest'),
callback='parse_item', follow=True)
]'''
def parse_item(self, response):
products = response.xpath('//div[#class="summary"]/h3')
for product in products:
item = AmazonProduct()
print('found it!')
yield item
For some unknown reason, the crawler is not following the links. This code is based on the blog tutorial from the guys at RealPython, where they crawl StackOverflow for questions. Actually, just uncomment the commented code to see that this works.
Any idea what I'm missing here? Thanks!
UPDATE:
Based on the answer from #Rejected, I've switched to shell and I could see that the HTML code is different, as he pointed out, than the one I could see in the browser.
Actually, the code Scrapy is getting, the interesting bits, is:
<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s?ie=UTF8&me=A19COJAJDNQSRP&page=2">
<span id="pagnNextString">Next Page</span>
<span class="srSprite pagnNextArrow"></span>
</a>
I've changed my reg expression so it looks like this:
s[^">&]+&me=A19COJAJDNQSRP&page=[0-9]$
Now I'm getting the links in the shell:
[Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=1', text='\n \n \n \n \n \n \n \n ', fragment='', nofollow=False), Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=2', text='2', fragment='', nofollow=False), Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=3', text='3', fragment='', nofollow=False)]
And also the crawler is getting them correctly!

Scrapy is being provided different HTML data than what you are seeing in your browser (even just requesting "view-source:url").
Why, I wasn't able to determine with 100% certainty. The desired three(?) links will match r's/ref=sr_pg_[0-9]' in your allow path.
Since Amazon is doing something to determine browser, you should test what you're getting in your instance of Scrapy, too. Drop it into shell, and play around with the LinkExtractor yourself via the following:
LinkExtractor(allow=r's/ref=sr_pg_[0-9]').extract_links(response)

Related

Why does requests.get() is giving me the information in Spanish?

I'm trying to request the weather from Google for an specific place at an specific time. When I get the response the text is in Spanish instead of English. Ie. instead of "Mostly cloudly" I get "parcialmente nublado". I'm using the requests library and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=weather+Nissan+Stadium+Nashville+TN+Thursday+December+29+2022+8:15+PM"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
clima = soup.find("div",class_="tAd8D")
print(clima.text)
Output
jueves
Mayormente nublado
Máxima: 16°C Mínima: 8°C
Desired output:
Thursday
Mostly cloudy
Maximun : x (fahrenheit) Minimum x(fahrenheit)

The most likely explanation is that Google associates your IP address with a primarily Spanish-speaking region and defaults to giving you results in Spanish.
Try specifying English in your search string by adding hl=en:
https://www.google.com/search?hl=en&q=my+search+string

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!

You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

Multiple classes, unable to return desired page(s)

first want to say that I am a first time poster so I am sorry in advance if any parts of my question or the way it is asked/presented "sucks." With that being said, I've been trying to scrape a table from barchart.com use jupyter and beautifulsoup that is on multiple pages and while I have been successful in returning the entire page as a whole, I haven't had much luck trying to return the specific pages I need. I did include some images, the first three of which reference the elements that I am currently "choosing" from to use:
the 'div' element that highlights the entire table
another 'div' element within the first 'div' that also has the entire table I need
The 'table' element that I would use but it doesn't include the left most column that includes the tickers/stock symbols
Regardless of what I have tried to put in my code, I always get a "[]" back and haven't been able to figure out how to write the multiple parts of each 'div' or 'table', if that makes sense.
Code pic
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen, Request
stonks_url = Request('https://www.barchart.com/options/unusual-activity/stocks', headers={'User-Agent': 'Mozilla/5.0'})
stonks_data = urlopen(stonks_url)
stonks_html = stonks_data.read()
stonks_data.close()
page_soup = soup(stonks_html, 'html.parser')
uoa_table = page_soup.findAll('tbody', {'data-ng-repeat': 'rows in content'})
print(uoa_table)
Thanks in advance to any advice or guidance!

As this page is not working with javascript request you need to use the selenium and get the page source of the page and use it for processing the table
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
driver= webdriver.Chrome()
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)

Can anyone please write in english what exactly this code means : soup.find_all("p", class_="strikeout")

I wan to undetand in english what does this code means exacty.
I have tried leanring codes from beautifulsoup i got the hint but i am not able to get confidence.
soup.find_all("p", class_="strikeout")
code says find all the tags which is ... and something

I'll translate
soup.find_all("p", class_="strikeout")
as:
find all <p> tags with class equals to strikeout ( <p class="strikeout"> )
You should search the documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all) to find out if the class search is strict or not, meaning it will match or not something like
<p class="strikeout foo">

ReST strikethrough

Is it possible to strike text through in Restructured Text?
Something that for example renders as a <strike> tag when converted to HTML, like:
ReSTructuredText

I checked the docs better, as suggested by Ville Säävuori, and I decided to add the strikethrough like this:
.. role:: strike
:class: strike
In the document, this can be applied as follows:
:strike:`This text is crossed out`
Then in my css file I have an entry:
.strike {
text-decoration: line-through;
}

There is at least three ways of doing it:
.. role:: strike
An example of :strike:`strike through text`.
.. container:: strike
Here the full block of test is striked through.
An undecorated paragraph.
.. class:: strike
This paragraph too is is striked through.
.. admonition:: cancelled
:class: strike
I strike through cancelled text.
After applying rst2html you get:
<p>An example of <span class="strike">strike through text</span>.</p>
<div class="strike container">
Here the full block of test is striked through.</div>
<p>An undecorated paragraph.</p>
<p class="strike">This paragraph too is is striked through.</p>
<div class="strike admonition">
<p class="first admonition-title">cancelled</p>
<p class="last">I strike through cancelled text.</p>
You use them with a style
.strike {
text-decoration: line-through;
}
Here I have taken the admonition directive as example but any
directive that allow the :class: option would do.
As it generates a span the role directive is the only one that
allow to apply your style to a part of a paragraph.
It is redundant to add a class strike to a directive also named
strike, as suggest Gozzilli, because the directive name is the default
class for the html output.
I have checked these syntax both with rest2html and Sphinx. But
while everything works as expected with rest2html the class
directive fail with Sphinx. You have to replace it with
.. rst-class:: strike
This paragraph too is is striked through.
This is only stated in a small
footnote of Sphinx reSt Primer.

According to the official spec there is no directive for strikethrough markup in ReST.
However, if the environment allows for :raw: role or you are able to write your own roles, then you can write a custom plugin for it.

I found the other answers very helpful.
I am not very familiar with Sphinx but I am using it for a project. I too wanted the strike-through ability and have got it working based on the previous answers.
To be clear, I added my strikethrough role as gozzilli mentioned but I saved it inside my conf.py using the rst_prolog variable as discussed in the stack overflow thread here. This means that this role is available to all of your rest files.
I then extended the base html template as described above by creating layout.htmlwithin _templatesinside my source directory. The contents of this file are:
{% extends "!layout.html" %}
{% set css_files = css_files + ["_static/myStyle.css"] %}
This basically includes a custom css file to all your built default html docs.
Finally, in my _static directory within my source directory I included the file myStyle.css which contains:
.strike {
text-decoration: line-through;
}
Which the other answers have already provided.
I am merely writing this answer as it wasn't obvious to me with my limited Sphinx experience which files to edit.

Here's a Python definition of a del role, which works better than the accepted answer if you want to use the role in multiple pages of a Pelican blog or a Sphinx documentation project:
from docutils import nodes
from docutils.parsers.rst import roles
def deleted_role(_role, rawtext, text, _lineno, _inliner, options={}, _content=[]):
roles.set_classes(options)
options.setdefault('classes', []).append("del")
return [nodes.inline(rawtext, text, **options)], []
roles.register_canonical_role('del', deleted_role)
Even better would be to extend the HTML writer to produce a proper <del> tag, like this:
from docutils import nodes
from docutils.parsers.rst import roles
from docutils.writers._html_base import HTMLTranslator
class delnode(nodes.inline):
pass
def visit_delnode(self, node):
self.body.append(self.starttag(node, 'del', ''))
def depart_delnode(self, node):
self.body.append('</del>')
HTMLTranslator.visit_delnode = visit_delnode
HTMLTranslator.depart_delnode = depart_delnode
def deleted_role(_role, rawtext, text, _lineno, _inliner, options={}, _content=[]):
roles.set_classes(options)
return [delnode(rawtext, text, **options)], []
roles.register_canonical_role('del', deleted_role)
You can trivially adjust it to produce an <s>, of course.

Consider the user may have a different background, so here is no one solution that can be suitable for everyone.
1.Only one file
If you only use it only on one file. For example, you published a simple project to PyPI, and you may probably just only one README.rst file. The following may you want.
.. |ss| raw:: html
<strike>
.. |se| raw:: html
</strike>
single line
=============
|ss| abc\ |se|\defg
multiple line
=============
|ss|
line 1
line 2
|se|
789
you can copy and paste it on this website: https://livesphinx.herokuapp.com/
and will see the picture as the following:
It's simple, and you can on directly see the preview on some IDE, for example, PyCharm.
bellow is writing for the users of Sphinx
2.beginner of Sphinx
If you are a beginner of Sphinx. ( I mean maybe you want to use Sphinx to create a document, but Python is not familiar for you ) then try as following:
# conf.py
from pathlib import Path
html_static_path = ['_static', ]
html_css_files = ['css/user.define.css'] # If you want to control which HTML should contain it, you can put it on the HTML, which is very like the answer by #Gregory Kuhn.
with open(Path(__file__).parent / Path('_static/css/user.define.rst'), 'r') as f:
user_define_role = f.read()
rst_prolog = '\n'.join([ user_define_role + '\n',]) # will be included at the beginning of every source file that is read.
# rst_epilog = '\n'.join([ user_define_role + '\n',]) # it's ok if you put it on the end.
user.define.rst
.. role:: strike
user.define.css
.strike {text-decoration: line-through;}
With the rst_prolog, It can auto-add the role on each rst files, but if you change the content( that file, it contains a format that you define), then you must rebuild to make the render is correct.
3.Create roles
You can create an extension to achieve it.
# conf.py
extensions = ['_ext.rst_roles', ]
html_static_path = ['_static', ]
html_css_files = ['css/user.define.css']
# rst_roles.py
from sphinx.application import Sphinx
from docutils.parsers.rst import roles
from docutils import nodes
from docutils.parsers.rst.states import Inliner
def strike_role(role, rawtext, text, lineno, inliner: Inliner, options={}, content=[]):
your_css_strike_name = 'strike'
return nodes.inline(rawtext, text, **dict(classes=[your_css_strike_name])), []
def setup(app: Sphinx):
roles.register_canonical_role('my-strike', strike_role) # usage: :my-strike:`content ...`
The full architecture:
conf.py
_ext/
rst_roles.py
_static/
css/
user.define.css
about the rules, you can reference this link rst-roles
And I vary recommended you to see the docutils.parsers.rst.roles.py .

I wrote an extension for this.
Just pip install sphinxnotes-strike and use:
:strike:`text`
or
:del:`text`
to show strike text.
For more info: https://sphinx-notes.github.io/strike/

Since Docutils 0.17, the HTML5-writer uses <del> if a matching class value is found in inline, literal, or container elements:
.. role:: del
:del:`This text has been deleted`, here is the rest of the paragraph.
.. container:: del
This paragraph has been deleted.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

CrawlSpider not following links - web-scraping

Related

Why does requests.get() is giving me the information in Spanish?

Scraping: No attribute find_all for <p>

Multiple classes, unable to return desired page(s)

Can anyone please write in english what exactly this code means : soup.find_all("p", class_="strikeout")

ReST strikethrough

Categories

Resources