Running multiple spiders in the same process, one spider at a time - web-scraping

I have a situation where I have a CrawlSpider that searches for results using postal codes and categories (POST data). I need to get all the results for all the categories in all postal codes. My spider takes a postal code and a category as arguments for the POST data. I want to programmatically start a spider for each postal code/category combo via a script.
The documentation explains you can run multiple spiders per process with this code example here: http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process This is along the same thing that I want to do however I want to essentially queue up spiders to run one after the another after the preceding spider finishes.
Any ideas on how to accomplish this? There seems to be some answers that apply to older versions of scrapy (~0.13) but the architecture has changed and they no longer function with the latest stable (0.24.4)

You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case:
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.settings import Settings
from twisted.internet import reactor
# for the sake of an example, sample postal codes
postal_codes = ['10801', '10802', '10803']
def configure_crawler(postal_code):
spider = MySpider(postal_code)
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# detach spider
crawler._spider = None
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
# callback fired when the spider is closed
def callback(spider, reason):
try:
postal_code = postal_codes.pop()
configure_crawler(postal_code)
except IndexError:
# stop the reactor if no postal codes left
reactor.stop()
settings = Settings()
crawler = Crawler(settings)
configure_crawler(postal_codes.pop())
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()

Related

Scrape BSCScan Token Holdings Page

I'm trying to get data from this page
https://bscscan.com/tokenholdings?a=0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d
But the Website owner doesn't provide endpoints APIs for this purpose. So I tried to achieve it in different ways:
-USING DRYSCRAPE but the library seems to be abandoned;
-USING REQUESTS but the data are provided dinamically by javascript;
-USING REQUESTS HTML but even in this case the data doesn't seems to be loaded.
I would like to ignore selenium cause it's slow but I don't know how to solve this issue. Anyone has a solution that could work? The data I need is the table containing the tokens of the wallet. Thank U in advice and hv a nice day.
You can do it with requests-html, for example let's grab the symbol of the first row:
from requests_html import HTMLSession
session = HTMLSession()
url='https://bscscan.com/tokenholdings'
token={'a': '0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d'}
r = session.get(url, params=token)
r.html.render(sleep=2)
binance_row = r.html.find('tbody tr', first=True)
symbol = binance_row.find('td')[2].text
print(symbol)
Output:
BNB

How to figure out where is the raw data in a table?

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

How to create a stock availability checker with python requests if JavaScript is used?

I wrote some code which should check whether a product is back in stock and when it is, send me an email to notify me. This works when the things I'm looking for are in the html.
However, sometimes certain objects are loaded through JavaScript. How could I edit my code so that the web scraping also works with JavaScript?
This is my code thus far:
import time
import requests
while True:
# Get the url of the IKEA page
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
# Get the text from that page and put everything in lower cases
productpage = requests.get(url).text.lower()
# Set the strings that should be on the page if the product is not available
outofstockstrings = ['niet beschikbaar voor levering', 'alleen beschikbaar in de winkel']
# Check whether the strings are in the text of the webpage
if any(x in productpage for x in outofstockstrings):
time.sleep(1800)
continue
else:
# send me an email and break the loop
Instead of scraping and analyzing the HTML you could use the inofficial stock API that the IKEA website is using too. That API return JSON data which is way easier to analyze and you’ll also get estimates when the product gets back to stock.
There even is a project written in javascript / node which provides you this kind of information straight from the command line: https://github.com/Ephigenia/ikea-availability-checker
You can easily check the stock amount of the chair in all stores in the Netherlands:
npx ikea-availability-checker stock --country nl 20336841

Airflow Custom Metrics and/or Result Object with custom fields

While running pySpark SQL pipelines via Airflow I am interested in getting out some business stats like:
source read count
target write count
sizes of DFs during processing
error records count
One idea is to push it directly to the metrics, so it will gets automatically consumed by monitoring tools like Prometheus. Another idea is to obtain these values via some DAG result object, but I wasn't able to find anything about it in docs.
Please post some at least pseudo code if you have solution.
I would look to reuse Airflow's statistics and monitoring support in the airflow.stats.Stats class. Maybe something like this:
import logging
from airflow.stats import Stats
PYSPARK_LOG_PREFIX = "airflow_pyspark"
def your_python_operator(**context):
[...]
try:
Stats.incr(f"{PYSPARK_LOG_PREFIX}_read_count", src_read_count)
Stats.incr(f"{PYSPARK_LOG_PREFIX}_write_count", tgt_write_count)
# So on and so forth
except:
logging.exception("Caught exception during statistics logging")
[...]

How do you Identify the request that a QNetworkReply finished signal is emitted in response to when you are doing multiple requests In QtNetwork?

I have a project will load a HTTP page, parse it, and then open other pages based on the data it received from the first page.
Since Qt's QNetworkAccessManager works asyncronusly, it seems I should be able to load more than one page at a time by continuing to make HTTP requests, and then taking care of the response would happen in the order the replies come back and would be handled by the even loop.
I'm a having a few problems figuring out how to do this though:
First, I read somewhere on stackoverflow that you should use only one QNetworkAccess manager. I do not know if that is true.
The problem is that I'm connecting to the finished slot on the single QNetworkAccess manager. If I do more than one request at a time, I don't know what request the finished signal is in response to. I don't know if there is a way to inspect the QNetworkReply object that is passed from the signal to know what reply it is in response to? Or should I actually be using a different QNetworkAccessManager for each request?
Here is an example of how I'm chaining stuff together right now. But I know this won't work when I'm doing more than one request at at time:
from PyQt4 import QtCore,QtGui,QtNetwork
class Example(QtCore.QObject):
def __init__(self):
super().__init__()
self.QNetworkAccessManager_1 = QtNetwork.QNetworkAccessManager()
self.QNetworkCookieJar_1 = QtNetwork.QNetworkCookieJar()
self.QNetworkAccessManager_1.setCookieJar(self.QNetworkCookieJar_1)
self.app = QtGui.QApplication([])
def start_request(self):
QUrl_1 = QtCore.QUrl('https://erikbandersen.com/')
QNetworkRequest_1 = QtNetwork.QNetworkRequest(QUrl_1)
#
self.QNetworkAccessManager_1.finished.connect(self.someurl_finshed)
self.QNetworkAccessManager_1.get(QNetworkRequest_1)
def someurl_finshed(self, NetworkReply):
# I do this so that this function won't get called for a diffent request
# But it will only work if I'm doing one request at a time
self.QNetworkAccessManager_1.finished.disconnect(self.someurl_finshed)
page = bytes(NetworkReply.readAll())
# Do something with it
print(page)
QUrl_1 = QtCore.QUrl('https://erikbandersen.com/ipv6/')
QNetworkRequest_1 = QtNetwork.QNetworkRequest(QUrl_1)
#
self.QNetworkAccessManager_1.finished.connect(self.someurl2_finshed)
self.QNetworkAccessManager_1.get(QNetworkRequest_1)
def someurl2_finshed(self, NetworkReply):
page = bytes(NetworkReply.readAll())
# Do something with it
print(page)
kls = Example()
kls.start_request()
I am not familiar to PyQt but from general Qt programming point of view
Using only one QNetworkAccessManager is right design choice
finished signal provides QNetworkReply*, with that we can identify corresponding request using request().
I hope this will solve your problem with one manager and multiple requests.
This is a C++ example doing the same.

Resources