Execute Another Spider with Selenium via Scrapy Item Pipeline - web-scraping

hope you guys can help me with this as I'm kinda new to scrapy/web scraping in general and not really sure on how to proceed forward with this problem.
Essentially, I have 2 spiders:
audio_file_downloader_spider, This spider is going to:
check whether a particular webpage passed to it contains an audio file (.mp3/.wav, etc.)
Suppose that an audio file URL is found on the webpage, download the audio file to the local machine.
Suppose that there is NO audio file URL found on the webpage, please tell audio_recorder_spider to scrape the webpage
audio_recorder_spider, is going to use selenium chromedriver and it will:
Press the audio player play button on the webpage
Record the playback stream to an mp3 file (this is definitely doable)
Now... the problem I'm currently facing is..., how do we do something like this with scrapy?
Currently, I have both spiders ready and I can run the audio_file_downloader_spider with this command from the terminal (I'm using macOS):
scrapy crawl audio_file_downloader_spider \
-a url="<some_webpage_url>"
Now, I need to somehow tell scrapy to execute audio_recorder_spider on the same webpage url IF and ONLY IF there is no audio file URL detected on the webpage so that audio_recorder_spider can record the audio playback on the webpage.
Now I'm not too familiar with scrapy just yet but I did read the their item pipeline documentation. One of the examples they gave in their documentation shows a code which will automatically get a screenshot off of a URL using Splash and save that screenshot as a PNG file with a custom name. The code can be seen below:
import hashlib
from urllib.parse import quote
import scrapy
from itemadapter import ItemAdapter
from scrapy.utils.defer import maybe_deferred_to_future
class ScreenshotPipeline:
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
adapter = ItemAdapter(item)
encoded_item_url = quote(adapter["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await maybe_deferred_to_future(spider.crawler.engine.download(request, spider))
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = adapter["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = f"{url_hash}.png"
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
adapter["screenshot_filename"] = filename
return item
So this got me thinking, would it be possible to do the same thing but instead of using Splash and taking a screenshot of the webpage, I want to use Selenium and record the audio playback off of the URL.
Any help would be greatly appreciated, thanks in advance!

Related

scraping a tab in JS website without having to click on the tab

I am trying to scrape this website: https://www.casablanca-bourse.com/bourseweb/Societe-Cote.aspx?codeValeur=12200 the problem is i only want extract data from the tab "indicateurs clès" and i can't find a way to access to it in code source without clicking on it.
Indeed, i can't figure out the URL of this specific tab... i checked the code source and i found that there's a generated code that changed whenver i clicked on that tab
Any suggestions?
Thanks in advance
The problem is that this website uses AJAX to get the table in the "Indicateurs Clès", so it is requested from the server only when you click on the tab. To scrape the data, you should send the same request to the server. In other words, try to mimic the browser's behavior.
You can do it this way (for Chromium; for other browsers with DevTools it's pretty much similar):
Press F12 to open the DevTools.
Switch to the "Network" tab.
Select Fetch/XHR filter.
Click on the "Indicateurs Clès" tab on the page.
Inspect the new request(s) you see in the DevTools.
Once you find the request that returns the information you need ("Preview" and "Response"), right-click the request and select "Copy as cURL".
Go to https://curl.trillworks.com/
Select the programming language you're using for scraping
Paste the cURL to the left (into the "curl command" textarea).
Copy the code that appeared on the right and work with it. In some cases, you might need to inspect the request further and modify it.
In this particular case, the request data contains `__VIEWSTATE` and other info, which is used by the server to send only the data necessary to update the already existing table.
At the same time, you can omit everything but the __EVENTTARGET (the tab ID) and codeValeur. In such a case the server will return page XHTML, which includes the whole table. After that, you can parse that table and get all you need.
I don't know what tech stack you were initially going to use for scraping the website, but here is how you can get the tables with Python requests and BeautifulSoup4:
import requests
from bs4 import BeautifulSoup
params = (
('codeValeur', '12200'),
)
data = {
'__EVENTTARGET': 'SocieteCotee1$LBFicheTech',
}
response = requests.post('https://www.casablanca-bourse.com/bourseweb/Societe-Cote.aspx', params=params, data=data)
soup = BeautifulSoup(response.content)
# Parse XHTML to get the data you exactly need

Creating a macro of mouse clicks

Is it possible to create an automated python script/macro for a series of mouse clicks? The goal is to open a webpage, click button to open upload data window, and finally hit save button to crate a process.I am thinking of something equivalent to automated VBA macros which are recorded as operations are performed on sheets.
In past I have used pyautogui package for this activity but it requires hard coding of co-ordinates for mouse click and hence tedious to code.
Maybe try to use selenium with python...
Check the docs and examples.
Any easy example would be:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
To download a file with firefox try:
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Another option would be the usage of pythons webbrowser.
Automatetheboringstuff gives some good examples.

How to extract data that loads differently with scrapy

I’m trying to extract product reviews on URLs like this one
https://www.namastevaporizers.com/products/mighty-vaporizer
the spider I have extracts anything on the page but nothing from the comments, I think it is because the comments load differently but unfortunately this is where my knowledge of scrappy ends. Can anyone help me with this?
here is my spider
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy.spiders import Spider
from Namaste.items import NPPItem
class NPP(Spider):
name = 'Product_Pages'
start_urls = ['https://www.namastevaporizers.com/products/mighty-vaporizer'
def parse(self, response):
item_loader = ItemLoader(item=NPPItem(), response=response)
item_loader.add_css("Z_reviews", "div.yotpo-user-name") # gets nothing
item_loader.add_css("Z_reviews", "div.content-title") # gets nothing
item_loader.add_css("Z_reviews", "div.content-review") # gets nothing
item_loader.add_css("Z_reviews", "div.yotpo") # gets some data but missing most stuff, this is the entire yotpo content wrapper
item_loader.add_value("AAE_source_url", response.url) #works fine
return item_loader.load_item()
the reviews in this site are loaded by JS, so you need to forge the request as your chrome do
Follow these steps you will get the result
open your chrome dev tool, shift to network tab, search (note: it's search instead of filter) a review content, you will the
request (I got the request url:https://staticw2.yotpo.com/batch)
Copy the curl command in chrome
Execute the curl in shell, if it's success, the next step is parse the curl and forge it in python code (The curl actually works in this site, i tried)
You can parse the curl in https://curl.trillworks.com/#python

Scrapy does not find text in Xpath or Css

I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element.
to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text.
from scrapy.selector import Selector
start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
#BASIC ITEM AND SPIDER YADA, SPARE YOU THE DETAILS
hxs = Selector(response)
response_css = response.css("body")
desc_data = hxs.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract()
desc_data2 = response_css.css('#DETAILS_TRUNC_TEXT::text').extract()
both return empty lists. Yes, I found the xpath and css selector via chrome, but the rest of them work just fine as I'm able to find other data on the site. Please help me find out why this isn't working.
To get the data you need to use any browser simulator like selenium so that It can catch the response of dynamically generated content. You need to put some delay to let the webpage load it's content fully. This is how you can go:
from selenium import webdriver
from scrapy import Selector
import time
driver = webdriver.Chrome()
URL = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
driver.get(URL)
time.sleep(5) #If you take out this line you won't get anything because the content of that page take some time to get loaded.
sel = Selector(text=driver.page_source)
item = sel.css('#DETAILS_TRUNC_TEXT::text').extract() #It is working
item_ano = sel.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract() #It is also working
print(item, item_ano)
driver.quit()
I tried your xpath and css in scrapy shell, and got nothing also.
Then I used view(response) command and found out the site is dynamic.
Here is a screenshot:
You can see that the details under Overview doesn't show up, and that's why no matter how you try, you still got nothing.
Solutions: Try Selenium (check the solution that SIM provided in the last answer) or Splash.
Good Luck. :)

scrapy script - Load youtube csv or xml file into wordpress player or custom plugin

I have scrapy spider code which will scrap a webpage and pull the youtube video link into a file. I am trying to get that scrapy to run the url's as strings rather than a array.
This way my output is one URL without quotes, and then I wish to add text after the output of the URL. ",&source=Open YouTube Playlist"
This way I can load the FULL url into a wordpress webplayer via native or a plugin, and it will auto-create a youtube list out of my output.
Maybe I am not thinking clearly? Is there a better way to accomplish the same goal?
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/#href').extract():
yield scrapy.Request(response.urljoin(link),
callback=self.parse_page)
# If page contains link to next page extract link and parse
next_page = response.xpath('//a[contains(.,
"Go\s+to\s+page\s+2")]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse)
# Youtube link 1st pass
def parse_page(self, response):
link = response.xpath('//iframe/#src').extract_first()
linkprune = link.split('/embed/')[+1]
output = linkprune.split('?')[0]
yield{
'https://www.youtube.com/watch_videos?video_ids=': output + ','
}
Current Output
https://www.youtube.com/watch_videos?video_ids=
"mueStjvHneI,"
"X7HfQL4fYgQ,"
"UtnR4gPMs_Q,"
"Kd9pbiKQqr4,"
"AokjaT-CnBk,"
"VdvhAsX6buo,"
"pF-XykcAqz8,"
"Fl0DDmx-jZw,"
"dpzLDiuQq9o,"
"J2_bl0zI504,"
...
Aiming to achieve
https://www.youtube.com/watch_videos?video_ids=mueStjvHneI,X7HfQL4fYgQ,UtnR4gPMs_Q,Kd9pbiKQqr4,VdvhAsX6buo,pF-XykcAqz8,dpzLDiuQq9o,&source=Open YouTube Playlist
If you load this URL, it will create a beautiful Youtube list.

Resources