How to extract data that loads differently with scrapy - web-scraping

I’m trying to extract product reviews on URLs like this one
https://www.namastevaporizers.com/products/mighty-vaporizer
the spider I have extracts anything on the page but nothing from the comments, I think it is because the comments load differently but unfortunately this is where my knowledge of scrappy ends. Can anyone help me with this?
here is my spider
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy.spiders import Spider
from Namaste.items import NPPItem
class NPP(Spider):
name = 'Product_Pages'
start_urls = ['https://www.namastevaporizers.com/products/mighty-vaporizer'
def parse(self, response):
item_loader = ItemLoader(item=NPPItem(), response=response)
item_loader.add_css("Z_reviews", "div.yotpo-user-name") # gets nothing
item_loader.add_css("Z_reviews", "div.content-title") # gets nothing
item_loader.add_css("Z_reviews", "div.content-review") # gets nothing
item_loader.add_css("Z_reviews", "div.yotpo") # gets some data but missing most stuff, this is the entire yotpo content wrapper
item_loader.add_value("AAE_source_url", response.url) #works fine
return item_loader.load_item()

the reviews in this site are loaded by JS, so you need to forge the request as your chrome do
Follow these steps you will get the result
open your chrome dev tool, shift to network tab, search (note: it's search instead of filter) a review content, you will the
request (I got the request url:https://staticw2.yotpo.com/batch)
Copy the curl command in chrome
Execute the curl in shell, if it's success, the next step is parse the curl and forge it in python code (The curl actually works in this site, i tried)
You can parse the curl in https://curl.trillworks.com/#python

Related

Splash is unable to extract elements

I am trying to scrape https://www.lithia.com/new-inventory/index.htm.
But it seems that the slash is unable to extract simple elements on the page.
I tried to extract an element from the page with the appropriate XPath using either Scrapy project (python) or slash site (http://0.0.0.0:8050/), but Splash is unable to extract the element.
Code (I have simply so it is easier to convey and debug) :
import scrapy
from scrapy_splash import SplashRequest
from time import sleep
class CarSpider(scrapy.Spider):
name = 'car1'
allowed_domains = ['lithia.com/']
start_urls = ['https://www.lithia.com/baierl-auto-group/new-inventory.htm']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url = url,
callback = self.parse,
endpoint = 'render.html')
def parse(self, response):
sleep(5)
year = response.xpath('//span[contains(#class, "facet-list-group-label") and contains(text(), "Year")]')
sleep(5)
yield{
'year': year,
}
It returns:
{'year': []}
Meaning it is not extracted.
I check the Splash site (http://0.0.0.0:8050/) as well, and lots of element is not displayed in the HTML output. It seems like there is some rendering issue.
Following that, I came across this page (https://splash.readthedocs.io/en/stable/faq.html#website-is-not-rendered-correctly), informing possible debugs of the rendering issue by Splash:
I have tried:
Turning off private mode.
Tuning splash:wait()
setting splash:viewport_full()
adding splash:set_user_agent
enable it using splash.plugins_enabled
splash.html5_media_enabled property to enable HTML5 media
But so far, I am still unable to extract the element. In fact, lots of other elements cannot be extracted as well, just giving the element above as an example.
Please Help.

Execute Another Spider with Selenium via Scrapy Item Pipeline

hope you guys can help me with this as I'm kinda new to scrapy/web scraping in general and not really sure on how to proceed forward with this problem.
Essentially, I have 2 spiders:
audio_file_downloader_spider, This spider is going to:
check whether a particular webpage passed to it contains an audio file (.mp3/.wav, etc.)
Suppose that an audio file URL is found on the webpage, download the audio file to the local machine.
Suppose that there is NO audio file URL found on the webpage, please tell audio_recorder_spider to scrape the webpage
audio_recorder_spider, is going to use selenium chromedriver and it will:
Press the audio player play button on the webpage
Record the playback stream to an mp3 file (this is definitely doable)
Now... the problem I'm currently facing is..., how do we do something like this with scrapy?
Currently, I have both spiders ready and I can run the audio_file_downloader_spider with this command from the terminal (I'm using macOS):
scrapy crawl audio_file_downloader_spider \
-a url="<some_webpage_url>"
Now, I need to somehow tell scrapy to execute audio_recorder_spider on the same webpage url IF and ONLY IF there is no audio file URL detected on the webpage so that audio_recorder_spider can record the audio playback on the webpage.
Now I'm not too familiar with scrapy just yet but I did read the their item pipeline documentation. One of the examples they gave in their documentation shows a code which will automatically get a screenshot off of a URL using Splash and save that screenshot as a PNG file with a custom name. The code can be seen below:
import hashlib
from urllib.parse import quote
import scrapy
from itemadapter import ItemAdapter
from scrapy.utils.defer import maybe_deferred_to_future
class ScreenshotPipeline:
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
adapter = ItemAdapter(item)
encoded_item_url = quote(adapter["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await maybe_deferred_to_future(spider.crawler.engine.download(request, spider))
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = adapter["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = f"{url_hash}.png"
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
adapter["screenshot_filename"] = filename
return item
So this got me thinking, would it be possible to do the same thing but instead of using Splash and taking a screenshot of the webpage, I want to use Selenium and record the audio playback off of the URL.
Any help would be greatly appreciated, thanks in advance!

Request returns not actual value

I have written the following code and it works fine. I really enjoyed because I am quite new in python requests or even python3 but at the following day I noticed that the price variable is not updated. And it does not update any time I run the code for a week (709.49 if does it matter). I think it is not a secret so I pasted the whole code below with link to the website.
So I want to ask whether I wrote something in wrong way or the web page is not that simple to make a request. Could you tell me what happened?
Here is the original code:
import requests
import re
from bs4 import BeautifulSoup
pattern = '\d+\.?\d*'
site_doc = requests.get('https://bitbay.net/pl/kurs-walut/kurs-ethereum-pln').text
soup = BeautifulSoup(site_doc, 'html.parser')
price = str(soup.select('title'))
price = re.findall(pattern, price)
print(price)
Thanks in advance!
The reason this doesn't work is that the content you are trying to get is JavaScript rendered. For this, I'd recommend using Selenium in order to get JavaScript rendered content.

scrapy script - Load youtube csv or xml file into wordpress player or custom plugin

I have scrapy spider code which will scrap a webpage and pull the youtube video link into a file. I am trying to get that scrapy to run the url's as strings rather than a array.
This way my output is one URL without quotes, and then I wish to add text after the output of the URL. ",&source=Open YouTube Playlist"
This way I can load the FULL url into a wordpress webplayer via native or a plugin, and it will auto-create a youtube list out of my output.
Maybe I am not thinking clearly? Is there a better way to accomplish the same goal?
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/#href').extract():
yield scrapy.Request(response.urljoin(link),
callback=self.parse_page)
# If page contains link to next page extract link and parse
next_page = response.xpath('//a[contains(.,
"Go\s+to\s+page\s+2")]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse)
# Youtube link 1st pass
def parse_page(self, response):
link = response.xpath('//iframe/#src').extract_first()
linkprune = link.split('/embed/')[+1]
output = linkprune.split('?')[0]
yield{
'https://www.youtube.com/watch_videos?video_ids=': output + ','
}
Current Output
https://www.youtube.com/watch_videos?video_ids=
"mueStjvHneI,"
"X7HfQL4fYgQ,"
"UtnR4gPMs_Q,"
"Kd9pbiKQqr4,"
"AokjaT-CnBk,"
"VdvhAsX6buo,"
"pF-XykcAqz8,"
"Fl0DDmx-jZw,"
"dpzLDiuQq9o,"
"J2_bl0zI504,"
...
Aiming to achieve
https://www.youtube.com/watch_videos?video_ids=mueStjvHneI,X7HfQL4fYgQ,UtnR4gPMs_Q,Kd9pbiKQqr4,VdvhAsX6buo,pF-XykcAqz8,dpzLDiuQq9o,&source=Open YouTube Playlist
If you load this URL, it will create a beautiful Youtube list.

Scrapy spider will not crawl on start urls

I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!
First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.

Resources