Why does extracting `span` return `None` using BeautfulSoup4? - web-scraping

I want to scrape this site and extract products (title, price), but when I use tag <span> to extract titles, it doesn't work.
I think I am entering the wrong tag.
rom bs4 import BeautifulSoup
url= "https://www.banimode.com/1505/%D9%BE%D8%B1%D9%81%D8%B1%D9%88%D8%B4-%D8%AA%D8%B1%DB%8C%D9%86-%D9%85%D8%AD%D8%B5%D9%88%D9%84%D8%A7%D8%AA?page=2"
page = requests.get(url)
soup = BeautifulSoup(page.content , "html.parser")
print(soup.span.string)
But it returns None.

Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Content is rendered dynamically and loaded via XHR that could be inspected via dev tools of your browser.
Example
import requests
data = requests.get('https://mobapi.banimode.com/api/v1/products?page_size=24&page=1').json()['data']['data']
for i in data:
print(i['product_name'], i['product_price'])

Related

why I couldn't get the search keywords in the alphaFold Protein Structure Database using Beautiful Soup and Requests.get

I've been trying to scrape the search result of the AlphaFold
Protein Structure Database and couldn't find the desired information in the scraping result.
So my idea is that, e.g., if I put the search key word "Alpha-elapitoxin-Oh2b" in the search bar and click the search button, it will generate a new page with the URL:
https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b
In google chrome, I used "inspect" to check the code for this page and found my desired search result, i.e. the I.D. for this protein: P82662.
However, when I used requests and bs4 to scrape this page. I couldn't find the desired "P82662" in the returned information, also not even the search words "Alpha-elapitoxin-Oh2b"
import requests
from bs4 import BeautifulSoup
response = requests.get('https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b')
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())
I searched StackOverflow and tried to find a solution of not being able to find the result with BS4 and requests and found someone said that it is because the page of the search result was wrapped with JavaScript. So is it true? How can I solve this problem?
Thanks!
The desired search data is loaded dynamically from external source via API as json format as get method. So bs4 getting empty ResultSet.
import requests
res= requests.get('https://alphafold.ebi.ac.uk/api/search?q=%28text%3A%2aAlpha%5C-elapitoxin%5C-Oh2b%20OR%20text%3AAlpha%5C-elapitoxin%5C-Oh2b%2a%29&type=main&start=0&rows=20')
for item in res.json()['docs']:
id_num =item['uniprotAccession']
print(id_num)
Output:
P82662

Splash is unable to extract elements

I am trying to scrape https://www.lithia.com/new-inventory/index.htm.
But it seems that the slash is unable to extract simple elements on the page.
I tried to extract an element from the page with the appropriate XPath using either Scrapy project (python) or slash site (http://0.0.0.0:8050/), but Splash is unable to extract the element.
Code (I have simply so it is easier to convey and debug) :
import scrapy
from scrapy_splash import SplashRequest
from time import sleep
class CarSpider(scrapy.Spider):
name = 'car1'
allowed_domains = ['lithia.com/']
start_urls = ['https://www.lithia.com/baierl-auto-group/new-inventory.htm']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url = url,
callback = self.parse,
endpoint = 'render.html')
def parse(self, response):
sleep(5)
year = response.xpath('//span[contains(#class, "facet-list-group-label") and contains(text(), "Year")]')
sleep(5)
yield{
'year': year,
}
It returns:
{'year': []}
Meaning it is not extracted.
I check the Splash site (http://0.0.0.0:8050/) as well, and lots of element is not displayed in the HTML output. It seems like there is some rendering issue.
Following that, I came across this page (https://splash.readthedocs.io/en/stable/faq.html#website-is-not-rendered-correctly), informing possible debugs of the rendering issue by Splash:
I have tried:
Turning off private mode.
Tuning splash:wait()
setting splash:viewport_full()
adding splash:set_user_agent
enable it using splash.plugins_enabled
splash.html5_media_enabled property to enable HTML5 media
But so far, I am still unable to extract the element. In fact, lots of other elements cannot be extracted as well, just giving the element above as an example.
Please Help.

Request returns not actual value

I have written the following code and it works fine. I really enjoyed because I am quite new in python requests or even python3 but at the following day I noticed that the price variable is not updated. And it does not update any time I run the code for a week (709.49 if does it matter). I think it is not a secret so I pasted the whole code below with link to the website.
So I want to ask whether I wrote something in wrong way or the web page is not that simple to make a request. Could you tell me what happened?
Here is the original code:
import requests
import re
from bs4 import BeautifulSoup
pattern = '\d+\.?\d*'
site_doc = requests.get('https://bitbay.net/pl/kurs-walut/kurs-ethereum-pln').text
soup = BeautifulSoup(site_doc, 'html.parser')
price = str(soup.select('title'))
price = re.findall(pattern, price)
print(price)
Thanks in advance!
The reason this doesn't work is that the content you are trying to get is JavaScript rendered. For this, I'd recommend using Selenium in order to get JavaScript rendered content.

How to extract data that loads differently with scrapy

I’m trying to extract product reviews on URLs like this one
https://www.namastevaporizers.com/products/mighty-vaporizer
the spider I have extracts anything on the page but nothing from the comments, I think it is because the comments load differently but unfortunately this is where my knowledge of scrappy ends. Can anyone help me with this?
here is my spider
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy.spiders import Spider
from Namaste.items import NPPItem
class NPP(Spider):
name = 'Product_Pages'
start_urls = ['https://www.namastevaporizers.com/products/mighty-vaporizer'
def parse(self, response):
item_loader = ItemLoader(item=NPPItem(), response=response)
item_loader.add_css("Z_reviews", "div.yotpo-user-name") # gets nothing
item_loader.add_css("Z_reviews", "div.content-title") # gets nothing
item_loader.add_css("Z_reviews", "div.content-review") # gets nothing
item_loader.add_css("Z_reviews", "div.yotpo") # gets some data but missing most stuff, this is the entire yotpo content wrapper
item_loader.add_value("AAE_source_url", response.url) #works fine
return item_loader.load_item()
the reviews in this site are loaded by JS, so you need to forge the request as your chrome do
Follow these steps you will get the result
open your chrome dev tool, shift to network tab, search (note: it's search instead of filter) a review content, you will the
request (I got the request url:https://staticw2.yotpo.com/batch)
Copy the curl command in chrome
Execute the curl in shell, if it's success, the next step is parse the curl and forge it in python code (The curl actually works in this site, i tried)
You can parse the curl in https://curl.trillworks.com/#python

scrapy script - Load youtube csv or xml file into wordpress player or custom plugin

I have scrapy spider code which will scrap a webpage and pull the youtube video link into a file. I am trying to get that scrapy to run the url's as strings rather than a array.
This way my output is one URL without quotes, and then I wish to add text after the output of the URL. ",&source=Open YouTube Playlist"
This way I can load the FULL url into a wordpress webplayer via native or a plugin, and it will auto-create a youtube list out of my output.
Maybe I am not thinking clearly? Is there a better way to accomplish the same goal?
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/#href').extract():
yield scrapy.Request(response.urljoin(link),
callback=self.parse_page)
# If page contains link to next page extract link and parse
next_page = response.xpath('//a[contains(.,
"Go\s+to\s+page\s+2")]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse)
# Youtube link 1st pass
def parse_page(self, response):
link = response.xpath('//iframe/#src').extract_first()
linkprune = link.split('/embed/')[+1]
output = linkprune.split('?')[0]
yield{
'https://www.youtube.com/watch_videos?video_ids=': output + ','
}
Current Output
https://www.youtube.com/watch_videos?video_ids=
"mueStjvHneI,"
"X7HfQL4fYgQ,"
"UtnR4gPMs_Q,"
"Kd9pbiKQqr4,"
"AokjaT-CnBk,"
"VdvhAsX6buo,"
"pF-XykcAqz8,"
"Fl0DDmx-jZw,"
"dpzLDiuQq9o,"
"J2_bl0zI504,"
...
Aiming to achieve
https://www.youtube.com/watch_videos?video_ids=mueStjvHneI,X7HfQL4fYgQ,UtnR4gPMs_Q,Kd9pbiKQqr4,VdvhAsX6buo,pF-XykcAqz8,dpzLDiuQq9o,&source=Open YouTube Playlist
If you load this URL, it will create a beautiful Youtube list.

Resources