I am new to scraping and i would like to scrape products and prices from daraz.pk . I learned from a tutorial and was able to scrape data from amazon but not able to do it in daraz.
Please can anyone tell me how get the laptop product name from this link: https://www.daraz.pk/gaming-laptops/?spm=a2a0e.home.cate_1_4.1.35e349375wfPov
i tried using response.css("c16H9d::text").extract() but not able to retrieve any data.
Regards
I have written this code for grooming category of Daraz.pkl. However, if you want to scrape other products, just add link of that page in Url and add required xpath below.
import bs4 as bs
import re
from selenium import webdriver
name=[]
price=[]
url = 'https://www.daraz.pk/dog-grooming-supplies/'
driver = webdriver.Chrome('chromedriver')
driver.get(url)
for i in range(1,40):
target_name=driver.find_element_by_xpath('//*[#id="root"]/div/div[3]/div[1]/div/div[1]/div[2]/div['+str(i)+']/div/div/div[2]/div[2]/a')
target_prize=driver.find_element_by_xpath('//*[#id="root"]/div/div[3]/div[1]/div/div[1]/div[2]/div['+str(i)+']/div/div/div[2]/div[3]/span')
name.append(target_name.text)
price.append(target_prize.text)
driver.quit()
print(name)
print(price)
Adjust the intend, if you find any problem and do let me know, if you find any problem
Related
I've been trying to scrape the search result of the AlphaFold
Protein Structure Database and couldn't find the desired information in the scraping result.
So my idea is that, e.g., if I put the search key word "Alpha-elapitoxin-Oh2b" in the search bar and click the search button, it will generate a new page with the URL:
https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b
In google chrome, I used "inspect" to check the code for this page and found my desired search result, i.e. the I.D. for this protein: P82662.
However, when I used requests and bs4 to scrape this page. I couldn't find the desired "P82662" in the returned information, also not even the search words "Alpha-elapitoxin-Oh2b"
import requests
from bs4 import BeautifulSoup
response = requests.get('https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b')
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())
I searched StackOverflow and tried to find a solution of not being able to find the result with BS4 and requests and found someone said that it is because the page of the search result was wrapped with JavaScript. So is it true? How can I solve this problem?
Thanks!
The desired search data is loaded dynamically from external source via API as json format as get method. So bs4 getting empty ResultSet.
import requests
res= requests.get('https://alphafold.ebi.ac.uk/api/search?q=%28text%3A%2aAlpha%5C-elapitoxin%5C-Oh2b%20OR%20text%3AAlpha%5C-elapitoxin%5C-Oh2b%2a%29&type=main&start=0&rows=20')
for item in res.json()['docs']:
id_num =item['uniprotAccession']
print(id_num)
Output:
P82662
We currently busy with a property web scrape and trying to scrape multiple pages without manually getting the page range (There are 5 pages)
for num in range(0,5):
url = "https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p" + str(num)
How do you output a URL of all pages without manually typing the page range?
Output
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p1
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p2
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p3
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
Maybe using the ul class="pagination" in order to count the page number?
you can use pagination class to fetch the last a tag and from that you can fetch data-pagenumber and then use it get all the links. Follow the below code to get it done.
Code:
import requests
from bs4 import BeautifulSoup
#url="https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467"
url="https://www.property24.com/for-sale/woodstock/cape-town/western-cape/10164"
data=requests.get(url)
soup=BeautifulSoup(data.content,"html.parser")
noofpages=soup.find("ul",{"class":"pagination"}).find_all("a")[-1]["data-pagenumber"]
for i in range(1,int(noofpages)+1):
print(f"{url}/p{i}")
Output:
Let me know if you have any questions :)
I have written the following code and it works fine. I really enjoyed because I am quite new in python requests or even python3 but at the following day I noticed that the price variable is not updated. And it does not update any time I run the code for a week (709.49 if does it matter). I think it is not a secret so I pasted the whole code below with link to the website.
So I want to ask whether I wrote something in wrong way or the web page is not that simple to make a request. Could you tell me what happened?
Here is the original code:
import requests
import re
from bs4 import BeautifulSoup
pattern = '\d+\.?\d*'
site_doc = requests.get('https://bitbay.net/pl/kurs-walut/kurs-ethereum-pln').text
soup = BeautifulSoup(site_doc, 'html.parser')
price = str(soup.select('title'))
price = re.findall(pattern, price)
print(price)
Thanks in advance!
The reason this doesn't work is that the content you are trying to get is JavaScript rendered. For this, I'd recommend using Selenium in order to get JavaScript rendered content.
I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!
First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.
I want to use Drupal Feeds Spider to import a list of urls (in my case a list of imdb movies).
for doing that I installed Drupal Feeds, and Feeds Spider Fetcher.
when I try to get list of url I am use Xpath to get the links, there is one problem
for example here my list http://www.imdb.com/search/title?title_type=feature to get urls,
the xpath for urls is used .//*[#id='main']/table/tbody/tr[4]/td[3]/a/#href
but the Final link Be like this href="/title/tt0993846/ Feeds can't import.
I want links be like this href="http://imdb.com/title/tt0993846/
I tried this Xpath concat('http://imdb.com/', .//*[#id='main']/table/tbody/tr[4]/td[3]/a/#href)
But it didn't work , show error Download of failed with code -1002.
XPath 2.0 solution:
//td[#class='title']/a/resolve-uri(#href)