I want to use Drupal Feeds Spider to import a list of urls (in my case a list of imdb movies).
for doing that I installed Drupal Feeds, and Feeds Spider Fetcher.
when I try to get list of url I am use Xpath to get the links, there is one problem
for example here my list http://www.imdb.com/search/title?title_type=feature to get urls,
the xpath for urls is used .//*[#id='main']/table/tbody/tr[4]/td[3]/a/#href
but the Final link Be like this href="/title/tt0993846/ Feeds can't import.
I want links be like this href="http://imdb.com/title/tt0993846/
I tried this Xpath concat('http://imdb.com/', .//*[#id='main']/table/tbody/tr[4]/td[3]/a/#href)
But it didn't work , show error Download of failed with code -1002.
XPath 2.0 solution:
//td[#class='title']/a/resolve-uri(#href)
Related
I've been trying to scrape the search result of the AlphaFold
Protein Structure Database and couldn't find the desired information in the scraping result.
So my idea is that, e.g., if I put the search key word "Alpha-elapitoxin-Oh2b" in the search bar and click the search button, it will generate a new page with the URL:
https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b
In google chrome, I used "inspect" to check the code for this page and found my desired search result, i.e. the I.D. for this protein: P82662.
However, when I used requests and bs4 to scrape this page. I couldn't find the desired "P82662" in the returned information, also not even the search words "Alpha-elapitoxin-Oh2b"
import requests
from bs4 import BeautifulSoup
response = requests.get('https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b')
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())
I searched StackOverflow and tried to find a solution of not being able to find the result with BS4 and requests and found someone said that it is because the page of the search result was wrapped with JavaScript. So is it true? How can I solve this problem?
Thanks!
The desired search data is loaded dynamically from external source via API as json format as get method. So bs4 getting empty ResultSet.
import requests
res= requests.get('https://alphafold.ebi.ac.uk/api/search?q=%28text%3A%2aAlpha%5C-elapitoxin%5C-Oh2b%20OR%20text%3AAlpha%5C-elapitoxin%5C-Oh2b%2a%29&type=main&start=0&rows=20')
for item in res.json()['docs']:
id_num =item['uniprotAccession']
print(id_num)
Output:
P82662
We currently busy with a property web scrape and trying to scrape multiple pages without manually getting the page range (There are 5 pages)
for num in range(0,5):
url = "https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p" + str(num)
How do you output a URL of all pages without manually typing the page range?
Output
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p1
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p2
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p3
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
Maybe using the ul class="pagination" in order to count the page number?
you can use pagination class to fetch the last a tag and from that you can fetch data-pagenumber and then use it get all the links. Follow the below code to get it done.
Code:
import requests
from bs4 import BeautifulSoup
#url="https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467"
url="https://www.property24.com/for-sale/woodstock/cape-town/western-cape/10164"
data=requests.get(url)
soup=BeautifulSoup(data.content,"html.parser")
noofpages=soup.find("ul",{"class":"pagination"}).find_all("a")[-1]["data-pagenumber"]
for i in range(1,int(noofpages)+1):
print(f"{url}/p{i}")
Output:
Let me know if you have any questions :)
I am new to scraping and i would like to scrape products and prices from daraz.pk . I learned from a tutorial and was able to scrape data from amazon but not able to do it in daraz.
Please can anyone tell me how get the laptop product name from this link: https://www.daraz.pk/gaming-laptops/?spm=a2a0e.home.cate_1_4.1.35e349375wfPov
i tried using response.css("c16H9d::text").extract() but not able to retrieve any data.
Regards
I have written this code for grooming category of Daraz.pkl. However, if you want to scrape other products, just add link of that page in Url and add required xpath below.
import bs4 as bs
import re
from selenium import webdriver
name=[]
price=[]
url = 'https://www.daraz.pk/dog-grooming-supplies/'
driver = webdriver.Chrome('chromedriver')
driver.get(url)
for i in range(1,40):
target_name=driver.find_element_by_xpath('//*[#id="root"]/div/div[3]/div[1]/div/div[1]/div[2]/div['+str(i)+']/div/div/div[2]/div[2]/a')
target_prize=driver.find_element_by_xpath('//*[#id="root"]/div/div[3]/div[1]/div/div[1]/div[2]/div['+str(i)+']/div/div/div[2]/div[3]/span')
name.append(target_name.text)
price.append(target_prize.text)
driver.quit()
print(name)
print(price)
Adjust the intend, if you find any problem and do let me know, if you find any problem
Is it possible to get RSS feed for BlogSpot for specific keywords?
I have tried with the below URLs but they do not seem to be working.
Atom 1.0: https://blogname.blogspot.com/feeds/posts/default/-/[label]
RSS 2.0: https://blogname.blogspot.com/feeds/posts/default/-/[label]?alt=rss
For keyword-specific feeds, use the following endpoint
https://www.yourblogname.blogspot.com/feeds/posts/default?q=KEYWORD
https://www.blogger.com/feeds/BLOGID/posts/default?q=KEYWORD
The keyword will need to be passed as a query string to the q query parameter.
Be sure to enable blog feed
Go to Settings > Others > Site Feed > Allow Blog Feed then select Full
Blogger labels are case sensitive, It will treat Food differently from food
An example: https://fordemos.blogspot.com/feeds/posts/default/-/Food?alt=rss
I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or Wordpress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.
However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all "posts" links from the feed using feedparser and then want to extract the article content from the respective URL.
I could get URL's of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.
I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don't know how to get the "exact" content of the article (I assume "exact" means the data with all hyperlinks, iframes, slides shows etc still exist; I don't want CSS part).
So, can anyone help me on it?
Fetching the HTML code of all linked pages is quite easy.
The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the <body> tag, this shouldn't be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.
I suggest that you download the requests and BeautifulSoup module (both avaible via easy_install requests/bs4 or better pip install requests/bs4). The requests module makes fetching your page really easy.
The following example fetches a rss feed and returns three lists:
linksoups is a list of the BeautifulSoup instances of each page linked from the feed
linktexts is a list of the visible text of each page linked from the feed
linkimageurls is a list of lists with the src-urls of all the images embedded in each page linked from the feed
e.g. [['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]
import requests, bs4
# request the content of the feed an create a BeautifulSoup object from its content
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
responsesoup = bs4.BeautifulSoup(response.text)
linksoups = []
linktexts = []
linkimageurls = []
# iterate over all <link>…</link> tags and fill three lists: one with the soups of the
# linked pages, one with all their visible text and one with the urls of all embedded
# images
for link in responsesoup.find_all('link'):
url = link.text
linkresponse = requests.get(url) # add support for relative urls with urlparse
soup = bs4.BeautifulSoup(linkresponse.text)
linksoups.append(soup)
linktexts.append(soup.find('body').text)
# Append all text between tags inside of the body tag to the second list
images = soup.find_all('img')
imageurls = []
# get the src attribute of each <img> tag and append it to imageurls
for image in images:
imageurls.append(image['src'])
linkimageurls.append(imageurls)
# now somehow merge the retrieved information.
That might be a rough starting point for your project.