I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or Wordpress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.
However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all "posts" links from the feed using feedparser and then want to extract the article content from the respective URL.
I could get URL's of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.
I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don't know how to get the "exact" content of the article (I assume "exact" means the data with all hyperlinks, iframes, slides shows etc still exist; I don't want CSS part).
So, can anyone help me on it?
Fetching the HTML code of all linked pages is quite easy.
The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the <body> tag, this shouldn't be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.
I suggest that you download the requests and BeautifulSoup module (both avaible via easy_install requests/bs4 or better pip install requests/bs4). The requests module makes fetching your page really easy.
The following example fetches a rss feed and returns three lists:
linksoups is a list of the BeautifulSoup instances of each page linked from the feed
linktexts is a list of the visible text of each page linked from the feed
linkimageurls is a list of lists with the src-urls of all the images embedded in each page linked from the feed
e.g. [['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]
import requests, bs4
# request the content of the feed an create a BeautifulSoup object from its content
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
responsesoup = bs4.BeautifulSoup(response.text)
linksoups = []
linktexts = []
linkimageurls = []
# iterate over all <link>…</link> tags and fill three lists: one with the soups of the
# linked pages, one with all their visible text and one with the urls of all embedded
# images
for link in responsesoup.find_all('link'):
url = link.text
linkresponse = requests.get(url) # add support for relative urls with urlparse
soup = bs4.BeautifulSoup(linkresponse.text)
linksoups.append(soup)
linktexts.append(soup.find('body').text)
# Append all text between tags inside of the body tag to the second list
images = soup.find_all('img')
imageurls = []
# get the src attribute of each <img> tag and append it to imageurls
for image in images:
imageurls.append(image['src'])
linkimageurls.append(imageurls)
# now somehow merge the retrieved information.
That might be a rough starting point for your project.
Related
We currently busy with a property web scrape and trying to scrape multiple pages without manually getting the page range (There are 5 pages)
for num in range(0,5):
url = "https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p" + str(num)
How do you output a URL of all pages without manually typing the page range?
Output
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p1
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p2
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p3
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
Maybe using the ul class="pagination" in order to count the page number?
you can use pagination class to fetch the last a tag and from that you can fetch data-pagenumber and then use it get all the links. Follow the below code to get it done.
Code:
import requests
from bs4 import BeautifulSoup
#url="https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467"
url="https://www.property24.com/for-sale/woodstock/cape-town/western-cape/10164"
data=requests.get(url)
soup=BeautifulSoup(data.content,"html.parser")
noofpages=soup.find("ul",{"class":"pagination"}).find_all("a")[-1]["data-pagenumber"]
for i in range(1,int(noofpages)+1):
print(f"{url}/p{i}")
Output:
Let me know if you have any questions :)
Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance
There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket
My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in.
I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards.
The xpath of the first entry is:
/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a
My idea was to get the link via html_attr( , "href") once I am in the right tag.
My idea is:
library(rvest)
url = "https://www.kaggle.com/competitions"
kaggle_html = read_html(url)
kaggle_text = html_text(kaggle_html)
kaggle_node <- html_nodes(kaggle_html, xpath = "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a")
html_attr(kaggle_node, "href")
I cant go past a certain div. The following snippet shows the last node I can access
node <- html_nodes(kaggle_html, xpath="/html/body/div[1]/div[2]/div")
html_attrs(node)
Once I go one step further with html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div"), the node will be empty.
I think the issue is that kaggle uses a smart list that expands the further I scroll down.
(I am aware that I can use %>%. I am saving every step so that I am able to access and view them more easily to be able to learn how it properly works.)
I solved the issue. I think that I can not access the full html code of the site from R because the table is loaded by a script which expands the table (thus the HTML) with a user scrolling through.
I resolved it, by expanding the table manually, downloading the whole HTML webpage and loading the local file.
I am using MediaWiki (Sometimes I think that it could be better to use Drupal) to create a wiki.
I have tried to find out a correct api or something similar to import a table (csv, xml, or another format) with text fields.
The idea is to bring a document with "name-pages" and "tags" to create automatically empty pages.
Finally, the users will see that there are new empty pages to fill!.
And every day pass a scheduler (something like Feed Import in Drupal) to bring new pages. I mean, if the text exists, don't do anything; however, it the text is new, create a new wikimedia page!
I don't find the correct api to do this. Somebody knows any way to do this?
Thank you
Regards!
The main Data Type used by Yahoo Pipes is the [Item], which is RSS feed content. I want to take an RSS's content or sub-element, make it into [Text] (or a number might work), and then use it as an INPUT into a [Module] to build a RSS-URL with specific parameters. I will then use the new RSS-URL to pull more content.
Could possibly use the [URL Builder Module] or some work-around.
The key here is using "dynamic" data from an RSS feed (not user input, or a static data), and getting that data into a Data Type that is compatible (and/or accessible) as an INPUT into a module.
It seems like a vital functionality, but I cannot figure it out. I have tried many, many work-around attempts, with no success.
The Specific API and Methods (if you are interested)
Using the LastFM API.
1st Method: user.getWeeklyChartList. Then pick the "from" (start) and "to" (end) Unix timestamps from 1 year-ago-today.
2nd Method: user.getWeeklyAlbumChart using those specific (and "dynamic") timestamps to pull my top albums for that week.
tl;dr. Build an RSS-URL using specific parameters from another RSS feed's content.
I think I may have figured it out. I doubt it is the best way, but it works. The problem was the module I needed to use didn't have and input node. But the Loop module has an input node, so if I embed the URL builder into the Loop module I can then access sub-element content from the 1st feed to use as parameters to build the URL for the 2nd feed! Then I can just scrap all the extra stuff generated by the Loop, by using Truncate.