Beautifulsoup parsing for CSS selector value - web-scraping

I have inspected down to the following element on a webpage I am trying to scrape
<div data-testid="home-description-text-description-text" class="Text__TextBase-sc-1cait9d-0-div Text__TextContainerBase-sc-1cait9d-1 bjqKkI DescriptionTextBody__StyledTextContainer-sc-19zdz5l-1 fObgGE">
"Spectacular views of the Columbia river and Oregon hillsides. Bring your favorite builder. Secluded and very private. Mobile homes okay. Call your favorite Realtor today."
I have been unable to use page.select("data-testid")in fact, any methods I have tried to find by "div" followed by "data-testid" have been unsuccessful. I think finding by class_ would also be unsuccessful because I believe the class is being generated by javascript and will be a different value for each page, but I am unclear on how that works.
My goal is to eventually get the text "Spectacular views of the Columbia river and Oregon hillsides. Bring your favorite builder. Secluded and very private. Mobile homes okay. Call your favorite Realtor today."
Is there a way to search based on the expected value of "home-description-text-description-text"?

maybe this?
html = '<div data-testid="home-description-text-description-text" class="Text__TextBase-sc-1cait9d-0-div Text__TextContainerBase-sc-1cait9d-1 bjqKkI DescriptionTextBody__StyledTextContainer-sc-19zdz5l-1 fObgGE">'
soup = BeautifulSoup(html, "html.parser")
soup.find_all(attrs={'data-testid': re.compile('home-description-text-description-text')})

the selector page.select("data-testid") is wrong, to select tag attribute you need to wrap them using square braces page.select("[data-testid]")

Related

How to extract URL links using UiPath Studio

I am using UiPath Studio(2022.4.3) for data scraping, I don't find "Data Scraper" tool, instead there is this tool called "Table Extraction". How do I extract URL links found in Web pages/Applications.enter image description here
Maybe you can use "Find children" activity with upper level selector of the window that contains these links, and filter by elements with tag "a".
As a note, if you use this activity, you can set to find "Descendant", then you can find "childrens of childrens" with same activity.

Scraping the gender of clothing items

Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance
There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket

HTML <p> nodes InnerText including anchor text in CsQuery

I'm parsing some wordpress blog articles using CsQuery to do some text clustering analysis on them. I'd like to strip out the text from the pertinent <p> node.
var content = dom["div.entry-content>p"];
if (content.Length == 1)
{
System.Diagnostics.Debug.WriteLine(content[0].InnerHTML);
System.Diagnostics.Debug.WriteLine(content[0].InnerText);
}
In one of the posts the InnerHTML looks like this:
An MIT Europe project that attempts to <a title="Wired News: Gizmo Puts Cards
on the Table" href="http://www.wired.com/news/technology/0,1282,61265,00.html?
tw=rss.TEK">connect two loved ones seperated by distance</a> through the use
of two tables, a bunch of RFID tags and a couple of projectors.
and the corresponding InnerText like this
An MIT Europe project that attempts to through the use of two tables,
a bunch of RFID tags and a couple of projectors.
i.e. the inner text is missing the anchor text. I could parse the HTML myself but I am hoping there is a way to have CsQuery give me
An MIT Europe project that attempts to connect two loved ones
seperated by distance through the use of two tables, a bunch of RFID
tags and a couple of projectors.
(my italics.) How should I get this?
string result = dom["div.entry-content>p"].Text();
Text function will include everything that is bellow p includes p tag.
Try to use HtmlAgilityPack
using HAP = HtmlAgilityPack;
...
var doc = new HAP.HtmlDocument();
doc.LoadHtml("Your html");
var node = doc.DocumentNode.SelectSingleNode(#"node xPath");
Console.WriteLine(node.InnerText());
xPath is the path to the node on the page.
For example: In Google Chrome, press F12 and select your node, right-click and select "Copy xPath"
This topic header xPath: //*[#id="question-header"]/h1/a

How to read website content in python

I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or Wordpress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.
However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all "posts" links from the feed using feedparser and then want to extract the article content from the respective URL.
I could get URL's of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.
I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don't know how to get the "exact" content of the article (I assume "exact" means the data with all hyperlinks, iframes, slides shows etc still exist; I don't want CSS part).
So, can anyone help me on it?
Fetching the HTML code of all linked pages is quite easy.
The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the <body> tag, this shouldn't be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.
I suggest that you download the requests and BeautifulSoup module (both avaible via easy_install requests/bs4 or better pip install requests/bs4). The requests module makes fetching your page really easy.
The following example fetches a rss feed and returns three lists:
linksoups is a list of the BeautifulSoup instances of each page linked from the feed
linktexts is a list of the visible text of each page linked from the feed
linkimageurls is a list of lists with the src-urls of all the images embedded in each page linked from the feed
e.g. [['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]
import requests, bs4
# request the content of the feed an create a BeautifulSoup object from its content
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
responsesoup = bs4.BeautifulSoup(response.text)
linksoups = []
linktexts = []
linkimageurls = []
# iterate over all <link>…</link> tags and fill three lists: one with the soups of the
# linked pages, one with all their visible text and one with the urls of all embedded
# images
for link in responsesoup.find_all('link'):
url = link.text
linkresponse = requests.get(url) # add support for relative urls with urlparse
soup = bs4.BeautifulSoup(linkresponse.text)
linksoups.append(soup)
linktexts.append(soup.find('body').text)
# Append all text between tags inside of the body tag to the second list
images = soup.find_all('img')
imageurls = []
# get the src attribute of each <img> tag and append it to imageurls
for image in images:
imageurls.append(image['src'])
linkimageurls.append(imageurls)
# now somehow merge the retrieved information.
That might be a rough starting point for your project.

Cucumber/Webrat: follow link by CSS class?

is it possible to follow a link by it's class name instead of the id, text or title? Given I have (haha, cucumber insider he?) the following html code:
<div id="some_information_container">
Translation here
</div>
I do not want to match by text because I'd have to care about the translation values in my tests
I want to have my buttons look all the same style, so I will use the CSS class.
I don't want to assign a id to every single link, because some of them are perfectly identified through the container and the link class
Is there anything I missed in Cucumber/Webrat? Or do you have some advices to solve this in a better way?
Thanks for your help and best regards,
Joe
edit: I found an interesting discussion going on about this topic right here - seems to remain an open issue for now. Do you have any other solutions for this?
Here's how I did it with cucumber, hope it helps. The # in the step definition helps the CSS understand whats going on.
This only works with ID's not class names
Step Definition
Then /^(?:|I )should see ([^\"]*) within a div with id "([^\"]*)"$/ do |text, selector|
# checks for text within a specified div id
within "##{selector}" do |content|
if defined?(Spec::Rails::Matchers)
content.should contain(text)
else
hc = Webrat::Matchers::HasContent.new(text)
assert hc.matches?(content), hc.failure_message
end
end
end
Feature
Scenario Outline: Create Project
When I fill in name with <title>
And I select <data_type> from data_type
And I press "Create"
Then I should see <title> within a div with id "specifications"
Scenarios: Search Terms and Results
| data_type | title |
| Books | A Book Title |
Here is how to assert text within an element with the class name of "edit_botton"
Then I should see "Translation here" within "[#class='edit_button']"
How about find('a.some-class').click?
I'm not very familiar with the WebRat API, but what about using a DOM lookup to get the reference ID of the class that you are looking for then passing that to the click_link function?
Here's a link to some javascript to retrieve an item by class.
http://mykenta.blogspot.com/2007/10/getelementbyclass-revisited.html
Now that I think about it, what about using Javascript to just simply change it to some random ID then clicking that?
Either way, that should work until the frugal debate of a name to include the getbyclass function as is resolved.
Does have_tag work for you?
have_tag('a.edit_button')

Resources