search through xpath only within current scrapy Selector - css

Using Scrapy, I have extracted a selector like below (I have omitted other tags here for readability)
>>> row.get()
'<tr>\n <td>\n <span class="severity-list__item-text">H</span>\n </li>\n </ul>\n </tr>'
The selector here only contains one instance of span tag with class value "severity-list__item-text". However, the whole page contains 30 like this.
When I write this:
>>> l = row.xpath('//span[#class="severity-list__item-text"]')
>>> len(l)
30
Here, I was expecting to get only the one value within the given selector. However, it is returning me all the instances present in the page. What is the issue here? How can I limit my serach only within a given selector?

I think I was missing a '.' to signal to the parent selector.
>>> l = second_row.xpath('.//span[#class="severity-list__item-text"]')
>>> len(l)
1

Related

Unable to scrape the data using bs4

I am trying to scrape the star rating for the "value" data from the Trip Advisor hotels but I am not able to get the data using class name:
Below is the code which I have tried to use:
review_pages=requests.get("https://www.tripadvisor.com/Hotel_Review-g60745-d94367-Reviews-Harborside_Inn-Boston_Massachusetts.html")
soup3=BeautifulSoup(review_pages.text,'html.parser')
value=soup3.find_all(class_='hotels-review-list-parts-AdditionalRatings__bubbleRating--2WcwT')
Value_1=soup3.find_all(class_="hotels-review-list-parts-AdditionalRatings__ratings--3MtoD")
When I am trying to capture the values it is returning an empty list. Any direction would be really helpful. I have tried mutiple class names which are in that page but I am getting various fields such as Data,reviews ect but I am not able to get the bubble ratings for only service.
You can use an attribute = value selector and pass the class in with its value as a substring with ^ starts with operator to allow for different star values which form part of the attribute value.
Or, more simply use the span type selector to select for the child spans.
.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN span
In this line:
values=soup3.select('.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN [class^="ui_bubble_rating bubble_"]')
The first part of the selector, when reading from left to right, is selecting for the parent class of those ratings. The following space is a descendant combinator combining the following attribute = value selector which gathers a list of the qualifying children. As mentioned, you can replace that with just using span.
Code:
import requests
from bs4 import BeautifulSoup
import re
review_pages=requests.get("https://www.tripadvisor.com/Hotel_Review-g60745-d94367-Reviews-Harborside_Inn-Boston_Massachusetts.html")
soup3=BeautifulSoup(review_pages.content,'lxml')
values=soup3.select('.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN [class^="ui_bubble_rating bubble_"]') #.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN span
Value_1 = values[-1]
print(Value_1['class'][1])
stars = re.search(r'\d', Value_1['class'][1]).group(0)
print(stars)
Although I use re, I think it is overkill and you could simply use replace.

Unable to find xpath list trying to use wild card contains text or style

I am trying to find an XPATH for this site the XPath under “Main Lists”. I have so far:
//div[starts-with(#class, ('sm-CouponLink_Label'))]
However this finds 32 matches…
`//div[starts-with(#class, ('sm-CouponLink_Label'))]`[contains(text(),'*')or[contains(Style(),'*')]
Unfortunately in this case I am wanting to use XPaths and not CSS.
It is for this site, my code is here and here's an image of XPATH I am after.
I have also tried:
CSS: div:nth-child(1) > .sm-MarketContainer_NumColumns3 > div > div
Xpath equiv...: //div[1]//div[starts-with(#class, ('sm-MarketContainer_NumColumns3'))]//div//div
Though it does not appear to work.
UPDATED
WORKING CSS: div.sm-Market:has(div >div:contains('Main Lists')) * > .sm-CouponLink_Label
Xpath: //div[Contains(#class, ('sm-Market'))]//preceding::('Main Lists')//div[Contains(#class, ('sm-CouponLink_Label'))]
Not working as of yet..
Though I am unsure Selenium have equivalent for :has
Alternatively...
Something like:
//div[contains(text(),"Main Lists")]//following::div[contains(#class,"sm-Market")]//div[contains(#class,"sm-CouponLink_Label")]//preceding::div[contains(#class,"sm-Market_HeaderOpen ")]
(wrong area)
You can get all required elements with below piece of code:
league_names = [league for league in driver.find_elements_by_xpath('//div[normalize-space(#class)="sm-Market" and .//div="Main Lists"]//div[normalize-space(#class)="sm-CouponLink_Label"]') if league.text]
This should return you list of only non-empty nodes
If I understand this correctly, you want to narrow down further the result of your first XPath to return only div that has inner text or has attribute style. In this case you can use the following XPath :
//div[starts-with(#class, ('sm-CouponLink_Label'))][#style or text()]
UPDATE
As you clarified further, you want to get div with class 'sm-CouponLink_Label' that resides in the 'Main Lists' section. For this purpose, you should try to incorporate the 'Main Lists' in the XPath somehow. This is one possible way (formatted for readability) :
//div[
div/div/text()='Main Lists'
]//div[
starts-with(#class, 'sm-CouponLink_Label')
and
normalize-space()
]
Notice how normalize-space() is used to filter out empty div from the result. This should return 5 elements as expected, here is the result when I tested in Chrome :

Using xpath to match a string that contains any integer (between 0 and 9)

I am new to XPath and need some help.
The system auto generates the id which looks something like this:
<input type="file" class="form-file" size="22"
name="files[entry-23245_field_entry_attachment_und_0]"
id="edit-entry-23245-field-entry-attachment-und-0-upload"
style="background-color: transparent;">
I am able to locate the id using xpath or css however the numbers within the id string changes as this is randomly generated so the next time my test runs, it fails because it cant locate the string.
I would like to know if it is at all possible to write an xpath expression that will look for everything from the start of the string edit-entry- then some how look for any integer value between 0-9 within that string -23245-, then also match the end part field-entry-attachment-und-0-upload. this way when my test runs, it is able to locate the element all the time even if the numbers within the string change. iv tried adding \d+ to my xpath but it doesn't seem to pick it up.
This is the xpath:
//*[#id="edit-entry-23245-field-entry-attachment-und-0-upload"]
That is because your Xpath isn't extracting the right attribute. Use an Xpath like this to get the id of the element:
//input[#type="file" and #class="form-file"]/#id
This Regex should extract the value you are looking for:
/edit-entry-\d+-field-entry-attachment-und-0-upload/
If \d+ doesn't work for you this is another possibility:
/edit-entry-[0-9]+-field-entry-attachment-und-0-upload/
Just an idea for workaround, since we can't use regex in pure XPath.
In case you just need to match <input> element with id equals "edit-entry-[arbitrary-characters-here]-field-entry-attachment-und-0-upload", we can use starts-with() and ends-with() functions like this :
//*
[starts-with(#id, "edit-entry-")
and
ends-with(#id, "-field-entry-attachment-und-0-upload")]
and in case you're using XPath 1.0 where ends-with() function is not available :
//*
[starts-with(#id, "edit-entry-")
and
(
"-field-entry-attachment-und-0-upload"
=
substring(#id, string-length(#id) - string-length("-field-entry-attachment-und-0-upload") +1)
)
]

How to convert complex xpath to css

I have a complex html structure. New to CSS. Want to change my xpath to css as there could be some performance impact in IE
Xpath by firebug: .//*[#id='T_I:3']/span/a
I finetuned to : //div[#id='Overview']/descendant::*[#id='T_I:3']/span/a
Now I need corresponding CSS for the same. Is it possible or not?
First of all, I don't think your "finetuning" did the best possible job. An element id should be unique in the document and is therefore usually cached by modern browsers (which means that id lookup is instant). You can help the XPath engine by using the id() function.
Therefore, the XPath expression would be: id('T_I:3')/span/a (yes, that's a valid XPath 1.0 expression).
Anyway, to convert this to CSS, you'd use: #T_I:3 > span > a
Your "finetuned" expression converted would be: div#Overview #T_I:3 > span > a, but seriously, you only need one id selection.
The hashtag # is an id selector.
The space () is a descendant combinator.
The > sign is a child combinator.
EDIT based on a good comment by Fréderic Hamidi:
I don't think #T_I:3 is valid (the colon would be confused with the
start of a pseudo-class). You would have to find a way to escape it.
It turns out you also need to escape the underscore. For this, use the techniques mentioned in this SO question: Handling a colon in an element ID in a CSS selector.
The final CSS selector would be:
#T\5FI\3A3 > span > a

What is caret symbol ^ used for in css when selecting elements?

I encountered a css selector in a file like this:
#contactDetails ul li a, a[href^=tel] {....}
The circumflex character “^” as such has no defined meaning in CSS. The two-character operator “^=” can be used in attribute selectors. Generally, [attr^=val] refers to those elements that have the attribute attr with a value that starts with val.
Thus, a[href^=tel] refers to such a elements that have the attribute href with a value that starts with tel. It is probably meant to distinguish telephone number links from other links; it’s not quite adequate for that, since the selector also matches e.g. ... but it is probably meant to match only links with tel: as the protocol part. So a[href^="tel:"] would be safer.
a[href^="tel"]
(^) means it selects elements that have the specified attribute with a value beginning/starting exactly with a given string.
Here it selects all the 'anchor' elements the value of href attribute starting exactly with a string 'tel'
The carat "^" used like that will match a tags where the href starts with "tel" ( http://csscreator.com/content/attribute-selector-starts )
It means a tags whose href attribute begins with "tel"
Example:
This is a link
will match.

Resources