Unable to scrape the data using bs4

Unable to scrape the data using bs4 - web-scraping

I am trying to scrape the star rating for the "value" data from the Trip Advisor hotels but I am not able to get the data using class name:
Below is the code which I have tried to use:
review_pages=requests.get("https://www.tripadvisor.com/Hotel_Review-g60745-d94367-Reviews-Harborside_Inn-Boston_Massachusetts.html")
soup3=BeautifulSoup(review_pages.text,'html.parser')
value=soup3.find_all(class_='hotels-review-list-parts-AdditionalRatings__bubbleRating--2WcwT')
Value_1=soup3.find_all(class_="hotels-review-list-parts-AdditionalRatings__ratings--3MtoD")
When I am trying to capture the values it is returning an empty list. Any direction would be really helpful. I have tried mutiple class names which are in that page but I am getting various fields such as Data,reviews ect but I am not able to get the bubble ratings for only service.

You can use an attribute = value selector and pass the class in with its value as a substring with ^ starts with operator to allow for different star values which form part of the attribute value.
Or, more simply use the span type selector to select for the child spans.
.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN span
In this line:
values=soup3.select('.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN [class^="ui_bubble_rating bubble_"]')
The first part of the selector, when reading from left to right, is selecting for the parent class of those ratings. The following space is a descendant combinator combining the following attribute = value selector which gathers a list of the qualifying children. As mentioned, you can replace that with just using span.
Code:
import requests
from bs4 import BeautifulSoup
import re
review_pages=requests.get("https://www.tripadvisor.com/Hotel_Review-g60745-d94367-Reviews-Harborside_Inn-Boston_Massachusetts.html")
soup3=BeautifulSoup(review_pages.content,'lxml')
values=soup3.select('.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN [class^="ui_bubble_rating bubble_"]') #.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN span
Value_1 = values[-1]
print(Value_1['class'][1])
stars = re.search(r'\d', Value_1['class'][1]).group(0)
print(stars)
Although I use re, I think it is overkill and you could simply use replace.

Related

search through xpath only within current scrapy Selector

Using Scrapy, I have extracted a selector like below (I have omitted other tags here for readability)
>>> row.get()
'<tr>\n <td>\n <span class="severity-list__item-text">H</span>\n </li>\n </ul>\n </tr>'
The selector here only contains one instance of span tag with class value "severity-list__item-text". However, the whole page contains 30 like this.
When I write this:
>>> l = row.xpath('//span[#class="severity-list__item-text"]')
>>> len(l)
30
Here, I was expecting to get only the one value within the given selector. However, it is returning me all the instances present in the page. What is the issue here? How can I limit my serach only within a given selector?

I think I was missing a '.' to signal to the parent selector.
>>> l = second_row.xpath('.//span[#class="severity-list__item-text"]')
>>> len(l)
1

Finding element location in Shiny

I have a Shiny app with varying table sizes depending on inputs and I am trying to test the app using RSelenium. I would like to find the element location using XPath syntax. Finding one element using exact node works fine, however, finding several ones does not return any results at all. My Shiny app cant be shared but the same results occur on a Shiny hosted app by RStudio.
library(RSelenium)
rd <- rsDriver()
r <- rd$client
r$navigate('https://shiny.rstudio.com/gallery/datatables-demo.html')
r$switchToFrame(r$findElements("css selector", "iframe")[[1]])
e <- r$findElements('xpath', "//*[#id='DataTables_Table_0']/tbody/tr[1]/td[3]")
e[[1]]$getElementText()
e[[1]]$getElementLocation()[c('x', 'y')]
# Works as expected
# Find all elements - does not find any elements
e_all <- r$findElements('xpath', "//*[#id='DataTables_Table_0']/tbody/tr[*]/td[*]")

In the first XPath you are selecting the third td element that is a child of the first tr element under tbody.
In the second XPath you are selecting the td element (only if it has a child element) that is a child of a tr element that has child elements(which it has to, since you want to select the td child element(s)).
It is difficult to tell without some sample data, but I'm guessing that none of the td elements have any child elements, and so it isn't selecting anything.
Adjust the XPath to remove both of the predicate filters:
//*[#id='DataTables_Table_0']/tbody/tr/td
That should select all of the columns from all of the rows in that table.
If that selects too many columns and you need to restrict it, provide some example content and describe what you want to select or exclude, and we can help you add an appropriate predicate filter.

Unable to find xpath list trying to use wild card contains text or style

I am trying to find an XPATH for this site the XPath under “Main Lists”. I have so far:
//div[starts-with(#class, ('sm-CouponLink_Label'))]
However this finds 32 matches…
`//div[starts-with(#class, ('sm-CouponLink_Label'))]`[contains(text(),'*')or[contains(Style(),'*')]
Unfortunately in this case I am wanting to use XPaths and not CSS.
It is for this site, my code is here and here's an image of XPATH I am after.
I have also tried:
CSS: div:nth-child(1) > .sm-MarketContainer_NumColumns3 > div > div
Xpath equiv...: //div[1]//div[starts-with(#class, ('sm-MarketContainer_NumColumns3'))]//div//div
Though it does not appear to work.
UPDATED
WORKING CSS: div.sm-Market:has(div >div:contains('Main Lists')) * > .sm-CouponLink_Label
Xpath: //div[Contains(#class, ('sm-Market'))]//preceding::('Main Lists')//div[Contains(#class, ('sm-CouponLink_Label'))]
Not working as of yet..
Though I am unsure Selenium have equivalent for :has
Alternatively...
Something like:
//div[contains(text(),"Main Lists")]//following::div[contains(#class,"sm-Market")]//div[contains(#class,"sm-CouponLink_Label")]//preceding::div[contains(#class,"sm-Market_HeaderOpen ")]
(wrong area)

You can get all required elements with below piece of code:
league_names = [league for league in driver.find_elements_by_xpath('//div[normalize-space(#class)="sm-Market" and .//div="Main Lists"]//div[normalize-space(#class)="sm-CouponLink_Label"]') if league.text]
This should return you list of only non-empty nodes

If I understand this correctly, you want to narrow down further the result of your first XPath to return only div that has inner text or has attribute style. In this case you can use the following XPath :
//div[starts-with(#class, ('sm-CouponLink_Label'))][#style or text()]
UPDATE
As you clarified further, you want to get div with class 'sm-CouponLink_Label' that resides in the 'Main Lists' section. For this purpose, you should try to incorporate the 'Main Lists' in the XPath somehow. This is one possible way (formatted for readability) :
//div[
div/div/text()='Main Lists'
]//div[
starts-with(#class, 'sm-CouponLink_Label')
and
normalize-space()
]
Notice how normalize-space() is used to filter out empty div from the result. This should return 5 elements as expected, here is the result when I tested in Chrome :

Get data attributes with Nokogiri

I'm scraping a site that has a number of divs with the same ".pane" class and same "data-pane" data attributes.
input = doc.css('.pane[data-pane]')
How do I filter or select from the above to get the div which has a "data-pane" attribute equal to a specific value?

You can just treat it as you would any other attribute with the usual CSS syntax:
input = doc.css('.pane[data-pane="the value"]')

Acquiring all nodes that have ids beginning with "ABC"

I'm attempting to scrape a page that has about 10 columns using Ruby and Nokogiri, with most of the columns being pretty straightforward by having unique class names. However, some of them have class ids that seem to have long number strings appended to what would be the standard class name.
For example, gametimes are all picked up with .eventLine-time, team names with .team-name, but this particular one has, for example:
<div class="eventLine-book-value" id="eventLineOpener-118079-19-1522-1">-3 -120</div>
.eventLine-book-value is not specific to this column, so it's not useful. The 13 digits are different for every game, and trying something like:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(^selector)
end
Has left me with errors. I've seen ^ and ~ be used in other languages, but I'm new to this and I have tried searching for ways to pick up all data under id=eventLineOpener-XXXX to no avail.

To pick up all data under id=eventLineOpener-XXXX, you need to pass 'div[id*=eventLineOpener]' as the selector:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(selector) #doc.css('div[id*=eventLineOpener]')
end
The above method will return you an array of Nokogiri::XML::Element objects having id=eventLineOpener-XXXX.
Further, to extract the content of each of these Nokogiri::XML::Element objects, you need to iterate over each of these objects and use the text method on those objects. For example:
doc.css('div[id*=eventLineOpener]')[0].text

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unable to scrape the data using bs4 - web-scraping

Related

search through xpath only within current scrapy Selector

Finding element location in Shiny

Unable to find xpath list trying to use wild card contains text or style

Get data attributes with Nokogiri

Acquiring all nodes that have ids beginning with "ABC"

Categories

Resources