Why do I get different answers from css and xpath selectors? - css

I am trying to scrape the following link using scrapy. cpuc website document
There is a table on that page, the values of which I am trying to scrape. When I scrape using xpath, it gives the correct answer. eg response.xpath("//td[#class='ResultTitleTD']/text()").getall()
gives
['Comments filed by Southern California Gas Company on 06/24/2021 Conf# 167430', 'Proceeding: A2011004', 'Comments filed by Southern California Gas Company on 06/24/2021 Conf# 167430 (Certificate Of Service)', 'Proceeding: A2011004']
as expected but when I run response.css("td.ResultTitleTD::text").getall(), I get an empty list as answer.
Why are css and xpath selectors giving different answers for the same query?

You are trying to use invalid CSS locator.
You can use td.ResultTitleTD as a valid CSS selector, get those elements and then extract their texts but you can't use td.ResultTitleTD::text as CSS selector to access those lements texts directly.
UPD:
Here is the dev tools screenshot where I see those 2 elements located and highlighted using the above CSS selector

Related

Error during web scraping in R using Selector Gadget

I hope you are all doing well.
I am facing an error during web scraping in R using the Selector Gadget Tool where when I am selecting the data using the tool on the Coursera website, the no. of values it shows is correct (10). But when I copy that particular CSS code in R and run it, it's showing 18 names in the list. Please if anyone can help me with this. Here is a screenshot of the selector gadget output:
And here is what gets returned in R when I scrape that css selector:
The rendered content seen via a browser is not exactly the same as that returned by an XHR request (rvest). This is because a browser can run JavaScript to update content.
Inspect the page source by pressing Ctrl+U in browser on that webpage.
You can re-write your css selector list to match the actual html returned. One example would be as follows, which also removes the reliance on dynamic classes which change more frequently and would break your program more quickly.
library(rvest)
read_html("https://in.coursera.org/degrees/bachelors") |>
html_elements('[data-e2e="degree-list"] div[class] > p:first-child') |>
html_text2()
Learn about CSS selectors and operators here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors

Google Search Preview CSS Selector Cannot Be Registered by Rvest?

I am trying to scrape the preview text from a Google search result. My process right now is to put something into the google search box, hit "search", and then my goal is to get the search result titles (which has been achieved), followed by the text below each search title (which I refer to as "preview text").
Interestingly, I cannot seem to get any CSS selector to register in rvest::html_elements when trying to scrape the text from under a search result. Take the following example where I am trying to scrape text from under a search result for Elon Musk:
library(tidyverse)
library(rvest)
## reading the html google search
## next attempting to grab the text under the wikipedia page result
read_html("https://www.google.com/search?q=elon+musk&sxsrf=ALiCzsZR3iIs5wIwO8PsH8c6D3ghkPmCsA%3A1652081047528&ei=l8F4Yr76H-HFkPIP9uevoA4&ved=0ahUKEwj-oem_8dH3AhXhIkQIHfbzC-QQ4dUDCA4&uact=5&oq=elon+musk&gs_lcp=Cgdnd3Mtd2l6EAMyBAgAEEMyCgguELEDEIMBEEMyCggAELEDEIMBEEMyCwgAEIAEELEDEIMBMgoIABCxAxCDARBDMggIABCxAxCDATILCAAQgAQQsQMQgwEyCwgAEIAEELEDEIMBMgsIABCABBCxAxCDATIICAAQsQMQgwE6BwgAEEcQsAM6CggAEOQCELADGAE6DAguEMgDELADEEMYAjoECCMQJzoHCC4QsQMQQzoRCC4QgAQQsQMQgwEQxwEQ0QM6DgguEIAEELEDEMcBENEDOgsILhCxAxCDARDUAjoICC4QgAQQsQM6BAguEEM6DgguEIAEELEDEIMBENQCOgsILhCxAxCDARCRAjoFCAAQgAQ6CAgAEIAEELEDOgcIABCxAxBDSgQIQRgASgQIRhgBUMILWOkTYPUUaAJwAXgAgAGaAYgBsQiSAQMwLjmYAQCgAQHIAQ_AAQHaAQYIARABGAnaAQYIAhABGAg&sclient=gws-wiz") %>%
html_elements(".VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc")
When I inspect the source, I get that the class is class="VwiC3b yXK7lf MUxGbd yDYNvb lyLwlc". After doing a little research, I figured the CSS selector for this would be VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc since CSS selectors do not have whitespace, and this is actually many classes put together.
However, this code does not produce any results and I keep getting empty nodes. I am not sure what the issue is here.

How can I select an XPath with multiple conditions

From this website, I want to get all genres in the Genres & Subgenres menu using the selenium framework. For that, I need a general xpath or css selector that applies to all of them. I have noticed that all genres have "genreid:D+++" as a part of their id and are located in tag. How can I use this information to get all genres? If you know a better way to solve my problem please write it.
https://www.allmovie.com/advanced-search
Xpath for all Genres & Subgenres
//input[contains(#id,'genreid')]
I'm more familiar with CSS than XPath so I will answer that part of your question.
To make a CSS selection for an element ID that starts with genreid, you'd use this selector:
[id^=genreid]
That will give you the checkboxes. If you wanted the <li> elements that contain them, you could just select by the classname: .genre
For more info about CSS selection, see MDN's great documentation here.

When using apoc.load.html, Is it possible to return the full HTML rather than only text?

Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.
Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.
Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags

How to click on a link based on text in a table using selenium

Hi All,
I have the following table with links that I need to select. In this specific example I need to select the DIY Payroll but sometimes this can change its position within the table. The current xpath is:
.//*[#id='catalog-category-div-1']/table/tbody/tr/td1/ul/li[4]/a
So I do a:
By.xpath(".//*[#id='catalog-category-div-1']/table/tbody/tr/td[1]/ul/li[4]/a").click()
But the problem is here is that it can change position where it can be in td[2] or td[3] and li[n'th postion]
Can I have selenium go through the table and click on it based on text. Will the By.linktext() work here ?
You can use the following codes these codes will handle the dynamic changes.
You can use linkText() method as follows:
driver.findElement(By.linkText("DIY Payroll")).click();
If you want to use xpath then you can use following code.
driver.findElement(By.xpath(.//a[contains(text(),'DIY Payroll')).click();
If you need any more clarification you are welcome :)
I would suggest that you try By.linkText() or By.partialLinkText(). It will locate an A tag that contains the desired text.
driver.findElement(By.linkText("DIY Payroll")).click();
A couple issues you might run into:
The link text may exist more than once on the page. In this case, find an element that's easy to find (e.g. by id) that is a parent of only the link you want and then search from that element.
driver.findElement(By.id("someId")).findElement(By.linkText("DIY Payroll")).click();
The A tag may contain extra spaces, other characters, be capitalized, etc. In these case, you'll just have to try using .partialLinkText() or trial and error.
In some cases I've seen a link that isn't an A tag or contains additional tags inside. In this case, you're going to have to find another method to locate the text like XPath.
You should use a CSS selector for this case:
Can you try:
By.CssSelector("a.browse-catalog-categories-link")
You can use XPath to do this. //a will select all 'a' tags. The part inside of the square brackets will select everything with text "DIY Payroll". Combined together you get the desired solution.
//a[contains(text(),'DIY Payroll')]

Resources