Error during web scraping in R using Selector Gadget - r

I hope you are all doing well.
I am facing an error during web scraping in R using the Selector Gadget Tool where when I am selecting the data using the tool on the Coursera website, the no. of values it shows is correct (10). But when I copy that particular CSS code in R and run it, it's showing 18 names in the list. Please if anyone can help me with this. Here is a screenshot of the selector gadget output:
And here is what gets returned in R when I scrape that css selector:

The rendered content seen via a browser is not exactly the same as that returned by an XHR request (rvest). This is because a browser can run JavaScript to update content.
Inspect the page source by pressing Ctrl+U in browser on that webpage.
You can re-write your css selector list to match the actual html returned. One example would be as follows, which also removes the reliance on dynamic classes which change more frequently and would break your program more quickly.
library(rvest)
read_html("https://in.coursera.org/degrees/bachelors") |>
html_elements('[data-e2e="degree-list"] div[class] > p:first-child') |>
html_text2()
Learn about CSS selectors and operators here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors

Related

Google Search Preview CSS Selector Cannot Be Registered by Rvest?

I am trying to scrape the preview text from a Google search result. My process right now is to put something into the google search box, hit "search", and then my goal is to get the search result titles (which has been achieved), followed by the text below each search title (which I refer to as "preview text").
Interestingly, I cannot seem to get any CSS selector to register in rvest::html_elements when trying to scrape the text from under a search result. Take the following example where I am trying to scrape text from under a search result for Elon Musk:
library(tidyverse)
library(rvest)
## reading the html google search
## next attempting to grab the text under the wikipedia page result
read_html("https://www.google.com/search?q=elon+musk&sxsrf=ALiCzsZR3iIs5wIwO8PsH8c6D3ghkPmCsA%3A1652081047528&ei=l8F4Yr76H-HFkPIP9uevoA4&ved=0ahUKEwj-oem_8dH3AhXhIkQIHfbzC-QQ4dUDCA4&uact=5&oq=elon+musk&gs_lcp=Cgdnd3Mtd2l6EAMyBAgAEEMyCgguELEDEIMBEEMyCggAELEDEIMBEEMyCwgAEIAEELEDEIMBMgoIABCxAxCDARBDMggIABCxAxCDATILCAAQgAQQsQMQgwEyCwgAEIAEELEDEIMBMgsIABCABBCxAxCDATIICAAQsQMQgwE6BwgAEEcQsAM6CggAEOQCELADGAE6DAguEMgDELADEEMYAjoECCMQJzoHCC4QsQMQQzoRCC4QgAQQsQMQgwEQxwEQ0QM6DgguEIAEELEDEMcBENEDOgsILhCxAxCDARDUAjoICC4QgAQQsQM6BAguEEM6DgguEIAEELEDEIMBENQCOgsILhCxAxCDARCRAjoFCAAQgAQ6CAgAEIAEELEDOgcIABCxAxBDSgQIQRgASgQIRhgBUMILWOkTYPUUaAJwAXgAgAGaAYgBsQiSAQMwLjmYAQCgAQHIAQ_AAQHaAQYIARABGAnaAQYIAhABGAg&sclient=gws-wiz") %>%
html_elements(".VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc")
When I inspect the source, I get that the class is class="VwiC3b yXK7lf MUxGbd yDYNvb lyLwlc". After doing a little research, I figured the CSS selector for this would be VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc since CSS selectors do not have whitespace, and this is actually many classes put together.
However, this code does not produce any results and I keep getting empty nodes. I am not sure what the issue is here.

Scraping Data on TradingView with BeautifulSoup

I just started learning web scraping and decided to scrape the daily value from this site:
https://www.tradingview.com/symbols/INDEX-MMTW/
I am using BeautifulSoup and then doing inspect element and then Copy -> CSS Selector.
However, the returned items are always 0 length. I tried the select() method (from ATBS) and find() method.
Not sure what I am doing wrong. Here is the code...
import requests, bs4
res = requests.get('https://www.tradingview.com/symbols/INDEX-MMTW/')
res.raise_for_status()
nmmtw_data = bs4.BeautifulSoup(res.text, 'lxml')
(Instead of writing the selector yourself, you can also right-click on the element in your browser
and select Inspect Element. When the browser’s developer console opens, right-click on the element’s
HTML and select Copy ▸ CSS Selector to copy the selector string to the clipboard and paste it into your
source code.)
elems = nmmtw_data.select("div.js-symbol-last > span:nth-child(1)")
new_try = nmmtw_data.find(class_="tv-symbol-price-quote__value js-symbol-last")
print(type(new_try))
print(len(new_try))
print(elems)
print(type(elems))
print(len(elems))
Thanks in advance!
Since the price table is generated with JavaScript, unfortunately, we cannot simply use BeautifulSoup to scrape the pricing table. Instead, you should use web browser automation framework.
I'm sure you've found the solution so far, but if not, I believe the answer to your problems is using selenium module. Additionally, you need to install the webdriver specific to the browser you're using. I think BeautifulSoup is very limited these days because most of the sites are generated using java script.
All the info that you need for selenium you can find here:
https://www.selenium.dev/documentation/webdriver/

Scraping webpage (with R) where all elements are placed inside an <app-root> tag

I am trying to solve the RPA challenge (www.rpachallenge.com) using R and Rselenium.
I always use xpath to select elements/tags of an html page but I just can not extract anything from this one, not even using the classic rvest::html_nodes. I believe that the problem is that the html here is generated via javascript because all elements in the body are inside <app-root>...</app-root> tags (which google tells me is how Angular apps are written).
By inspecting the nodesets you can see the different structure of the scraped page with the app-root tag but nothing in it. Any idea how to access the tags in this page?
# You can try it yourselves by running this chunk
library(rvest)
library(magrittr)
url <- "http://www.rpachallenge.com"
rpa <- url %>%
read_html() %>%
View

When using apoc.load.html, Is it possible to return the full HTML rather than only text?

Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.
Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.
Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags

Having problems filtering by xPath

I'm trying to build a hacker news scraper using Symfony 2's Dom Crawler [1]
When I try out the xpath with a chrome plugin [2], it works. But when I try it in my scraper I keep getting The current node list is empty.
Here's my scraper code:
$crawler1 = $client1->request('GET','https://news.ycombinator.com/item?id=8296437');
$hnpost->selftext = $crawler1->filterXPath('/html/body/center/table/tbody/tr[3]/td/table[1]/tbody/tr[4]/td[2]')->text();
[1] http://api.symfony.com/2.0/Symfony/Component/DomCrawler/Crawler.html#method_filter
[2] https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en-US
If the problem is what I think it is, I've been battered by this one a couple of times. Chrome implicitly adds any missing <tbody> tags to the DOM, so if you then copy the XPath or CSS path, you may also have copied tags that don't necessarily exist in the source document. Try viewing the page's source and see if the DOM reported by your browser's console corresponds to the original source HTML. If the <tbody> tags are absent, be sure to exclude them in your filterXPath() call.

Resources