Scraping texts from webpage - r

I've been trying to scrape texts from this website but I can't seem to do it correctly.
I've tried searching and trying different ways but I just can't seem to scrape the reviews section as texts at the bottom of the page. Could someone tell me what's wrong with my code?
Here is my code:
newurl <- "https://www.sephora.com/product/virgin-marula-tm-luxury-facial-oil-P392245?icid2=products%20grid:p392245"
newurl <- read_html(newurl)
text <- newurl %>% html_nodes(".css-7rv8g1")
text <- html_text(text)
What I did was use a CSS selector to get the nodes for the review section which was .css-7rv8g1 and then I selected that node to get the text with the following code but it returns me an empty string.
Can someone tell me what did I do wrong here?

Related

rvest html_nodes function returning list of 0

Okay to start, I'm very new to web scraping. I'm trying to learn and I thought I'd start with something simple - scraping a paragraph of text from a webpage. The webpage I'm trying to scrape is https://www.cato.org/blog
I'm just trying to scrape the first paragraph that begins with "Border patrol arrests..."
I added the SelectorGadget extension to chrome to get the CSS selector.
The code I have written is as follows:
url <- "https://www.cato.org/blog"
webpage <- read_html(url)
text <- html_nodes(webpage, "p")
text <- html_text2(text)
However, after running text <- html_nodes(webpage, "p"), I just get an empty list. No errors or anything just... nothing. Am I doing something wrong? When I look up similar issues, I find answers recommending trying the RSelenium package but when I look up this package and how to use it for my task, a lot of it goes over my head.

How do I scrape only one section of text from a webpage in R?

I am trying to scrape specific portions of html based journal articles. For example if I only wanted to scrape the "Statistical analyses" sections of article in a Frontiers publication how could I do that? Since the number of paragraphs and locations of the section change for each article, the selectorGadget isn't helping.
https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full
I've tried using rvest with html_nodes and xpath, but I'm not having any luck. The best I can do is begin scraping at the section I want, but can't get it to stop after. Any suggestions?
example_page <- "https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full"
example_stats_section <- read_html(example_page) %>%
html_nodes(xpath="//h3[contains(., 'Statistical Analyses')]/following-sibling::p") %>%
html_text()
Since there is a "Results" section after each "Statistical analyses" try
//h3[.='Statistical Analyses']/following-sibling::p[following::h2[.="Results"]]
to get required section

Using rvest, xml2 and selector gadget for webscraping results in xml_missing <NA>

I'm trying to scrape information from the following URL:
https://www.google.com/search?q=812-800%20H%20St%20NW
I want to retrieve the highlighted "812 H St NW": [target][1]
The selector gadget (chrome extension) suggests to use the following node ".desktop-title-content"
However, I get an NA as a result and I don't get how to fix this problem.
Here is my code:
link <- "https://www.google.com/search?q=812-800%20H%20St%20NW"
xml2::read_html(link) %>%
rvest::html_node(".desktop-title-content") %>% rvest::html_text()
[1] NA
Thank you
[1]: https://i.stack.imgur.com/mzY75.png
I think you want to check the source page when SelectorGadget does not help you well. In this case, you just need to find text between <title> and </title>. I had some extra text (i.e., - google search) in the text. So I removed it in the end. You may not have that.
read_html("https://www.google.com/search?q=812-800%20H%20St%20NW") %>%
html_nodes("title") %>%
html_text() %>%
sub(pattern = " -.*$", replacement = "")
#[1] "812-800 H St NW "
It looks like the content that I want to get is generated by javascript. Therefore, I need to create a .js file and access it using phantom JS as per this tutorial: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r
Then, I will be able to use rvest to scrape the correct content.
Unfortunately, I need to do this for around 2000 different links. I will be looking for a solution to automatically create 2000 ".js" files.
Thanks for your answers.

Scraping Youtube video titles with R

I am trying to extract the title of Youtube video from a specific channel. My current scraper is the following:
url <- 'https://www.youtube.com/channel/UCmgE6sLiR_cC_0T4fRHpZ0A/videos'
webpage <- read_html(url)
titles_html <- html_nodes(webpage, '#contents')
titles <- html_text(titles_html)
I'm pretty sure the node is not the correct one, but I can't seem to find anything through Google Chrome SelectorGadget.
Anyone knows how to get data in this case?
Thank you very much!
#contents is correct but need more processing, use h3 a to get title of every video.

return {xml_nodeset (0)}

I am trying to scrape summoner division regarding each season from lolking.net, using rvest package in R.
http://www.lolking.net/summoner/na/20130821/Wiggily#/profile
I am trying to use the following code to get the season number.
web.page.level <- read_html(url.level)
node <- html_nodes(web.page.level, css = '.unskew-text.ng-binding')
season <- html_text(node)
But I always get {xml_nodeset (0)}. There is no luck trying to use xpath too.
Could someone tell me what is wrong with my code? How could I get the content with in the html class '.unskew-text.ng-binding' ?
As dmi3kno suggested I am trying to use Rsekenium to scrape the page but there is still problem.
The html of the page for example,
<div class="unskew-text ng-binding">S4</div>
I would like to get the text 'S4'. I try to use both xpath and css.
elem <- remDr$findElement('xpath', "//div[#class='unskew-text ng-binding']")
elem <- remDr$findElement('css', "[class = 'unskew-text ng-binding']")
But I always get no such element error. Could any one tell me what I did wrong. Or is there any other way I can try?

Resources