I am trying to scrape summoner division regarding each season from lolking.net, using rvest package in R.
http://www.lolking.net/summoner/na/20130821/Wiggily#/profile
I am trying to use the following code to get the season number.
web.page.level <- read_html(url.level)
node <- html_nodes(web.page.level, css = '.unskew-text.ng-binding')
season <- html_text(node)
But I always get {xml_nodeset (0)}. There is no luck trying to use xpath too.
Could someone tell me what is wrong with my code? How could I get the content with in the html class '.unskew-text.ng-binding' ?
As dmi3kno suggested I am trying to use Rsekenium to scrape the page but there is still problem.
The html of the page for example,
<div class="unskew-text ng-binding">S4</div>
I would like to get the text 'S4'. I try to use both xpath and css.
elem <- remDr$findElement('xpath', "//div[#class='unskew-text ng-binding']")
elem <- remDr$findElement('css', "[class = 'unskew-text ng-binding']")
But I always get no such element error. Could any one tell me what I did wrong. Or is there any other way I can try?
Related
Okay to start, I'm very new to web scraping. I'm trying to learn and I thought I'd start with something simple - scraping a paragraph of text from a webpage. The webpage I'm trying to scrape is https://www.cato.org/blog
I'm just trying to scrape the first paragraph that begins with "Border patrol arrests..."
I added the SelectorGadget extension to chrome to get the CSS selector.
The code I have written is as follows:
url <- "https://www.cato.org/blog"
webpage <- read_html(url)
text <- html_nodes(webpage, "p")
text <- html_text2(text)
However, after running text <- html_nodes(webpage, "p"), I just get an empty list. No errors or anything just... nothing. Am I doing something wrong? When I look up similar issues, I find answers recommending trying the RSelenium package but when I look up this package and how to use it for my task, a lot of it goes over my head.
I've been trying to scrape texts from this website but I can't seem to do it correctly.
I've tried searching and trying different ways but I just can't seem to scrape the reviews section as texts at the bottom of the page. Could someone tell me what's wrong with my code?
Here is my code:
newurl <- "https://www.sephora.com/product/virgin-marula-tm-luxury-facial-oil-P392245?icid2=products%20grid:p392245"
newurl <- read_html(newurl)
text <- newurl %>% html_nodes(".css-7rv8g1")
text <- html_text(text)
What I did was use a CSS selector to get the nodes for the review section which was .css-7rv8g1 and then I selected that node to get the text with the following code but it returns me an empty string.
Can someone tell me what did I do wrong here?
GOAL: I'm trying to scrape win-loss records for NBA teams from basketball-reference.com.
More broadly, I'm trying to better understand how to correctly use CSS selector gadget to scrape specified elements from a website, but would appreciate a solution for this problem.
The url I'm using (https://www.basketball-reference.com/leagues/NBA_2018_standings.html) has multiple tables on it, so I'm trying to use the CSS selector gadget to specify the element I want, which is the "Expanded Standings" table - about 1/3 of the way down the page.
I have read various tutorials about web scraping that involve the rvest and dplyr packages, as well as the CSS selector web browser add-in (which I have installed in Chrome, my browser of choice). That's what I'm going for.
Here is my code so far:
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
url %>%
read_html() %>%
html_nodes(css) %>%
html_table()
The result of this code is an error:
Error: html_name(x) == "table" is not TRUE
When I delete the last line of code, I get:
url %>%
read_html() %>%
html_nodes(css)
{xml_nodeset (0)}
It seems like there's an issue with the way I'm defining the CSS object/how I'm using the CSS selector tool. What I've been doing is clicking at the very right edge of the desired table, so that the table has a rectangle around it.
I've also tried to click a specific "cell" in the table (i.e., "65-17', which is the value in the "Overall" column for the Houston Rockets row), but that seems to highlight some, but not all of the table, and the random parts of other tables on the web page.
Can anyone provide a solution? Bonus points if you can help me understand where/why what I'm doing is incorrect.
Thanks in advance!
library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
mystr <- toString(mynode)
mystr <- gsub("<!--","",mystr)
mystr <- gsub("-->","",mystr)
newdiv <- read_html(mystr)
newtable <- html_nodes(newdiv,"#expanded_standings")
newframe <- html_table(newtable)
print(newframe)
library(rvest)
library(dplR)
library(stringr)
library(magrittr)
url <- "https://www.basketball-reference.com/leagues/NBA_2018_standings.html"
css <- "#expanded_standings"
css <- "#all_expanded_standings"
webpage <- read_html(url)
print(webpage)
mynode <- html_nodes(webpage,css)
#print node to console - interprets slashes
cat(toString(mynode))
I tried downloading the bare url html(before javascript render). Seems strange like the table data is in a comment block. In this div - there is the 'Expanded Standings' table.
I used python and beautifulsoup to extract the element and then remove the comment markers, resoup the string section and then parse the string into td bits. Strange like the rank is in a th element.
I've seen other posts for how to do this in java but sadly, I only know R
I want to get all the content (tags, attributes, values) verbatim contained in a tag, including the content of child tags. I thought I could do something like
a = xpathSApply(html, "//span[#class = 'class name']/node()", ????)
But then I realized I don't know any functions which get the entire content of your path and not just the attributes or just the text. How would I do this?
Not sure whether that is applicable in your use-case, but have you tried to work with the library xml2?
content <- read_xml( html )
nodes <- xml_find_all(content, xpath) # or xml_find_one if you want only the first result
From there you can do all kinds of things using xml_text(), xml_attrs(), xml_name(), xml_children(), ...
To really just retrieve the complete content, a <- paste(nodes[[1]]) would do I guess...
I would like to extract the table (table 4) from the URL "http://www.moneycontrol.com/financials/oilnaturalgascorporation/profit-loss/IP02". The catch is that I will have to use RSelenium
Now here is the code I am using:
remDr$navigate(URL)
doc<-htmlParse(remDr$getPageSource()[[1]])
x<-readHTMLTable(doc)
The above code is not able to extract the table 4. However when I do not use Rselenium like below, I am able to extract the table easily
download.file(URL,'quote.html')
doc<-htmlParse('quote.html')
x<-readHTMLTable(doc,which=5)
Please let me the solution as I have been stuck on this part for a month now. Appreciate your suggestions
I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium
readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6)
Hope that helps!
I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future
I'm struggling with more or less the same issue: I'm trying to come up with a solution that doesn't use htmlParse: for example (after navigating to the page):
table <- remDr$findElements(using = "tag name", value = "table"))
You might have to use css or xpath on yours, next step I'm still working on.
I finally got a table downloaded into a nice little data frame, It seems easy when you get it figured out. Using the help page from the XML package:
library(RSelenium)
library(XML)
u <- 'http://www.w3schools.com/html/html_tables.asp'
doc <- htmlParse(u)
tableNodes <- getNodeSet(do9c, "//table")
tb <- readHTMLTable(tableNodes[[1]])