Web-Scraping with rvest doesn't work correctly - r

I want to scrape the reviews of room from airbnb web-page. For example, from this web-page: https://www.airbnb.com/rooms/8400275
And this is my code for this task. I used rvest packege and selectorgadget:
x <- read_html('https://www.airbnb.com/rooms/8400275')
x_1 <- x%>%html_node('#reviews p')%>%html_text()%>%as.character()
Can you help me to fix that? Is it possible to do with rvest package(I am not familiar with xpathSApply)

I assume that you want to extract the comment itself. Looking at the html file, it seems that that is not an easy task, since you have to extract it within the script node. So, what I tried was this:
Reading the html. Here I use connection and readLines to read it
as character vectors.
Selecting the line that contains the review information.
Using str_extract to extract the comments.
For the first two steps, we can also use rvest or XML package to select the appropriate node.
url <- "https://www.airbnb.com/rooms/8400275"
con <- file (url)
raw <- readLines (con)
close (con)
comment.regex <- "\"comments\":\".*?\""
comment.line <- raw[grepl(comment.regex, raw)]
require(stringr)
comment <- str_extract_all(comment.line, comment.regex)

Related

Basic XML R package question - how to return other attributes for matching entries?

I've downloaded an XML database (Cellosaurus - https://web.expasy.org/cellosaurus/) and I'm trying to use the XML package in R to find all misspellings of a cell line name and return the misspelling and accession.
I've never used XML or XPath expressions before and I'm having real difficulties, so I also hope I've used the correct terminology in my question...
I've loaded the database like so:
doc <- XML::xmlInternalTreeParse(file)
and I can see an example entry which looks like this:
<cell-line category="Cancer cell line">
<accession-list>
<accession type="primary">CVCL_6774</accession>
</accession-list>
<name-list>
<name type="identifier">DOV13</name>
</name-list>
<comment-list>
<comment category="Misspelling"> DOR 13; In ArrayExpress E-MTAB-2706, PubMed=25485619 and PubMed=25877200 </comment>
</comment-list>
I think I've managed to pull out all of the misspellings (which is slightly useful already):
mispelt <- XML::getNodeSet(doc, "//comment[#category=\"Misspelling\"]")
but now I have no idea how to get the accession associated with each misspelling. Perhaps there's a different function I should be using?
Can anyone help me out or point me towards a simple XML R package tutorial please?
It's difficult to help with an incomplete example. But the basic idea is to navigate up the tree structure to get to the data you want. I've used the more current xml2 package but the same idea should hold for XML. For example
library(xml2)
xx <- read_xml("cell.xml")
nodes <- xml_find_all(xx, "//comment[#category=\"Misspelling\"]")
xml_find_first(nodes, ".//../../accession-list/accession") |> xml_text()
# [1] "CVCL_6774"
It's not clear if you have multiple comments or how your data is structured. You may need to lapply or purrr::map the second node selector after the first if you have multiple nodes

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

what's wrong with my R code?

I am struggling to parse contents from HTML using htmlTreeParse and XPath.
Below is the web link from where I need to extract information of "most valuable brands" and create a data frame out of it.
http://www.forbes.com/powerful-brands/list/#tab:rank
As a first step towards building the table, I am trying to extract the list of brands (Apple, Google, Microsoft etc. ). I am trying through below code:
library(XML)
htmlContent <- getURL("http://www.forbes.com/powerful-brands/list/#tab:rank", ssl.verifypeer=FALSE)
htmlParsed <- htmlTreeParse(htmlContent, useInternal = TRUE)
output <- xpathSApply(htmlParsed, "/html/body/div/div/div/table[#id='the_list']/tbody/tr/td[#class='name']", xmlValue)
But its returning NULL. I am not able to find my mistake. "/html/body/div/div/div/table[#id='the_list']/thead/tr/th" works correctly, returning ("", "Rank", "brand" etc.)
This means path upto table is correct. But I am not able to understand what's wrong thereafter.

Scrape data from flash page using rvest

I am trying to scrape data from this page:
http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?
If I try to scrape the name of the players using the css selector and the usual rvest syntax:
names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>%
html_nodes(".scoring-player-name") %>% sapply(html_text)
everything goes well.
Unfortunately if I try to scrape the statistics below (first serve pts won, ..)
using the selector .stat-breakdown span I am not able to retrieve any data.
I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.
I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .
This Tag also contains more information than it was displayed in UI of webpage.
I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.
library(XML)
library(RCurl)
library(stringr)
url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[#id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")
Output then is a string like this

readHTMLTable function not able to extract the html table

I would like to extract the table (table 4) from the URL "http://www.moneycontrol.com/financials/oilnaturalgascorporation/profit-loss/IP02". The catch is that I will have to use RSelenium
Now here is the code I am using:
remDr$navigate(URL)
doc<-htmlParse(remDr$getPageSource()[[1]])
x<-readHTMLTable(doc)
The above code is not able to extract the table 4. However when I do not use Rselenium like below, I am able to extract the table easily
download.file(URL,'quote.html')
doc<-htmlParse('quote.html')
x<-readHTMLTable(doc,which=5)
Please let me the solution as I have been stuck on this part for a month now. Appreciate your suggestions
I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium
readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6)
Hope that helps!
I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future
I'm struggling with more or less the same issue: I'm trying to come up with a solution that doesn't use htmlParse: for example (after navigating to the page):
table <- remDr$findElements(using = "tag name", value = "table"))
You might have to use css or xpath on yours, next step I'm still working on.
I finally got a table downloaded into a nice little data frame, It seems easy when you get it figured out. Using the help page from the XML package:
library(RSelenium)
library(XML)
u <- 'http://www.w3schools.com/html/html_tables.asp'
doc <- htmlParse(u)
tableNodes <- getNodeSet(do9c, "//table")
tb <- readHTMLTable(tableNodes[[1]])

Resources