readHTMLTable function not able to extract the html table

readHTMLTable function not able to extract the html table - r

I would like to extract the table (table 4) from the URL "http://www.moneycontrol.com/financials/oilnaturalgascorporation/profit-loss/IP02". The catch is that I will have to use RSelenium
Now here is the code I am using:
remDr$navigate(URL)
doc<-htmlParse(remDr$getPageSource()[[1]])
x<-readHTMLTable(doc)
The above code is not able to extract the table 4. However when I do not use Rselenium like below, I am able to extract the table easily
download.file(URL,'quote.html')
doc<-htmlParse('quote.html')
x<-readHTMLTable(doc,which=5)
Please let me the solution as I have been stuck on this part for a month now. Appreciate your suggestions

I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium
readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6)
Hope that helps!

I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future

I'm struggling with more or less the same issue: I'm trying to come up with a solution that doesn't use htmlParse: for example (after navigating to the page):
table <- remDr$findElements(using = "tag name", value = "table"))
You might have to use css or xpath on yours, next step I'm still working on.
I finally got a table downloaded into a nice little data frame, It seems easy when you get it figured out. Using the help page from the XML package:
library(RSelenium)
library(XML)
u <- 'http://www.w3schools.com/html/html_tables.asp'
doc <- htmlParse(u)
tableNodes <- getNodeSet(do9c, "//table")
tb <- readHTMLTable(tableNodes[[1]])

Related

Using R to mimic “clicking” a download file button on a webpage

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}

Using rvest or RSelenium to create automated webscrape of table inside frame

I know there's a great deal of resources/questions that deal with this subject but I have been trying for days and can't seem to figure it out. I have webscraped websites before but this one is causing me problems.
The website: njaqinow.net
What I want scraped: I would like to scrape the table under the "Current Status"->"Pollutants" tab. I would like to have this scraped every time the table is updated so I can use this information inside a shiny app I am creating.
What I have tried: I have tried numerous different approaches but for simplicity I will show my most recent approach:
library("rvest")
url<-"http://www.njaqinow.net"
webpage <- read_html(url)
test<-webpage%>%
html_node("table")%>%
html_table()
My guess is that this is way more complicated then I originally thought because it seems to me that the table is inside a frame. I am not a javascript/HTML pro so I am not entirely sure. Any help/guidance would be greatly appreciated!

I can contribute a solution with RSelenium. I would show you how to navigate to that table and
get its content. For formatting the table content i provide a link to another question, but wont be
in the scope of this answer.
I think you have two challenges. Switch into a frame and switching between frames.
Switch into a frame is done by remDr$switchToFrame().
Switching between frames is discussed here: https://github.com/ropensci/RSelenium/issues/155.
In your case:
remDr$switchToFrame("contents")
...
remDr$switchToFrame(NA)
remDr$switchToFrame("contentsi")
Full code would read:
remDr$navigate("http://www.njaqinow.net")
frame1 <- remDr$findElement("xpath", "//frame[#id = 'contents']")
remDr$switchToFrame(frame1)
remDr$findElement("xpath", "//*[text() = 'Current Status']")$clickElement()
remDr$findElement("xpath", "//*[text() = 'POLLUTANTS']")$clickElement()
remDr$switchToFrame(NA)
remDr$switchToFrame("contentsi")
table <- remDr$findElement("xpath", "//table[#id = 'C1WebGrid1']")
table$getElementText()
For formatting a table you could look here:
scraping table with R using RSelenium

Web scraping Wunderground

in the last couple of month I used a code to get historical data from Wunderground. The code worked. But since today they somehow changed the website. Unfortunately I am not very familiar with html stuff. This was the code which worked:
webpage <- read_html("https://www.wunderground.com/history/monthly/LOWW/date/2018-06?dayend=18&monthend=6&yearend=2018&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=")
tbls <- html_nodes(webpage, "table")
weather.data <- html_table(tbls)[[2]]
This code selected the second table on the website. Does anyone has an idea why it does not work anymore? They somehow changed the description for the nodes?
Cheers

Web-Scraping with rvest doesn't work correctly

I want to scrape the reviews of room from airbnb web-page. For example, from this web-page: https://www.airbnb.com/rooms/8400275
And this is my code for this task. I used rvest packege and selectorgadget:
x <- read_html('https://www.airbnb.com/rooms/8400275')
x_1 <- x%>%html_node('#reviews p')%>%html_text()%>%as.character()
Can you help me to fix that? Is it possible to do with rvest package(I am not familiar with xpathSApply)

I assume that you want to extract the comment itself. Looking at the html file, it seems that that is not an easy task, since you have to extract it within the script node. So, what I tried was this:
Reading the html. Here I use connection and readLines to read it
as character vectors.
Selecting the line that contains the review information.
Using str_extract to extract the comments.
For the first two steps, we can also use rvest or XML package to select the appropriate node.
url <- "https://www.airbnb.com/rooms/8400275"
con <- file (url)
raw <- readLines (con)
close (con)
comment.regex <- "\"comments\":\".*?\""
comment.line <- raw[grepl(comment.regex, raw)]
require(stringr)
comment <- str_extract_all(comment.line, comment.regex)

R: posting search forms and scraping results

I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards

First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

readHTMLTable function not able to extract the html table - r

I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6) Hope that helps!

I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future

Related

Using R to mimic “clicking” a download file button on a webpage

Using rvest or RSelenium to create automated webscrape of table inside frame

Web scraping Wunderground

Web-Scraping with rvest doesn't work correctly

R: posting search forms and scraping results

Categories

Resources