Scraping data from table

Scraping data from table - r

I want to extract data from the table present on web page http://www.moneycontrol.com/financials/afenterprises/profit-lossVI/AFE01#AFE01
I don't need the entire table at once but specific elements
X-path for 1st element is
/html/body/center[2]/div/div[1]/div[8]/div[3]/div[2]/div[2]/div[2]/div[1]/table[2]/tbody/tr[6]/td[2]
i wrote a code
library(rvest)
library(XML)
FJ<-htmlParse("http://www.moneycontrol.com/financials/afenterprises/profit-lossVI/AFE01#AFE01")
data<-xpathSApply(FJ,"/html/body/center[2]/div/div[1]/div[8]/div[3]/div[2]/div[2]/div[2]/div[1]/table[2]/tbody/tr[6]/td[2]")
print(data)
the output comes out to be NULL

It looks like you missed a div in between and you did basically a wrong "turn"...
xpathSApply(FJ,"/html/body/center[2]/div/div[1]/div[8]/div[3]/div[2]/div[2]/div[2]/div[1]/table[2]/tr[6]/td[2]")
xmlValue(xpathSApply(FJ,"/html/body/center[2]/div/div[1]/div[8]/div[3]/div[2]/div[2]/div[2]/div[1]/table[2]/tr[6]/td[2]")[[1]])

Related

Add hyperlinks to cells (and/or rows) of DT table in flexdashboard

I have a large table which I would like to present in interactive (filterable and sortable) form within a flexdashboard. I have achieved this using the DT package. One of the columns contains URLs, which currently are not 'clickable', so users will have to copy and paste into their browser. Is there a way to make these URLs into hyperlinks?
I have tried adding html tags in this format:
Link text
But the tags themselves display in the table, with no hyperlink.
Another related question - as these URLs are taking up a large proportion of the width of the table, is it possible to make an entire row 'clickable', directing users to the URL stored in 'column B' of the table, without displaying 'column B'?
The other answers I have found apply to using DT in JavaScript or shiny, and seem not to match my code.
Thank you :)

You can use escape=FAlSEarguement in datatable() function.
Here is an example:
df <- data.frame(textLink=paste0('Link text'))
datatable(df, escape = FALSE)

How to table data scraped from the web and read all the data from a table

I am trying to scrape data from the web, specifically from a table that has different filters and pages and I have the following code:
library (rvest)
url.colombia.compra <- "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to = & date_from = "
tmp <- read_html (url.colombia.compra)
tmp_2 <- html_nodes (tmp, ".active")
the problem is that the code generates a list for me but I need to format it as a table and I have not succeeded, besides that it only shows me data from the first page of the table, how could I complement the code so that it allows me to read the data from all the pages in the table and format it as a table.
This is the table that looks like the table that shows the data

I would split this problem into two parts. Your first is how to programmatically access each of the 11 pages of this online table.
Since this is a simple html table, using the "Next" button (siguiente) will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
...tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state...
We know that the pages are numbered starting with 0 (because "next" takes us to page1), and using the navigation bar we can see that there are 11 pages.
We can use the query parameters to construct a series of rvest::read_html() calls corresponding to page number by simply using lapply and paste0 to replace the page=. This will let us access all the pages of the table.
The second part is making use of rvest::html_table which will parse a tibble from the results of read_html
pages <-
lapply(0:11, function(x) {
data.frame(
html_table(
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="))
)
)
})
The result is a list of dataframes which we can combine with do.call.
do.call(rbind, pages)

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?

jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Scrape data from flash page using rvest

I am trying to scrape data from this page:
http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?
If I try to scrape the name of the players using the css selector and the usual rvest syntax:
names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>%
html_nodes(".scoring-player-name") %>% sapply(html_text)
everything goes well.
Unfortunately if I try to scrape the statistics below (first serve pts won, ..)
using the selector .stat-breakdown span I am not able to retrieve any data.
I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.

I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .
This Tag also contains more information than it was displayed in UI of webpage.
I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.
library(XML)
library(RCurl)
library(stringr)
url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[#id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")
Output then is a string like this

readHTMLTable function not able to extract the html table

I would like to extract the table (table 4) from the URL "http://www.moneycontrol.com/financials/oilnaturalgascorporation/profit-loss/IP02". The catch is that I will have to use RSelenium
Now here is the code I am using:
remDr$navigate(URL)
doc<-htmlParse(remDr$getPageSource()[[1]])
x<-readHTMLTable(doc)
The above code is not able to extract the table 4. However when I do not use Rselenium like below, I am able to extract the table easily
download.file(URL,'quote.html')
doc<-htmlParse('quote.html')
x<-readHTMLTable(doc,which=5)
Please let me the solution as I have been stuck on this part for a month now. Appreciate your suggestions

I think it works fine. The table you were able to get using download.file can also be gotten by using the following code for RSelenium
readHTMLTable(htmlParse(remDr$getPageSource(),asText=TRUE),header=TRUE,which=6)
Hope that helps!

I found the solution. In my case, I had to first navigate to the inner frame (boxBg1) before I could extract the outer html and then use readHtmlTable function. It works fine now. Will post in case I run into a similar issue in the future

I'm struggling with more or less the same issue: I'm trying to come up with a solution that doesn't use htmlParse: for example (after navigating to the page):
table <- remDr$findElements(using = "tag name", value = "table"))
You might have to use css or xpath on yours, next step I'm still working on.
I finally got a table downloaded into a nice little data frame, It seems easy when you get it figured out. Using the help page from the XML package:
library(RSelenium)
library(XML)
u <- 'http://www.w3schools.com/html/html_tables.asp'
doc <- htmlParse(u)
tableNodes <- getNodeSet(do9c, "//table")
tb <- readHTMLTable(tableNodes[[1]])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping data from table - r

Related

Add hyperlinks to cells (and/or rows) of DT table in flexdashboard

How to table data scraped from the web and read all the data from a table

How do you download data from API and export it into a nice CSV file to query?

Scrape data from flash page using rvest

readHTMLTable function not able to extract the html table

Categories

Resources