google sheets webscraping , importxml fails - web-scraping

I have this url https://www.bloomberg.com/quote/TAGLTEC:MK
I want to scrape the 0.8225 element into my google sheets.
E58 is cell with url https://www.bloomberg.com/quote/TAGLTEC:MK
the scraping query i used was =REPLACE(E58,(E58,"//span[#class='priceText__06f600fa3e']"),1, 3, "") but it fails

Try this ...
=importxml("https://www.bloomberg.com/quote/TAGLTEC:MK","//section[#class='box main']")
and
=index(flatten(importxml("https://www.bloomberg.com/quote/TAGLTEC:MK","//section[#class='info']")),2,1)
and you will see the exact reason why there is no way to retrieve data by using importxml.

Related

Rvest web scrape returns empty character

I am looking to scrape some data from a chemical database using R, mainly name, CAS Number, and molecular weight for now. However, I am having trouble getting rvest to extract the information I'm looking for. This is the code I have so far:
library(rvest)
library(magrittr)
# Read HTML code from website
# I am using this format because I ultimately hope to pull specific items from several different websites
webpage <- read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/", 1))
# Use CSS selectors to scrape the chemical name
chem_name_html <- webpage %>%
html_nodes(".short .breakword") %>%
html_text()
# Convert the data to text
chem_name_data <- html_text(chem_name_html)
However, when I'm trying to create name_html, R only returns character (empty). I am using SelectorGadget to get the HTML node, but I noticed that SelectorGadget gives me a different node than what the Inspector does in Google Chrome. I have tried both ".short .breakword" and ".summary-title short .breakword" in that line of code, but neither gives me what I am looking for.
I have recently run into the same issues using rvest to scrape PubChem. The problem is that the information on the page is rendered using javascript as you are scrolling down the page, so rvest is only getting minimal information from the page.
There are a few workarounds though. The simplest way to get the information that you need into R is using an R package called webchem.
If you are looking up name, CAS number, and molecular weight then you can do something like:
library(webchem)
chem_properties <- pc_prop(1, properties = c('IUPACName', 'MolecularWeight'))
The full list of compound properties that can be extracted using this api can be found here. Unfortunately there isn't a property through this api to get CAS number, but webchem gives us another way to query that using the Chemical Translation Service.
chem_cas <- cts_convert(query = '1', from = 'CID', to = 'CAS')
The second way to get information from the page that is a bit more robust but not quite as easy to work with is by grabbing information from the JSON api.
library(jsonlite)
chem_json <-
read_json(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", "1", "/JSON/?response_type=save$response_basename=CID_", "1"))
With that command you'll get a list of lists, which I had to write a convoluted script to parse the information that I needed from the page. If you are familiar with JSON, you can parse far more information from the page, but not quite everything. For example, in sections like Literature, Patents, and Biomolecular Interactions and Pathways, the information in these sections will not fully show up in the JSON information.
The final and most comprehensive way to get all information from the page is to use something like Scrapy or PhantomJS to render the full html output of the PubChem page, then use rvest to scrape it like you originally intended. This is something that I'm still working on as it is my first time using web scrapers as well.
I'm still a beginner in this realm, but hopefully this helps you a bit.

How can I scrape data from a website within a frame using R?

The following link contains the results of the marathon of Paris: http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon.
I want to scrape these results, but the information lies within a frame. I know the basics of scraping with Rvest and Rselenium, but I am clueless on how to retrieve the data within such a frame. To get an idea, one of the things I tried was:
url = "http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon"
site = read_html(url)
ParisResults = site %>% html_node("iframe") %>% html_table()
ParisResults = as.data.frame(ParisResults)
Any help in solving this problem would be very welcome!
The results are loaded by ajax from the following url :
url="http://www.aso.fr/massevents/resultats/ajax.php?v=1460995792&course=mar16&langue=us&version=3&action=search"
table <- url %>%
read_html(encoding="UTF-8") %>%
html_nodes(xpath='//table[#class="footable"]') %>%
html_table()
PS: I don't know what ajax is exactly, and I just know basics of rvest
EDIT: in order to answer the question in the comment: I don't have a lot of experience in web scraping. If you only use very basic technics with rvest or xml, you have to understand a little more the web site, and every site has its own structure. For this one, here is how I did:
As you see, in the source code you don't see any results because they are in an iframe, and when inspecting the code, you can see after "RESULTS OF 2016 EDITION":
class="iframe-xdm iframe-resultats" data-href="http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=3"
Now you can use directly this url : http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=2
But you still can get the results. You can then use Chrome developer tools > Network > XHR. When refreshing the page, you can see that the data is loaded from this url (when you choose the Men category) : http://www.aso.fr/massevents/resultats/ajax.php?course=mar16&langue=us&version=2&action=search&fields%5Bsex%5D=F&limiter=&order=
Now you can get the results !
And if you want the second page, etc. you can click on the number of the page, then use developer tool to see what happens !

Scraping does not return the desired data

I am trying to get data from the site https://bill.torrentpower.com/. I desire to input the city "Ahmedabad" and Service number "3031629" and extract the table which gives the bill details.
My code is simple
a<- postForm("https://bill.torrentpower.com/billdetails.aspx",
"ctl00$cph1$drpCity" = 1,
"ctl00$cph1$txtServiceNo" = "3031629",
.opts = list(ssl.verifypeer = FALSE)
)
write(a,file="a.html")
When I open the file a.html, I do not see the table containing the bill details. All other details are visible on a.html. My aim is to capture the tablular output as an R object.
The issue here is that the table is generated by the JavaScript code after the page has loaded and hence you will not get the content of the table.
This is a common problem with scraping information that has lots of dynamic content.
A work around this is to stimulate a web browser using RSelenium.
http://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf
This will stimulate with web browser in your R session and you can navigate the webpages using various methdos ( see the user manual for info)
Personally, I find RSelenium with PhantomJS combination the most useful since I use a lot of JavaScript. Alternatively, if you find using R Syntax abit troublesome you may use PhantomJS on its own as well. http://phantomjs.org/

Screen scraping actual page not source html with R

I am trying to screen scrape tennis results data (point by point data, not just final result) from this page using R.
http://www.scoreboard.com/au/match/wang-j-karlovic-i-2014/M1mWYtEF/#point-by-point;1
Using the regular R screen scraping functions like readlines(),htmlParseTree() etc I am able to scrape the source html for the page, but that does not contain the results data.
Is it possible to scrape all the text from the page, as if I were on the page in my browser and selected all and then copied?
That data is loaded using AJAX from http://d.scoreboard.com/au/x/feed/d_mh_M1mWYtEF_en-au_1, so R will not be able to just load it for you. However, because both use the code M1mWYtEF, you can go directly to the page that has the data you want. Using Chrome's devtools, I was able to see that the page sends a header of X-Fsign: SW9D1eZo that will let you access that page (you get a 401 Unauthorized error otherwise).
Here is R code for getting the html that holds the data you want from your example page:
library(httr)
page_code <- "M1mWYtEF"
linked_page <- paste0("http://d.scoreboard.com/au/x/feed/d_mh_",
page_code, "_en-au_1")
GET(linked_page, add_headers("X-Fsign" = "SW9D1eZo"))

Why is ImportXML not working for a specific field while trying to scrape kickstarter.com?

I am trying to screen scrape funding status of a specific Kickstarter project.
I am using following formula in my Google spreadsheet, what I am trying here is to get the $ amount of project's funding status:
=ImportXML("http://www.kickstarter.com/projects/1904431672/trsst-a-distributed-secure-blog-platform-for-the-o","//data[#class='Project942741362']")
It returns #N/A in the cell, with comment:
error: The xPath query did not return any data.
When I try using ImportXML on other parts of the same webpage it seems to work perfectly well. Could someone please point out what I am doing wrong here?
It seems that the tag "data" is not correctly parsed.
A choice of workaround may be:
=REGEXEXTRACT(IMPORTXML("http://...", "//div[#id='pledged']"), "^\S*")

Resources