I'm trying to download a csv file of data corresponding to the chart on the below website:
http://vixcentral.com/
If I click on the menu button on the top right of the chart there's an option to download the chart data into a csv.
The issue is that that button seems to generate a download link that only works temporarily, so I'm unable to use a regular downloader such as read_csv or rio::import to pull the file into R.
It seems both the chart and the download link are generated by the Highcharts javascript.
Is there any straightforward way to download this data into R by figuring out the link?
Or does it have to be a scraping excercise?
If you right-click on the screen and press 'Inspect Element', then go to the 'Network' tab, you can see xhr requests being done to obtain data (for example while clicking around the different charts).
You noted that you're interested in the result of http://vixcentral.com/ajax_update/?_=1590762673737.
The number in the end of this URL is the Unix epoch of the current time. That's why it changes.
There is a little bit of security from scraping in the sense that they try to block requests that do not come from their own site. By setting the header X-Requested-With to "XMLHttpRequest", it works. You can view the headers used for this request by clicking on it in the 'inspect element' screen of your browser. There are a bunch of headers being set, and by removing each one and testing, I found out that this is the only one that's needed for your purpose.
Below reads the data and parses it into an R object using jsonlite.
res <- httr::GET("http://vixcentral.com/ajax_update/?_=1590762673737",
add_headers("X-Requested-With" = "XMLHttpRequest"))
res_text <- content(res, "text")
jsonlite::fromJSON(res_text)
Related
I am trying to scrape this website: https://www.casablanca-bourse.com/bourseweb/Societe-Cote.aspx?codeValeur=12200 the problem is i only want extract data from the tab "indicateurs clès" and i can't find a way to access to it in code source without clicking on it.
Indeed, i can't figure out the URL of this specific tab... i checked the code source and i found that there's a generated code that changed whenver i clicked on that tab
Any suggestions?
Thanks in advance
The problem is that this website uses AJAX to get the table in the "Indicateurs Clès", so it is requested from the server only when you click on the tab. To scrape the data, you should send the same request to the server. In other words, try to mimic the browser's behavior.
You can do it this way (for Chromium; for other browsers with DevTools it's pretty much similar):
Press F12 to open the DevTools.
Switch to the "Network" tab.
Select Fetch/XHR filter.
Click on the "Indicateurs Clès" tab on the page.
Inspect the new request(s) you see in the DevTools.
Once you find the request that returns the information you need ("Preview" and "Response"), right-click the request and select "Copy as cURL".
Go to https://curl.trillworks.com/
Select the programming language you're using for scraping
Paste the cURL to the left (into the "curl command" textarea).
Copy the code that appeared on the right and work with it. In some cases, you might need to inspect the request further and modify it.
In this particular case, the request data contains `__VIEWSTATE` and other info, which is used by the server to send only the data necessary to update the already existing table.
At the same time, you can omit everything but the __EVENTTARGET (the tab ID) and codeValeur. In such a case the server will return page XHTML, which includes the whole table. After that, you can parse that table and get all you need.
I don't know what tech stack you were initially going to use for scraping the website, but here is how you can get the tables with Python requests and BeautifulSoup4:
import requests
from bs4 import BeautifulSoup
params = (
('codeValeur', '12200'),
)
data = {
'__EVENTTARGET': 'SocieteCotee1$LBFicheTech',
}
response = requests.post('https://www.casablanca-bourse.com/bourseweb/Societe-Cote.aspx', params=params, data=data)
soup = BeautifulSoup(response.content)
# Parse XHTML to get the data you exactly need
My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in.
I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards.
The xpath of the first entry is:
/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a
My idea was to get the link via html_attr( , "href") once I am in the right tag.
My idea is:
library(rvest)
url = "https://www.kaggle.com/competitions"
kaggle_html = read_html(url)
kaggle_text = html_text(kaggle_html)
kaggle_node <- html_nodes(kaggle_html, xpath = "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a")
html_attr(kaggle_node, "href")
I cant go past a certain div. The following snippet shows the last node I can access
node <- html_nodes(kaggle_html, xpath="/html/body/div[1]/div[2]/div")
html_attrs(node)
Once I go one step further with html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div"), the node will be empty.
I think the issue is that kaggle uses a smart list that expands the further I scroll down.
(I am aware that I can use %>%. I am saving every step so that I am able to access and view them more easily to be able to learn how it properly works.)
I solved the issue. I think that I can not access the full html code of the site from R because the table is loaded by a script which expands the table (thus the HTML) with a user scrolling through.
I resolved it, by expanding the table manually, downloading the whole HTML webpage and loading the local file.
I am trying to download pdf files as follows:
(since this is a commercial site, I had to replace the url, username, and password below)
## login to the site first
library(RSelenium)
RSelenium::checkForServer()
RSelenium::startServer(log = TRUE, invisible = FALSE)
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate(url) # the url of the login page
remDr$findElement("id", "LoginForm_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "LoginForm_password")$sendKeysToElement(list("pass"))
remDr$findElement("name", "start")$clickElement() ## this is the login button
This is a website that contains data on interaction of firms. Knowing the api, I have figured out which page-name for each report that I am interested in. On the page is a "download pdf" button. When I click this button, the site dynamically generates a report in pdf-format and returns the report (with a random name, like "97da08491e3e41447f591c2b668c0602.pdf". I think it uses wkhtml2pdf for this. I click the button using the following code:
# pp is the name of a link for a given report
remDr$navigate(pp)
Sys.sleep(7) # wait for the page to load
remDr$findElement("id", "download-pdf")$clickElement()
When the "download pdf" button is clicked, the document is generated by the site, and then saved by Chrome. (the random name is different each time and there is no way I can use something like download.file() to get it) This works fine, except that the document is saved with this random name. Rather, I want to capture the pdf that is returned by the site and then save it using a more informative name (I have to do this hundreds of times, so I don't want to have to go through all the pdf's manually in order to find the report on specific firms).
So, my question is: how can I capture a pdf that is dynamically generated-and-returned by a site and then save it under a name of my own choosing?
(I apologize for not being able to provide the links to the site, but this is a proprietary site that I am not allowed to share publicly. However, I expect that this issue might be of use to more people and more sites).
You can manipulate the files in your download folder with R. I would simply list the files:
L <- dir(".",pattern="*.pdf")
If needed you can select the last PDF using the info from:
file.info(L)
And then change the file name using
file.rename(identifiedName, meaningFullName)
I am trying to screen scrape tennis results data (point by point data, not just final result) from this page using R.
http://www.scoreboard.com/au/match/wang-j-karlovic-i-2014/M1mWYtEF/#point-by-point;1
Using the regular R screen scraping functions like readlines(),htmlParseTree() etc I am able to scrape the source html for the page, but that does not contain the results data.
Is it possible to scrape all the text from the page, as if I were on the page in my browser and selected all and then copied?
That data is loaded using AJAX from http://d.scoreboard.com/au/x/feed/d_mh_M1mWYtEF_en-au_1, so R will not be able to just load it for you. However, because both use the code M1mWYtEF, you can go directly to the page that has the data you want. Using Chrome's devtools, I was able to see that the page sends a header of X-Fsign: SW9D1eZo that will let you access that page (you get a 401 Unauthorized error otherwise).
Here is R code for getting the html that holds the data you want from your example page:
library(httr)
page_code <- "M1mWYtEF"
linked_page <- paste0("http://d.scoreboard.com/au/x/feed/d_mh_",
page_code, "_en-au_1")
GET(linked_page, add_headers("X-Fsign" = "SW9D1eZo"))
There are a number of fairly detailed answers on SO which cover authenticated login to an aspx site and a download from it. As a complete n00b I haven't been able to find a simple explanation of how to get data from a web form
The following MWE is intended as an example only. And this question is more intended to teach me how to do it for a wider collection of webpages.
website :
http://data.un.org/Data.aspx?d=SNA&f=group_code%3a101
what I tried and (obviously) failed.
test=read.csv('http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc')
giving me goobledegook with a View(test)
Anything that steps me through this or points me in the right direction would be very gratefully received.
The URL you are accessing using read.csv is returning a zipped file. You could download it
using httr say and write the contents to a temp file:
library(httr)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
response <- GET(urlUN)
writeBin(content(response, as = "raw"), "temp/temp.zip")
fName <- unzip("temp/temp.zip", list = TRUE)$Name
unzip("temp/temp.zip", exdir = "temp")
read.csv(paste0("temp/", fName))
Alternatively Hmisc has a useful getZip function:
library(Hmisc)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
unData <- read.csv(getZip(urlUN))
The links are being generated dynamically. The other problem is the content isn't actually at that link. You're making a request to a (very odd and poorly documented) API which will eventually return with the zip file. If you look in the Chrome dev tools as you click on that link you'll see the message and response headers.
There's a few ways you can solve this. If you know some javascript you can script a headless webkit instance like Phantom to load up these pages, simulate lick events and wait for a content response, then pipe that to something.
Alternately you may be able to finagle httr into treating this like a proper restful API. I have no idea if that's even remotely possible. :)