I'd like to apologise in advance for the lack of a reproducible example. The data I'm using my script on are not live right now and are confidential in addition.
I wanted to make a script which can find all links on a certain page. The script works as following:
* find homepage html to start with
* find all urls on this homepage
* open these urls with Selenium
* save the html of each page in a list
* repeat this (find urls, open urls, save html)
The workhorse of this script is the following function:
function(listofhtmls) {
urls <- lapply(listofhtmls, scrape)
urls <- lapply(urls, clean)
urls <- unlist(urls)
urls <- urls[-which(duplicated(urls))]
urls <- paste("base_url", urls, sep = "")
html <- lapply(urls, savesource)
result <- list(html, urls)
return(result) }
Urls are scraped, cleaned (I don't need all urls) and duplicated urls are removed.
All of this works fine for most pages but sometimes I get a strange error while using this function:
Error: '' does not exist in current working directory.
Called from: check_path(path)
I don't see any link between the working directory and the parsing that's going on. I'd like to resolve this error as it's kinda blocking the rest of my script at the moment. Thanks in advance and once again excuses for not using a reproducible example.
Related
This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !
You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
I'm trying to get a list of all the XML documents in a web server directory (and all its subdirectories).
I've tried these examples:
One:
library(XML)
url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"
getHTMLLinks(url)
Returns: character(0) Warning message: XML content does not seem to be XML
Two:
readHTMLTable(url)
Returns the same error.
I've tried other sites as well, like those included in the examples. I saw some SO questions (example) about this error saying to change https to http. When I do that I get Error: failed to load external entity.
Is there a way I can get a list of all the XML files at that URL and all the subdirectories using R?
To get the raw html from the page:
require(rvest)
url <- "https://cmgds.marine.usgs.gov/metadata/pcmsc/"
html <- read_html(url)
Then, we'll get all the links using html_nodes. The names are truncated, so we need to get the href attribute rather than just using html_table().
data <- html %>% html_nodes("a") %>% html_attr('href')
I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.
I would like to use RCurl as a polite webcrawler to download data from a website.
Obviously I need the data for scientific research. Although I have the rights to access the content of the website via my university, the terms of use of the website forbid the use of webcrawlers.
I tried to ask the administrator of the site directly for the data but they only replied in a very vague fashion. Well anyway it seems like they won’t simply send the underlying databases to me.
What I want to do now is ask them officially to get the one-time permission to download specific text-only content from their site using an R code based on RCurl that includes a delay of three seconds after each request has been executed.
The address of the sites that I want to download data from work like this:
http://plants.jstor.org/specimen/ID of the site
I tried to program it with RCurl but I cannot get it done.
A few things complicate things:
One can only access the website if cookies are allowed (I got that working in RCurl with the cookiefile-argument).
The Next-button only appears in the source code when one actually accesses the site by clicking through the different links in a normal browser.
In the source code the Next-button is encoded with an expression including
Next > >
When one tries to access the site directly (without having clicked through to it in the same browser before), it won't work, the line with the link is simply not in the source code.
The IDs of the sites are combinations of letters and digits (like “goe0003746” or “cord00002203”), so I can't simply write a for-loop in R that tries every number from 1 to 1,000,000.
So my program is supposed to mimic a person that clicks through all the sites via the Next-button, each time saving the textual content.
Each time after saving the content of a site, it should wait three seconds before clicking on the Next-button (it must be a polite crawler). I got that working in R as well using the Sys.sleep function.
I also thought of using an automated program, but there seem to be a lot of such programs and I don’t know which one to use.
I’m also not exactly the program-writing person (apart from a little bit of R), so I would really appreciate a solution that doesn’t include programming in Python, C++, PHP or the like.
Any thoughts would be much appreciated! Thank you very much in advance for comments and proposals !!
Try a different strategy.
##########################
####
#### Scrape http://plants.jstor.org/specimen/
#### Idea:: Gather links from http://plants.jstor.org/search?t=2076
#### Then follow links:
####
#########################
library(RCurl)
library(XML)
### get search page::
cookie = 'cookiefile.txt'
curl = getCurlHandle ( cookiefile = cookie ,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = F,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE)
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
### get links from search page
getLinks = function() {
links = character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function()links)
}
## retrieve links
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
## cleanup links to keep only the one we want.
querry.jstor.links = NULL
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
## number of results
jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))
for(i in 2:ceiling(NumOfRes/20)){
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
## make directory for saving data:
dir.create('./jstorQuery/')
## Now we have all the links, so we can retrieve all the info
for(j in 1:length(querry.jstor.links)){
if(nchar(querry.jstor.links[j]) != 1){
querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
## contruct name:
filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]
## save in directory:
write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
}
I may be missing exactly the bit you are hung up on, but it sounds like you are almost there.
It seems you can request page 1 with cookies on. Then parse the content searching for the next site ID, then request that page by building the URL with the next site ID. Then scrape whatever data you want.
It sounds like you have code that does almost all of this. Is the problem parsing page 1 to get the ID for the next step? If so, you should formulate a reproducible example and I suspect you'll get a very fast answer to your syntax problems.
If you're having trouble seeing what the site is doing, I recommend using the Tamper Data plug in for Firefox. It will let you see what request is being made at each mouse click. I find it really useful for this type of thing.