I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.
For this, I scrape with getURL through this loop:
b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id
lapply(l, FUN = function(x) {
print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))
# scrape
data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
data <- paste(b, x, data, sep = "/")
data <- getURL(data)
write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})
Is there any way to optimise the getURL() function here? I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:
Error in function (type, msg, asError = TRUE) :
Failed to connect to 0.0.0.12: No route to host
Any ideas? Thanks!
Try mclapply {multicore} instead of lapply.
"mclapply is a parallelized version of lapply, it returns a list of
the same length as X, each element of which is the result of applying
FUN to the corresponding element of X."
(http://www.rforge.net/doc/packages/multicore/mclapply.html)
If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.
"Note that xmlTreeParse does allow a hybrid style of processing that
allows us to apply handlers to nodes in the tree as they are being
converted to R objects. This is a style of event-driven or
asynchronous calling."
(http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse)
Why use R? For big scraping jobs you are better off using something already developed for the task. I've had good results with Down Them All, a browser add on. Just tell it where to start, how deep to go, what patterns to follow, and where to dump the HTML.
Then use R to read the data from the HTML files.
Advantages are massive - these add-ons are developed especially for the task so they will do multiple downloads (controllable by you), they will send the right headers so your next question won't be 'how do I set the user agent string with RCurl?', and they can cope with retrying when some of the downloads fail, which they inevitably do.
Of course the disadvantage is that you can't easily start this process automatically, in which case maybe you'd be better off with 'curl' on the command line, or some other command-line mirroring utility.
Honestly, you've got better things to do with your time than write website code in R...
Related
Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)
I am building a dataset by webscraping data from various websites for a stock signal prediction algorithm. The way my algorithm is set up involves layering for-loops and loading thousands of URLs because each link refers to stock and its various quantitative statistics. Need help increasing processing speed. Any tips?
I have talked to a few different people about how to solve this and some people have recommended vectorization, but that is new to me. I have also tried switching to data table, but I haven't seen much change. The eval lines are a trick I learned to manipulate the data the way I want but I figure it may be a reason why it is slow, but I doubt it. I have also wondered about remote processing, but this probably goes beyond the R world.
For the code below, imagine there are 4 more sections like this for other variables from different websites I want to load, and all of these blocks are in an even larger for-loop because I'm generating two datasets (set = c("training, testing")).
The tryCatch is for preventing the code from stopping if it encounters an error loading a URL. The urls are loaded into a list, one for each stock - so they are pretty long. The second for-loop scrapes the data from the URLS and posts them formatted correctly in a data frame.
library(quantmod)
library(readr)
library(rvest)
library(data.table)
urlsmacd <- vector("list", length =
eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep = "")))))
for(h in 1:eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep =
""))))){
urlsmacd[h] <- paste0('http://www.stockta.com/cgi-bin/analysis.pl?
symb=',eval(parse(text=as.name(paste0(set[y],"[,1][h]", sep =
"")))),'&mode=table&table=macd&num1=1', sep = '')
}
for(j in 1:eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep =
""))))){
tryCatch({
html <- read_html(urlsmacd[[j]])
#get macd html
MACD26 <- html_nodes(html,'.borderTd~ .borderTd+ .borderTd:nth-child(3)
font')
MACD26 <- toString(MACD26)
MACD26 <- gsub("<[^>]+>", "", MACD26)
if(!is.na(MACD26)){
MACD26 <- as.double(MACD26)
}
eval(parse(text=as.name(paste0(set[y],"$","MACD26[j] <- MACD26"))))
MACD12 <- html_nodes(html,'.borderTd+ .borderTd:nth-child(2) font')
MACD12 <- toString(MACD12)
MACD12 <- gsub("<[^>]+>", "",MACD12)
if(!is.na(MACD12)){
MACD12 <- as.double(MACD12)
}
eval(parse(text=as.name(paste0(set[y],"$","MACD12[j] <- MACD12"))))
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
All of this said and done, this process takes around 6 hours. At this rate, shaving hours off this process would make progressing my project so much easier.
Thank you people of StackOverflow for your support.
Check the doParallel package. It has a parallel implementation of the foreach loop. It let you use more cores of your CPU (if there are available) to perform parallel R sessions for a defined function. For example:
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores, type="FORK")
registerDoParallel(cl)
result <- foreach(i=10:10000) %dopar%
getPrimeNumbers(i)
If the urls are stored in a list, there is also a parallel lapply.
The example is taken from this great post:
https://www.r-bloggers.com/lets-be-faster-and-more-parallel-in-r-with-doparallel-package/amp/
Hope it helps.
I was trying different things to lemmatize huge corpus of words using different techniques in R language. Finally, I managed to use a package koRpus which is the wrapper for TreeTagger application.
content.cc is my corpus containing near 7000 documents with average word number of about 300 words. I set the function:
lemmatizeCorpus <- function(x) {
if (x != "") {
words.cc <- treetag(x, treetagger="manual", format="obj",
TT.tknz=FALSE, lang="en",
TT.options=list(path="c:/TreeTagger", preset="en"))
words.lm <- ifelse(words.cc#TT.res$token != words.cc#TT.res$lemma,
ifelse(words.cc#TT.res$lemma != "<unknown>", words.cc#TT.res$lemma, words.cc#TT.res$token),
words.cc#TT.res$token)
content.w <- toString(paste(words.lm, collapse = " "))
}
}
and executed that way:
content.lw <- sapply(X = content.cc$content, FUN = function(x) lemmatizeCorpus(x), USE.NAMES = F)
It brings the desired effect - changes words which have its root in the TT dictionary, and, what's important here, leaves the hierarchy the same as in the corpus (number of document, words positions, words number). The problem is this works for about an hour (on my rather slow machine, but it's not important on what cp it runs).
I tried to merge the whole corpus into one char matrix: stri_extract_all_words(content.cc$content) and applied the corpus as a whole in treetag function. It was faster in about 5x (the same function body), but I got lost trying find indices for which words belongs to which document, because the number of extracted words by stri and performed by treetag differed quite a bit. That loop is stable.
Another try was using stemmer from tm package, which is popular and help and solutions can be found on this forum also, but it reaches regex memory limit very fast and going into looping throws the same effect as current approach.
All I need is some suggestions what can I do with it? Can I? Maybe it's not possible to speed up the things because TreeTagger just works that way and can't be faster. I know it's challenging. Using sapply for example the result is about 2x faster then pure loop, so it's some improvement.
I could not understand the code completely, because of lack of familiarity with the API's you mentioned. But if I understood your problem statement, you basically need to lemmatize a corpus, or have the entire string/sentences lemmatized.
Did you try out library 'textstem'?
library(textstem)
lemmatize_strings("He quit lazing around and actively to activities like running, aerobics and swimming", dictionary = lexicon::hash_lemmas)
The output is...
[1] "He quit laze around and actively to activity like run, aerobics and swim"
You could thus try to first lemmatize your entire input, and generate a corpus for it?
I'm relatively new to R but experienced in traditional programming languages (e.g., C, Java). I've recently run into the situation where I had so many data files to load that I was spending almost as much time on that one task as I was on the actual analysis. I spent a little time googling this but didn't run across any solutions that I found directly relevant (I might have missed something, I'm impatient that way). Despite that I came up with a simple solution to my problem that I wanted to share with the community in case anyone else found themselves in similar circumstances.
A bit of background info: The data I'm analyzing is real-time performance and diagnostic metrics for an experimental system that is driven by real-time data feeds (i.e., complicated). The upshot is that between trials filenames don't change and the data is written out directly to csv files (I wrote the logging code so I get to be my own best friend like that ;). There are dozens of files generated during a single trial and we have potentially hundreds of trials to look forward to.
I had a few ideas and after playing around with the code a bit I came up with the following solution:
# Create mapping that associates files with a handle that the loader will use to
# generate a named list of data frames (don't even try this on the cmdline)
createDataFileMapping <- function() {
list(
c(file = "file1.csv", descr = "descriptor1"),
c(file = "file2.csv", descr = "descriptor2"),
...
)
}
# Batch load csv files and return as list of data frames
loadTrialData <- function(load.dir, mapping) {
dfList <- list()
for (item in mapping) {
file <- paste(load.dir, item[["file"]], sep = "/")
df <- read.csv(file)
dfList[[ item[["descr"]] ]] <- df
}
return(dfList)
}
Invoking is as simple as loadTrialData("~/data/directory", createDataFileMapping()).
I'm sure there are other ways to solve this problem but the above gets the job done in my case. I'm sure this is slightly less memory-efficient than loading the files directly into data frames in the global environment, and the syntax for passing individual data frames to analysis/plotting functions isn't as elegant as it could be, but I'm not choosy. If you have a more flexible/generalizable solution then please don't hesitate to post!
What you have is sound, I would add only two comments:
Don't worry about extra memory usage, assuming the data frames are of nontrivial size you won't lose much putting them in a big list.
You might add ... as an argument to your function and pass it through to read.csv, so that if another user needs to specify extra arguments because their file wasn't in quite the same format (or wants stringsAsFactors=FALSE or something) then they have the flexibility to do that.
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.