I have many URLs which I import their text in R.
I use this code:
setNames(lapply(1:1000, function(x) gettxt(get(paste0("url", x)))), paste0("url", 1:1000, "_txt")) %>%
list2env(envir = globalenv())
However, some URLs can not import and show this error:
Error in file(con, "r") : cannot open the connection In addition:
Warning message: In file(con, "r") : InternetOpenUrl failed: 'A
connection with the server could not be established'
So, my code doesn't run and doesn't import any text from any URL.
How can I recognize wrong URLs and skip them in other to import correct URLs?
one possible aproach besides trycatch mentioned by #tester can be the purrr-package:
library(purrr)
# declare function
my_gettxt <- function(x) {
gettxt(get(paste0("url", x)))
}
# make function error prone by defining the otherwise value (could be empty df with column defintion, etc.) used as output if function fails
my_gettxt <- purrr::possibly(my_gettxt , otherwise = NA)
# use map from purrr instead of apply function
my_data <- purrr::map(1:1000, ~my_gettxt(.x))
Related
I'd like to create function for dbplyr connection, this connection I call this way.
tbl("connection"), in_schema("schema", "table") %>%
collect()
I created function
loadData <-
function(tbl_name) {
result <- tbl("connection"), in_schema("usa", tbl_name)
return(result)
}
I call it this way
test <- loadData(tbl_name = "sales") %>% collect()
But I'm getting error:
Error in return(tbl("connection"), in_schema("usa", tbl_name)) :
multi-argument returns are not permitted
In addition: Warning message:
In mget(objectNames, envir = ns, inherits = TRUE) :
strings not representable in native encoding will be translated to UTF-8
Is there a way how to call all my connection via similar function?
I'm relatively new to using R in this way and I'm completely stuck with the following problem.
I'm attempting to save html pages for parliamentary debates to a local folder in order to carry out some scraping in the future. I've written the following function (relying on other snippets of code rather than entirely freestyle!) in order to construct the directory, strip the URL down to a more understandable format (e.g. "2010-12_academieshl.html"), and then, if the file does not exist, write the file to the specified folder. (At this point I'm aware that the use of gsub below is kind of clumsy!)
dlPages <- function(pageurl, folder ,handle) {
dir.create(folder, showWarnings = FALSE)
gsub_URL <- gsub("/stages.html", "", link_list)
gsub_URL <- gsub("http://services.parliament.uk/bills/", "", gsub_URL)
gsub_URL <- gsub("/", "_", gsub_URL)
page_name <- str_c(gsub_URL, ".html")
if (!file.exists(str_c(folder, "/", page_name))) {
content <- try(getURL(pageurl, curl = handle))
write(content, str_c(folder, "/", page_name))
Sys.sleep(1)
} }
I'm then using l_ply to run a list (link_list) of links over the function:
handle <- getCurlHandle()
l_ply(link_list, dlPages,
folder = "lords_bills_all",
handle = handle)
The following error message is being returned:
Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument
Along with the following warning messages:
In addition: Warning messages:
1: In if (!file.exists(str_c(folder, "/", page_name))) { :
the condition has length > 1 and only the first element will be used
2: In if (file == "") file <- stdout() else if (substring(file, 1L, :
the condition has length > 1 and only the first element will be used
3: In if (substring(file, 1L, 1L) == "|") { :
the condition has length > 1 and only the first element will be used
Can someone help me understand where I'm going wrong? Also, is it best to use an 'if' argument in this scenario or write a 'for' loop instead?
Thanks in advance.
Andy
I would like to create a loop to load this files through read.esetof bioconductor.
I tried that:
for(k in 1:29){
expr <- paste0("/home/proj/MT_Nellore/R/eBrowser/Adjusted/LRRadjustedextremes0.5kgchr",k,".txt")
pdat <- paste0("/home/proj/MT_Nellore/R/eBrowser/Adjusted/Samplesbinary0.5.txt")
ffdat <- paste0("/home/proj/MT_Nellore/R/LRR/Chr_adjusted/probeslabeladjustedchr",k,".txt")
eset <- read.eset(exprs.file="expr", pdat.file="/home/proj/MT_Nellore/R/eBrowser/Adjusted/Samplesbinary0.5.txt", fdat.file="ffdat")
}
However I get this error:
## Error in file(file, "r") : cannot open the connection
## In addition: Warning message:
## In file(file, "r") : cannot open file 'ffdat': No such file or directory
Any suggestions?
Ah - just spotted the error - you must remove quotes from around the "ffdat" on the final line, and same for the "expr"
(Windows 7 / R version 3.0.1)
Below the commands and the resulting error:
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp
\RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory
How do I solve this issue?
EDIT I
(As suggested by Ben and described here)
I downloaded Xpdf copied the 32bit version to
C:\Program Files (x86)\xpdf32
and the 64bit version to
C:\Program Files\xpdf64
The environment variables pdfinfo and pdftotext are referring to the respective executables either 32bit (tested with R 32bit) or to 64bit (tested with R 64bit)
EDIT II
One very confusing observation is that starting from a fresh session (tm not loaded) the last command alone will produce the error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpKi5GnL
\pdfinfode8283c422f': No such file or directory
I don't understand this at all because the function variable is not defined by tm.readPDF yet. Below you'll find the function pdf refers to "naturally" and to what is returned by tm.readPDF:
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0674bd8c>
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0c3d7364>
Apparently there is no difference - then why use readPDF at all?
EDIT III
The pdf file is located here: C:\Users\Raffael\Documents
> getwd()
[1] "C:/Users/Raffael/Documents"
EDIT IV
First instruction in pdf() is a call to tm:::pdfinfo() - and there the error is caused within the first few lines:
> outfile <- tempfile("pdfinfo")
> on.exit(unlink(outfile))
> status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")),
+ stdout = outfile)
> tags <- c("Title", "Subject", "Keywords", "Author", "Creator",
+ "Producer", "CreationDate", "ModDate", "Tagged", "Form",
+ "Pages", "Encrypted", "Page size", "File size", "Optimized",
+ "PDF version")
> re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",
+ tags)), collapse = "|"))
> lines <- readLines(outfile, warn = FALSE)
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d419174450': No such file or direc
Apparently tempfile() simply doesn't create a file.
> outfile <- tempfile("pdfinfo")
> outfile
[1] "C:\\Users\\Raffael\\AppData\\Local\\Temp\\RtmpquRYX6\\pdfinfo8d437bd65d9"
The folder C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6 exists and holds some files but none is named pdfinfo8d437bd65d9.
Intersting, on my machine after a fresh start pdf is a function to convert an image to a PDF:
getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
package:grDevices
namespace:grDevices [etc.]
But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text using system as Tony Breyal describes here.
In your case it would be (note the two sets of quotes):
system(paste('"C:/Program Files/xpdf64/pdftotext.exe"',
'"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)
This could easily be extended with an *apply function or loop if you have many PDF files.
I am trying to vectorize this call to source_url, in order to load some functions from GitHub:
library(devtools)
# Find ggnet functions.
fun = c("ggnet.R", "functions.R")
fun = paste0("https://raw.github.com/briatte/ggnet/master/", fun)
# Load ggnet functions.
source_url(fun[1], prompt = FALSE)
source_url(fun[2], prompt = FALSE)
The last two lines should be able to work in a lapply call, but for some reason, this won't work from knitr: to have this code work when I process a Rmd document to HTML, I have to call source_url twice.
The same error shows up with source_url from devtools and with the one from downloader: somehwere in my code, an object of type closure is not subsettable.
I suspect that this has something to do with SHA; any explanation would be most welcome.
It has nothing to do with knitr or devtools or vectorization. It is just an error in your(?) code, and it is fairly easy to find it out using traceback().
> library(devtools)
> # Find ggnet functions.
> fun = c("ggnet.R", "functions.R")
> fun = paste0("https://raw.github.com/briatte/ggnet/master/", fun)
> # Load ggnet functions.
> source_url(fun[1], prompt = FALSE)
SHA-1 hash of file is 2c731cbdf4a670170fb5298f7870c93677e95c7b
> source_url(fun[2], prompt = FALSE)
SHA-1 hash of file is d7d466413f9ddddc1d71982dada34e291454efcb
Error in df$Source : object of type 'closure' is not subsettable
> traceback()
7: which(df$Source == x) at file34af6f0b0be5#14
6: who.is.followed.by(df, "JacquesBompard") at file34af6f0b0be5#19
5: eval(expr, envir, enclos)
4: eval(ei, envir)
3: withVisible(eval(ei, envir))
2: source(temp_file, ...)
1: source_url(fun[2], prompt = FALSE)
You used df in the code, and df is a function in the stats package (density of the F distribution). I know you probably mean a data frame, but you did not declare that in the code.