Error in ocrFile function in AbbyyR package - r

While I was using Abbyy cloud SDK for OCR, I keep on getting the error below when I try to use the ocrFile function which is inside the AbbyyR package.
" Error in curl_download(finishedlist$resultUrl[res$id == finishedlist$id], :
Argument 'url' must be string. "
When I send the files to the cloud and process them everything works fine but when the cloud returns the files there is a problem in downloading them. I thought that it might be a network or certificate problem but I can't solve the problem.
Thanks in advance

There is a problem in source code, it needs as.character() function for url.
I updated ocrFile function as follows:
install.packages("curl")
library(curl)
new_ocrFile<-function (file_path = "", output_dir = "./", exportFormat = c("txt",
"txtUnstructured", "rtf", "docx", "xlsx", "pptx", "pdfSearchable",
"pdfTextAndImages", "pdfa", "xml", "xmlForCorrectedImage",
"alto"), save_to_file = TRUE)
{
exportFormat <- match.arg(exportFormat)
res <- processImage(file_path = file_path, exportFormat = exportFormat)
while (!(any(as.character(res$id) == as.character(listFinishedTasks()$id)))) {
Sys.sleep(1)
}
finishedlist <- listFinishedTasks()
res$id <- as.character(res$id)
finishedlist$id <- as.character(finishedlist$id)
if (identical(save_to_file, FALSE)) {
res <- curl_fetch_memory(as.character(finishedlist$resultUrl[res$id ==
finishedlist$id]))
return(rawToChar(res$content))
}
curl_download(as.character(finishedlist$resultUrl[res$id == finishedlist$id]),
destfile = paste0(output_dir, unlist(strsplit(basename(file_path),
"[.]"))[1], ".", exportFormat))
}
I hope, it helps.

Related

Error when trying to download species models in R with ebirdst::ebirdst_download()

I've requested an access key and set it. Sys.getenv("EBIRDST_KEY") returns the correct key. ebirdst_download(species = "Sharp-tailed Grouse") returns an error
Error in ebirdst_download(species = "Sharp-tailed Grouse") : Cannot
access Status and Trends data URL. Ensure that you have a working
internet connection and a valid API key for the Status and Trends
data.
My internet is working. Looking at the source code, I believe the error is generated because the function reads a url, essentially read_json(stringr::str_glue("{api_url}list-obj/{species}?key={key}")) which for some reason is null.
https://rdrr.io/github/CornellLabofOrnithology/ebirdst/src/R/ebirdst-loading.R?fbclid=IwAR1JYbCoD_VGwtZ0e1tz7yEPIR1buwN3GUyraZqokeS8rFTox4g3ceWRnns
But I can get it to work if I use the following lines of the source code. I'm not sure why the ebirdst function is failing.
species<-"Sharp-tailed Grouse"
path = rappdirs::user_data_dir("ebirdst")
species <- get_species(species)
which_run <- which(ebirdst::ebirdst_runs$species_code == species)
run <- ebirdst::ebirdst_runs$run_name[which_run]
key<-Sys.getenv("EBIRDST_KEY")
api_url <- "https://st-download.ebird.org/v1/"
list_obj_url <- stringr::str_glue("{api_url}list-obj/{species}?key={key}")
files <-jsonlite::read_json(list_obj_url, simplifyVector = TRUE)
files <- data.frame(file = files)
files <- files[!stringr::str_detect(files$file, "\\.db$"), , drop = FALSE]
files$src_path <- stringr::str_glue("{api_url}fetch?objKey={files$file}",
"&key={key}")
files$dest_path <- file.path(path, files$file)
files$exists <- file.exists(files$dest_path)
dirs <- unique(dirname(files$dest_path))
for (d in dirs) {
dir.create(d, showWarnings = FALSE, recursive = TRUE)
}
old_timeout <- getOption("timeout")
options(timeout = max(3000, old_timeout))
for (i in seq_len(nrow(files))) {
dl_response <- utils::download.file(files$src_path[i],
files$dest_path[i],
mode = "wb")
if (dl_response != 0) {
stop("Error downloading file: ", files$file[i])
}
}

installr:install.pandoc() appears broken

I recently noticed that the install.pandoc function in the installr package appears to be broken.
I get the following error message:
trying URL 'https://github.com/'
Content type 'text/html; charset=utf-8' length unknown
downloaded 78 KB
github.com is not compatible with the version of Windows you're running. Check your computer's system information and then contact the software publisher.
It looks like the function is not finding the appropriate file from GitHub. I have submitted a pull request to the installr package on GitHub which corrects this error.
Here is the function that should install Pandoc correctly and that was submitted as a pull request. In case you run into this error before it is fixed.
library(installr)
FixedInstall.Pandoc <- function (URL = "https://github.com/jgm/pandoc/releases", use_regex = TRUE,
to_restart, ...)
{
URL <- "https://github.com/jgm/pandoc/releases"
page_with_download_url <- URL
if (!use_regex)
warning("use_regex is no longer supported, you can stop using it from now on...")
page <- readLines(page_with_download_url, warn = FALSE)
sysArch <- Sys.getenv("R_ARCH")
sysArch <- gsub("/ |/x", "", sysArch)
pat <- paste0("jgm/pandoc/releases/download/[0-9.]+/pandoc-[0-9.-]+-windows",".*", sysArch, ".*", ".msi")
target_line <- grep("windows", page, value = TRUE)
m <- regexpr(pat, target_line)
URL <- regmatches(target_line, m)
URL <- head(URL, 1)
URL <- paste("https://github.com/", URL, sep = "")
installed <- install.URL(URL, ...)
if (!installed)
return(invisible(FALSE))
if (missing(to_restart)) {
if (is.windows()) {
you_should_restart <- "You should restart your computer\n in order for pandoc to work properly"
winDialog(type = "ok", message = you_should_restart)
choices <- c("Yes", "No")
question <- "Do you want to restart your computer now?"
the_answer <- menu(choices, graphics = "TRUE", title = question)
to_restart <- the_answer == 1L
}
else {
to_restart <- FALSE
}
}
if (to_restart)
os.restart()
}

R Parallel Programming: Error in { : task 1 failed - "could not find function "%>%""

I tried to do Parallel Programming in R by modified my script. On my script I did two parallel programming. First one was done but the second was error whereas the script structure were same. Below is my code:
library(rvest)
library(RMySQL)
library(curl)
library(gdata)
library(doMC)
library(foreach)
library(doParallel)
library(raster)
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
setwd('/home/chandra/R/IlmuOne/MisterAladin')
no_cores <- detectCores()
cl<-makeCluster(no_cores)
registerDoParallel(cl)
MasterData = read.xls("Master Hotels - FINAL.xlsx", sheet = 1, header = TRUE)
MasterData$url_agoda = as.character(MasterData$url_agoda)
today = as.Date(format(Sys.time(), "%Y-%m-%d"))+2
ntasks <- nrow(MasterData)
#This section perfomed well
foreach(i=1:ntasks) %dopar% {
url = MasterData$url_agoda[i]
if (trim(url)!='-' & trim(url)!='')
{
from = gregexpr(pattern ='=',url)[[1]][1]
piece1 = substr(url,1,from)
from = gregexpr(pattern ='&los=',url)[[1]][1]
piece2 = substr(url,from,nchar(url))
MasterData$url_agoda[i] = paste0(piece1,today,piece2)
}
}
con <- dbConnect(RMySQL::MySQL(), username = "root", password = "master",host = "localhost", dbname = "mister_aladin")
#Tried first 10 data
#Below section was error and always return error: Error in { : task 1 failed - "could not find function "%>%""
foreach(a=1:10, .packages='foreach') %dopar% {
hotel_id = MasterData$id[a]
vendor = 'Agoda'
url = MasterData$url_agoda[a]
if (url!='-')
{
tryCatch({
hotel <- curl(url) %>%
read_html() %>%
html_nodes(xpath='//*[#id="room-grouping"]') %>%
html_table(fill = TRUE)
hotel <- hotel[[1]]
hotel$hotel_id= hotel_id
hotel$vendor= vendor
colnames(hotel)[1] = 'TheSpace'
colnames(hotel)[4] = 'PricePerNight'
room = '-'
hotel$NormalPrice = 0
hotel$FinalPrice = 0
for(i in 1:nrow(hotel))
{
if (i==1 | (!grepl('See photos',hotel$TheSpace[i]) & hotel$TheSpace[i]!='') )
{
room = hotel$TheSpace[i]
}
hotel$TheSpace[i] = room
#Normal Price
if (gregexpr(pattern ='IDR',hotel$PricePerNight[i])[[1]][1][1]==1)
{
split = strsplit(hotel$PricePerNight[i],'\n')[[1]]
NormalPrice = trim(split[2])
hotel$NormalPrice[i] = NormalPrice
NormalPrice = as.integer(gsub(",","",NormalPrice))
hotel$NormalPrice[i] = NormalPrice
}
#Final Price
if (gregexpr(pattern ='IDR',hotel$PricePerNight[i])[[1]][1][1]==1)
{
split = strsplit(hotel$PricePerNight[i],'\n')[[1]]
FinalPrice = trim(split[6])
hotel$FinalPrice[i] = FinalPrice
FinalPrice = as.integer(gsub(",","",FinalPrice))
hotel$FinalPrice[i] = FinalPrice
}
hotel$NormalPrice[is.na(hotel$NormalPrice)] <- 0
hotel$FinalPrice[is.na(hotel$FinalPrice)] <- 0
}
hotel = hotel[which(hotel$FinalPrice!=0),c("TheSpace","NormalPrice","FinalPrice")]
colnames(hotel) = c('room','normal_price','final_price')
hotel$log = format(Sys.time(), "%Y-%m-%d %H:%M:%S")
hotel$hotel_id = hotel_id
hotel$vendor = vendor
Push = hotel[,c('hotel_id','room','normal_price','final_price','vendor','log')]
#print(paste0('Agoda: push one record, hotel id ',hotel_id,'!'))
#cat(paste(paste0('Agoda: push one record, hotel id ',hotel_id,'!'),'\n'))
dbWriteTable(conn=con,name='prices_',value=as.data.frame(Push), append = TRUE, row.names = F)
},
error = function(e) {
Sys.sleep(2)
e
})
}
}
dbDisconnect(con)
stopImplicitCluster()
Every time I run the script it always gives me error: Error in { : task 1 failed - "could not find function "%>%""
I already check every post on this forum and tried to apply it but no one works.
Please advise any solution
you have to use .packages = c("magrittr", ...) and include all the packages, which are necessary to run the code within the foreach loop. However, .packages = "foreach" is not helping.
See, you can imagine that all the packages you define in .packages are forwareded / loaded in each parallel worker.
The %>% operator requires the package magrittr. In this case however it does not suffice to load it at the beginning of your script - it needs to be loaded for each of the nodes. You could add this line to the creation of your cluster to accomplish this:
cl<-makeCluster(no_cores)
registerDoParallel(cl)
clusterCall(cl, function() library(magrittr))

starting a function in asynch mode in R, as a separate process

I am looking for the ability to start R processes Asynchronously from within R.
Something like the below function
startFunctionInAsynchMode<-function(workingDir,filesToSource, functionName, ...){
#workingdir - the dir that should be set as wd
#filesToSource - vector of fileNames to be sourced
#functionName - the actual function to be run asynchrously
#... - other parameters to be passed to the function
#Return Value - should be the System Process Id Started
}
Would anyone have quick ideas? I checked packages like parallel etc. but doesn't seem to fit.
Thanks in advance
Here is an implementation using R CMD. Basic version tested. And with some open items.
startFunctionInAsynchMode<-function(workingDir,filesToSource, functionName, ...){
wd<-getwd()
setwd(workingDir)
fs<-makeFiles()
scriptFile<-fs$ScriptFile
cat(file=scriptFile,paste0("source(\"",filesToSource,"\")", collapse = "\n"))
cat(file=scriptFile,"\n",append = T)
functionCall<-getFunctionCall(functionName,as.list(match.call()), startIndex=5)
cat(file=scriptFile,functionCall,append = T)
commandsToRun <- paste0("(R CMD BATCH ", scriptFile, " ",fs$LogFile , " --slave ) &")
print(commandsToRun)
system(commandsToRun)
Sys.sleep(5)
pids<-getPids(scriptFile, "--restore")
cat(file=fs$KillScript,paste0("kill -9 ",pids$PID[1]))
setwd(wd)
return(as.character(pids$PID[1]))
}
makeFiles<-function(){
res<-list()
dir.create("./temp/tempRgen", recursive=T,showWarnings = F)
tf<-tempfile("rGen-","./temp/tempRgen", fileext = "")
res$ScriptFile<-paste0(tf,".R")
res$LogFile<-paste0(tf,".log")
res$KillScript<-paste0(tf,"-kill.sh")
file.create(res$KillScript,showWarnings = F)
file.create(res$ScriptFile,showWarnings = F)
res
}
#Open Items to be handled
#1. Named Arguments
#2. Non String Arguments
getFunctionCall<-function(functionName,argList,startIndex){
res<-paste0(functionName,"(")
if(!is.null(argList)){
if(length(argList)>=startIndex){
first=T
for(i in startIndex:length(argList)){
if(first){
first=F
} else {
res<-paste0(res,",")
}
res<-paste0(res,"\"",argList[[i]],"\"")
}
}
}
res<-paste0(res,")")
}
getPids <- function(grepFor, refineWith){
numCols <- length(unlist(str_split(system("ps aux", intern=T)[1], "\\s+")))
psOutput <- system(paste0("ps auxww | grep ", grepFor), intern=T)
psOutput <- psOutput[str_detect(psOutput, refineWith)]
pidDf <- ldply(psOutput, parseEachPsLine)
# Remove the process that actually grep-ed for my search string
pidDf <- pidDf[!str_detect(pidDf$COMMAND, "grep"),]
return(pidDf)
}
parseEachPsLine <- function(line){
tabular <- read.table(textConnection(line), header=F, sep=" ")
tabular <- tabular[!is.na(tabular)]
psTitles <- c("USER", "PID", "CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND")
psColNames <- setNames(seq(1, length(psTitles)), psTitles)
COMMAND <- paste0(tabular[(psColNames["COMMAND"]):length(tabular)], collapse=" ")
return(data.frame("PID"=tabular[psColNames["PID"]], "STARTED"=tabular[psColNames["START"]], "COMMAND"=COMMAND, "STATUS"=tabular[psColNames["STAT"]]))
}

Using the openpaths.cc API with R

I tried to pull my location data from openpaths.cc to use it with R.
The API uses OAuth and is documented here, however, it only provides an example in Python.
After looking around how to handle OAuth (which I am barely familiar with) in R, I found ROAuth, so I used the usage example provided as a basis.
According to the API-documentation, the endpoint for all requests is https://openpaths.cc/api/1, and I have my access key and access secret, so I naively plugged them in for cKey, cSecret, reqURL, accessURL, authURL, and testURL, but only got "bad request" as a result from the credentials$handshake() line.
reqURL <- "https://openpaths.cc/api/1"
accessURL <- "https://openpaths.cc/api/1"
authURL <- "https://openpaths.cc/api/1"
cKey <- "key"
cSecret <- "secret"
testURL <- "https://openpaths.cc/api/1"
credentials <- OAuthFactory$new(consumerKey=cKey,
consumerSecret=cSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL,
needsVerifier=TRUE)
credentials$handshake()
## the GET isn’t strictly necessary as that’s the default
credentials$OAuthRequest(testURL, "GET")
While I feel like I have no idea what I'm doing, I at least verified that ROAuth is capable of using the HMAC-SHA1 method, wich is required by openpaths.
EDIT: I have ROAuth version 0.9.3 installed
EDIT2: After learning about httr, I thought this might be the appropriate library for the task, however I still could not produce any usable results, since the token creation via oauth1.0_token only lead to a Bad request again.
I think my primary problem is the lack of API documentation from openpaths.cc. With all these tools, I still have no idea how to properly use them.
Here is as far as I got. I receive a "400 Not Authorized", maybe this is due to the fact that my openpaths account is not connected to foursquare, maybe something is wrong with the code. Please try it out!
Required packages:
library(RCurl)
library(digest)
library(base64)
Some functions borrowed/adapted from ROAuth:
## Get a random sequence of characters.
## Nonce - number used only once.
genNonce <- function(len = 15L + sample(1:16, 1L)) {
els <- c(letters, LETTERS, 0:9, "_")
paste(sample(els, len, replace = TRUE), collapse = "")
}
## this function is derived from utils::URLencode
## Characters not in the unreserved character set ([RFC3986] section 2.3) MUST be encoded
## unreserved = ALPHA, DIGIT, '-', '.', '_', '~'
## cf. http://oauth.net/core/1.0/#encoding_parameters
encodeURI <- function(URI, ...) {
if (!is.character(URI)) {
URI
} else {
OK <- "[^-A-Za-z0-9_.~]"
x <- strsplit(URI, "")[[1L]]
z <- grep(OK, x)
if (length(z)) {
y <- sapply(x[z], function(x) paste("%", toupper(as.character(charToRaw(x))),
sep = "", collapse = ""))
x[z] <- y
}
paste(x, collapse = "")
}
}
## we escape the values of the parameters in a special way that escapes
## the resulting % prefix in the escaped characters, e.g. %20 becomes
## %2520 as %25 is the escape for %
## cf. http://tools.ietf.org/html/rfc5849#section-3.4.1.3.2
normalizeParams <- function(params, escapeFun) {
names(params) <- sapply(names(params), escapeFun, post.amp = TRUE)
params <- sapply(params, escapeFun, post.amp = TRUE)
## If two or more parameters share the same name, they are sorted by their value.
params <- params[order(names(params), params)]
return(paste(names(params), params, sep = "=", collapse = "&"))
}
## From Ozaki Toru's code at https://gist.github.com/586468
signWithHMAC <- function(key, data) {
blockSize <- 64
hashlength <- 20
innerpad <- rawToBits(as.raw(rep(0x36, blockSize)))
outerpad <- rawToBits(as.raw(rep(0x5C, blockSize)))
zero <- rep(0 ,64)
HexdigestToDigest <- function(digest) {
as.raw(strtoi(substring(digest, (1:hashlength)*2-1,
(1:hashlength)*2), base=16))
}
mac <- function(pad, text) {
HexdigestToDigest(digest(append(packBits(xor(key, pad)), text),
algo='sha1', serialize=FALSE))
}
if(nchar(key) >= 64) {
keyDigested <- digest(key, algo="sha1", serialize=FALSE)
key <- intToUtf8(strtoi(HexdigestToDigest(keyDigested), base=16))
}
key <- rawToBits(as.raw(append(utf8ToInt(key), zero)[1:blockSize]))
base64(mac(outerpad, mac(innerpad, charToRaw(data))))[1]
}
## Sign an request made up of the URL, the parameters as a named character
## vector the consumer key and secret and the token and token secret.
signRequest <- function(uri, consumerKey, consumerSecret, params=character(),
oauthKey = "", oauthSecret = "", httpMethod = "GET",
nonce = genNonce(),
timestamp = Sys.time()) {
httpMethod <- toupper(httpMethod)
params["oauth_nonce"] <- nonce
params["oauth_timestamp"] <- as.integer(timestamp)
params["oauth_consumer_key"] <- consumerKey
params["oauth_signature_method"] <- 'HMAC-SHA1'
params["oauth_version"] <- '1.0'
if(oauthKey != "") params["oauth_token"] <- oauthKey
odat <- paste(
encodeURI(httpMethod), encodeURI(uri),
encodeURI(normalizeParams(params, encodeURI), post.amp = TRUE),
sep = "&"
)
okey <- encodeURI(consumerSecret)
if(oauthSecret != "") okey <- paste(okey, encodeURI(oauthSecret), sep = "&")
params["oauth_signature"] <- signWithHMAC(okey, odat)
return(params)
}
Now this function tries to replicate the example at the openpaths website:
openpaths <- function(
access_key=getOption("openpaths.access_key"),
secret_key=getOption("openpaths.secret_key"),
curl=getCurlHandle()) {
uri <- 'https://openpaths.cc/api/1'
params <- signRequest(uri, consumerKey=access_key, consumerSecret=secret_key)
oa_header <- paste(names(params), params, sep="=", collapse=",")
ret <- getURL(
uri,
curl=curl,
.opts=list(
header=TRUE,
verbose=TRUE,
httpheader=c(Authorization=paste("OAuth ", oa_header, sep="")),
ssl.verifypeer = TRUE,
ssl.verifyhost = TRUE,
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")
)
)
return(ret)
}
I've made some progress on this problem, although it's challenging due
to the flakiness of the site, and the custom OAuth process that they're
using. First you'll need to install development version of httr - this
exports some previously internal functions.
devtools::install_github("hadley/httr")
OpenPaths is unusual in that the app secret and key are the same as the
token and token secret. This means we need to write a custom auth
header:
library(httr)
app <- oauth_app("OpenPaths", "JSLEKAPZIMFVFROHBDT4KNBVSI")
#> Using secret stored in environment variable OPENPATHS_CONSUMER_SECRET
# Implement custom header for 2-leg authentication, and oauth_body_hash
auth_header <- function(url, method = "GET") {
oauth_signature(url, method, app, app$key, app$secret,
# Use sha1 of empty string since http request body is empty
body_hash = "da39a3ee5e6b4b0d3255bfef95601890afd80709")
}
Then you can use this to sign your request. This is currently failing
for me because the site seems to be down (again).
url <- "https://openpaths.cc/api/1"
r <- GET(url, oauth_header(auth_header(url)))
stop_for_status(r)
content(r)

Resources