Installing pdftotext on Windows (for use with R, 'tm' package) - r

I am having trouble using R, 'tm' package, to read in .pdf files.
Specifically, I try to run the following code:
library(tm)
filename = "myfile.pdf"
tmp1 <- readPDF(PdftotextOptions="-layout")
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
doc[1:15]
...which gives me the error:
Error in readPDF(PdftotextOptions = "-layout") :
unused argument (PdftotextOptions = "-layout")
I assume this is due to the fact that the pdftotext program (part of xpdf, http://www.foolabs.com/xpdf/download.html) has not been installed correctly on my machine, so that R cannot access it.
What are the steps to install xpdf/pdftotext correctly such that the above R code can be executed? (I am aware of similar questions already posted, however they don't address the same issue)

PdftotextOptions is no parameter of readPDF. readPDF has a control parameter, which expects a list. So correct use would be:
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
tmp1 <- readPDF(control = list(text = "-layout"))
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
}

Set
setwd('C:/xpdf/bin64')
It works for me.

Related

How to export sf object to GDB using RPyGeo in R (Windows)?

I have a bunch of sf objects I'd like to export to GDB from R. I'm running R 4.0.2 on Windows 10. In this case the sf objects are all vector point data. The main reasons to export to GDB are to keep longer field names (the shapefile truncation is very annoying), and because GDBs are more desirable storage locations for our workflows.
Yes, I know about the ArcGisBinding package. I've got it to work in a test script but it's pretty unstable - often crashing and requiring a restart of R. This is a problem, because the sf objects I'd like to export come after an already long Rmd that reads in, formats and cleans the data. So it's not a simple manner of re-running the script until arc.write doesn't break. I could break up the script, but then I'd still have to read in a bunch of shapefiles. One option I haven't yet explored is using reticulate to call a python script instead of trying to do everything in R, but we're trying to do our analysis all in one place, if possible.
I'm pretty sure I've managed to set up RPyGeo appropriately, first setting my python path using the reticulate package. I'm doing it this way because IT restrictions means I can't edit PATH variables on my machine.
#package calls
library(sf)
library(spData)
library(reticulate)
#set python version in reticulate
py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
reticulate::use_python(python = py_path, required = TRUE)
#call RPyGeo
library(RPyGeo) # for potential point export
#output gdb
out.gdb <- "C:/LOCAL_PROJECTS/Output/Output.gdb"
#RPyGeo Parameters
# Note that, in order to use RPyGeo you need a working ArcMap or ArcGIS Pro installation on your computer.
# python path - note that this will change depending on which version of Arc one is using
# py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
arcpy <- rpygeo_build_env(workspace = out.gdb,
overwrite = TRUE,
extensions = c("Spatial","DataInteroperability"),
path = py_path)
I've tried a bunch of different tools to export an sf object, here using dummy data also used in the RPyGeo vignette
data(nz, package = "spData")
arcpy$Copy_management(in_data = nz,out_data = "nz_test")
arcpy$Copy_management(in_data = nz,out_data = file.path(out.gdb,"nz"))
arcpy$FeatureClassToGeodatabase_conversion(Input_Features = nz,Output_Geodatabase = out.gdb)
arcpy$FeatureClassToFeatureClass_conversion(in_features = nz,out_path = out.gdb,out_name = "nz")
arcpy$QuickExport_interop(Input = nz,Output = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = "nz")
Each time I get an error, for example:
Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: Object: Error in executing tool
Detailed traceback:
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3232, in CopyFeatures
raise e
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3229, in CopyFeatures
retval = convertArcObjectToPythonObject(gp.CopyFeatures_management(*gp_fixargs((in_features, out_feature_class, config_keyword, spatial_grid_1, spatial_grid_2, spatial_grid_3), True)))
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\geoprocessing\_base.py", line 511, in <lambda>
return lambda *args: val(*gp_fixargs(args, True))
I'm not an expert in ArcPy by any means. Nor am I an expert in tracing errors inside packages. Am I making a simple syntax mistake? Is there something else that I'm missing? Any help would be much appreciated!

Error when exporting R data frame using openxlsx ("Error in zipr")

Usually I'm using the openxlsx package and the write.xlsx function when exporting R data frames into .xlsx-files. Since yesterday - probably after I was using the package XLConnect - something got messed up and the write.xlsx function doesn't work anymore. This is the error I get:
Error in zipr(zipfile = tmpFile, include_directories = FALSE, files = list.files(path = tmpDir, :
unused argument (include_directories = FALSE)
Unfortunately, I don't understand what this error means. Thanks for any helpful advice.
Edit: The function works when I use an older openxlsx version (4.1.0).
I was getting the same error.
I think the problem is with dependencies of openxlsx. There is a "zipR" package that might be picked up when you install openxlsx, while the actual dependency is zip package:
https://cran.r-project.org/web/packages/zip/index.html
https://cran.r-project.org/web/packages/zipR/zipR.pdf
I installed "zip" along with openxlsx and I don't get the error anymore.
I do not really understand the error message here. My computer does not allow me to save files to "c:/". So, if remove "c:/" part, it works fine, to save the file to the current working directory.
library(openxlsx)
df <- data.frame('x' = c(1,2,3),
'y' = c(3,2,1))
openxlsx::write.xlsx(df, "test.xlsx")
You would also try another package: writexl
writexl::write_xlsx(df, "text5.xlsx")`
This works on my machine.

Can I get the URL of what will be used by install.packages?

When running install.packages("any_package") on windows I get the message :
trying URL
'somepath.zip'
I would like to get this path without downloading, is it possible ?
In other terms I'd like to get the CRAN link to the windows binary of the latest release (the best would actually be to be able to call a new function with the same parameters as install.packages and get the proper url(s) as an output).
I would need a way that works from the R console (no manual checking of the CRAN page etc).
I am not sure if this is what you are looking for. This build the URL from the repository information and building the file name of the list of available packages.
#get repository name
repos<- getOption("repos")
#Get url for the binary package
#contrib.url(repos, "both")
contriburl<-contrib.url(repos, "binary")
#"https://mirrors.nics.utk.edu/cran/bin/windows/contrib/3.5"
#make data.frame of avaialbe packages
df<-as.data.frame(available.packages())
#find package of interest
pkg <- "tidyr" #example
#ofinterest<-grep(pkg, df$Package)
ofinterest<-match(pkg, df$Package) #returns a single value
#assemble name, assumes it is always a zip file
name<-paste0(df[ofinterest,]$Package, "_", df[ofinterest,]$Version, ".zip")
#make final URL
finalurl<-paste0(contriburl, "/", name)
Here's a couple functions which respectively :
get the latest R version from RStudio's website
get the url of the last released windows binary
The first is a variation of code I found in the installr package. It seems there's no clean way of getting the last version, so we have to scrape a webpage.
The second is really just #Dave2e's code optimized and refactored into a function (with a fix for outdated R versions), so please direct upvotes to his answer.
get_package_url <- function(pkg){
version <- try(
available.packages()[pkg,"Version"],
silent = TRUE)
if(inherits(version,"try-error"))
stop("Package '",pkg,"' is not available")
contriburl <- contrib.url(getOption("repos"), "binary")
url <- file.path(
dirname(contriburl),
get_last_R_version(2),
paste0(pkg,"_",version,".zip"))
url
}
get_last_R_version <- function(n=3){
page <- readLines(
"https://cran.rstudio.com/bin/windows/base/",
warn = FALSE)
line <- grep("R-[0-9.]+.+-win\\.exe", page,value=TRUE)
long <- gsub("^.*?R-([0-9.]+.+)-win\\.exe.*$","\\1",line)
paste(strsplit(long,"\\.")[[1]][1:n], collapse=".")
}
get_package_url("data.table")
# on my system with R 3.3.1
# [1] "https://lib.ugent.be/CRAN/bin/windows/contrib/3.5/data.table_1.11.4.zip"

TreeTagger in R

I have downloaded TreeTaggerv3.2 for Windows and have configured it per the install.txt. I am trying to use it in R with koRpus package. I have set the kRp.env as -
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en",
preset="en", treetagger="manual", format="file",
TT.tknz=TRUE, encoding="UTF-8" )
.My data to be tagged is in a file and trying to use it as treetag("myfile.txt") but it is throwing the error-
Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, :
'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\windows\system32\cmd.exe /c C:\TreeTagger\bin\tag-english.bat
C:\Users\vivsingh\Desktop\NLP\tree_tag_ex.txt' had status 255
The standalone TreeTagger is working on by windows.Any idea on how it works?
I had the exact same error and warning while trying lemmatization on R word vector following Bernhard Learns blog using windows 7 and R 3.4.1 (x64). The issue was also appearing using textstem package but TreeTagger was running properly in cmd window.
I mixed several answers I found on this post and here is my steps and code running properly:
get into R win_library (~\Documents\R\win-library\3.4\rJava\jri\x64\jri.dll) and copy jri.dll (thanks kravi!) to replace it the parent folder.
close and restart R
library(koRpus)
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual", format="obj", TT.tknz=FALSE , lang="en", TT.options=list(path="c:/TreeTagger", preset="en"))
lemma_tagged_tbl <- tbl_df(lemma_tagged#TT.res)
Hope it helps.
I am posting this answer to keep a record. I also faced the same issue due to incorrect specification of the location of jri.dll on 64-Bit processor and windows 8.1. If we call
set.kRp.env(TT.cmd="manual", lang="en", TT.options=list(path="/path/to/tree-tagger-windows-x.x/TreeTagger", preset="en")) and we follow either of following two steps, we can resolve this error:
While installing R, if we install only 64 Bit version of R, and
specify the proper path for these variables
LD_LIBRARY_PATH = /path/to/rJava/jri
JAVA_HOME = /path/to/jdk1.x.x
java.library.path = /path/to/rJava/jri/jri.dll
CLASSPATH = /path/to/rJava/jri
If we already installed both versions viz. 32 bit and 64 bit of R on your computer then just copy jri.dll from /path/to/rJava/jri/x64/jri.dll and replace at path/to/rJava/jri/jri.dll. Further, we need to set the path of above mentioned four variables.
I've got this issue (very similar I guess) and posted query to GitHub.
https://github.com/unDocUMeantIt/koRpus/issues/7
The current working solution for me for this case was easier than I could expect, just downgrading the koRpus package. This can change with time but this version should remain appropriate.
library("devtools")
install_github("unDocUMeantIt/koRpus", ref="0.06-5")
This package is not Java related they said.
You can face the same error while setting up the korpus environment and getting the result from treetagger. For example, when you use:
tagged.text <- treetag(
"C:/temp/sample_text.txt",
treetagger = "manual",
lang = "en",
TT.options = list(
path = "c:/Treetagger",
preset = "en"
),
doc_id = "sample"
)
You would receive a similar error
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does not fail but produce a table with proper results, please contact the author!
Here you need to change the value of treetagger, from
treetagger = "manual"
to
treetagger = "kRp.env"
However, before that remember to set the kRp.env as #Xochitl C. suggested in their answer
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
Once you do this, you'll get the desired result.

issue with get_rollit_source

I tried to use get_rollit_source from the RcppRoll package as follows:
library(RcppRoll)
get_rollit_source(roll_max,edit=TRUE,RStudio=TRUE)
I get an error:
Error in get("outFile", envir = environment(fun)) :
object 'outFile' not found
I tried
outFile="C:/myDir/Test.cpp"
get_rollit_source(roll_max,edit=TRUE,RStudio=FALSE,outFile=outFile)
I get an error:
Error in get_rollit_source(roll_max, edit = TRUE, RStudio = FALSE, outFile = outFile) :
File does not exist!
How can fix this issue?
I noticed that the RcppRoll folder in the R library doesn't contain any src directory. Should I download it?
get_rollit_source only works for 'custom' functions. For things baked into the package, you could just download + read the source code (you can download the source tarball here, or go to the GitHub repo).
Anyway, something like the following should work:
rolling_sqsum <- rollit(final_trans = "x * x")
get_rollit_source(rolling_sqsum)
(I wrote this package quite a while back when I was still learning R / Rcpp so there are definitely some rough edges...)

Resources