Encoding problem when your package contains functions with non-english characters - r

I am building my own package, and I keep running into encoding issues because the functions in my package has non-english (non-ASCII) characters.
Inherently, Korean characters are a part of many of the functions in my package. A sample function:
library(rvest)
sampleprob <- function(url) {
# sample url: "http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851"
result <- grepl("연결재무제표 주석", html_text(read_html(url)))
return(result)
}
However, when installing the package I run into encoding problems.
I created a sample package (https://github.com/hyk0127/KorEncod/) with just one function (what is shown above) and uploaded it onto my github page for a reproducible example. I run the following code to install:
library(devtools)
install_github("hyk0127/KorEncod")
Below is the error message that I see
Error : (converted from warning) unable to re-encode 'hello.R' line 7
ERROR: unable to collate and parse R files for package 'KorEncod'
* removing 'C:/Users/myname/Documents/R/win-library/3.6/KorEncod'
* restoring previous 'C:/Users/myname/Documents/R/win-library/3.6/KorEncod'
Error: Failed to install 'KorEncod' from GitHub:
(converted from warning) installation of package ‘C:/Users/myname/AppData/Local/Temp/RtmpmS5ZOe/file48c02d205c44/KorEncod_0.1.0.tar.gz’ had non-zero exit status
The error message about line 7 refers to the Korean characters in the function.
It is possible to locally install the package with tar.gz file, but then the function does not run as intended, because the Korean characters are recognized in broken encoding.
This cannot be the first time that someone has tried building a package that has non-english (or non-ASCII) characters, and yet I couldn't find a solution to this. Any help will be deeply appreciated.
A few pieces of info that I think are related:
Currently the DESCRIPTION file specifies "Encoding: UTF-8".
I have used sys.setlocale to set the locale into Korean and back to no avail.
I have specified #encoding UTF-8 to the function to no avail as well.
I am currently using Windows where the administrative language is set to English. I have tried using a different laptop with Windows & administrative language set to Korean, and the same problem appears.

The key trick is replacing the non-ASCII characters with their unicode codes - the \uxxxx encoding.
These can be generated via stringi::stri_escape_unicode() function.
Note that since it will be necessary to completely get rid of the Korean characters in your code in order to pass the R CMD check it will be necessary to perform a manual copy & re-encode via {stringi} on the command line & paste back operation on all your R scripts included in the package.
I am not aware of an available automated solution for this problem.
In the specific use case of the example provided the unicode would read like this:
sampleprob <- function(url) {
# stringi::stri_escape_unicode("연결재무제표 주석") to get the \uxxxx codes
result <- grepl("\uc5f0\uacb0\uc7ac\ubb34\uc81c\ud45c \uc8fc\uc11d",
rvest::html_text(xml2::read_html(url)))
return(result)
}
sampleprob("http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851")
[1] TRUE
This will be a hassle, but it seems to be the only way to make your code platform neutral (which is a key CRAN requirement, and thus subject to R CMD check).

Adding for the future value (for those facing similar problems), you can also solve this problem by saving the non-ASCII characters in a data file, then loading the value & using it.
So save the character as a data file (using standard package folder names and roxygen2 package)
# In your package, save as a separate file within .\data-raw
kor_chrs <- list(sampleprob = "연결재무제표 주석")
usethis::use_data(kor_chrs)
Then in your functions load the data and use them.
# This is your R file for the function within ./R folder
#' #importFrom rvest html_text
#' #importFrom xml2 read_html
#' #export
sampleprob <- function(url) {
# sample url: "http://dart.fss.or.kr/dsaf001/main.do?rcpNo=20200330003851"
result <- grepl(kor_chrs$sampleprob[1], html_text(read_html(url)))
return(result)
}
This, yes, is still a workaround, but it runs in Windows machines without any troubles.

Related

Producing a PDF from a single R function

Imagine, you define an R function to share with a pal, only a single function. In case of, you decide later to include this function in a package, you document it using Roxygen comments and tags (e. g. #' #name my_function). Is it possible to produce a PDF from this single R file? If yes, how?
1) We will use the file lc.R as an example which we first download from github. First use kitten to create the boilerplate for a package. Copy lc.R to it. Then run document from devtools to roxygenize it and finally use Rd2pdf to create the pdf, lc.pdf .
library(devtools)
library(pkgKitten)
library(roxygen2)
# set up lc in lc.R to use as a test example
u <- "https://raw.githubusercontent.com/mailund/lc/master/R/lc.R"
download.file(u, "./lc.R")
# create package containing lc.R - ignore any NAMESPACE warnings
kitten("lc")
file.copy("lc.R", "./lc/R")
# roxygenize it generating an Rd file
document("lc")
file.copy("lc/man/lc.Rd", ".")
# convert Rd file to pdf
R <- file.path(R.home("bin"), "R")
cmd <- paste(R, "CMD Rd2pdf lc.Rd")
system(cmd, wait = FALSE)
2) There used to be a package on CRAN named document (or see gitlab) which does the same thing in one step but it was removed last year. Note that the document package depends on the fritools (or see gitlab) package which was also removed. The source of both are archived on CRAN and on gitlab and it may be possible to build them yourself.
3) This approach does not create a PDF but it does allow one to view formatted help for a script converting it from the roxygen2 markup to HTML showing it in the browser. Note that the box package should not be attached, i.e. do not use a library(box) statement. Assume that lc.R is in the current directory -- see the download.file statement in (1) above. The code below may generate warnings or errors but it still works to bring up the help for the lc function in lc.R showing it in the default browser.
box::use(./lc)
box::help(lc$lc)

Custom .traineddata file usage in the tesseract in R

I have a bunch of .JPGs with some text at the bottom, which consists mostly (but not exclusively of numbers). I wish to use the tesseract package in R to be able to 'read' the text in those .JPGs. Unfortunately, the base tesseract language proved too inaccurate to be worth using. Subsequently I tried using the Magick package to adjust the pictures (crop, resize convert etc) hoping to get a better reading from tesseract, but in my case this failed to get satisfactory results.
I eventually managed to use the description on this link (https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6) to create a new custom language in Tesseract 4.1.1 (as downloaded from https://github.com/tesseract-ocr/tesseract), which I named font_name.traineddata. The custom-made font_name.traineddata works perfectly on the Tesseract 4.1.1 console and shows significant improvement in results on the base language.
The question I have is: How I get the font_name.traineddata file to be part of the ocr command in R? I have tried the simple solution of just pasting the font_name.traineddata file into the appropriate tessdata folder in the package tesseract (the same folder that also contains the standard english data file called eng.traineddata) and then trying the following:
font_name <- tesseract ("font_name")
ocr("C:/1.jpg", engine = font_name)
This does not work and gives the error :
Error in tesseract_engine_internal(datapath, language, configs, opt_names, :
Unable to find training data for: font_name. Please consult manual for: ?tesseract_download
tesseract_download seems to be of no use, as it is a helper function to download training data from the official tessdata repository. I have also tried renaming the file to a three character name, with the same error.
Does anybody have any suggestions on how to make custom .traineddata files work with ocr in R?

Creating R package using code from script file

I’ve written some R functions and dropped them into a script file using RStudio. These are bits of code that I use over and over, so I’m wondering how I might most easily create an R package out of them (for my own private use).
I’ve read various “how to” guides online but they’re quite complicated. Can anyone suggest an “idiot’s guide” to doing this please?
I've been involved in creating R packages recently, so I can help you with that. Before proceeding to the steps to be followed, there are some pre-requisites, which include:
RStudio
devtools package (for most of the functions involved in creation of a package)
roxygen2 package (for roxygen documentation)
In case you don't have the aforementioned packages, you can install them with these commands respectively:
install.packages("devtools")
install.packages("roxygen2")
Steps:
(1) Import devtools in RStudio by using library(devtools).
(devtools is a core package that makes creating R packages easier with its tools)
(2) Create your package by using:
create_package("~/directory/package_name") for a custom directory.
or
create_package("package_name") if you want your package to be created in current workspace directory.
(3) Soon after you execute this function, it will open a new RStudio session. You will observe that in the old session some lines will be auto-generated which basically tells R to create a new package with required components in the specified directory.
After this, we are done with this old instance of RStudio. We will continue our work on the new RStudio session window.
By far the package creation part is already over (yes, that simple) however, a package isn't directly functionable just by its creation plus the fact that you need to include a function in it requires some additional aspects of a package such as its documentation (where the function's title, parameters, return types, examples etc as mentioned using #param, #return etc - you would be familiar if you see roxygen documentation like in some github repositories) and R CMD checks to get it working.
I'll get to that in the subsequent steps, but just in case you want to verify that your package is created, you can look at:
The top right corner of the new RStudio session, where you can see the package name that you created.
The console, where you will see that R created a new directory/folder in the path that we specified in create_package() function.
The files panel of RStudio session, where you'll notice a bunch of new files and directories within your directory.
(4) As you mentioned in your words, you drop your functions in a script file - hence you will need to create the script first, which can be done using:
use_r("function_name")
A new R script will pop up in your working session, ready to be used.
Now go ahead and write your function(s) in it.
(5) After your done, you need to load the function(s) you have written for your package. This is accomplished by using the devtools::load_all() function.
When you execute load_all() in the console, you'll get to know that the functions have been loaded into your package when you'll see Loading package_name displayed in console.
You can try calling your functions after that in the console to verify that they work as a part of the package.
(6) Now that your function has been written and loaded into your package, it is time to move onto checks. It is a good practice to check the whole package as we make changes to our package. The function devtools::check() offers an easy way to do this.
Try executing check() in the console, it will go through a number of procedures checking your package for warnings/errors and give details for the same as messages on the screen (pertaining to what are the errors/warnings/notes). The R CMD check results at the end will contain the vital logs for you to see what are the errors and warnings you got along with their frequency.
If the functions in your package are written well, (with additional package dependencies taken care of) it will give you two warnings upon execution of check:
The first warning will be regarding the license that your package uses, which is not specified for a new pacakge.
The second should be the one for documentation, warning us that our code is not documented.
To resolve the first issue which is the license, use the use_mit_license("license_holder_name") command (or any other license which suits your package - but then for private use as you mentioned, it doesn't really matter what you specify if only your going to use it or not its to be distributed) with your name as in place of license_holder_name or anything which suits a license name.
This will add the license field in the .DESCRIPTION file (in your files panel) plus create additional files adding the license information.
Also you'll need to edit the .DESCRIPTION file, which have self-explanatory fields to fill-in or edit. Here is an example of how you can have it:
Package: Your_package_name
Title: Give a brief title
Version: 1.0.0.0
Authors#R:
person(given = "Your_first_name",
family = "Your_surname/family_name",
role = c("package_creator", "author"),
email = "youremailaddress#gmail.com",
comment = c(ORCID = "YOUR-ORCID-ID"))
Description: Give a brief description considering your package functionality.
License: will be updated with whatever license you provide, the above step will take care of this line.
Encoding: UTF-8
LazyData: true
To resolve the documentation warning, you'll need to document your function using roxygen documentation. An example:
#' #param a parameter one
#' #param b parameter two
#' #return sum of a and b
#' #export
#'
#' #examples
#' yourfunction(1,2)
yourfunction <- function(a,b)
{
sum <- a+b
return(sum)
}
Follow the roxygen syntax and add attributes as you desire, some may be optional such as #title for specifying title, while others such as #import are required (must) if your importing from other packages other than base R.
After your done documenting your function(s) using the Roxygen skeleton, we can tell our package that we have documented our functions by running devtools::document(). After you execute the document() command, perform check() again to see if you get any warnings. If you don't, then that means you're good to go. (you won't if you follow the steps)
Lastly, you'll need to install the package, for it to be accessible by R. Simply use the install() command (yes the same one you used at the beginning, except you don't need to specify the package here like install("package") since you are currently working in an instance where the package is loaded and is ready to be deployed/installed) and you'll see after a few lines of installation a statement like "Done (package_name)", which indicates the installation of our package is complete.
Now you can try your function by first importing your package using library("package_name") and then calling your desired function from the package. Thats it, congrats you did it!
I've tried to include the procedure in a lucid way (the way I create my R packages), but if you have any doubts feel free to ask.

Error in R Tesseract

I have the R Tesseract package working with the default eng.traineddata under OSX, but it simply won't find other languages.
trial <- ocr("test.png", engine = tesseract(language = "jpn", datapath="/Users/histmr/Library/R/3.3/library/tesseract/tessdata"))
Generates the error:
Failed loading language 'jpn'
Tesseract couldn't load any languages!
Error in tesseract_engine_internal(datapath, language) :
Unable to find training data for: jpn
I've checked with
tesseract_info()
$datapath
[1] "/Users/histmr/Library/R/3.3/library/tesseract/tessdata/"
$available
[1] "eng" "jpn"
$version
[1] "3.05.00"
Sometimes I get references to a "TESSDATA_PREFIX environment variable" but I don't know where that is. How can I get the correct directory path (I can see the file in the directory) or edit the "TESSDATA_PREFIX environment variable"?
The problem seems to occur with Japanese but NOT French
tesseract_download("fra")
french <- tesseract("fra")
Works fine! But
tesseract_download("jpn")
japanese <- tesseract("jpn")
Generates an error
The error message Error in tesseract_engine_internal(datapath, language) said the language file, in your case jpn.traineddata, is not available in the TESSDATA_PREFIX which is the default path for storing all the trained language data. If you haven't set the path, you may open a terminal and type the command below.
export TESSDATA_PREFIX=/Users/histmr/Library/R/3.3/library/tesseract/tessdata/
Hope this help.
One possible problem is multiple installs of Tesseract (I used Homebrew and MacPorts) creating multiple TESSDATA folders. Strangely R was happier with a seemingly identical folder, but in a different place closer to root, ordinarily hidden under OSX. I got things working with
export TESSDATA_PREFIX=/opt/local/share
I hope this helps

Where is the .R script file located on the PC?

I want to find the location of the script .R files which are used for computation in R.
I know that by typing the object function, I will get the code which is running and then I can copy and edit and save it as a new script file and use that.
The reason for asking to find the foo.R file is
Curiosity
Know what is the algorithm used in the numerical computations
More immedietly, the function from stats package I am using, is running results for two of the arguments and not the others and have to figure out how to make it work.
Error shown by R implies that there might be some modification required in the script file.
I am looking for a more general answer, if its possible.
Edit: As per the comments so far, here is the code to compute spectrum of a time series using autoregressive methods. The data input is a univariate series.
x = ts(data)
spec.ar(x, method = "yule-walker") 1
spec.ar(x, method = "burg") 2
command 1 is running ok.
command 2 gives the following error.
Error in ar.burg.default(x, aic = aic, order.max = order.max, na.action = na.action, :
Burg's algorithm only implemented for univariate series
I did try specify all the arguments correctly like na.action=na.fail, order.max = NULL etc but the message is the same.
Kindly suggest possible solutions.
P.S. (This question is posted after searching the library folder where R is installed and zip files which come with packages, manuals, and opening .rdb, .rdx files)
See FAQ 7.40 How do I access the source code for a function?
In most cases, typing the name of the function will print its source
code. However, code is sometimes hidden in a namespace, or compiled.
For a complete overview on how to access source code, see Uwe Ligges
(2006), “Help Desk: Accessing the sources”, R News, 6/4, 43–45
(http://cran.r-project.org/doc/Rnews/Rnews_2006-4.pdf).
When R installs a package, it evaluates all the ".R" source files and re-saves them into a binary format for faster loading. Therefore you typically cannot easily find the source file.
As has been suggested elsewhere, you can simply type the function name and see the source code, or download the source package and find the source there.
library(plyr)
ddply # prints the source for ddply
# See the content of the R directory for plyr,
# but it's only binary files:
dir(file.path(find.package("plyr"), "R"))
# [1] "plyr" "plyr.rdb" "plyr.rdx"
# Get the source for the package:
download.packages("plyr", "~", type="source")
# ...then unpack and inspect the R directory...
.libPaths() should tell you all of your current library locations. It's possible to have more than one installation of a package if there are two libraries but only the one that is in the first library will be used. Unless you offer the code and the exact error message, it's not likely that anyone will be able to offer better advice.
I think you are asking to see what I call the source code for a function in a package. If so, the way I do it is as follows, which has worked successfully for me on the three times I have tried. I keep these instructions handy in a few places and just copied and pasted them here:
To see the source code for a function in Program R download the package containing the function. Specifically, download the file that ends in "tar.gz". This is a compressed file. Expand the compressed file using, for example, "WinZip". Now you need to open the uncompressed file that ends in ".tar". Download the free software "7-Zip". Click on the file "7zFM.exe" and navigate to the directory containing the ".tar" file. You can extract the contents of that ".tar" file into a new folder. The contents consist of R files showing the source code for the functions in the R package.
EDIT:
Today (July 8, 2012) I was able to open the 'tar.gz' file using the latest version of 'WinZIP' and could copy the contents (the source code) from there without having to use '7-Zip'.
EDIT:
Today (January 19, 2013) I viewed the source code for functions in base R by downloading the file
'R-2.15.2.tar.gz'
To download that file go to the http://cran.at.r-project.org/ webpage and click on that file in this line:
"The latest release (2012-10-26, Trick or Treat): R-2.15.2.tar.gz, read what's new in the latest version."
Unzip the file. WinZip will work, or it did for me. Then search your computer for readtable.r or another base R function.
agstudy noted here https://stackoverflow.com/questions/14417214/source-file-for-r-function that source code for read.csv is located in the file readtable.r, so do not expect every base R function to have its own file.

Resources