Problem with spell checking packages in R - r

I'm trying to check spelling some words in Russian using "hunspell" library in R.
bad_words <- hunspell("Язвенная болзень", dict='ru_RU.dic')
I have installed Russian dictionary, from here: https://code.google.com/archive/p/hunspell-ru/
It has encoding UTF-8. However, I have following error:
Failed to convert line 1 to ISO8859-1 encoding. Try spelling with a UTF8 dictionary.
It seems strange, neither dict nor R file don't have encoding ISO8859-1...
What is the problem?

If you are operating on Windows, my first guess would be that this is related to the lack of native UTF-8 support in R on Windows. This will be resolved when R4.2 is released; you might wish to try using the development release and seeing whether the problem persists.
Another thing to check is whether your DESCRIPTION file contains the line Encoding: UTF-8, such that your source files are treated as having this encoding.

Related

Unicode version and conformance info in R

Unicode has many versions.
Current version is 14.
Package utf8 documentation explicitly says it supports 10.0.0.
But what about R itself? I can not find any comment on which version of Unicode R supports.
Unicode has different level of support it can get.
As I print out '\u0061\u20de\u0308\u20dd' in R studio,
the output is not what unicode 14 guideline says for enclosing marks.
So for complete specification of unicode support,
I think R should say about the level it supports unicode, but I could not find any for this info either. So anyone knows any info about it?

How can I use R-scripts containing Umlauts cross-platform?

I'm using Windows on my Desktop PC at work and Linux on my Laptop. I frequently work on R-scripts in Rstudio alternating between the two machines. Whenever I try to import scripts that contain umlauts on one system after working with the other, the umlauts (e.g. ä,ü,ß,ö) are replaced with question marks. Importantly, I'm not talking about data that I am importing but the text in the script itself. For example, writing the following script file in Linux:
# This iß an exämple
text <- c("R kann äußerst nervig sein")
Will be displayed differently when opened on Windows:
# This i? an ex?mple
text <- c("R kann ?u?erst nervig sein")
Are there any settings that prevent this from happening? I've already tried to set the standard encoding to utf-8 on both machines but it didn't seem to change anything.
The standard R build on Windows doesn't fully support UTF-8, because Windows itself just added that capability very recently. So you could download the "WinUCRT" build of R (though I forget the location, Google probably knows), and then things would be fine.
Alternatively, for widest portability you could write your scripts in pure ascii by encoding the accented letters as Unicode escapes. The stringi package can help with this, e.g.
cat(stringi::stri_escape_unicode("R kann äußerst nervig sein"))
#> R kann \u00e4u\u00dferst nervig sein
Created on 2021-11-09 by the reprex package (v2.0.1)
so you'd put this in your code:
text <- "R kann \u00e4u\u00dferst nervig sein"
(There's no need to call c() for one element.) This is inconvenient, but should work on all systems.

how to change tesseract config to recognize § and apply with pdftools::pdf_ocr_text in R?

I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract.
I looked at the following links:
CRAN tesseract package vignette
SO link of a similar question
and this github page
And I tried the following:
I found the configuration files using tesseract_info() and edited the digits file under configs.
The digits file content was like this:
tessedit_char_whitelist 0123456789.
After editing it looks like this:
tessedit_char_whitelist 0123456789-$§.
This did not change anything at all, I am still not able to extract §. They still appear as 8.
After the 1st step failed, I tried the following:
filepng <- pdftools::pdf_convert(filePathPDF, dpi = 600)
specs <- tesseract("deu", options = list(tessedit_char_whitelist = "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM#߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = specs)
This one failed too. I am by no means an expert on OCR and tesseract has room for improvements when it comes to documentation.
How can I add § to the list of characters to be recognized in the right way, so that it applies?
Update
The following works to recognize §, when I remove language from the argument list:
charlist <- tesseract(options = list(tessedit_char_whitelist = " 1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM#߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = charlist)
But this time, I am losing German umlauts. I cannot find out how I can specify the language and the char_whitelist at the same time. According to the documentation, tesseract() accepts language argument and options argument. But this does not seem to work. Any ideas?
Update:
I tried using tesseract in command line (MacOS Catalina 10.15.7).
I converted a scanned PDF file first to an image then used this:
tesseract fileConverted.tiff fileToText
It creates fileToText.txt. It does recognize §. All of them are correctly recognized. But German umlauts are not recognized correctly, since I did not specify language at all. When I use the same command with the language argument
tesseract fileConverted.tiff fileToText -l deu
German umlauts are recognized properly but § is not.
The digits config file I changed is here:
/usr/local/Cellar/tesseract/4.1.1/share/tessdata/configs
My understanding is: it is not a problem specific to R, but it occurs with tesseract itself. Setting tessedit_char_whitelist and the language at the same time does not seem to be possible or I am missing something horribly.
As said above, tesseract 4 does not support setting a whitelist. To go around that problem, you could use the command-line switch. You need to set OCR Engine mode to the "Original Tesseract only" with --oem 0 then use -c tessedit_char_whitelist=abc... to pass your whitelist directly via the command-line.
Overall, it should look something like this :
tesseract fileConverted.tiff fileToText --oem 0 -l deu -c tessedit_char_whitelist=0123456789-$§

Using accented letters in R package documentation using bibtex and roxygen2

I am developing an R package in RStudio (R version 3.6.1; RStudio version 1.2.1335) using roxygen2 (version 6.1.1) and am using the \insertCite{} command together with a bibtex file in order to cite references in the documentation for individual functions. I am following the instructions Inserting references in Rd and roxygen2 documentation. Everything works fine, except when I try to include a reference with accented characters. So my REFERENCES.bib file contains the following entry:
#ARTICLE{Cabcdef15,
author={John {\c C}abcdef},
title={A title},
journal={Journal of Applied Stuff},
year={2015},
volume={81},
number={1},
pages={100--200},
}
The {\c C} is the LaTeX command for a C-cedilla (Ç). (I also tried \c{C} and pasting Ç directly and neither resolved the issue.)
I cited this reference in the roxygen2 preamble for my R function myfunction using \insertCite{Cabcdef15}{mypackage}. However, in the documentation output (after running devtools::document() and devtools::build(), installing the package and running library(mypackage) and ?myfunction) the citation appears in my browser (Google Chrome) as (Çabcdef 2015) rather than as (Çabcdef 2015).
Presumably this is an encoding issue. However, from what I read in the aforementioned instructions (under 4.4 Encoding of file REFERENCES.bib) this should be working, provided that I have the line Encoding: UTF-8 in the DESCRIPTION file for my R package, which I do. Hence I am stumped.
I have a strong suspicion you are using a Microsoft operating system.
I have code in a roxygen2 examples block which outputs accented French characters: works fine with non-French locales on MacOS and Linux: Windows makes a mess of it. I have UTF-8 in package DESCRIPTION. For me, the obvious work-around is not to use Windows for documenting the package. UTF-8 everywhere seems to work well for me, except on Windows. The R documentation links are helpful, and, in a related post, the mighty Yihui Xie writes about this issue.
This WONTFIX R issue also hints at the root cause: Windows.
A more palatable and Windows-compatible workaround is discussed in platform specific sections in Writing R Extensions.

Warning about UTF-8 with roxygen2

I have a problem about UTF-8.
After conducting roxygen2::roxygenise() for my package,
it showed the warning message 'roxygen2 requires Encoding: UTF-8'.
How can I fix it?
roxygen2::roxygenise()
> Writing NAMESPACE
>
> Loading ABXTT
>
> Writing NAMESPACE
>
> Warning message:
>
> roxygen2 requires Encoding: UTF-8
Add:
Encoding: UTF-8
to your DESCRIPTION file.
As James said,
Add: Encoding: UTF-8
to your DESCRIPTION file to eliminate the warning. You can find the documentation for the format of the DESCRIPTION file on CRAN). That standard includes this field to tell the rest of the build process which file encoding is used. CRAN defaults to use ASCII, so if you use something other than ASCII this field is needed.
As it turns out, roxygen as of 6.1.0 reads and writes files as UTF-8. As noted above this actually is in conflict with CRAN standards/defaults. So, at some point roxygen may stop writing files as UTF-8. Until then, your package documentation has been written as UTF-8 and you probably will need to add this line to avoid the warning.
Side note: UTF-8 reads as ASCII so long as there are no characters outside of the ASCII range (i.e. no special characters). So, in practice whether you remember to include this field or not may make no difference.

Resources