I work with R-Studio in kubuntu 16.04. My language settings are:
> Sys.getlocale()
[1] "LC_CTYPE=de_AT.UTF-8;LC_NUMERIC=C;LC_TIME=de_AT.UTF-8;LC_COLLATE=de_AT.UTF-8;LC_MONETARY=de_AT.UTF-8;LC_MESSAGES=de_AT.UTF-8;LC_PAPER=de_AT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_AT.UTF-8;LC_IDENTIFICATION=C"
> Sys.getenv()
...
LANG de_AT.UTF-8
LANGUAGE de_AT:de
...
However, if I export a plot with the "Enhanced Metafile Graphics Device" (link), for instance:
emf("file.emf"); hist(somedata, main = "Überschrift"); dev.off()
and then import file.emf into MS Word (on another PC) and make it editable, all the text of the plot is in US English.
Question 1: Is it possible to obtain plots with text languages other than English?
Question 2: How?
(I realize my answer is late, but adding information for posterity)
There are two graphics formats supported by devEMF:
EMF. Locale / language information is not included in the EMF format specification except to distinguish languages with vertical vs. horizontal text.
EMF+. Locale / language information IS included in the EMF+ specification, but this feature is not implemented by devEMF.
That said, Unicode characters are fully supported for both formats and should not disappear or otherwise change when viewed/edited in MS Word.
Related
I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract.
I looked at the following links:
CRAN tesseract package vignette
SO link of a similar question
and this github page
And I tried the following:
I found the configuration files using tesseract_info() and edited the digits file under configs.
The digits file content was like this:
tessedit_char_whitelist 0123456789.
After editing it looks like this:
tessedit_char_whitelist 0123456789-$§.
This did not change anything at all, I am still not able to extract §. They still appear as 8.
After the 1st step failed, I tried the following:
filepng <- pdftools::pdf_convert(filePathPDF, dpi = 600)
specs <- tesseract("deu", options = list(tessedit_char_whitelist = "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM#߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = specs)
This one failed too. I am by no means an expert on OCR and tesseract has room for improvements when it comes to documentation.
How can I add § to the list of characters to be recognized in the right way, so that it applies?
Update
The following works to recognize §, when I remove language from the argument list:
charlist <- tesseract(options = list(tessedit_char_whitelist = " 1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM#߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = charlist)
But this time, I am losing German umlauts. I cannot find out how I can specify the language and the char_whitelist at the same time. According to the documentation, tesseract() accepts language argument and options argument. But this does not seem to work. Any ideas?
Update:
I tried using tesseract in command line (MacOS Catalina 10.15.7).
I converted a scanned PDF file first to an image then used this:
tesseract fileConverted.tiff fileToText
It creates fileToText.txt. It does recognize §. All of them are correctly recognized. But German umlauts are not recognized correctly, since I did not specify language at all. When I use the same command with the language argument
tesseract fileConverted.tiff fileToText -l deu
German umlauts are recognized properly but § is not.
The digits config file I changed is here:
/usr/local/Cellar/tesseract/4.1.1/share/tessdata/configs
My understanding is: it is not a problem specific to R, but it occurs with tesseract itself. Setting tessedit_char_whitelist and the language at the same time does not seem to be possible or I am missing something horribly.
As said above, tesseract 4 does not support setting a whitelist. To go around that problem, you could use the command-line switch. You need to set OCR Engine mode to the "Original Tesseract only" with --oem 0 then use -c tessedit_char_whitelist=abc... to pass your whitelist directly via the command-line.
Overall, it should look something like this :
tesseract fileConverted.tiff fileToText --oem 0 -l deu -c tessedit_char_whitelist=0123456789-$§
On some computers, the following code used in conjunction with the packages siar and SIBER does not render the delta and/or permil symbol correctly in the axes labels. Instead, either a blank axis label, or text such as "\u2030" is rendered in its place.
plot(0,xlab = expression(paste(delta^13,"C (\u2030)")))
One often encountered problem is that your computer's region settings (i.e. your operating system, not the applications R or Rstudio) is set to use a non-UTF8 character set. If you type
Sys.setlocale()
in the R command window, you should see something like
"en_IE.UTF-8/en_IE.UTF-8/en_IE.UTF-8/C/en_IE.UTF-8/en_IE.UTF-8"
which for me means I'm using UTF-8 in english with Irish region settings.
If you don't see UTF-8 then the \u2030 and other character codes won't work
I constructed dendrogram in R with the code:
data(iris)
aver<-sapply(iris[,-5],function(x) by(x,iris$Species,mean))
matrix<-dist(aver)
clust<-hclust((matrix),"ave")
clust$labels<-row.names(aver)
plot(as.dendrogram(clust))
I wanted to save the dendrogram as svg file using the code:
install.packages("Cairo")
library(Cairo)
svg("plot.svg")
plot(as.dendrogram(clust))
dev.off()
Here the problem started:
When I imported the "plot.svg" into Inkscape (ver: 0.48.4) and selected any label (e.g. "setosa") it was not recognized as a text, but rather as some "user defined" object. Specifically, when I selected any "letter" in the label and inspect it with the XML Editor (ctrl+shift+X) in Inkscape I obtained this information:
**id**: use117
**x**: 142.527344
**xlink:href**: #glyph0-8
**y**: 442.589844
On the other hand, when I manually wrote "setosa" using "create and edit text objects" tool, and inspected in XML Editor, it returned:
**id**: text4274
**sodipodi:linespacing**: 125%
**style**: font-size:18px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Palatino Linotype;-inkscape-font-specification:Palatino Linotype
**transform**: scale(0.8,0.8)
**x**: 176.02016
**xml:space**: preserve
**y**: 596.96674
It is likely that Inkscape did not recognize the labels as a text according to the attribute "id" from XML Editor. Hence, I am not able to change neither font, size as well as use other functions related to text objects in Inkscape.
Here is the svg file, that I made with the previous code and imported into Inkscape
I checked previous steps using other versions of Inkscape as well as R, but it would be the same.
Here is the question:
Do you have any suggestion how I can gather labels as a text attribute instead of a "user defined" (or whatever it is object...) when importing svg files from R into Inkscape?
UPDATE
#baptiste linked to the SO thread where #Oscar Perpiñán suggested three packages (gridSVG, SVGAnnotation and RSVGTipsDevice) that manipulate SVG. Unfortunately, neither of packages suggested could solve the problem with the text issue.
So far I found SO thread where #Mo Sander suggested RSvgDevice package since it can preserve text object rather than glyphs. Being stuck with the RSvgDevice installation procedure, I found that it RSvgDevice is only available for 32-bit installations and R < 2.15.0. Otherwise, R returned warning message:
Warning message:
package ‘RSvgDevice’ is not available (for R version 3.0.1)
Beside the requirements for older R versions, currently only RSvgDevice can preserve a text object in SVG.
I'm a bit late to the party, but I've been dealing with this myself. I found a trick to make it work. First, I export the plot as PDF instead of SVG because PDF fonts are recognized by inkscape.
This, however brings a new problem as the text often ends up being defined letter by letter meaning that you can change the font, but the spacing is still defined and it becomes immensely annoying. I found that it was due to the x coordinate being defined at each letter.
I wrote a perl script and put it in this gist to remove all the trailing coordinates. After that I'm able to manipulate all the fonts I wished. Note, that this will only work for horizontal text.
Hope that helps this problem you had over a year ago :)
This is a failing in Cairo. Major, from my point of view.
The cairo SVG surface (i.e. the back-end in Cairo used to "draw" on SVG) simply does not support the "text" tag. It does not understand about strings at all. Instead, it places each character (glyph) individually. So any SVG generated with Cairo is not useful if you want to post-process contained text with a vector editor. :(
The only mention I found on the cairo list was this one:
http://lists.cairographics.org/archives/cairo/2011-February/021777.html
The svglite package exports text on Linux as desired.
[EDIT] According to this thread, there is also a way to remove the squeezing of the edited text into the fixed box width. Just remove the textLength field from the object in the XML editor.
Cheers
Can't directly comment on mgrewe answer because of my low reputation but thank you for the solution.
Implemented the textLength edit into R:
svgitem<-readLines('file.svg')
svgitem<-gsub('textLength=','tL=',svgitem)
writeLines(svgitem,'without_textLength.svg')
Text-box seems to be no longer affected after edition in Inkscape using the without_textLength.svg file and keeps a trace of old textLength renamed 'tL'.
Thanks again mgrewe, I've lost so many hours reformating text in Inkscape before seeing your answer.
R is clearly not using the standard SVG text objects for producing its labels. I have no idea why. I am not an R user.
Perhaps by default it uses it's own custom font that it manually inserts glyph-by-glyph into the output. Are you using the same font in both cases? In Inkscape you are using Palatino. Is that what you are using for the labels in R?
I want to set (x_1, x_2, \dots, x_n) in a bold font within R documentation. I wrote
\deqn(\bold{x}_1, \bold{x}_2, \ldots, \bold{x}_n),
but when Rstudio shows HTML preview of the documentation, x is not bold and \bold{x} is illustrated in HTML help page. The other latex math bold producer such as \boldsymbol, \mathbf, \boldmath were also unsuccessful.
So, What is the right command for setting a character in a bold within math mode?
Thank you,
P.S. When I applied \mathbf and \boldsymbol character x in pdf constructed documentation became bold, but how about the HTML help page?
Note: it is \deqn{ .. } but you've used '(...)'.
But then, the documentation (the "Writing R Extensions" manual that comes with every version of R and is also available on CRAN / r-project.org)
explains that both \deqn and \eqn are for layouting pure LaTeX and possibly give
(as 2nd argument) a text version of the corresponding formula.
Hence, \bold is not appropriate: It's Rd language instead of LaTeX.
Did you try \mathbf{} .. as that is the (pure) LaTeX way ?
HTML is conceptually between LaTeX and simple text (= the two arguments of (d)eqn).
Within R (and outside), there are efforts and experiments of better HTML rendering of such math equations. Do ask on R-help or R-devel (the mailing lists!) if you want to enquire about our plans about this ((I have not been involved in it)).
BTW: I strongly disagree with HW's opinion that you should not care about the PDF version of the reference manual. Mathematically minded authors and readers (with some experience) very much appreciate nice (formulas in) PDF versions of the help pages / reference manuals.
Yes, it is a matter of taste, to a large extent. I do prefer carefully written and layouted reference material to those help pages that are only written because they are required by "R CMD check" ;-)
I'm creating a plot in R, and need to add an en dash to some axis labels, as opposed to your everyday hyphen.
axis(1, at=c(0:2), labels=c("0-10","11-30","31-70"))
I'm running R version 2.8.1 on Linux.
Old question but still a problem...
I'm using R vsn 3.3.2 on OSX 10.12.2, plotting with plot() to a pdf file that I import into Affinity Designer vsn 1.5.4. Axis labels of the form "2-0" show up in Affinity Designer with the dash overlapping the "0". I don't know if the problem lies with Affinity Designer or the pdf file or what. It would be nice to be able to try various Unicode dash characters, but R and pdf files both seem to not yet be fully equipped to deal with Unicode using the default fonts.
Solution: the "cairo" package in R:
library("cairo")
d = 0:11
names(d) = paste(0:11, "-", 11:0, sep="")
names(d) = gsub("-", "\U2012", names(d)) # U+2012 is "figure dash"
d
barplot(d)
cairo_pdf(filename="x.pdf", width=11, height=8)
barplot(d)
dev.off()
The dashes show up in the R console, default R plotting device, and the pdf file viewed with both Preview and Affinity Designer.
In this example, you can use the expression() function to get en dashes rendered properly:
axis(1,
at=c(0:2),
labels=c(expression(0-10),
expression(11-30),
expression(31-70)))
You're using Linux, so depending on how well R understands unicode, you could map one of your spare keyboard keys to the Compose Key and then just type it out. To get a —, press Compose and then the normal - key two or three times (depending on your system's mappings). Note that when using the Compose key, you don't hold it down - just press the keys in sequence.
Exactly how you'd enable that varies, but in Ubuntu, System->Preferences->Keyboard, Layout tab, Layout Options button, and select something appropriate for the "Compose key position" item. I usually use the Menu key.
Edit: My mistake, you wanted an en-dash, not an em-dash. Then en-dash (–) is Compose dash dash period, rather than Compose dash dash dash.
a MDPI journal has requested to change from hyphen to en dash in the axis labels.
Using the base system for graph, I solved the problem by simply changing the "-" with "\u2013" without spaces. The example code for axis in a complete form is
axis(1,1:2,c("20\u201329","40\u201349")
In my case the two labels expressed two age groups. I used it in R 4.1.3 and windows 10.