odfWeave error source document containing non-ASCII chars - r

Getting this error calling ODFWeave on my doc.
Pre-processing the contents
Sweaving content.Rnw
Error: ‘content.Rnw’ is not ASCII and does not declare an encoding
I've seen some ways you can add an encoding switch in LaTeX docs "(Sweave --encoding=utf-8)", but don't know if this can be done with odfWeave
I've worked around it before by converting the source doc back to ASCII, but ideally it would be nice if the conversion would run with whatever is in my doc (and some names, for example, require a non-ASCII charset).

We made changes to odfWeave so that it (rightly) uses a utf-8 encoding. In fact, we coerce this by using the 'encoding="UTF-8"' option to Sweave.
I guess the question is "why isn't the document utf-8"? Honestly, I dont really have a good answer for you since I don't have the document (or the results of sessionInfo()). You might be creating non-utf8 characters in the course of weaving.
One thread that might help is this:
http://r.789695.n4.nabble.com/Running-odfWeave-on-its-own-examples-odt-td4639889.html
Figuring this out appears pretty complex and I wish I had a clear-cut answer for you.

Related

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!
It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.
There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

Track the exact place of a not encoded character in an R script file

more a tip question that can save lots of time in many cases. I have a script.R file which I try to save and get the error:
Not all of the characters in ~/folder/script.R could be encoded using ASCII. To save using a different encoding, choose "File | Save with Encoding..." from the main menu.
I was working on this file for months and today I was editing like crazy my code and got this error for the first time, so obviously I inserted a character that can not be encoded while I was working today.
My question is, can I track and find this specific character and where exactly in the document is?
There are about 1000 lines in my code and it's almost impossible to manually search it.
Use tools::showNonASCIIfile() to spot the non-ascii.
Let me suggest two slight improvements this.
Process:
Save your file using a different encoding (eg UTF-8)
set a variable 'f' to the name of that file. something like this f <- yourpath\\yourfile.R
Then use tools::showNonASCIIfile(f) to display the faulty characters.
Something to check:
I have a Markdown file which I run to output to Word document (not important).
Some of the packages I used to initialise overload previous functions. I have found that the warning messages sometimes have nonASCII characters and this seems to have caused this message for me - some fault put all that output at the end of the file and I had to delete it anyway!
Check where characters are coming back from Warnings!
Cheers
Expanding the accepted answer with this answer to another question, to check for offending characters in the script currently open in RStudio, you can use this:
tools::showNonASCIIfile(rstudioapi::getSourceEditorContext()$path)

R exporting text issue

I have a problem that it might be a bit unique, but I think that if it is answered it could answer other questions about encoding too.
In order to expand my R skills I tried to write a function that I could manage the vcf file from android phones. Everything went ok, until I tried to upload the file in the phone. An error appeared that the first line starts with something else than a normal VCF version 3 file. But when I check the file on the PC it appears to be ok without these characters that my phone said. So, I asked about it and one person here said that it is the Byte Ordering Mark and I should use a HEX editor to see it. And it was there even it couldn't be seen in the TXT editor of windows and linux.
Thus, I tried to solve the problem by using fileEncoding arguments in R. the code that I use to write the file is:
write.table(cons2,file=paste(filename,".vcf",sep=""),row.names=F,col.names=F,quote=FALSE,fileEncoding="")
I put ASCII as argument, UTF-8 etc but no luck. ASCII seems to delete some of the characters, and UTF-8 makes these characters be visible in the text file.
I would appreciate if someone could provide a solution to this.
PS: I know that if I modify the file in a HEX editor it solves the problem, but I want the solution in the R coding.

Converting .pdf files to excel (.xls)

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.

Where to find mapping of Latex \sum commands to actual Sigma symbol in ttf?

If this is OT, please tell me where to repost it.
I need to render some math equations, in real time. Where can I find a mapping of the LaTeX "english" names (like \sum ) to symbol XYZ of ABC.ttf ? I can read + render ttf's fine; I just don't know where to get the ttfs that have the math symbols and how they're indexed.
Thanks!
I did not find such a list. But I compiled a similar list by hand.
Look at my supersuer question where the AHK script provides a very partial map
I've taken a look at the LaTeX mappings from symbol names to fonts and it's scary. LaTeX's fonts are more complicated than TTF to begin with, and the math fonts are the most complicated of the bunch. For starters, there are separate variants for the glyph depending on context: a big "\sum" in a formula is one character, but if you type "\sum" in inline math you get a different character optimized for inline formulas.
You might be better off using a Unicode database. You can download the Unicode database (as text files) from unicode.org, and there are also some libraries (IBM's LibICU?) which let you look up symbols by name. Most of the symbols you're looking for start at code point U+2200. This stuff doesn't give you nice looking formulas, just the symbols.
You might also be looking for a MathML renderer instead. This will give you the nice looking formulas (not as good as LaTeX), and has an XML interface which should be easy enough to work with.
I would look at Xetex. It allows unicode input, and allows one to use TTF fonts, so they must be doing exactly what you want to do. Or maybe you can even replace Latex with Xetex in whatever you're doing: then you wouldn't need to know the hairy details.

Resources