Unescape LaTeX to UTF-8 or ASCII - r

I use the R packages RefManageR and bibtex packages to read in a bibtex file I exported from Mendeley (my reference manager). Sometimes authors are listed with accents in their name (López), but in BibTeX these are escaped to "L{\\'{o}}pez". However, in another reference this name is spelled without accent (Lopez).
How can I parse the "L{\\'{o}}pez" to López or Lopez so I can compare them?
I googled but this only shows how I can escape -while I want to unescape- or to make pdf's from R.

I tried this and it worked for me, but I still think there must be a better solution:
deTeX <- function(x) {
gsub("\\{\\\\.+?\\{([a-z]*)\\}\\}", "\\1", x, fixed = FALSE, perl = TRUE, ignore.case = TRUE)
}

Related

Remove space from exported filename in R using xlsx package

Trying to export a file using write.xlsx() from package "xlsx".
File export works as expected, however having trouble with the naming convention.
I wish for the file to be names as follows:
filename <today's date>.xlsx
At present I can do either of the following:
write.xlsx(exports, paste("filename", Sys.Date(),".xlsx"))
which gives:
filename 2020-04-21 .xlsx
Or I can write
write.xlsx(exports, paste("filename", Sys.Date(),".xlsx", sep = ""))
which gives:
filename2020-04-21.xlsx
How do I remove the space between the date and file extension such that the file name is:
filename 2020-04-01.xlsx
I appreciate this is somewhat a vanity thing and I could use sep = "_" to place underscores throughout, but this is not the naming convention I am trying to achieve.
Wow. I cannot believe I didn't think to just add an additional space in the last example.
Going to blame cabin fever from social distancing/isolating for that little derp.
write.xlsx(exports, paste("filename ", Sys.Date(),".xlsx", sep = ""))

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

Output accented characters for use with latex

I'm trying to use R to create the content of a tex file. The content contains many accented letters and I not able to correctly write them to a tex file.
Here is a short minimal example of what I would like to perform:
I have a file texinput.tex, which already exists and is encoded as UTF8 without BOM. When I manually write é in Notepad++ and save this file, it compiles correctly in LaTex and the output is as expected.
Then I tried to do this in R:
str.to.write <- "é"
cat(str.to.write, file = "tex_list.tex", append=TRUE)
As a result, the encoded character xe9 appears in the tex file. LaTex throws this error when trying to compile:
! File ended while scanning use of \UTFviii#three#octets.<inserted text>\par \include{texinput}
I then tried all of the following things before the cat command:
Encoding(str.to.write) <- "latin1"
-> same output error as above
str.to.write <- enc2utf8(str.to.write)
-> same output and error as above
Encoding(str.to.write) <- "UTF-8"
-> this appears in the tex file: \xe9. LaTex throws this error: ! Undefined control sequence. \xe
Encoding(str.to.write) <- "bytes"
-> this appears in the tex file: \\xe9. LaTex compiles without error and the output is xe9
I know that I could replace é by \'{e}, but I would like to have an automatic method, because the real content is very long and contains words from 3 different Latin languages, so it has lots of different accented characters.
However, I would also be happy about a function to automatically sanitize the R output to be used with Latex. I tried using xtable and sanitize.text.function, but it appears that it doesn't accept character vectors as input.
After quite a bit of searching and trial-and-error, I found something that worked for me:
# create output function
writeTex <- function(x) {write.table(x, "tex_list.tex",
append = TRUE, row.names = FALSE,
col.names = FALSE, quote = FALSE,
fileEncoding = "UTF-8")}
writeTex("é")
Output is as expected (é), and it compiles perfectly well in LaTex.
Use TIPA for processing International
Phonetic Alphabet (IPA) symbols in Latex. It has become standard in the linguistics field.

Using chinese characters without changing locale in R

I can use chinese characters in R, can put them in the strings inside a data.frame, substitute them with gsub, and they display normally on screen. I can save them to a file using write.table, but I can't read them with read.table! I'm using fileEncoding="UTF-8" for write.table and for read.table, but the latter gives me:
invalid multibyte string at ...
I've read about changing the locale, but if the chinese characters work everywhere else, I would like not to mess with the locale (my machine use a mix of english and portuguese locale). Is that possible?
I'm using RKWard in Ubuntu 14.10.
EDIT: chinese characters work perfectly everywhere in the files, they just produce errors when used for quoting...
Sorry. I arrived too late. I am using ubuntu 20.04 and the following worked for my file:
lists <- read_delim("LISTS.csv", ";", escape_double = FALSE, locale = locale(encoding = "ISO-8859-1"), trim_ws = TRUE)
Good luck

In R , Reading a PDF document [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Resources