I just encountered something baffling (at least to me) and hope wiser members can shed some light.
I used RStudio 0.98.490 on Windows XP to save a plot to PNG. The filename was created using strwrap(sprintf()). I habitually use longer (i.e. more informative) filenames, and I guess I was probably over-enthusiastic this time, which is why this issue surfaced.
I noticed that when the width of the RStudio console is shorter than the length of the filename during run-time, the latter gets truncated and the file created does not have the .png extension. I experimented and dragged the width of RStudio console to longer than the filename - the problem disappears.
My question: why does this happen? More importantly, can I resolve this truncation without changes to my filename? I am a newbie to R and I can't see why 2 seemingly unrelated items should interact.
Truncation shown below:
> writeLines( paste0(FName, " generated") ) # Write to Console
aaaaaa aaaaaaaaaaaaaabcdef ghijk lmnopqrstuvaaaaaa aaaaaaaaaaaaa213424534aaaaaa generated
aaaaaaaaaaaaa.png generated
>
Sample code is attached below:
astring <- "aaaaaa aaaaaaaaaaaaa"
FName <- strwrap( sprintf("%sabcdef ghijk lmnopqrstuv%s213424534%s.png",
astring, astring, astring) ) # simulate long filename
png( filename = FName)
a <- rnorm(100)
b <- rnorm(100)*2
plot(b,a)
dev.off()
writeLines( paste0(FName, " generated") ) # Write to Console
The closest resource I found was https://stackoverflow.com/questions/6104448/preserving-long-comments-in-console-output-not-falling-victim-to-truncat but the problem faced by the author appeared slightly different.
I would appreciate very much if someone can enlighten. Thanks!
EDIT: Thanks to #jlhoward, I looked up strwrap() and found the width parameter. By assigning '255' (or any big integer), the problem is resolved.
Why are you using strwrap(...)?
As the documentation explains, strwrap(...) parses your input into words, and then wraps (by inserting "\n") based on a width parameter. The default for this parameter is getOption("width"), which is based on the console width. Try typing
getOption("width")
then shrink or expand your console window and do it again.
If you just use sprintf(...) to generate your filename, you don't have this problem.
Related
Hi I am trying to do text mining in R version 3.4.2
I am trying to import .txt files from local drive using VCorpus command.
But after Run following code
cname <- file.path("C:", "texts")
cname
dir(cname)
library(readr)
library(tm)
docs <- VCorpus(DirSource(cname))
summary(docs)
inspect(docs[1])
writeLines(as.character(docs[1]))
Output:
Well, the election, it came out really well. Next time we**’**ll triple the number and so on
’ its originally aporstophe(')s now how can i convert or get original text in Rstudio?
Please it will appreciate if someone help me
Thanks in Advance
Encoding issues are not easy to solve, since they depend on various factors (file ecnoding, encoding settings during loading, etc.). As a first step you might try the following line, if we are lucky it solves your problem.
Encoding(your_text) <- "UTF-8"
Otherwise, other solutions have to be chekced, e.g., using stri_trans from stringi package or replacing wrong symbols with brute force via gsub(falsecharacter, desiredcharacter, fixed = TRUE) (there are debugging tables, e.g., on i18nqa.com).
I solved this a different way.
I found that apostrophes that looked like this: ' would render properly, while ones that looked slightly different, like this: ’ would not.
So, for any text that I was printing, I converted ’ to ' like this:
mytext <- gsub("’", "'", mytext )
Tada... no more issues with "’".
trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.
My issue is now with getting Tesseract to do anything with my PNG files. I get the error:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Here is my code:
lapply(myfiles, function(i){
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
lapply(mypngs, function(z){
shell(shQuote(paste0("tesseract ", z, " out")))
file.remove(paste0(z))
})
})
The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.
I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.
dest <- "C:\\users\\YOURNAME\\desktop"
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})
Background
It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.
In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).
The problem was I kept getting the same string of errors that you got:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid
What I think the problem is
What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.
It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).
When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.
basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.
Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.
Possible workarounds
So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:
Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.
Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.
lower DPI As suggested by the first answer, this would also probably reduce memory use.
break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.
These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.
Example Code
# Load Libraries
library(tesseract)
dir.create("finished_data")
# Define Functions
ocr2 <- function(pdf_path){
# tell tesseract which language to guess
eng <- tesseract("eng")
#convert to png first
#pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
# tell tesseract to convert the pdf at pdf_path
seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
#combine all the pages into one page
combined_pages <- paste(seperated_pages, collapse = "**new page**")
# I delete png files as I go to avoid overfilling the hard drive
# because work computer has no hard drive space :'(
png_file_paths <- list.files(pattern = "png$")
file.remove(png_file_paths)
combined_pages
}
# find pdf_paths
fax_file_paths <- list.files(path="./raw_data",
pattern = "pdf$",
recursive = TRUE)
#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths),
ocr2)
fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))
write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)
I want my "main" and my "sub" to contain the Euro-Sign "€".
Rstudio is displaying the Euro-Sign perfectly, meanwhile my saved pdf-Plot is displaying "..." instead of "€".
Can you help me? What am I doing wrong?
Here's a minimal example:
Euro <- "\u20AC"
pdf("test.pdf")
plot(1:10,main=paste("Gained money in",Euro,sep=" "),
xlab="Day",ylab=paste("Money in",Euro,sep=" "))
I've already found out that the problem appears to be the pdf-format, because png is working totally fine. Converting it later on to pdf with "convert" in bash is possible, but the actual plot needs the high pdf-resolution.
Thanks!
I am learning R programming and I have tried to reproduce code from several examples (typing it myslef into RStudio console) and sometimes mine would give an error for no apparent reason.
I found out that if I manually type the entire code in RStudio it gives me "rror: unexpected input in "first_line_of_code
If right after I copy exactly that same code from the console (pressing up arrow, selecting and copying) and then paste it in the same place, it works.
Why is that happening and how can I solve it?
Example:
h<- c()
for (i in 1:10) {
h <- append (h,i^2)
}
h
Error is "rror: unexpected input in "h<- c()
Thanks for the replies.
The code I posted is just an example of the multiple codes that gave me an error when I type it, but works once I copy paste the same code that I typed.
I my main concern is to figure out why that happens and how to solve it since now every time I get an error I have to copy paste it just to be sure that I actually did something wrong.
Any idea about that?
What are you trying to learn here? There are much better ways of creating an object with no content, e.g.
h <- vector('numeric')
And if you want to "build" a vector, always start by creating an empty vector of desired length. There's a huge time penalty if you re-size it every time thru a loop.
What you should be doing, I suspect, is:
h <- (1:10)^2
I can't replicate the error you reported exactly, but I can get close to it...
It think it is because you're copying another symbol when you're copying across the code, such as a quotation mark. If you have a look at your console, when you enter the text the first time (with the erroneous " at the front), it gives you a + symbol where you type, instead of the usual >:
> "h<- c()
+ for (i in 1:10) {
+ h <- append (h,i^2)
+ }
+ h
+
Then pressing the up arrow and running the same thing again actually closes the quote, and lets the whole thing run again, adding in the error.
+ "h<- c()
Error: unexpected symbol in:
"
"h"
> for (i in 1:10) {
+ h <- append (h,i^2)
+ }
> h
If it's not this, try copying each line across one at a time. This will tell you which line R is having an issue with. Hope that helps :)
Is it possible to use a prefix when specifying a filepath string in R to ignore escape characters?
For example if I want to read in the file example.csv when using windows, I need to manually change \ to / or \\. For example,
'E:\DATA\example.csv'
becomes
'E:/DATA/example.csv'
data <- read.csv('E:/DATA/example.csv')
In python I can prefix my string using r to avoid doing this (e.g. r'E:\DATA\example.csv'). Is there a similar command in R, or an approach that I can use to avoid having this problem. (I move between windows, mac and linux - this is just a problem on the windows OS obviously).
You can use file.path to construct the correct file path, independent of operating system.
file.path("E:", "DATA", "example.csv")
[1] "E:/DATA/example.csv"
It is also possible to convert a file path to the canonical form for your operating system, using normalizePath:
zz <- file.path("E:", "DATA", "example.csv")
normalizePath(zz)
[1] "E:\\DATA\\example.csv"
But in direct response to your question: I am not aware of a way to ignore the escape sequence using R. In other words, I do not believe it is possible to copy a file path from Windows and paste it directly into R.
However, if what you are really after is a way of copying and pasting from the Windows Clipboard and get a valid R string, try readClipboard
For example, if I copy a file path from Windows Explorer, then run the following code, I get a valid file path:
zz <- readClipboard()
zz
[1] "C:\\Users\\Andrie\\R\\win-library\\"
It is now possible with R version 4.0.0. See ?Quotes for more.
Example
r"(c:\Program files\R)"
## "c:\\Program files\\R"
If E:\DATA\example.csv is on the clipboard then do this:
example.csv <- scan("clipboard", what = "")
## Read 1 item
example.csv
## [1] "E:\\DATA\\example.csv"
Now you can copy "E:\\DATA\\example.csv" from the above output above onto the clipboard and then paste that into your source code if you need to hard code the path.
Similar remarks apply if E:\DATA\example.csv is in a file.
If the file exists then another thing to try is:
example.csv <- file.choose()
and then navigate to it and continue as in 1) above (except the file.choose line replaces the scan statement there).
Note that its not true that you need to change the backslashes to forward slashes for read.csv on Windows but if for some reason you truly need to do that translation then if the file exists then this will translate backslashes to forward slashes (but if it does not exist then it will give an annoying warning so you might want to use one of the other approaches below):
normalizePath(example.csv, winslash = "/")
and these translate backslashes to forward slashes even if the file does not exist:
gsub("\\", "/", example.csv, fixed = TRUE)
## [1] "E:/DATA/example.csv"
or
chartr("\\", "/", example.csv)
## [1] "E:/DATA/example.csv"
In 4.0+ the following syntax is supported. ?Quotes discusses additional variations.
r"{E:\DATA\example.csv}"
EDIT: Added more info on normalizePath.
EDIT: Added (4).
A slightly different approach I use with a custom made function that takes a windows path and corrects it for R.
pathPrep <- function() {
cat("Please enter the path:\\n\\n")
oldstring <- readline()
chartr("\\\\", "/", oldstring)
}
Let's try it out!
When prompted paste the path into console or use ctrl + r on everything at once
(x <- pathPrep())
C:/Users/Me/Desktop/SomeFolder/example.csv
Now you can feed it to a function
shell.exec(x) #this piece would work only if
# this file really exists in the
# location specified
But as others pointed out what you want is not truly possible.
No, this is not possible with R versions before 4.0.0. Sorry.
I know this question is old, but for people stumbling upon this question in recent times, wanted to share that with the latest version R4.0.0, it is possible to parse in raw strings. The syntax for that is r"()". Note that the string goes in the brackets.
Example:
> r"(C:\Users)"
[1] "C:\\Users"
Source: https://cran.r-project.org/doc/manuals/r-devel/NEWS.html
jump to section: significant user-visible changes.
Here's an incredibly ugly one-line hack to do this in base R, with no packages necessary:
setwd(gsub(", ", "", toString(paste0(read.table("clipboard", sep="\\", stringsAsFactors=F)[1,], sep="/"))))
Usable in its own little wrapper function thus (using suppressWarnings for peace of mind):
> getwd()
[1] "C:/Users/username1/Documents"
> change_wd=function(){
+ suppressWarnings(setwd(gsub(", ", "", toString(paste0(read.table("clipboard", sep="\\", stringsAsFactors=F)[1,], sep="/")))))
+ getwd()
+ }
Now you can run it:
#Copy your new folder path to clipboard
> change_wd()
[1] "C:/Users/username1/Documents/New Folder"
To answer the actual question of "Can I parse raw-string in R without having to double-escape backslashes?" which is a good question, and has a lot of uses besides the specific use-case with the clipboard.
I have found a package that appears to provide this functionality:
https://github.com/trinker/pathr
See "win_fix".
The use-case specified in the docs is exactly the use-case you just stated, however I haven't investigated whether it handles more flexible usage scenarios yet.