Error in R Tesseract - r

I have the R Tesseract package working with the default eng.traineddata under OSX, but it simply won't find other languages.
trial <- ocr("test.png", engine = tesseract(language = "jpn", datapath="/Users/histmr/Library/R/3.3/library/tesseract/tessdata"))
Generates the error:
Failed loading language 'jpn'
Tesseract couldn't load any languages!
Error in tesseract_engine_internal(datapath, language) :
Unable to find training data for: jpn
I've checked with
tesseract_info()
$datapath
[1] "/Users/histmr/Library/R/3.3/library/tesseract/tessdata/"
$available
[1] "eng" "jpn"
$version
[1] "3.05.00"
Sometimes I get references to a "TESSDATA_PREFIX environment variable" but I don't know where that is. How can I get the correct directory path (I can see the file in the directory) or edit the "TESSDATA_PREFIX environment variable"?
The problem seems to occur with Japanese but NOT French
tesseract_download("fra")
french <- tesseract("fra")
Works fine! But
tesseract_download("jpn")
japanese <- tesseract("jpn")
Generates an error

The error message Error in tesseract_engine_internal(datapath, language) said the language file, in your case jpn.traineddata, is not available in the TESSDATA_PREFIX which is the default path for storing all the trained language data. If you haven't set the path, you may open a terminal and type the command below.
export TESSDATA_PREFIX=/Users/histmr/Library/R/3.3/library/tesseract/tessdata/
Hope this help.

One possible problem is multiple installs of Tesseract (I used Homebrew and MacPorts) creating multiple TESSDATA folders. Strangely R was happier with a seemingly identical folder, but in a different place closer to root, ordinarily hidden under OSX. I got things working with
export TESSDATA_PREFIX=/opt/local/share
I hope this helps

Related

Custom .traineddata file usage in the tesseract in R

I have a bunch of .JPGs with some text at the bottom, which consists mostly (but not exclusively of numbers). I wish to use the tesseract package in R to be able to 'read' the text in those .JPGs. Unfortunately, the base tesseract language proved too inaccurate to be worth using. Subsequently I tried using the Magick package to adjust the pictures (crop, resize convert etc) hoping to get a better reading from tesseract, but in my case this failed to get satisfactory results.
I eventually managed to use the description on this link (https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6) to create a new custom language in Tesseract 4.1.1 (as downloaded from https://github.com/tesseract-ocr/tesseract), which I named font_name.traineddata. The custom-made font_name.traineddata works perfectly on the Tesseract 4.1.1 console and shows significant improvement in results on the base language.
The question I have is: How I get the font_name.traineddata file to be part of the ocr command in R? I have tried the simple solution of just pasting the font_name.traineddata file into the appropriate tessdata folder in the package tesseract (the same folder that also contains the standard english data file called eng.traineddata) and then trying the following:
font_name <- tesseract ("font_name")
ocr("C:/1.jpg", engine = font_name)
This does not work and gives the error :
Error in tesseract_engine_internal(datapath, language, configs, opt_names, :
Unable to find training data for: font_name. Please consult manual for: ?tesseract_download
tesseract_download seems to be of no use, as it is a helper function to download training data from the official tessdata repository. I have also tried renaming the file to a three character name, with the same error.
Does anybody have any suggestions on how to make custom .traineddata files work with ocr in R?

Unable to import previously working SAS-formats files using R-package 'haven'

Around a year ago, I used the 'haven'-package to import two .sas7bdat files along with their respective .sas7bcat formats and it worked wonderfully.
For some reason, however, it does not any longer even though all the SAS-files incl. format files have remained unchanged since then.
When I try running the code now, R gives me the following error:
Error in df_parse_sas_file(spec_data, spec_cat, encoding = encoding,
catalog_encoding = catalog_encoding, : Failed to parse P:/SAS
files/formats.sas7bcat: Invalid file, or file has unsupported features.
R and the 'haven'-package have been reinstalled to their newest versions since the first time when it worked, so I imagine that this might be the reason since all the SAS-files and the code remains unchanged.
For this reason, I tried to reinstall the old version of 'haven' but cannot since this apparently requires a manual installation of 'Rtools' which is not allowed on my computer, so I am a bit stuck here.
Any suggestions will be greatly appreciated, thanks.
A potential workaround is that the package sas7bdat can also read SAS files. I don't know how much extra work this might involve for you though
You can read in a dataset with the code
read.sas7bdat("filename.sas7bdat")

Text mining with tm in R antiword error

So I'm rather new to R, and I'm learning how to mine text from this handy website: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
I do have my own text set of .doc, .docx, and .xlsx files and I'm trying to mine them. They're located in a folder in my working directory called 'files', but I have already encountered an error after simply writing a few lines of code.
The code I have so far is:
library(tm)
library(readtext)
data = readtext('files')
At this point, after waiting for 25 seconds or so, I get the error:
Error: System call to 'antiword' failed (1): The Big Block Depot is damaged
and the code stops running there.
I have tried searching online for solutions but it seems like a fairly rare error and so I only found 1 possible solution at https://github.com/ropensci/antiword/issues/1 but that did not work for me.
This solution suggested that one of my files were corrupt, and suggested using the code
fixInNamespace(antiword, pos="package:antiword")
to change the error to a warning to not interrupt the reading of the files. I tried that, and at first it raised the error of
Error in as.environment(pos):
no item called "package:antiword" on the search list
After which, I loaded the antiword library with a library(antiword) and changed the stop( to a warning(. However, when I ran the data = readtext('files') line again, it immediately raised the error
Error in is_windows() : could not find function "is_windows"
I'm at a loss here! Any help would be appreciated. Should I be using another package in this case?
I had the same problem with my code, where I tried to get a doc. file in R. I also used the readtext library. What helped me was converting the Word documents I was trying to get into R from doc. to docx. When I ran the same code after it worked.

The cause of "bad magic number" error when loading a workspace and how to avoid it?

I tried to load my R workspace and received this error:
Error: bad restore file magic number (file may be corrupted) -- no data loaded
In addition: Warning message:
file ‘WORKSPACE_Wedding_Weekend_September’ has magic number '#gets'
Use of save versions prior to 2 is deprecated
I'm not particularly interested in the technical details, but mostly in how I caused it and how I can prevent it in the future. Here's some notes on the situation:
I'm running R 2.15.1 on a MacBook Pro running Windows XP on a bootcamp partition.
There is something obviously wrong this workspace file, since it weighs in at only ~80kb while all my others are usually >10,000
Over the weekend I was running an external modeling program in R and storing its output to different objects. I ran several iterations of the model over the course of several days, eg output_Saturday <- call_model()
There is nothing special to the model output, its just a list with slots for betas, VC-matrices, model specification, etc.
I got that error when I accidentally used load() instead of source() or readRDS().
Also worth noting the following from a document by the R Core Team summarizing changes in versions of R after v3.5.0 (here):
R has new serialization format (version 3) which supports custom serialization of
ALTREP framework objects... Serialized data in format 3 cannot be read by versions of R prior to version 3.5.0.
I encountered this issue when I saved a workspace in v3.6.0, and then shared the file with a colleague that was using v3.4.2. I was able to resolve the issue by adding "version=2" to my save function.
Assuming your file is named "myfile.ext"
If the file you're trying to load is not an R-script, for which you would use
source("myfile.ext")
you might try the readRDSfunction and assign it to a variable-name:
my.data <- readRDS("myfile.ext")
The magic number comes from UNIX-type systems where the first few bytes of a file held a marker indicating the file type.
This error indicates you are trying to load a non-valid file type into R. For some reason, R no longer recognizes this file as an R workspace file.
Install the readr package, then use library(readr).
It also occurs when you try to load() an rds object instead of using
object <- readRDS("object.rds")
I got the error when saved with saveRDS() rather than save(). E.g. save(iris, file="data/iris.RData")
This fixed the issue for me. I found this info here
Also note that with save() / load() the object is loaded in with the same name it is initially saved with (i.e you can't rename it until it's already loaded into the R environment under the name it had when you initially saved it).
I had this problem when I saved the Rdata file in an older version of R and then I tried to open in a new one. I solved by updating my R version to the newest.
If you are working with devtools try to save the files with:
devtools::use_data(x, internal = TRUE)
Then, delete all files saved previously.
From doc:
internal If FALSE, saves each object in individual .rda files in the data directory. These are available whenever the package is loaded. If
TRUE, stores all objects in a single R/sysdata.rda file. These objects
are only available within the package.
This error occured when I updated my R and R Studio versions and loaded files I created under my prior version. So I reinstalled my prior R version and everything worked as it should.

Where is the .R script file located on the PC?

I want to find the location of the script .R files which are used for computation in R.
I know that by typing the object function, I will get the code which is running and then I can copy and edit and save it as a new script file and use that.
The reason for asking to find the foo.R file is
Curiosity
Know what is the algorithm used in the numerical computations
More immedietly, the function from stats package I am using, is running results for two of the arguments and not the others and have to figure out how to make it work.
Error shown by R implies that there might be some modification required in the script file.
I am looking for a more general answer, if its possible.
Edit: As per the comments so far, here is the code to compute spectrum of a time series using autoregressive methods. The data input is a univariate series.
x = ts(data)
spec.ar(x, method = "yule-walker") 1
spec.ar(x, method = "burg") 2
command 1 is running ok.
command 2 gives the following error.
Error in ar.burg.default(x, aic = aic, order.max = order.max, na.action = na.action, :
Burg's algorithm only implemented for univariate series
I did try specify all the arguments correctly like na.action=na.fail, order.max = NULL etc but the message is the same.
Kindly suggest possible solutions.
P.S. (This question is posted after searching the library folder where R is installed and zip files which come with packages, manuals, and opening .rdb, .rdx files)
See FAQ 7.40 How do I access the source code for a function?
In most cases, typing the name of the function will print its source
code. However, code is sometimes hidden in a namespace, or compiled.
For a complete overview on how to access source code, see Uwe Ligges
(2006), “Help Desk: Accessing the sources”, R News, 6/4, 43–45
(http://cran.r-project.org/doc/Rnews/Rnews_2006-4.pdf).
When R installs a package, it evaluates all the ".R" source files and re-saves them into a binary format for faster loading. Therefore you typically cannot easily find the source file.
As has been suggested elsewhere, you can simply type the function name and see the source code, or download the source package and find the source there.
library(plyr)
ddply # prints the source for ddply
# See the content of the R directory for plyr,
# but it's only binary files:
dir(file.path(find.package("plyr"), "R"))
# [1] "plyr" "plyr.rdb" "plyr.rdx"
# Get the source for the package:
download.packages("plyr", "~", type="source")
# ...then unpack and inspect the R directory...
.libPaths() should tell you all of your current library locations. It's possible to have more than one installation of a package if there are two libraries but only the one that is in the first library will be used. Unless you offer the code and the exact error message, it's not likely that anyone will be able to offer better advice.
I think you are asking to see what I call the source code for a function in a package. If so, the way I do it is as follows, which has worked successfully for me on the three times I have tried. I keep these instructions handy in a few places and just copied and pasted them here:
To see the source code for a function in Program R download the package containing the function. Specifically, download the file that ends in "tar.gz". This is a compressed file. Expand the compressed file using, for example, "WinZip". Now you need to open the uncompressed file that ends in ".tar". Download the free software "7-Zip". Click on the file "7zFM.exe" and navigate to the directory containing the ".tar" file. You can extract the contents of that ".tar" file into a new folder. The contents consist of R files showing the source code for the functions in the R package.
EDIT:
Today (July 8, 2012) I was able to open the 'tar.gz' file using the latest version of 'WinZIP' and could copy the contents (the source code) from there without having to use '7-Zip'.
EDIT:
Today (January 19, 2013) I viewed the source code for functions in base R by downloading the file
'R-2.15.2.tar.gz'
To download that file go to the http://cran.at.r-project.org/ webpage and click on that file in this line:
"The latest release (2012-10-26, Trick or Treat): R-2.15.2.tar.gz, read what's new in the latest version."
Unzip the file. WinZip will work, or it did for me. Then search your computer for readtable.r or another base R function.
agstudy noted here https://stackoverflow.com/questions/14417214/source-file-for-r-function that source code for read.csv is located in the file readtable.r, so do not expect every base R function to have its own file.

Resources