udpipe_accuracy() always gives the same error " The CoNLL-U line '....' does not contain 10 columns!" - r

This is regarding the R package udpipe for NLP. I am using it to tokenize, tag, lemmatize and perform dependency parsing on text files.
I am not sure which template the conllu file is needed for the function
udpipe_accuracy
I loaded a CSV file of 10 columns but the error persists.
I could not search any questions on SO on this package and also there is no tag of udpipe.

udpipe_accuracy is used in combination with udpipe_train.
If you trained a custom udpipe model with udpipe_train based on data in conllu format, you can see how good it is by using udpipe_accuracy on hold-out conllu data which was not used to build the model.

Related

Rendering a Quarto blog post trips an error when reading in a brms file object

First, I'll apologize for not having a fuller reproducable example, but I'm not entirely sure how to go about that given the various layers to the question/problem.
I'm moving a blog over from Blogdown to a new Quarto-based website and blog. I have three saved brms object files that I'm trying to read into a code chunk in one of the posts. The code chunks work fine when I run them manually, but when I try to render the blog post I get the following error:
Quitting from lines 75-86 (tables-modelsummary-brms.qmd)
Error in stri_replace_all_charclass(str, "[\\u0020\\r\\n\\t]", " ", merge = TRUE) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
Calls: .main ... stri_trim -> stri_trim_both -> stri_replace_all_charclass
Execution halted
I've checked the primary data frame contained in the brms model object and all of the character vectors there are valid UTF-8 vectors. These models objects can be quite large, so it's possible I'm missing something buried deep within the model object, but so far it's nothing apparent.
I tried re-running the models again to ensure that the model objects' files weren't corrupted, and also to make sure that the encoding issue wasn't somehow introduced the last time they were run, which would have been on a Windows machine and a different version of brms.
I've also moved the brms files around to different directories to see if it's a file path issue. The same error comes up regardless of whether the files are in the same folder as the blog post qmd file or in a parent directory file I use for storing site data.
I've also migrated several other posts to the new Quarto site successfully, and some of them also contain R code, but it's all rendering without a problem.
Finally, I don't quite understand how to implement the suggersted alternate function found in the error message either.

Custom .traineddata file usage in the tesseract in R

I have a bunch of .JPGs with some text at the bottom, which consists mostly (but not exclusively of numbers). I wish to use the tesseract package in R to be able to 'read' the text in those .JPGs. Unfortunately, the base tesseract language proved too inaccurate to be worth using. Subsequently I tried using the Magick package to adjust the pictures (crop, resize convert etc) hoping to get a better reading from tesseract, but in my case this failed to get satisfactory results.
I eventually managed to use the description on this link (https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6) to create a new custom language in Tesseract 4.1.1 (as downloaded from https://github.com/tesseract-ocr/tesseract), which I named font_name.traineddata. The custom-made font_name.traineddata works perfectly on the Tesseract 4.1.1 console and shows significant improvement in results on the base language.
The question I have is: How I get the font_name.traineddata file to be part of the ocr command in R? I have tried the simple solution of just pasting the font_name.traineddata file into the appropriate tessdata folder in the package tesseract (the same folder that also contains the standard english data file called eng.traineddata) and then trying the following:
font_name <- tesseract ("font_name")
ocr("C:/1.jpg", engine = font_name)
This does not work and gives the error :
Error in tesseract_engine_internal(datapath, language, configs, opt_names, :
Unable to find training data for: font_name. Please consult manual for: ?tesseract_download
tesseract_download seems to be of no use, as it is a helper function to download training data from the official tessdata repository. I have also tried renaming the file to a three character name, with the same error.
Does anybody have any suggestions on how to make custom .traineddata files work with ocr in R?

How to include raw data in an R package

I'm working on the final assignment of the course Building R Packages.
In this assignment, we need to create an R package based on some example functions provided by the instructors. We need to organize and document the package, then make it available on GitHub. My package is called FARS and is already available in this GitHub repo.
I'm having trouble with making raw data available with the package. After following the instructions provided in the course's readings and also in chapter 14.3 of the book Building R Packages, the files are still not being recognized.
What did I do so far?
Prepared all the package's documentation, including roxygen2 tags, DESCRIPTION, README.Md, and vignette, following these steps in addition to instructions provided in the readings and book mentioned;
Created a subdirectory named inst/extdata in the package's directory;
Copied all three example files (.csv.bz2) with raw data to inst/extdata;
Tested the functions using testthat;
Installed my FARS package.
Now I'm trying to check if one of the files is available after installing the package:
system.file("extdata", "accident_2013.csv.bz2",
package = "FARS",
mustWork = TRUE)
I get an error message:
Error in system.file("extdata", "accident_2013.csv.bz2", package = "FARS", :
no file found
These data files need to be available with the package, so the examples provided in the vignette work properly.
Here's a "real-life" example, using a simple package I wrote recently.
I have a "data" directory in the build directory.
EDIT To clarify the comments found in R-exts, the directory tree packagename/inst/extdata is intended for data that your functions call directly, by specifying that directory path. Since you want to load data into your workspace, use the data directory.
My "data" directory contains one file named preciseNumbersAsChar.r . That file contains assignments such as
charE <- {long number string}
If you read the help page for the command data, it explains that files ending in .r are sourced when called.
library(FunWithNumbers)
data('preciseNumbersAsChar') #works
Which is to say, the defined objects are now in my environment.
It's worth reading the help page for data in detail as different file types are handled slightly differently.

How to read/consume SegY File in R

There is a package called Rquake, which is used for the estimation and analysis of seismic data.
But there is no such relevant examples out there.
I basically wanted to kow how to use this package, I have referred a PDF https://cran.r-project.org/web/packages/Rquake/Rquake.pdf https://www.rdocumentation.org/packages/Rquake/versions/2.4-0/topics/Rquake-package
...but didn't found anything relevant.

Exporting data in Roxygen2 so that they are available without requiring data()

After reading questions such as this SO question on documenting a data set with Roxygen I have managed to document a dataset (which I will refer to as cells) and it now appears in the list generated by data(package="mypackage") and is loaded if I run the command data(cells). After this, cells will appear when ls() is run.
However, in many packages the data is immediately available without requiring a data() call. Also, the data names do not appear when ls() is run. An example is the baseball data set that comes with plyr. I have looked at the source for plyr and I cannot see how this is done.
In the DESCRIPTION file of your package make sure that there is a field called LazyData that is set to TRUE.
From the "Writing R Extensions" guide:
The ‘data’ subdirectory is for data files, either to be made available
via lazy-loading or for loading using data(). (The choice is made by
the ‘LazyData’ field in the ‘DESCRIPTION’ file: the default is not to
do so.)
I think the exact syntax changed with R version 2.14; before that it was LazyLoad not LazyData.

Resources