How to use TorchText to load FastText pre-trained word and subword embeddings? - torch

When using torch, TorchText provides ability to load word embedding from FastText pre-trained corpus. However, I am wondering would the same way can we get subword embedding in cases where out-of-vocabulary issues arise ?

Related

Get 'make html' to process jupyter notebooks with markdown instead of with restructuredtext

The github repo on which I work with many others contains many python files and about ten jupyter notebooks. 'make html' currently assumes that the markdown JN cells are written in reST, which can produce meaningless and ugly results. Is it possible to configure sphinx (or maybe nbsphinx??) so that readthedocs for the JN markdown cells is rendered using markdown (preferably in the JN flavour)?
There is a website https://gist.github.com/dupuy/1855764 that addresses this problem by discussing constructs that are common to markdown and reST, but the document is at least 10 years old. For example, it lacks the "click here" link construct that works in both markup languages, namely:
[click here](urlname).
There remain in our JNs constructs that do not seem to have a common syntax that produces decent rendering in both markdown and reST, or at least I have not been successful in searching for one. An example (and there may be others) is making nested lists, specially lists without numbers or bullets.
An alternative would be for the JN text cells to be rendered by reST. There is a website https://nbsphinx.readthedocs.io/en/0.8.8/raw-cells.html that explains how to use reST in a JN. I have two problems with that. Firstly, JNs in our environment do not behave as this website explains, and I do not know how to change our environment (configuration files for JNs??) to make them behave in the way claimed by the website. Secondly, our JNs are designed to be used by naive users (even more naive than I am), and so the JN must work when the naive user uses JN in its "out of the box" configuration.

How to deal with varying indentation in ReStructuredText codeblocks?

I am converting an older document into ReST. The document has the following construction:
Question is now how to get this with ReST. The following does not work:
[…] are listed below.
::
dataio - Data format conversion package (RFITS, etc.)
dbms - Database management package (not yet implemented)
…
system - System utilities package
utilities - Miscellaneous utilities package
A package must be loaded in order […]
I don't know about raw docutils, but at least Sphinx has a code-block directive that has better control on the indentation I think.
On the other hand, I have seen other people use csv tables to achieve a similar result.

Custom .traineddata file usage in the tesseract in R

I have a bunch of .JPGs with some text at the bottom, which consists mostly (but not exclusively of numbers). I wish to use the tesseract package in R to be able to 'read' the text in those .JPGs. Unfortunately, the base tesseract language proved too inaccurate to be worth using. Subsequently I tried using the Magick package to adjust the pictures (crop, resize convert etc) hoping to get a better reading from tesseract, but in my case this failed to get satisfactory results.
I eventually managed to use the description on this link (https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6) to create a new custom language in Tesseract 4.1.1 (as downloaded from https://github.com/tesseract-ocr/tesseract), which I named font_name.traineddata. The custom-made font_name.traineddata works perfectly on the Tesseract 4.1.1 console and shows significant improvement in results on the base language.
The question I have is: How I get the font_name.traineddata file to be part of the ocr command in R? I have tried the simple solution of just pasting the font_name.traineddata file into the appropriate tessdata folder in the package tesseract (the same folder that also contains the standard english data file called eng.traineddata) and then trying the following:
font_name <- tesseract ("font_name")
ocr("C:/1.jpg", engine = font_name)
This does not work and gives the error :
Error in tesseract_engine_internal(datapath, language, configs, opt_names, :
Unable to find training data for: font_name. Please consult manual for: ?tesseract_download
tesseract_download seems to be of no use, as it is a helper function to download training data from the official tessdata repository. I have also tried renaming the file to a three character name, with the same error.
Does anybody have any suggestions on how to make custom .traineddata files work with ocr in R?

udpipe_accuracy() always gives the same error " The CoNLL-U line '....' does not contain 10 columns!"

This is regarding the R package udpipe for NLP. I am using it to tokenize, tag, lemmatize and perform dependency parsing on text files.
I am not sure which template the conllu file is needed for the function
udpipe_accuracy
I loaded a CSV file of 10 columns but the error persists.
I could not search any questions on SO on this package and also there is no tag of udpipe.
udpipe_accuracy is used in combination with udpipe_train.
If you trained a custom udpipe model with udpipe_train based on data in conllu format, you can see how good it is by using udpipe_accuracy on hold-out conllu data which was not used to build the model.

Rreport/LaTeX quality output package

I'm looking for some LaTeX template for creating quality output. On R-bloggers I've bumped on Frank Harrel's Rreport package. Due to my quite modest LaTeX abilities, only a user-friendly (and noob-friendly) interface should suffice. Here's a link to an official website. I'm following the instructions, but I cannot manage to install an app. I use Ubuntu 9.10, R version is 2.10.1 (updated regularly from UCLA's CRAN server), and of course, cvs is installed on my system.
Now, I'd like to know if there is some user-friendly LaTeX template package (Sweave is still to advanced/spartan for me). I'm aware that my question is quite confounding, but a brief glance on examples on Rreport page should give you a hint. I'm aware that LaTeX skills are a must, but just for now I need something that will suit my needs (as a psychological researcher).
Is there any package similar with Rreport?
lyx? http://www.lyx.org/
On Ubuntu:
sudo apt-get install lyx
From the lyx page:
LyX combines the power and flexibility
of TeX/LaTeX with the ease of use of a
graphical interface. This results in
world-class support for creation of
mathematical content (via a fully
integrated equation editor) and
structured documents like academic
articles, theses, and books.
If you want to produce Latex with a simpler markup you could use the ASCII package that has a Sweave driver that can be used with reSTructured text, which can then be converted to Latex. Although I would only use it if you want to be able to convert the same doc also to html or odf. In any case it is a good idea to learn the basic Latex.
The online text processor zoho allows export to latex. Maybe this can be helpful to learn latex, but I do not know how to integrate Sweave/R in this. (I did not work with zoho, by the way).

Resources