Dictionary File Structure of Open Spell-Checkers - dictionary

Is there any explanation docs or tutorials of the file structure of FreeDict, Aspell, Hunspell/OpenOffice Dictionaries especially concerning the switches at the end of each row in each .dic file? My guess is that the switches describe the semantic interpretation of the word whether it's a
noun
adjective
adverb
adverbial
etc.
or any combination of the above. But I don't know how to match these to the switch characters.
I'm also curios about what the .aff file describes.

This looks like a good starting point, and the downloads at this page may have the format documentation you're looking for.

Just a couple of links that might help you:
this is on sthackoverflow :
What's the format of the OpenOffice dictionaries?
this second one is a good start
http://sourceforge.net/apps/mediawiki/freedict/index.php?title=Main_Page
hope this helps

In Hunspell the tags you choose are arbitrary, they have no meaning other than that which you assign to them. You can choose from using letters, numbers (1-65535) and more.
The affix file describes many things, but is mainly concerned with how words are inflected.
For example:
$ test.dic
4
apple/a
banana/a
green/b
small/b
$ test.aff
SFX a Y 2 # Allow the following 2 suffixes to words with the "a" flag.
SFX a 0 s . # An "s" at the end for words ending in any letter (signified by the dot). "Apples" and "bananas".
SFX a 0 s' . # "Apples'" and "bananas'".
SFX b Y 2
SFX b 0 er . # "Greener" and "smaller".
SFX b 0 est . # "Greenest" and "smallest".
The manual explains most of the things in detail. There are also test files one can look at.

Related

R: Removing substring by occurrence of character

I have a vector, species_name, in dataframe genexp_2016 which contains the common and scientific names, as well as the location of several different species. For example, species_name strings may be written as
head(genexp_2016)
rank species_name status
1 1396 Addax (Addax nasomaculatus) - Wherever found E
2 1313 Babirusa (Babyrousa babyrussa) - Wherever found E
3 1396 Baboon, gelada (Theropithecus gelada) - Wherever found T
4 229 Bat, Florida bonneted (Eumops floridanus) - Wherever found E
5 109 Bat, gray (Myotis grisescens) - Wherever found E
What I'm attempting to do, however, is find a way to remove the end of each string in 'species_name` such that I am left with only the common name and the scientific name, and remove the location ('Wherever found').
I have thought about trying to tell R to delete everything after the first occurrence of the - character, but this is an imperfect method since some species in the dataframe have a heifen in their name, such as the black-footed ferret.
The most effective solution I've thought of is this: Telling R to read strings starting from the end instead of the beginning, and upon finding the first occurrence of -, delete everything between that character's position in the string and the end of the string. It seems like this is something I should be able to do in R, but my skills are currently not so advanced to know how to do this. Does anyone have any ideas of how I might code this, or perhaps a more efficient way for me to remove the location description in each string?
Thanks, and I appreciate any help you can offer.
Too keep everything until the last - (they keyword here is greedy) you could do:
x <- 'Addax (Addax nasomaculatus) - Wherever found'
sub('(.+)-.+', '\\1', x)
# [1] "Addax (Addax nasomaculatus) "

R list.files: some regexes only return a single file

I'm puzzled by the behaviour of regexes in the list.files command. I have a folder with ~500 files, most names start with "new_" and end with ".txt". There are some other files on the same folder, e.g. README, _cabs.txt.
I'd like to get only the new_*.txt files. I've tried different ways to call list.files with different results. Here are they:
#1 This returns ALL files including README and others
list.files(path="correctpath/")
#2 This returns ALL files including _cabs.txt, which I do not want.
list.files(path="correctpath/",pattern="txt")
#3 This returns ALL files I want, but...
list.files(path="correctpath/",pattern="new_")
#4 This returns just one of the new_*.txt files.
list.files(path="correctpath/",pattern="new*\\.txt")
#5 This returns an empty list.
list.files(path="correctpath/",pattern="new_*\\.txt")
So I have one solution that works, but would like to understand what's going on with the approaches 4 and 5.
thanks in advance
Rafael
list.files(path="correctpath/",pattern="new_.*\\.txt")
* means 0 or more times. If you want to match any character 0 or more time you need to add a period before it .* because a period means any character (except newline). The pattern "new_.*\\.txt" should work.
Good R regex reference.

What does mean attribute "TRY" in Hunspell

Hunspell affix file may contain attribute TRY. What it does?
The Hunspell documentation says:
Hunspell can suggest right word forms, when they differ from the bad input word by one TRY
character. The parameter of TRY is case sensitive.
But I did not understand what it means.
I have following affix and dictionary files:
.aff
SET UTF-8
TRY e
.dic
2
created
create
And Hunspell input:
create
*
created
*
sreate
& sreate 1 0: create
sreated
& sreated 1 0: created
crzated
& crzated 2 0: created, create
You can see, that words "sreate", "sreated", "crzated" differ from the right word forms by "s" and "z" characters. Why this happens?
Thank you in advance.
TRY states a set of letters that can be wrong. If a misspelled word differs from a word in the dictionary file by one of these letters, then Hunspell can suggest that dictionary word.
In your example, the letter e is mistaken as z in crzated. Hence Hunspell replaces z with e.
I'm not sure about sreate and sreated TBH.

.dic line format definition

I am currently investigating the most appropriate dictionary to use in an application I am building.
Inspecting the dictionaries which are bundled with Sublime Text 2, the file format is as you would expect - a list of alphabetically ordered words. However, alot of those words have additional information appended to them. Take this snippet as an example:
abaft
abbreviation/M
abdicate/DNGSn
Abelard/M
abider/M
Abidjan
ablaze
abloom
aboveground
abrader/M
Abram/M
abreaction/MS
abrogator/MS
abscond/DRSG
absinthe/MS
absoluteness/S
absorbency/SM
abstract/ShTVDPiGY
absurdness/S
A fruitless Google search has not shed any light on what the letters after the slash (/) mean.
Maybe they hint at the sex of the word, but that is only a guess and I'd prefer to read a formal explanation of their meaning.
Has anybody come across these?
The letters following the slash are called affixes. These encodings can be prefixes or suffixes that may be applied to the root word.
See this blog post for a nice explanation and examples of what these affixes can be used for.
Another place to look is the aspell manual.
TLDR: each letter in the .dic file following the slash is a name of a rule in the .aff file.
https://superuser.com/a/633869/367530
Each rule is in the .aff file for that language. The rules come in two
flavors: SFX for suffixes, and PFX for prefixes. Each line begins with
PFX/SFX and then the rule letter identifier (the ones that follow the
word in the dictionary file:
PFX [rule_letter_identifier] [combineable_flag]
[number_of_rule_lines_that_follow]
You can normally ignore the combinable flag, it is Y or N depending on
whether it can be combined with other rules. Then there are some
number of lines (indicated by the ) that list different possibilities
for how this rule applies in different situations. It looks like this:
PFX [rule_letter_identifier] [number_of_letters_to_delete]
[what_to_add] [when_to_add_it]
For example:
SFX B Y 3
SFX B 0 able [^aeiou]
SFX B 0 able ee
SFX B e able [^aeiou]e
If B is one of the letters following a word, i.e. someword/B, then this is one of the
rules that can apply. There are three possibilities that can happen
(because there are three lines). Only one will apply:
able is added to the end when the end of the word is not (indicated by ^) one of the letters in the set (indicated by [ ]) of letters a, e, i, o, and u. For example, question → questionable
able is added to the end when the end of the word is ee. For example, agree → agreeable.
able is added to the end when the end of the word is not a vowel ([^aeiou]) followed by an e. The letter e is stripped (the column before able). For example, excite → excitable.
PFX rules are the same, but apply at the beginning of the word instead
for prefixes.

Is there a way to check the spelling of words in a character vector?

The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?
Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.
This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.
library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]
This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.
Once this is done, checking a word is as easy as
c("persnickety", "sqwrzib") %in% words # TRUE FALSE
There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.
Hope this helps.

Resources