Hunspell affix file may contain attribute TRY. What it does?
The Hunspell documentation says:
Hunspell can suggest right word forms, when they differ from the bad input word by one TRY
character. The parameter of TRY is case sensitive.
But I did not understand what it means.
I have following affix and dictionary files:
.aff
SET UTF-8
TRY e
.dic
2
created
create
And Hunspell input:
create
*
created
*
sreate
& sreate 1 0: create
sreated
& sreated 1 0: created
crzated
& crzated 2 0: created, create
You can see, that words "sreate", "sreated", "crzated" differ from the right word forms by "s" and "z" characters. Why this happens?
Thank you in advance.
TRY states a set of letters that can be wrong. If a misspelled word differs from a word in the dictionary file by one of these letters, then Hunspell can suggest that dictionary word.
In your example, the letter e is mistaken as z in crzated. Hence Hunspell replaces z with e.
I'm not sure about sreate and sreated TBH.
Related
I have a variable named full.path.
And I am checking if the string contained in it is having certain special character or not.
From my code below, I am trying to grep some special character. As the characters are not there, still the output that I get is true.
Could someone explain and help. Thanks in advance.
full.path <- "/home/xyz"
#This returns TRUE :(
grepl("[?.,;:'-_+=()!##$%^&*|~`{}]", full.path)
By plugging this regex into https://regexr.com/ I was able to spot the issue: if you have - in a character class, you will create a range. The range from ' to _ happens to include uppercase letters, so you get spurious matches.
To avoid this behaviour, you can put - first in the character class, which is how you signal you want to actually match - and not a range:
> grepl("[-?.,;:'_+=()!##$%^&*|~`{}]", full.path)
[1] FALSE
I was having an intro class at datacamp.com and ran into a problem.
Goal: find right emails using grep. "Right emails" defined by having an "#", end with ".edu").
Emails vector:
emails <- c("john.doe#ivyleague.edu", "education#world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
I was thinking of
grep("#*\\.edu$",emails)
and it gave me
[1] 1 4 5
because I thought "*" matches "multiple characters". Later I found that it doesn't work like that.
Turned out the right code is
grep("#.*\\.edu$",emails)
I googled some documentation and only have a vague sense of how to get the correct answer. Can someone explain how exactly R match the right emails? Thanks a bunch!!
You've already been advised the using the asterisk quantifier wasn't giving you the specificity you needed, so use the "+" quantifier, which forces at least one such match. I decided to make the problem more complex by adding some where there were duplicated at-signs:
emails <- c("john.doe##ivyleague.edu", "education##world.gov", "dalai.lama#peace.org",
"invalid.edu", "quant#bigdatacollege.edu", "cookie.monster#sesame.tv")
grep( "^[^#]+#[^#]+\\.edu$", emails)
#[1] 5
That uses the regex character-class structure where items inside flankking square-brackets are taken as literals except when there is an initial up-caret ("^"), in which case it is the negation of the character class, i.e. in this case any character except "#". This will also exclude situations where the at-sign is the first character. Thanks to KonradRudolph who pointed out that adding "^" as the first character in the pattern (which signifies the point just before the first character of a potential match) would prevent allowing Items with an initial "##" from being matched.
I am currently investigating the most appropriate dictionary to use in an application I am building.
Inspecting the dictionaries which are bundled with Sublime Text 2, the file format is as you would expect - a list of alphabetically ordered words. However, alot of those words have additional information appended to them. Take this snippet as an example:
abaft
abbreviation/M
abdicate/DNGSn
Abelard/M
abider/M
Abidjan
ablaze
abloom
aboveground
abrader/M
Abram/M
abreaction/MS
abrogator/MS
abscond/DRSG
absinthe/MS
absoluteness/S
absorbency/SM
abstract/ShTVDPiGY
absurdness/S
A fruitless Google search has not shed any light on what the letters after the slash (/) mean.
Maybe they hint at the sex of the word, but that is only a guess and I'd prefer to read a formal explanation of their meaning.
Has anybody come across these?
The letters following the slash are called affixes. These encodings can be prefixes or suffixes that may be applied to the root word.
See this blog post for a nice explanation and examples of what these affixes can be used for.
Another place to look is the aspell manual.
TLDR: each letter in the .dic file following the slash is a name of a rule in the .aff file.
https://superuser.com/a/633869/367530
Each rule is in the .aff file for that language. The rules come in two
flavors: SFX for suffixes, and PFX for prefixes. Each line begins with
PFX/SFX and then the rule letter identifier (the ones that follow the
word in the dictionary file:
PFX [rule_letter_identifier] [combineable_flag]
[number_of_rule_lines_that_follow]
You can normally ignore the combinable flag, it is Y or N depending on
whether it can be combined with other rules. Then there are some
number of lines (indicated by the ) that list different possibilities
for how this rule applies in different situations. It looks like this:
PFX [rule_letter_identifier] [number_of_letters_to_delete]
[what_to_add] [when_to_add_it]
For example:
SFX B Y 3
SFX B 0 able [^aeiou]
SFX B 0 able ee
SFX B e able [^aeiou]e
If B is one of the letters following a word, i.e. someword/B, then this is one of the
rules that can apply. There are three possibilities that can happen
(because there are three lines). Only one will apply:
able is added to the end when the end of the word is not (indicated by ^) one of the letters in the set (indicated by [ ]) of letters a, e, i, o, and u. For example, question → questionable
able is added to the end when the end of the word is ee. For example, agree → agreeable.
able is added to the end when the end of the word is not a vowel ([^aeiou]) followed by an e. The letter e is stripped (the column before able). For example, excite → excitable.
PFX rules are the same, but apply at the beginning of the word instead
for prefixes.
The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?
Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.
This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.
library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]
This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.
Once this is done, checking a word is as easy as
c("persnickety", "sqwrzib") %in% words # TRUE FALSE
There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.
Hope this helps.
I'm having a hard time trying to create a right regular expression for the RegularExpressionValidator control that allows password to be checked for the following:
- Is greater than seven characters.
- Contains at least one digit.
- Contains at least one special (non-alphanumeric) character.
Cant seem to find any results out there too. Any help would be appreciated! Thanks!
Maybe you will find this article helpful. You may try the following expression
^.*(?=.{8,})(?=.*[\d])(?=.*[\W]).*$
and the breakdown:
(?=.{8,}) - contains at least 8 characters
(?=.*[\d]) - contains at least one digit
(?=.*[\W]) - contains at least one special character
http://msdn.microsoft.com/en-us/library/ms972966.aspx
Search for "Lookaround processing" which is necessary in these examples. You can also test for a range of values by using .{4,8} as in Microsoft's example:
^(?=.*\d).{4,8}$
Try this
((?=.*\d)(?=.*[a-z])(?=.*[\W]).{6,20})
Description of above Regular Expression:
( # Start of group
(?=.*\d) # must contains one digit from 0-9
(?=.*[a-z]) # must contains one lowercase characters
(?=.*[\W]) # must contains at least one special character
. # match anything with previous condition checking
{7,20} # length at least 7 characters and maximum of 20
) # End of group
"/W" will increase the range of characters that can be used for password and pit can be more safe.
Use for Strong password with Uppercase, Lowercase, Numbers, Symbols & At least 8 Characters.
//Code for Validation with regular expression in ASP.Net core.
[RegularExpression(#"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^\da-zA-Z]).{8,15}$")]
Regular expression password validation:
#"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^\da-zA-Z]).{8,15}$"