Difference between stem and normalized_stem in wowool lexicons - dictionary

I am using wowool but in the lexicons I don't see any difference between stem or normalized_stem. When should I use one or the other?
My sample is from the documentation: "I like kiwis. KIWIS are good."
Both seem to match with
lexicon: (input="stem") : { kiwi } =Fruit;
and
lexicon: (input="normalized_stem") : { kiwi } =Fruit;

This is normal because the root form of KIWIS is kiwiso the stem and normalized_stem will match.
If you would use Kiwi with a initial capital then only the normalized_stem will match, the reason is the stem of Kiwi is a Proper Noun so it will not be stemmed.
I advise you to look at the stem of the words when you are trying to decide whether to use stem or normalized_stem.
// Wowool Source
lexicon: (input="stem") { kiwi } =S_Fruit;
lexicon: (input="normalized_stem") { kiwi } =NS_Fruit;
./wow -l en -i "I like kiwis. I like Kiwis are good. Kiwis" --domains rules
-- EyeOnText WoWoolConsole 2.1.0
1:Process:stream_16840253095957608044 (42b/42b)
Language:english
s(0,13)
{Sentence
t(0,1) "I" (init-cap, init-token)['I':Pron-Pers, +1p, +sg]
t(2,6) "like" ['like':V-Pres, +inf, +positive]
{NS_Fruit
{S_Fruit
t(7,12) "kiwis" ['kiwi':Nn-Pl]
}S_Fruit }NS_Fruit
t(12,13) "." ['.':Punct-Sent]
}Sentence
s(14,36)
{Sentence
t(14,15) "I" (init-cap, init-token)['I':Pron-Pers, +1p, +sg]
t(16,20) "like" ['like':V-Pres, +inf, +positive]
t(21,26) "Kiwis" (init-cap, nf, nf-lex)['Kiwis':Prop-Std]
t(27,30) "are" ['be':V-Pres-Pl-be]
t(31,35) "good" ['good':Adj-Std]
t(35,36) "." ['.':Punct-Sent]
}Sentence
s(37,42)
{Sentence
{NS_Fruit
{S_Fruit
t(37,42) "Kiwis" (init-cap, init-token)['kiwi':Nn-Pl]
}S_Fruit }NS_Fruit }Sentence

Related

Regex to match only semicolons not in parenthesis [duplicate]

This question already has answers here:
Regex - Split String on Comma, Skip Anything Between Balanced Parentheses
(2 answers)
Closed 1 year ago.
I have the following string:
Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
I want to replace the semicolons that are not in parenthesis to commas. There can be any number of brackets and any number of semicolons within the brackets and the result should look like this:
Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews
This is my current code:
x<- Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
gsub(";(?![^(]*\\))",",",x,perl=TRUE)
[1] "Almonds , Roasted Peanuts (Peanuts, Canola Oil (Antioxidants (319; 320)); Salt), Cashews "
The problem I am facing is if there's a nested () inside a bigger bracket, the regex I have will replace the semicolon to comma.
Can I please get some help on regex that will solve the problem? Thank you in advance.
The pattern ;(?![^(]*\)) means matching a semicolon, and assert that what is to the right is not a ) without a ( in between.
That assertion will be true for a nested opening parenthesis, and will still match the ;
You could use a recursive pattern to match nested parenthesis to match what you don't want to change, and then use a SKIP FAIL approach.
Then you can match the semicolons and replace them with a comma.
[^;]*(\((?>[^()]+|(?1))*\))(*SKIP)(*F)|;
In parts, the pattern matches
[^;]* Match 0+ times any char except ;
( Capture group 1
\( Match the opening (
(?> Atomic group
[^()]+ Match 1+ times any char except ( and )
| Or
(?1) Recurse the whole first sub pattern (group 1)
)* Close the atomic group and optionally repeat
\) Match the closing )
) Close group 1
(*SKIP)(*F) Skip what is matched
| Or
; Match a semicolon
See a regex demo and an R demo.
x <- c("Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews",
"Peanuts (32.5%); Macadamia Nuts (14%; PPPG(AHA)); Hazelnuts (9%); nuts(98%)")
gsub("[^;]*(\\((?>[^()]+|(?1))*\\))(*SKIP)(*F)|;",",",x,perl=TRUE)
Output
[1] "Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews"
[2] "Peanuts (32.5%), Macadamia Nuts (14%; PPPG(AHA)), Hazelnuts (9%), nuts(98%)"

How to allow a space into a wildcard?

Let's say I have this sentence :
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
When I write this (kwicis a quantedafunction) :
kwic(text,phrase("great* cake*"))
I get
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
However, when I do
kwic(text,phrase("great*cake*"))
I get a kwicobject with 0 row, i.e. nothing
I would like to know what does the *replace exactly and, more important, how to "allow" a space to be taken into account in the wildcard ?
To answer what the * matches, you need to understand the "glob" valuetype, which you can read about using ?valuetype and also here. In short, * matches any number of any characters including none. Note that this is very different from its use in a regular expression, which means "match none or more of the preceding character".
The pattern argument in kwic() matches one pattern per token, after tokenizing the text. Even wrapped in the phrase() function, it still only considers sequences of matches to tokens. So you cannot match the whitespace (which defines the boundaries between tokens) unless you actually include these inside the token's value itself.
How could you do that? Like this:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
But your original usage of kwic(text, phrase("great* cake*")) is the recommended approach.

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

capital letters in firt letter

In python, I want a program that turn the first leter on a word capital letter.
For exemple:
turn "a red apple is sweeter than a green apple" in "A Red Apple is Sweeter Than A Green Apple"
How can I do?
I've tried this:
d = input('insert a quote')
def mydic(d):
dic = {}
for i in d:
palavras = dic.keys()
if i in palavras:
dic[i] += 1
else :
dic[i] = 1
return dic
You could use the title() method.
For example:
sentence = str(input("Insert a quote: ")).title()
print(sentence)
Input: a red apple is sweeter than a green apple
Output: A Red Apple Is Sweeter Than A Green Apple
What you want to do is this:
split the input string into words ie. string.split(' ') splits a given string by spaces, returns a list.
for each word, capitalize the first letter and concatenate onto a bigger string ie. word[:1].upper() + word[1:] this will uppercase the first letter
Add all the words back into a list and return it.

R - How to split text and punctuation with a exception?

Analysing Facebook comments in R for Sentimental Analysis. Emojis are coding in text between <> symbols.
Example:
"Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
<U+2764> and <U+1F628> are emojis (heavy black heart and fearful face,
respectively).
So, I need split words/numbers and punctuations/symbols, except in emoji codes.
I did, using gsub function, this:
a1 <- "([[:alpha:]])([[:punct:]])"
a2 <- "([[:punct:]])([[:alpha:]])"
b <- "\\1 \\2"
gsub(a1, b, gsub(a2, b, "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"))
...but, the results, logically, also affects emojis code:
[1] "Jesus te ama !!! < U +2764> Ou não ...?< U +1F628> ( fé em stand by )"
The objective is create a exception for the text between <>, split it externally and don't split internally - i.e.:
[1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
Note that:
sometimes the space between the sentence/word/punct and a emoji code is non-existent (needs to be created)
It is required that a punct sequence stays join (e.g. "!!!", "...?")
How can I do it?
You may use the following regex solution:
a1 <- "(?<=<)U\\+\\w+>(*SKIP)(*F)|(?<=\\S)(?=<U\\+\\w+>)|(?<=[[:alpha:]])(?=[[:punct:]])|(?<=[[:punct:]])(?=[[:alpha:]])"
gsub(a1, " ", "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)", perl=TRUE)
# => [1] "Jesus te ama !!! <U+2764> Ou não ...? <U+1F628> ( fé em stand by )"
See the online R demo
This PCRE regex (see perl=TRUE argument in the call to gsub) matches:
(?<=<)U\\+\\w+>(*SKIP)(*F) - a U+ and 1+ word chars with > after if preceded with < - and the match value is discarded with the PCRE verbs (*SKIP)(*F) and the next match is looked for from the end of this match
| - or
(?<=\\S)(?=<U\\+\\w+>) - a non-whitespace char must be present immediately to the left of the current location, and a <U+, 1+ word chars and > must be present immediately to the right of the current location
| - or
(?<=[[:alpha:]])(?=[[:punct:]]) - a letter must be present immediately to the left of the current location, and a punctuation must be present immediately to the right of the current location
| - or
(?<=[[:punct:]])(?=[[:alpha:]]) - a punctuation must be present immediately to the left of the current location, and a letter must be present immediately to the right of the current location
> str <- "Jesus te ama!!! <U+2764> Ou não...?<U+1F628> (fé em stand by)"
> strsplit(str,"[[:space:]]|(?=[.!?])",perl=TRUE)
[[1]]
[1] "Jesus" "te" "ama" "!" "!" "!"
[7] "" "<U+2764>" "" "Ou" "não" "."
[13] "." "." "?" "<U+1F628>" "(fé" "em"
[19] "stand" "by)"

Resources