For some reason, stop words is not working for my corpus, entirely in French. I've been trying repeatedly over the past few days, but many words that should have been filtered simply are not. I am not sure if anyone else has a similar issue? I read somewhere that it could be because of the accents. I tried stringi::stri_trans_general(x, "Latin-ASCII") but I am not certain I did this correctly. Also, I notice that French stop words are sometimes referred to as "french" or "fr".
This is one example of code I tried, I would be extremely grateful for any advice.
I also manually installed quanteda, because I had difficulties downloading it, so it could be linked to that.
text_corp <- quanteda::corpus(data,
text_field="text")
head(stopwords("french"))
summary(text_corp)
my_dfm <- dfm(text_corp)
myStemMat <- dfm(text_corp, remove = stopwords("french"), stem = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE)
myStemMat[, 1:5]
topfeatures(myStemMat 20)
In this last step, there are still words like "etre" (to be), "plus" (more), comme ("like"), avant ("before"), avoir ("to have")
I also tried to filter stop words in a different way, through token creation:
tokens <-
tokens(
text_corp,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE,
)
mydfm <- dfm(tokens,
tolower = TRUE,
stem = TRUE,
remove = stopwords("french")
)
topfeatures(mydfm, 20)
The stopwords are working just fine, however the default Snowball list of French stopwords simply does not include the words you wish to remove.
You can see that by inspecting the vector of stopwords returned by stopwords("fr"):
library("quanteda")
## Package version: 2.1.2
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr")
## [1] FALSE FALSE FALSE FALSE FALSE
This is the full list of words:
sort(stopwords("fr"))
## [1] "à" "ai" "aie" "aient" "aies" "ait"
## [7] "as" "au" "aura" "aurai" "auraient" "aurais"
## [13] "aurait" "auras" "aurez" "auriez" "aurions" "aurons"
## [19] "auront" "aux" "avaient" "avais" "avait" "avec"
## [25] "avez" "aviez" "avions" "avons" "ayant" "ayez"
## [31] "ayons" "c" "ce" "ceci" "cela" "celà"
## [37] "ces" "cet" "cette" "d" "dans" "de"
## [43] "des" "du" "elle" "en" "es" "est"
## [49] "et" "étaient" "étais" "était" "étant" "été"
## [55] "étée" "étées" "étés" "êtes" "étiez" "étions"
## [61] "eu" "eue" "eues" "eûmes" "eurent" "eus"
## [67] "eusse" "eussent" "eusses" "eussiez" "eussions" "eut"
## [73] "eût" "eûtes" "eux" "fûmes" "furent" "fus"
## [79] "fusse" "fussent" "fusses" "fussiez" "fussions" "fut"
## [85] "fût" "fûtes" "ici" "il" "ils" "j"
## [91] "je" "l" "la" "le" "les" "leur"
## [97] "leurs" "lui" "m" "ma" "mais" "me"
## [103] "même" "mes" "moi" "mon" "n" "ne"
## [109] "nos" "notre" "nous" "on" "ont" "ou"
## [115] "par" "pas" "pour" "qu" "que" "quel"
## [121] "quelle" "quelles" "quels" "qui" "s" "sa"
## [127] "sans" "se" "sera" "serai" "seraient" "serais"
## [133] "serait" "seras" "serez" "seriez" "serions" "serons"
## [139] "seront" "ses" "soi" "soient" "sois" "soit"
## [145] "sommes" "son" "sont" "soyez" "soyons" "suis"
## [151] "sur" "t" "ta" "te" "tes" "toi"
## [157] "ton" "tu" "un" "une" "vos" "votre"
## [163] "vous" "y"
That's why they are not removed. We can see this with an example I created, using many of your words:
toks <- tokens("Je veux avoir une glace et être heureux, comme un enfant avant le dîner.",
remove_punct = TRUE
)
tokens_remove(toks, stopwords("fr"))
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "avoir" "glace" "être" "heureux" "comme" "enfant"
## [8] "avant" "dîner"
How to remove them? Either use a more complete list of stopwords, or customize the Snowball list by appending the stopwords you want to the existing ones.
mystopwords <- c(stopwords("fr"), "comme", "avoir", "plus", "avant", "être")
tokens_remove(toks, mystopwords)
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "glace" "heureux" "enfant" "dîner"
You could also use one of the other stopword sources, such as the "stopwords-iso", which does contain all of the words you wish to remove:
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr", source = "stopwords-iso")
## [1] TRUE TRUE TRUE TRUE TRUE
With regard to the language question, see the help for ?stopwords::stopwords, which states:
The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.
With regard to what you tried with stringi::stri_trans_general(x, "Latin-ASCII"), this would only help you if you wanted to remove "etre" and your stopword list contained only "être". In the example below, the stopword vector containing the accented character is concatenated with a version of itself in which the accents have been removed.
sw <- "être"
tokens("etre être heureux") %>%
tokens_remove(sw)
## Tokens consisting of 1 document.
## text1 :
## [1] "etre" "heureux"
tokens("etre être heureux") %>%
tokens_remove(c(sw, stringi::stri_trans_general(sw, "Latin-ASCII")))
## Tokens consisting of 1 document.
## text1 :
## [1] "heureux"
c(sw, stringi::stri_trans_general(sw, "Latin-ASCII"))
## [1] "être" "etre"
Related
I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote:
"The tokens object . . . . While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized."
This makes perfect sense, so here goes:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) Generate collocations using the corpus object:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) preprocess text and identify collocations and lemmatize for downstream tasks.
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) test results
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"missing data" should be "miss datum".
This is only works if each document in df is a single word. I can make the process work if I generate my collocations using a token object from the get-go but that's not what I want.
The problem is that you have already compounded the elements of the collocations into a single "token" containing a space, but by supplying the phrase() wrapper in tokens_compound(), you are telling tokens_replace() to look for two sequential tokens, not the one with a space.
The way to get what you want is by making the lemmatised replacement match the collocation.
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"
I have a word and want to output in R all possible deviatons (replacement, substitution, insertion) for a fixed distance value into a vector.
For instance, the word "Cat" and a fixed distance value of 1 results in a vector with the elements "cot", "at", ...
I'm going to assume that you want all actual words, not just permutations of the characters with an edit distance of 1 that would include non-words such as "zat".
We can do this using adist() to compute the edit distance between your target word and all eligible English words, taken from some word list. Here, I used the English syllable dictionary from the quanteda package (you did tag this question as quanteda after all) but this could have been any vector of English dictionary words from any other source as well.
To narrow things down, we first exclude all words that are different in length from the target word by your distance value.
distfn <- function(word, distance = 1) {
# select eligible words for efficiency
eligible_y_words <- names(quanteda::data_int_syllables)
wordlengths <- nchar(eligible_y_words)
eligible_y_words <- eligible_y_words[wordlengths >= (nchar(word) - distance) &
wordlengths <= (nchar(word) + distance)]
# compute Levenshtein distance
distances <- utils::adist(word, eligible_y_words)[1, ]
# return only those for the requested distance value
eligible_y_words[distances == distance]
}
distfn("cat", 1)
## [1] "at" "bat" "ca" "cab" "cac" "cad" "cai" "cal" "cam" "can"
## [11] "cant" "cao" "cap" "caq" "car" "cart" "cas" "cast" "cate" "cato"
## [21] "cats" "catt" "cau" "caw" "cay" "chat" "coat" "cot" "ct" "cut"
## [31] "dat" "eat" "fat" "gat" "hat" "kat" "lat" "mat" "nat" "oat"
## [41] "pat" "rat" "sat" "scat" "tat" "vat" "wat"
To demonstrate how this works on longer words, with alternative distance values.
distfn("coffee", 1)
## [1] "caffee" "coffeen" "coffees" "coffel" "coffer" "coffey" "cuffee"
## [8] "toffee"
distfn("coffee", 2)
## [1] "caffey" "calfee" "chafee" "chaffee" "cofer" "coffee's"
## [7] "coffelt" "coffers" "coffin" "cofide" "cohee" "coiffe"
## [13] "coiffed" "colee" "colfer" "combee" "comfed" "confer"
## [19] "conlee" "coppee" "cottee" "coulee" "coutee" "cuffe"
## [25] "cuffed" "diffee" "duffee" "hoffer" "jaffee" "joffe"
## [31] "mcaffee" "moffet" "noffke" "offen" "offer" "roffe"
## [37] "scoffed" "soffel" "soffer" "yoffie"
(Yes, according to the CMU pronunciation dictionary, those are all actual words...)
EDIT: Make for all permutations of letters, not just actual words
This involves permutations from the alphabet that have the fixed edit distances from the input word. Here I've done it not particular efficiently by forming all permutations of letters within the eligible ranges, and then computing their edit distance from the target word, and then selecting them. So it's a variation of above, except instead of a dictionary, it uses permuted words.
distfn2 <- function(word, distance = 1) {
result <- character()
# start with deletions
for (i in max((nchar(word) - distance), 0):(nchar(word) - 1)) {
result <- c(
result,
combn(unlist(strsplit(word, "", fixed = TRUE)), i,
paste,
collapse = "", simplify = TRUE
)
)
}
# now for changes and insertions
for (i in (nchar(word)):(nchar(word) + distance)) {
# all possible edits
edits <- apply(expand.grid(rep(list(letters), i)),
1, paste0,
collapse = ""
)
# remove original word
edits <- edits[edits != word]
# get all distances, add to result
distances <- utils::adist(word, edits)[1, ]
result <- c(result, edits[distances == distance])
}
result
}
For the OP example:
distfn2("cat", 1)
## [1] "ca" "ct" "at" "caa" "cab" "cac" "cad" "cae" "caf" "cag"
## [11] "cah" "cai" "caj" "cak" "cal" "cam" "can" "cao" "cap" "caq"
## [21] "car" "cas" "aat" "bat" "dat" "eat" "fat" "gat" "hat" "iat"
## [31] "jat" "kat" "lat" "mat" "nat" "oat" "pat" "qat" "rat" "sat"
## [41] "tat" "uat" "vat" "wat" "xat" "yat" "zat" "cbt" "cct" "cdt"
## [51] "cet" "cft" "cgt" "cht" "cit" "cjt" "ckt" "clt" "cmt" "cnt"
## [61] "cot" "cpt" "cqt" "crt" "cst" "ctt" "cut" "cvt" "cwt" "cxt"
## [71] "cyt" "czt" "cau" "cav" "caw" "cax" "cay" "caz" "cata" "catb"
## [81] "catc" "catd" "cate" "catf" "catg" "cath" "cati" "catj" "catk" "catl"
## [91] "catm" "catn" "cato" "catp" "catq" "catr" "cats" "caat" "cbat" "acat"
## [101] "bcat" "ccat" "dcat" "ecat" "fcat" "gcat" "hcat" "icat" "jcat" "kcat"
## [111] "lcat" "mcat" "ncat" "ocat" "pcat" "qcat" "rcat" "scat" "tcat" "ucat"
## [121] "vcat" "wcat" "xcat" "ycat" "zcat" "cdat" "ceat" "cfat" "cgat" "chat"
## [131] "ciat" "cjat" "ckat" "clat" "cmat" "cnat" "coat" "cpat" "cqat" "crat"
## [141] "csat" "ctat" "cuat" "cvat" "cwat" "cxat" "cyat" "czat" "cabt" "cact"
## [151] "cadt" "caet" "caft" "cagt" "caht" "cait" "cajt" "cakt" "calt" "camt"
## [161] "cant" "caot" "capt" "caqt" "cart" "cast" "catt" "caut" "cavt" "cawt"
## [171] "caxt" "cayt" "cazt" "catu" "catv" "catw" "catx" "caty" "catz"
Also works with other edit distances, although it becomes very slow for longer words.
d2 <- distfn2("cat", 2)
set.seed(100)
c(head(d2, 50), sample(d2, 50), tail(d2, 50))
## [1] "c" "a" "t" "ca" "ct" "at" "aaa" "baa"
## [9] "daa" "eaa" "faa" "gaa" "haa" "iaa" "jaa" "kaa"
## [17] "laa" "maa" "naa" "oaa" "paa" "qaa" "raa" "saa"
## [25] "taa" "uaa" "vaa" "waa" "xaa" "yaa" "zaa" "cba"
## [33] "aca" "bca" "cca" "dca" "eca" "fca" "gca" "hca"
## [41] "ica" "jca" "kca" "lca" "mca" "nca" "oca" "pca"
## [49] "qca" "rca" "cnts" "cian" "pcatb" "cqo" "uawt" "hazt"
## [57] "cpxat" "aaet" "ckata" "caod" "ncatl" "qcamt" "cdtp" "qajt"
## [65] "bckat" "qcatr" "cqah" "rcbt" "cvbt" "bbcat" "vcaz" "ylcat"
## [73] "cahz" "jcgat" "mant" "jatd" "czlat" "cbamt" "cajta" "cafp"
## [81] "cizt" "cmaut" "qwat" "jcazt" "hdcat" "ucant" "hate" "cajtl"
## [89] "caaty" "cix" "nmat" "cajit" "cmnat" "caobt" "catoi" "ncau"
## [97] "ucoat" "ncamt" "jath" "oats" "chatz" "ciatz" "cjatz" "ckatz"
## [105] "clatz" "cmatz" "cnatz" "coatz" "cpatz" "cqatz" "cratz" "csatz"
## [113] "ctatz" "cuatz" "cvatz" "cwatz" "cxatz" "cyatz" "czatz" "cabtz"
## [121] "cactz" "cadtz" "caetz" "caftz" "cagtz" "cahtz" "caitz" "cajtz"
## [129] "caktz" "caltz" "camtz" "cantz" "caotz" "captz" "caqtz" "cartz"
## [137] "castz" "cattz" "cautz" "cavtz" "cawtz" "caxtz" "caytz" "caztz"
## [145] "catuz" "catvz" "catwz" "catxz" "catyz" "catzz"
This could be speeded up by less brute force formation of all permutations and then applying adist() to them - it could consist of changes or insertions of known edit distances generated algorithmically from letters.
I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words.
For example:
Teddybear = (fluffy AND adorable AND soft)
Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)). But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words?
This is not currently something that is directly possible in quanteda (v1.2.0). However, there are workarounds in which you create dictionary sequences that are permutations of your desired sequence. Here is one such solution.
First, I will create some example texts. Note that the sequences are separated by either "," or "and" in some cases. Also, the third text has just two of your phrases rather than three. (More on that in a moment.)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
Now, let's generate a pair of functions to generate permutation sequences and subsequences from a vector. These will use some functions from the combinat package. The first is an inner function to generate permutations, the second is the main calling function that can generate full-length permutations or any subsample down to subsample_limit. (To use these more generally, of course, I'd add error checking, but I've skipped that for this example.)
genperms <- function(vec) {
combs <- combinat::permn(vec)
sapply(combs, paste, collapse = " ")
}
# vec any vector
# subsample_limit integer from 1 to length(vec), subsamples from
# which to return permutations; default is no subsamples
permutefn <- function(vec, subsample_limit = length(vec)) {
ret <- character()
for (i in length(vec):subsample_limit) {
ret <- c(ret,
unlist(lapply(combinat::combn(vec, i, simplify = FALSE),
genperms)))
}
ret
}
To demonstrate how these work:
fas <- c("fluffy", "adorable", "soft")
permutefn(fas)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# and with subsampling:
permutefn(fas, 2)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
# [7] "fluffy adorable" "adorable fluffy" "fluffy soft"
# [10] "soft fluffy" "adorable soft" "soft adorable"
Now apply these to the texts using tokens_lookup(). I've avoided the punctuation issue by setting remove_punct = TRUE. To show the original tokens not replaced, I have also uses exclusive = FALSE.
tokens(txt, remove_punct = TRUE) %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "fluffy" "adorable" "and" "soft"
# [8] "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the"
# [8] "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
The first case here was not caught, because the second and third elements were separated by "and". We can remove that using tokens_remove(), and then get the match:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "fluffy" "adorable" "toy" "was" "shaped" "like"
# [8] "a" "bear"
Finally, to match the third text in which just two of the three dictionary elements exist, we can pass 2 as the subsample_limit argument:
tokens(txt, remove_punct = TRUE) %>%
tokens_remove("and") %>%
tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas, 2))),
exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The" "toy" "was" "TEDDYBEAR" "he" "said"
#
# text2 :
# [1] "The" "TEDDYBEAR" "toy" "was" "on" "the" "floor"
#
# text3 :
# [1] "The" "TEDDYBEAR" "toy" "was" "shaped" "like" "a"
# [8] "bear"
#
If you want to know which documents have all the words, just do:
require(quanteda)
txt <- c("The toy was fluffy, adorable and soft, he said.",
"The soft, adorable, fluffy toy was on the floor.",
"The fluffy, adorable toy was shaped like a bear.")
dict <- dictionary(list(teddybear = list(c1 = "fluffy", c2 = "adorable", c3 = "soft")))
mt <- dfm_lookup(dfm(txt), dictionary = dict["teddybear"], levels = 2)
cbind(mt, "teddybear" = as.numeric(rowSums(mt > 0) == length(dict[["teddybear"]])))
# Document-feature matrix of: 3 documents, 4 features (16.7% sparse).
# 3 x 4 sparse Matrix of class "dfm"
# features
# docs c1 c2 c3 teddybear
# text1 1 1 1 1
# text2 1 1 1 1
# text3 1 1 0 0
I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.
I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.
Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
I have two variables with names stored in them. I want to see how many names in variable ScanName are in vector B while ignoring the capitals. Also, what are the differences?
I want to ignore the difference between a capital letter in the search (for example it should consider hsa-mir-1 and hsa-miR-1 as the same).
My data are like this :
str(B)
Factor w/ 1046 levels "hsa-let-7a-1",..: 1 2 3 4 5 6 7 8 9 10 ...
>B
[1] hsa-let-7a-1 hsa-let-7a-2 hsa-let-7a-3 hsa-let-7b hsa-let-7c hsa-let-7d
[7] hsa-let-7e hsa-let-7f-1 hsa-let-7f-2 hsa-let-7g hsa-let-7i hsa-mir-1-1
[13] hsa-miR-1238 hsa-mir-100 hsa-mir-101-1 hsa-mir-101-2 hsa-mir-103-1 hsa-mir-103-1-as
[19] hsa-mir-103-2 hsa-mir-103-2-as hsa-mir-105-1 hsa-mir-105-2 hsa-mir-106a hsa-mir-106b
and
> str(ScanName)
chr [1:1146] "hsa-miR-103b" "hsa-miR-1178" "hsa-miR-1179" "hsa-miR-1180" "hsa-miR-1181
> ScanName
[1] "hsa-miR-103b" "hsa-miR-1178" "hsa-miR-1179" "hsa-miR-1180" "hsa-miR-1181" "hsa-miR-1182"
[7] "hsa-miR-1183" "hsa-miR-1184" "hsa-miR-1193" "hsa-miR-1197" "hsa-miR-1200" "hsa-miR-1203"
[13] "hsa-miR-1204" "hsa-miR-1205" "hsa-miR-1206" "hsa-miR-1208" "hsa-miR-1224-3p" "hsa-miR-1225-3p"
[19] "hsa-miR-1225-5p" "hsa-miR-1227" "hsa-miR-1228" "hsa-miR-1229" "hsa-miR-1231" "hsa-miR-1233"
[25] "hsa-miR-1234" "hsa-let-7a-2" "hsa-miR-1238" "hsa-miR-1243" "hsa-miR-1244" "hsa-miR-1245"
[31] "hsa-miR-1245b-3p" "hsa-miR-1246" "hsa-miR-1247" "hsa-miR-1248" "hsa-miR-1249" "hsa-miR-1250"
[37] "hsa-miR-1251" "hsa-miR-1252"
you can use %in% and tolower
ScanName[tolower(ScanName) %in% tolower(B)]
You can also use grep with the ignore.case argument set to TRUE
> unlist(sapply(B, function(x){
grep(x, ScanName, ignore.case = TRUE, value = TRUE)
}, USE.NAMES = FALSE))
## [1] "hsa-let-7a-2" "hsa-miR-1238"
which gives the same result at
> ScanName[tolower(ScanName) %in% tolower(B)]
## [1] "hsa-let-7a-2" "hsa-miR-1238"