How to perform Lemmatization in R? - r

This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.
According to Wikipedia, lemmatization is defined as:
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
A simple Google search for lemmatization in R will only point to the package wordnet of R. When I tried this package expecting that a character vector c("run", "ran", "running") input to the lemmatization function would result in c("run", "run", "run"), I saw that this package only provides functionality similar to grepl function through various filter names and a dictionary.
An example code from wordnet package, which gives maximum of 5 words starting with "car", as the filter name explains itself:
filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)
The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using R I want to find true roots of the words: (For e.g. from c("run", "ran", "running") to c("run", "run", "run")).

Hello you can try package koRpus which allow to use Treetagger :
tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="./TreeTagger", preset="en"))
tagged.results#TT.res
## token tag lemma lttr wclass desc stop stem
## 1 run NN run 3 noun Noun, singular or mass NA NA
## 2 ran VVD run 3 verb Verb, past tense NA NA
## 3 running VVG run 7 verb Verb, gerund or present participle NA NA
See the lemma column for the result you're asking for.

As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results:
library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)
## [1] "run" "run" "run"

#Andy and #Arunkumar are correct when they say textstem library can be used to perform stemming and/or lemmatization. However, lemmatize_words() will only work on a vector of words. But in a corpus, we do not have vector of words; we have strings, with each string being a document's content. Hence, to perform lemmatization on a corpus, you can use function lemmatize_strings() as an argument to tm_map() of tm package.
> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make
terrific th-grade learning tool samuel beckett applied iranian voting process bard
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make
terrific th - grade learn tool samuel beckett apply iranian vote process bard black
comedy willie love another trumpet blast may new mexican cinema - bornin"
Do not forget to run the following line of code after you have done lemmatization:
> corpus <- tm_map(corpus, PlainTextDocument)
This is because in order to create a document-term matrix, you need to have 'PlainTextDocument' type object, which gets changed after you use lemmatize_strings() (to be more specific, the corpus object does not contain content and meta-data of each document anymore - it is now just a structure containing documents' content; this is not the type of object that DocumentTermMatrix() takes as an argument).
Hope this helps!

Maybe stemming is enough for you? Typical natural language processing tasks make do with stemmed texts. You can find several packages from CRAN Task View of NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
If you really do require something more complex, then there's specialized solutsions based on mapping sentences to neural nets. As far as I know, these require massive amount of training data. There is lots of open software created and made available by Stanford NLP Group.
If you really want to dig into the topic, then you can dig through the event archives linked at the same Stanford NLP Group publications section. There's some books on the topic as well.

I think the answers are a bit outdated here. You should be using R package udpipe now - available at https://CRAN.R-project.org/package=udpipe - see https://github.com/bnosac/udpipe or docs at https://bnosac.github.io/udpipe/en
Notice the difference between the word meeting (NOUN) and the word meet (VERB) in the following example when doing lemmatisation and when doing stemming, and the annoying screwing up of the word 'someone' to 'someon' when doing stemming.
library(udpipe)
x <- c(doc_a = "In our last meeting, someone said that we are meeting again tomorrow",
doc_b = "It's better to be good at being the best")
anno <- udpipe(x, "english")
anno[, c("doc_id", "sentence_id", "token", "lemma", "upos")]
#> doc_id sentence_id token lemma upos
#> 1 doc_a 1 In in ADP
#> 2 doc_a 1 our we PRON
#> 3 doc_a 1 last last ADJ
#> 4 doc_a 1 meeting meeting NOUN
#> 5 doc_a 1 , , PUNCT
#> 6 doc_a 1 someone someone PRON
#> 7 doc_a 1 said say VERB
#> 8 doc_a 1 that that SCONJ
#> 9 doc_a 1 we we PRON
#> 10 doc_a 1 are be AUX
#> 11 doc_a 1 meeting meet VERB
#> 12 doc_a 1 again again ADV
#> 13 doc_a 1 tomorrow tomorrow NOUN
#> 14 doc_b 1 It it PRON
#> 15 doc_b 1 's be AUX
#> 16 doc_b 1 better better ADJ
#> 17 doc_b 1 to to PART
#> 18 doc_b 1 be be AUX
#> 19 doc_b 1 good good ADJ
#> 20 doc_b 1 at at SCONJ
#> 21 doc_b 1 being be AUX
#> 22 doc_b 1 the the DET
#> 23 doc_b 1 best best ADJ
lemmatisation <- paste.data.frame(anno, term = "lemma",
group = c("doc_id", "sentence_id"))
lemmatisation
#> doc_id sentence_id
#> 1 doc_a 1
#> 2 doc_b 1
#> lemma
#> 1 in we last meeting , someone say that we be meet again tomorrow
#> 2 it be better to be good at be the best
library(SnowballC)
tokens <- strsplit(x, split = "[[:space:][:punct:]]+")
stemming <- lapply(tokens, FUN = function(x) wordStem(x, language = "en"))
stemming
#> $doc_a
#> [1] "In" "our" "last" "meet" "someon" "said"
#> [7] "that" "we" "are" "meet" "again" "tomorrow"
#>
#> $doc_b
#> [1] "It" "s" "better" "to" "be" "good" "at" "be"
#> [9] "the" "best"

Lemmatization can be done in R easily with textStem package.
Steps are:
1) Install textstem
2) Load the package by
library(textstem)
3) stem_word=lemmatize_words(word, dictionary = lexicon::hash_lemmas)
where stem_word is the result of lemmatization and word is the input word.

Related

How can I separate words in a corpus according to their POS?

I’m exploring a textual corpus and I would like to be able to separate words following their grammatical type, for example consider only verbs and nouns.
I use spaCyr to do lemmatization with the spacy_parse() function and have seen in Quanteda reference (https://quanteda.io/reference/as.tokens.html) that there is a as.tokens() function that let me build a token object with the result of spacy_parse().
as.tokens(
x,
concatenator = "/",
include_pos = c("none", "pos", "tag"),
use_lemma = FALSE,
...
)
This way, I can get back something that looks like this (text is in French):
etu1_repres_1 :
[1] "OK/PROPN" ",/PUNCT" "déjà/ADV" ",/PUNCT" "je/PRON" "pense/VERB" "que/SCONJ"
[8] "je/PRON" "être/AUX" "influencer/VERB" "de/ADP" "par/ADP"
Let’s say I would like to separate the tokens and keep only tokens of type PRON and VERB.
Q1: How can I separate them from the other tokens to keep only:
etu1_repres_1 :
[1] "je/PRON" "pense/VERB" "je/PRON" "influencer/VERB"
Q2: How can I do to remove the "/PRON" or "/VERB" part of each token to be able to build a data-feature matrix with only the lemmas.
Thanks a lot for helping,
Gabriel
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <-
as.tokens(list(etu1_repres_1 = c("OK/PROPN", ",/PUNCT", "déjà/ADV", ",/PUNCT",
"je/PRON", "pense/VERB", "que/SCONJ", "je/PRON",
"être/AUX", "influencer/VERB", "de/ADP", "par/ADP")))
# part 1
toks2 <- tokens_keep(toks, c("*/PRON", "*/VERB"))
toks2
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je/PRON" "pense/VERB" "je/PRON" "influencer/VERB"
# part 2
toks3 <- tokens_split(toks2, "/") |>
tokens_remove(c("PRON", "VERB"))
toks3
#> Tokens consisting of 1 document.
#> etu1_repres_1 :
#> [1] "je" "pense" "je" "influencer"
dfm(toks3)
#> Document-feature matrix of: 1 document, 3 features (0.00% sparse) and 0 docvars.
#> features
#> docs je pense influencer
#> etu1_repres_1 2 1 1
Created on 2022-08-19 by the reprex package (v2.0.1)

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

Clean spaces in an extracted list of authors packages in R

I face a "strange" issue with a code that I have written in R to extract all the authors of the R packages installed on my computer. Indeed, I try to remove undesirable spaces before and after the commas ( , ) but I can't get the expected clean result using R text cleaning common techniques.
Here is the script for reproduction so that you can see the issue in the final result on your own screen:
library("tools")
pdb<-CRAN_package_db()
subset<-pdb[,c(1,17)]
ipck<-as.vector(installed.packages()[,1])
pdbCleaned <- subset[subset$Package %in% ipck, ]
pdbCleaned$Author
Authors <-gsub("[\r\n]", "", pdbCleaned$Author)
Authors <-gsub("\\[.*?\\]", "", Authors)
Authors <-gsub("\\(.*?\\)", "", Authors)
Authors <-gsub("<.*>", "", Authors)
Authors <-gsub("))", "", Authors)
Authors <-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", " ", Authors)
Authors
Here is an attempt at a solution with stringr. Note that you don't have to match your installed packages to the entire CRAN db, you can just pull the author field from your installed packages.
I just use two regexes: one to remove anything wrapped in [], (), <> which is often things like [aut] or <email#domain>, and one to remove anything with spaces surrounding , or and. Note that depending on the packages you have installed this will work varyingly well. You will have to tweak to adjust for the packages you have, for example I might want to remove double commas ,, because of the ada package. Other packages just have a lot of random text in their Author field which makes it hard to manage automatically, such as the akima package. But as a first pass this should do the trick.
library(tidyverse)
authors <- installed.packages(fields = "Author") %>%
as_tibble() %>%
select(package = Package, author = Author)
authors %>%
mutate(
author = str_replace_all(author, "(\\[|\\(|<).*(\\]|\\)|>)", ""),
author = str_replace_all(author, "[:space:]*(,|and)[:space:]*", ","),
author = str_trim(author)
)
#> # A tibble: 620 x 2
#> package author
#> <chr> <chr>
#> 1 abind Tony Plate,Richard Heiberger
#> 2 actuar Vincent Goulet,Sébastien Auclair,Christophe Dutang,Xavier Mi~
#> 3 ada Mark Culp,Kjell Johnson,,George Michailidis
#> 4 AER Christian Kleiber,Achim Zeileis
#> 5 AGD Stef van Buuren
#> 6 agricolae Felipe de Mendiburu
#> 7 akima "Hiroshi Akima,Albrecht Gebhardt,bicubic*\n functions),Th~
#> 8 alr3 Sanford Weisberg
#> 9 alr4 Sanford Weisberg
#> 10 amap Antoine Lucas
#> # ... with 610 more rows
Created on 2018-03-14 by the reprex package (v0.2.0).

trying to create a value of strings using the | or operator

I am trying to scrape a website link. So far I downloaded the text and set it as a dataframe. I have the folllowing;
keywords <- c(credit | model)
text_df <- as.data.frame.table(text_df)
text_df %>%
filter(str_detect(text, keywords))
where credit and model are two values I want to search the website, i.e. return row with the word credit or model in.
I get the following error
Error in filter_impl(.data, dots) : object 'credit' not found
The code only returns the results with the word "model" in and ignores the word "credit".
How can I go about returning all results with either the word "credit" or "model" in.
My plan is to have keywords <- c(credit | model | more_key_words | something_else | many values)
Thanks in advance.
EDIT:
text_df:
Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use
So I am trying to extract just rows 1 and 3.
I think the issue is you need to pass a string as an argument to str_detect. To check for "credit" or "model" you can paste them into a single string separated by |.
library(tidyverse)
library(stringr)
text_df <- read_table("Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use")
keywords <- c("credit", "model")
any_word <- paste(keywords, collapse = "|")
text_df %>% filter(str_detect(text, any_word))
#> # A tibble: 2 x 3
#> Var `1` text
#> <int> <chr> <chr>
#> 1 1 Here is some credit information
#> 2 3 This line may contain the keyword model
Ok I have checked it and I think it will not work you way, as you must use the or | operator inside filter() not inside str_detect()
So it would work this way:
keywords <- c("virg", "tos")
library(dplyr)
library(stringr)
iris %>%
filter(str_detect(Species, keywords[1]) | str_detect(Species, keywords[2]))
as a keywords[1] etc you have to specify each "keyword" from this variable
I would recommend staying away from regex when you're dealing with words. There are packages tailored for your particular task that you can use. Try, for example, the following
library(corpus)
text <- readLines("http://norvig.com/big.txt") # sherlock holmes
terms <- c("watson", "sherlock holmes", "elementary")
text_locate(text, terms)
## text before instance after
## 1 1 …Book of The Adventures of Sherlock Holmes
## 2 27 Title: The Adventures of Sherlock Holmes
## 3 40 … EBOOK, THE ADVENTURES OF SHERLOCK HOLMES ***
## 4 50 SHERLOCK HOLMES
## 5 77 To Sherlock Holmes she is always the woman. I…
## 6 85 …," he remarked. "I think, Watson , that you have put on seve…
## 7 89 …t a trifle more, I fancy, Watson . And in practice again, I …
## 8 145 …ere's money in this case, Watson , if there is nothing else.…
## 9 163 …friend and colleague, Dr. Watson , who is occasionally good …
## 10 315 … for you. And good-night, Watson ," he added, as the wheels …
## 11 352 …s quite too good to lose, Watson . I was just balancing whet…
## 12 422 …as I had pictured it from Sherlock Holmes ' succinct description, but…
## 13 504 "Good-night, Mister Sherlock Holmes ."
## 14 515 …t it!" he cried, grasping Sherlock Holmes by either shoulder and loo…
## 15 553 "Mr. Sherlock Holmes , I believe?" said she.
## 16 559 "What!" Sherlock Holmes staggered back, white with…
## 17 565 …tter was superscribed to " Sherlock Holmes , Esq. To be left till call…
## 18 567 "MY DEAR MR. SHERLOCK HOLMES ,--You really did it very w…
## 19 569 …est to the celebrated Mr. Sherlock Holmes . Then I, rather imprudentl…
## 20 571 …s; and I remain, dear Mr. Sherlock Holmes ,
## ⋮ (189 rows total)
Note that this matches the term regardless of the case.
For your specific use case, do
ix <- text_detect(text, terms)
or
matches <- text_subset(text, terms)

Creating POS tags for single words/tokens in R

I am looking for a way to create POS tags for single words/tokens from a list I have in R. I know that the accuracy will decrease if I do it for single tokens instead of sentences but the data I have are "delete edits" from Wikipedia and people mostly delete single, unconnected words instead of whole sentences. I have seen this question a few times for Python but I haven't found a solution for it in R yet.
My data will look somehwat like this
Tokens <- list(c("1976","green","Normandy","coast","[", "[", "template" "]","]","Fish","visting","England","?"))
And ideally, I would like to have something like this returned:
1976 CD
green JJ
Normandy NN
coast NN
[ x
[ x
template NN
] x
] x
Fish NN
visiting VBG
England NN
? x
I found some websites doing that online but I doubt that they are running anything in R. They also specifically state NOT to use it on single words/Tokens.
My Question thus: Is it possible to do this in R with reasonable accuracy? How would the code look like to not incorporate sentence structure? Would it be easier to just compare the lists to a huge tagged diary?
In general, there is no decent post tagger in native R, and all possible solutions rely on outside libraries. As one of such solutions, you can try our package spacyr using spaCy in the backend. It's not on CRAN yet but soon to be.
https://github.com/kbenoit/spacyr
The sample code is like this:
library(spacyr)
spacy_initialize()
Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]",
"Fish","visting","England","?")
spacy_parse(Tokens, tag = TRUE)
and the output is like this:
doc_id sentence_id token_id token lemma pos tag entity
1 text1 1 1 1976 1976 NUM CD DATE_B
2 text2 1 1 green green ADJ JJ
3 text3 1 1 Normandy normandy PROPN NNP ORG_B
4 text4 1 1 coast coast NOUN NN
5 text5 1 1 [ [ PUNCT -LRB-
6 text6 1 1 [ [ PUNCT -LRB-
7 text7 1 1 template template NOUN NN
8 text8 1 1 ] ] PUNCT -RRB-
9 text9 1 1 ] ] PUNCT -RRB-
10 text10 1 1 Fish fish NOUN NN
11 text11 1 1 visting vist VERB VBG
12 text12 1 1 England england PROPN NNP GPE_B
13 text13 1 1 ? ? PUNCT .
Although the package can do more, you can find what you need in tag field.
NOTE: (2017-05-20)
Now spacyr package is on CRAN, but the version has some issues with non-ascii characters. We recognized the issue after CRAN submission and resolved in the version on github. If you are planning to use it for German texts, please install the latest master on github.
devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)
This revision will be incorporated to CRAN package in next update.
NOTE2:
There are detailed instructions for installing spaCy and spacyr on Windows and Mac.
Windows:
https://github.com/kbenoit/spacyr/blob/master/inst/doc/WINDOWS.md
Mac:
https://github.com/kbenoit/spacyr/blob/master/inst/doc/MAC.md
Heres the steps I took to make amatsuo_net's suggestion work for me:
Installing spaCy and english language library for anaconda:
Open Anaconda prompt as Admin
execute:
activate py36
conda config --add channels conda-forge
conda install spacy
python -m spacy link en_core_web_sm en
Using the Wrapper for R studio:
install.packages("fastmatch")
install.packages("RcppParallel")
library(fastmatch)
library(RcppParallel)
devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)
library(spacyr)
spacy_initialize(condaenv = "py36")
Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]","Fish","visting","England","?");Tokens
spacy_parse(Tokens, tag = TRUE)

Resources