In the following codes, my aim is to reduce the number of words with the same stem.
For example, kompis in Swedish refer a friend in English, and the words with similar roots are kompisar, kompiserna.
rm(list=ls())
Sys.setlocale("LC_ALL","sv_SE.UTF-8")
library(tm)
library(SnowballC)
kompis <- c("kompisar", "kompis", "kompiserna")
stem_doc <- stemDocument(kompis, language="swedish")
stem_doc
1] "kompis" "kompis" "kompis"
I create a sample text file including the word kompis, kompisar, kompiserna.
Then, I did some preproceses in the corpus via following codes:
text <- c("TV och vara med kompisar.",
"Jobba på kompis huset",
"Ta det lugnt, umgås med kompisar.",
"Umgås med kompisar, vänner ",
"kolla anime med kompiserna")
corpus.prep <- Corpus(VectorSource(text), readerControl =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, PlainTextDocument)
corpus.prep <- tm_map(corpus.prep, stemDocument,language = "swedish")
head(content(corpus.prep[[1]]))
The results as follows. However, it includes the original words rather than same stem: kompis.
1] "TV och vara med kompisar."
2] "Jobba på kompi huset"
3] "Ta det lugnt, umgå med kompisar."
4] "Umgås med kompisar, vänner"
5] "kolla anim med kompiserna"
Do you know how to fix it?
You are almost there, but using PlainTextDocument is interfering with your goal.
The following code will return your expected result. I'm using remove punctuation otherwise the stemming will not work on the works that are at the end of the sentence. Also you will see warning messages appearing after both tm_map calls. You can ignore these.
corpus.prep <- Corpus(VectorSource(text), readerControl =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, removePunctuation)
corpus.prep <- tm_map(corpus.prep, stemDocument, language = "swedish")
head(content(corpus.prep))
[1] "TV och var med kompis" "Jobb på kompis huset" "Ta det lugnt umgås med kompis" "Umgås med kompis vänn"
[5] "koll anim med kompis"
For this kind of work I tend to use quanteda. Better support and works a lot better than tm.
library(quanteda)
# remove_punct not really needed as quanteda treats the "." as a separate token.
my_dfm <- dfm(text, remove_punct = TRUE)
dfm_wordstem(my_dfm, language = "swedish")
Document-feature matrix of: 5 documents, 15 features (69.3% sparse).
5 x 15 sparse Matrix of class "dfm"
features
docs tv och var med kompis jobb på huset ta det lugnt umgås vänn koll anim
text1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
text2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
text3 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0
text4 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0
text5 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
Using tidytext, see issue #17
library(dplyr)
library(tidytext)
library(SnowballC)
txt <- c("TV och vara med kompisar.",
"Jobba på kompis huset",
"Ta det lugnt, umgås med kompisar.",
"Umgås med kompisar, vänner ",
"kolla anime med kompiserna")
data_frame(txt = txt) %>%
unnest_tokens(word, txt) %>%
mutate(word = wordStem(word, "swedish"))
The wordStem function is from the snowballC package which comes with multiple languages, see getStemLanguages
Related
I have been using the AWESOME quanteda library for text analysis lately and it has been quite a joy, recently I have stumbled with a task and that is to use a dictionary relating words to a numeric sentiment score to summarize a measure per document called: NetSentScore which is calculating in the following manner:
NetSentScore per document= sum(Positive_wordscore)+sum(Negative_wordscore)
I have the following dictionary:
ScoreDict<- tibble::tibble(
score= c(-5,-9,1,8,9,-10),
word = c("bad", "horrible", "open","awesome","gorgeous","trash")
)
My corpus
text<-c("this is a bad movie very bad","horrible movie, just awful","im open to new dreams",
"awesome place i loved it","she is gorgeous","that is trash")
by definition quanteda will not allow to have numeric data in a dictionary, but I can have this:
> text %>%
+ corpus() %>%
+ tokens(remove_punct = TRUE) %>%
+ tokens_remove(stopwords("en")) %>%
+ dfm()
Document-feature matrix of: 6 documents, 14 features (82.14% sparse) and 0 docvars.
features
docs bad movie horrible just awful im open new dreams awesome
text1 2 1 0 0 0 0 0 0 0 0
text2 0 1 1 1 1 0 0 0 0 0
text3 0 0 0 0 0 1 1 1 1 0
text4 0 0 0 0 0 0 0 0 0 1
text5 0 0 0 0 0 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 4 more features ]
which gives me the number or times a word was found in a document, I will only need to "join" or "merge" with my dictionary so I have have the score by each word and then compute the NetSentScore, is there a way to do this in quanteda?
Please keep in mind that I do have a quite massive large corpus so converting my dfm to a dataframe will make the RAM die as I have over 500k documents and approx 800 features.
to illustrate the NetSentScore of text1 will be:
2*-5+0=-10, this is because the word bad appears two times and according to the dictionary it has a score of -5
As #stomper suggests, you can do this with the quanteda.sentiment package, by setting the numeric values as "valences" for the dictionary. Here's how to do it.
This ought to work on 500k documents but of course this will depend on your machine's capacity.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.sentiment")
#>
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#>
#> data_dictionary_LSD2015
dict <- dictionary(list(
sentiment = c("bad", "horrible", "open", "awesome", "gorgeous", "trash")
))
valence(dict) <- list(
sentiment = c(bad = -5,
horrible = -9,
open = 1,
awesome = 8, gorgeous = 9,
trash = -10)
)
print(dict)
#> Dictionary object with 1 key entry.
#> Valences set for keys: sentiment
#> - [sentiment]:
#> - bad, horrible, open, awesome, gorgeous, trash
text <- c("this is a bad movie very bad",
"horrible movie, just awful",
"im open to new dreams",
"awesome place i loved it",
"she is gorgeous",
"that is trash")
Now to compute the document scores, you use textstat_valence() but you sent the normalisation to "none" in order to sum the valences rather than average them. Normalisation is the default because raw sums are affected by documents having different lengths, but as this package is still in a developmental stage, it's easy to imagine that other choices might be preferable to the default.
textstat_valence(tokens(text), dictionary = dict, normalization = "none")
#> doc_id sentiment
#> 1 text1 -10
#> 2 text2 -9
#> 3 text3 1
#> 4 text4 8
#> 5 text5 9
#> 6 text6 -10
Created on 2023-01-11 with reprex v2.0.2
Below is my code where I am creating bigrams from text data. The output I am getting is fine except that I need the field names to have an underscore so that I can use these as variables for a model.
text<- c("Since I love to travel, this is what I rely on every time.",
"I got the rewards card for the no international transaction fee",
"I got the rewards card mainly for the flight perks",
"Very good card, easy application process, and no international
transaction fee",
"The customer service is outstanding!",
"My wife got the rewards card for the gift cards and international
transaction fee.She loves it")
df<- data.frame(text)
library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
BigramTokenizer<-
function(x)
unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)
dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))
sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2
Here is what the output looks like:
Terms
Docs got rewards international transaction rewards card transaction fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0
How do I make the field names like got_rewards instead of got rewards
This is not a really tm specific question I guess. Anyway, you can set collapse="_" in your code or modify the column names after the fact like so:
colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
Terms
Docs got_rewards international_transaction rewards_card transaction_fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0
I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data.
My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words only, but this would fail at the beginning of a sentence.
Text1 <- c("Halle an der Saale ist die grünste Stadt Deutschlands")
Text2 <- c("In Hamburg regnet es immer, das ist also so wie in London.")
Text3 <- c("James Bond trinkt am liebsten Martini")
myCorpus <- corpus(c(Text1, Text2, Text3))
metadoc(myCorpus, "language") <- "german"
summary(myCorpus, showmeta = T)
myDfm <- dfm(myCorpus, tolower = F, remove_numbers = T,
remove = stopwords("german"), remove_punct = TRUE,
remove_separators = T)
topfeatures(myDfm, 20)
From this minimal example, I would like to retrieve:
"Halle", "Saale", "Stadt", "Deutschland", "Hamburg", "London", "Martini", "James", "Bond".
I assume I need a dictionary, which defines verbs/nouns/etc. and the proper names (James Bond, Hamburg etc.), or is there a build in function/dict?
Bonus Question: Does the solution work for English texts too?
You need some help from a part-of-speech tagger. Fortunately there is a great one, with a German language model, in the form of spaCy, and a package we wrote as a wrapper around it, spacyr. Installation instructions are at the spacyr page.
This code will do what you want:
txt <- c("Halle an der Saale ist die grünste Stadt Deutschlands",
"In Hamburg regnet es immer, das ist also so wie in London.",
"James Bond trinkt am liebsten Martini")
library("spacyr")
spacy_initialize(model = "de")
txtparsed <- spacy_parse(txt, tag = TRUE, pos = TRUE)
head(txtparsed, 20)
# doc_id sentence_id token_id token lemma pos tag entity
# 1 text1 1 1 Halle halle PROPN NE LOC_B
# 2 text1 1 1 an an ADP APPR LOC_I
# 3 text1 1 1 der der DET ART LOC_I
# 4 text1 1 1 Saale saale PROPN NE LOC_I
# 5 text1 1 1 ist ist AUX VAFIN
# 6 text1 1 1 die die DET ART
# 7 text1 1 1 grünste grünste ADJ ADJA
# 8 text1 1 1 Stadt stadt NOUN NN
# 9 text1 1 1 Deutschlands deutschlands PROPN NE LOC_B
# 10 text2 1 1 In in ADP APPR
# 11 text2 1 1 Hamburg hamburg PROPN NE LOC_B
# 12 text2 1 1 regnet regnet VERB VVFIN
# 13 text2 1 1 es es PRON PPER
# 14 text2 1 1 immer immer ADV ADV
# 15 text2 1 1 , , PUNCT $,
# 16 text2 1 1 das das PRON PDS
# 17 text2 1 1 ist ist AUX VAFIN
# 18 text2 1 1 also also ADV ADV
# 19 text2 1 1 so so ADV ADV
# 20 text2 1 1 wie wie CONJ KOKOM
(nouns <- with(txtparsed, subset(token, pos == "NOUN")))
# [1] "Stadt"
(propernouns <- with(txtparsed, subset(token, pos == "PROPN")))
# [1] "Halle" "Saale" "Deutschlands" "Hamburg" "London"
# [6] "James" "Bond" "Martini"
Here, you can see that the nouns you wanted are marked in the simpler pos field as "proper nouns". The tag field is a more detailed, German-language tagset that you could also select from.
The lists of selected nouns can then be used in quanteda:
library("quanteda")
myDfm <- dfm(txt, tolower = FALSE, remove_numbers = TRUE,
remove = stopwords("german"), remove_punct = TRUE)
head(myDfm)
# Document-feature matrix of: 3 documents, 14 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale grünste Stadt Deutschlands Hamburg
# text1 1 1 1 1 1 0
# text2 0 0 0 0 0 1
# text3 0 0 0 0 0 0
head(dfm_select(myDfm, pattern = propernouns))
# Document-feature matrix of: 3 documents, 8 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale Deutschlands Hamburg London James
# text1 1 1 1 0 0 0
# text2 0 0 0 1 1 0
# text3 0 0 0 0 0 1
I have a column in my data frame (df) as follows:
> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that occur the most are used as columns. I have tried the following code:
> people_list = strsplit(people, ", ")
> corp = Corpus(VectorSource(people_list))
> dtm = DocumentTermMatrix(corp, people_dict)
where people_dict is a list of the most commonly occurring people (~150 full names of people) from people_list as follows:
> people_dict[1:3]
[[1]]
[1] "Christian Slater"
[[2]]
[1] "Tara Reid"
[[3]]
[1] "Stephen Dorff"
However, the DocumentTermMatrix function seems to not be using the people_dict at all because I have way more columns than in my people_dict. Also, I think that the DocumentTermMatrix function is splitting each name string into multiple strings. For example, "Danny Devito" becomes a column for "Danny" and "Devito".
> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
I have read through all the TM documentation that I can find, and I have spent hours searching on stackoverflow for a solution. Please help!
The default tokenizer splits text into individual words. You need to provide a custom function
commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))
Note that you do not separate the actors before creating the corpus.
people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")
The control options didn't work with just Coprus, I used VCorpus
corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize =
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))
All of the options are passed within control, including:
tokenize - function
dictionary
tolower = FALSE
Results:
as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
1 0 1 0
2 0 0 0
3 0 0 1
I hope this helps
I have a dataframe with this structure :
#Load lexicon
Lexicon_DF <- read.csv("LexiconFrancais.csv",header=F, sep=";")
The structure of the "LexiconFrancais.csv" is like this :
French Translation (Google Translate);Positive;Negative
un dos;0;0
abaque;0;0
abandonner;0;1
abandonné;0;1
abandon;0;1
se calmer;0;0
réduction;0;0
abba;1;0
abbé;0;0
abréger;0;0
abréviation;0;0
> Lexicon_DF
V1 V2 V3
1 French Translation (Google Translate) Positive Negative
2 un dos 0 0
3 abaque 0 0
4 abandonner 0 1
5 abandonné 0 1
6 abandon 0 1
7 se calmer 0 0
8 réduction 0 0
9 abba 1 0
10 abbé 0 0
11 abréger 0 0
12 abréviation 0 0
I try to stemm the first column of the dataframe, for this I did :
Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
But after this command I find only the first column in the Lexicon_DF dataframe, the two other column disappear.
> Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
> Lexicon_DF
[1] "French Translation (Google Translate)" "un dos" "abaqu"
[4] "abandon" "abandon" "abandon"
[7] "se calm" "réduct" "abba"
[10] "abbé" "abreg" "abrévi"
How can I do the stemming wtihout missing the two other columns?
thank you
You are trying to replace the whole content of Lexicon_DF with the o/p of wordStem-
Try this :
Lexicon_DF$V1 <-SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')