Suppose I am parsing an english corpus with the tm package, and I do the usual cleaning steps.
library(tm)
data("crude")
corpus <- Corpus(crude)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords)) stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
# text matrices
tdm <- TermDocumentMatrix(corpus)
dtm<- DocumentTermMatrix(corpus)
How do I identify the words written in a different language than the one of the corpus? A similar problem is faced with Python here, but my research did not produces interesting results.
This is not a complete solution, but I feel like it might help. I recently had to do something similar where I had to remove words from a corpus with Chinese characters. I ended up using a custom transformation with a regex to remove anything with a non a-z 0-9 character in it.
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
For example, if there is a Chinese word in there, it gets removed.
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = 'English 象形字 Chinese',
replacement = "",
ignore.case = TRUE,
perl = TRUE)
Output: "English Chinese"
It's trickier if you are trying to remove words from a language like Spanish because some letters have an accent while others don't. For example, this doesn't work completely, but maybe it's a start.
gsub(pattern = '[a-zA-Z0-9]+[^a-zA-Z0-9\\s]+[a-zA-Z0-9]+',
x = 'El jalapeño es caliente',
replacement = "",
ignore.case = TRUE,
perl = TRUE)
Output: "El es caliente"
Hope this helps!
Related
I am running textual and sentiment analysis on multi-lingual text files from the healthcare sector, and I want to remove stopwords from all the languages at once. I don't want to write the name of every language in the code to remove the stopwords. Is there any way I can do it fast?
Here is my code: The total number of files is 596
files = list.files(path = getwd(), pattern = "txt", all.files = FALSE,
full.names = TRUE, recursive = TRUE)
txt = {}
for (i in 1:596)
try(
{
txt[[i]] <- readLines(files[i], warn = FALSE)
filename <- txt[[i]]
filename <- trimws(filename)
corpus <- iconv(filename, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
# Clean Text
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords("english"))
cleanset <- tm_map(cleanset, removeWords, stopwords("spanish"))
cleanset <- tm_map(cleanset, content_transformer(tolower))
cleanset <- tm_map(cleanset, stripWhitespace)
# Remove spaces and newlines
cleanset <- tm_map("\n", " ", cleanset)
cleanset <- tm_map("^\\s+", "", cleanset)
cleanset <- tm_map("\\s+$", "", cleanset)
cleanset <- tm_map("[ |\t]+", " ", cleanset)
}, silent = TRUE)
I want to remove stopwords from all the languages at once.
Merge the results of each stopwords(cc) call, and pass that to a single tm_map(corpus, removeWords, allStopwords) call.
I don't want to write the name of every language in the code to remove the stopwords
You could use stopwords_getlanguages() to get a list of all the supported languages, and do it as a loop. See an example at https://www.rdocumentation.org/packages/stopwords/versions/2.3
For what its worth, I think this (using stopwords of all languages) is a bad idea. What is a stop word in one language could be a high information word in another language. E.g. just skimming https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt I spotted "embargo", "final", "mayor", "salvo", "sea", which are not in the English stopword list, and could carry information.
Of course it depends on what you are doing with the data once all these words have been stripped out.
But if something like searching for drug names, or other keywords, just do that on the original data, without removing stopwords.
Use spacy where it has more than 15 language models with stopwords. For R language spacyr.
I have been using the code below to load text as a corpus and using the tm package to clean the text. As a next step I am loading a dictionary and cleaning it as well. Then I am matching the words from the text with the dictionary to calculate a score. However, the matching results in a higher number of matches than actual words in the text (e.g., the competence score is 1500 but the actual number of words in the text is only 1000).
I think it is related to the stemming of the text and the dictionary as the matches are lower when there is no stemming performed.
Do you have any ideas why this is happening?
Thank you very much.
R Code
Step 1 Storing data as corpus
file.path <- file.path(here("Generated Files", "Data Preparation")) corpus <- Corpus(DirSource(file.path))
Step 2 Cleaning data
#Removing special characters
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "#")
corpus <- tm_map(corpus, toSpace, "\\|")
#Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus, removeNumbers)
#Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Remove your own stop word
specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("view", "pdf"))
#Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
#Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
#Text stemming
corpus <- tm_map(corpus, stemDocument)
#Unique words
corpus <- tm_map(corpus, unique)
Step 3 DTM
dtm <- DocumentTermMatrix(corpus)
Step 4 Load Dictionaries
dic.competence <- read_excel(here("Raw Data", "6. Dictionaries", "Brand.xlsx"))
dic.competence <- tolower(dic.competence$COMPETENCE)
dic.competence <- stemDocument(dic.competence)
dic.competence <- unique(dic.competence)
Step 5 Count frequencies
corpus.terms = colnames(dtm)
competence = match(corpus.terms, dic.competence, nomatch=0)
Step 6 Calculate scores
competence.score = sum(competence) / rowSums(as.matrix(dtm))
competence.score.df = data.frame(scores = competence.score)
What does competence return when you run that line? I'm not sure how your dictionary is set up, so I can't say for certain what's happening there. I brought in my own random corpus text as the primary text and brought in a separate corpus as the dictionary and your code worked great. The row names of competence.score.df were the names of the different txt files in my corpus and the scores were all in a 0-1 range.
# this is my 'dictionary' of terms:
tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")),
control = list(removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
removePunctuation = TRUE))
# then I used your programming and it worked as I think you were expecting
# notice what I used here for the dictionary
(competence = match(colnames(dtm),
Terms(tdm)[1:10], # I only used the first 10 in my test of your code
nomatch = 0))
(competence.score = sum(competence)/rowSums(as.matrix(dtm)))
(competence.score.df = data.frame(scores = competence.score))
I am currently looking to perform some text mining on 25000 YouTube comments, which I gathered using the tuber package. I am very new to coding and with all these different information out there, this can be a bit overwhelming at times. So I already cleaned my corpus, that I created:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Add extra stopwords
myStopwords <- c(stopwords('english'),"im", "just", "one","youre",
"hes","shes","its","were","theyre","ive","youve","weve","theyve","id")
# Remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z""0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)}))
# Replace word elongations using the textclean package by Tyler Rinker.
corpus <- tm_map(corpus, replace_word_elongation)
# Creating data frame from corpus
corpus_asdataframe<-data.frame(text = sapply(corpus, as.character),stringsAsFactors = FALSE)
# Due to pre-processing some rows are empty. Therefore, the empty rows should be removed.
# Remove empty rows from data frame and "NA's"
corpus_asdataframe <-corpus_asdataframe[!apply(is.na(corpus_asdataframe) | corpus_asdataframe == "", 1, all),]
corpus_asdataframe<-as.data.frame(corpus_asdataframe)
# Create corpus of clean data frame
corpus <- Corpus(VectorSource(corpus_asdataframe$corpus_asdataframe))
So now the issue is that there are a lot of Spanish or German comments in my corpus, which I would like to exclude. I thought that maybe it is possible to download an English dictionary and maybe use an inner jointo detect english words and remove all other languages. However, I am very new to coding (I am studying Business Administration and never had to do anything with computer science) and so my skills are not sufficient for applying my idea to my corpus (or data frame). I really hope find a little help here. That would me very much appreciated! Thank you and best regards from Germany!
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
CREDIT: Ken Benoit at: Find in a dfm non-english tokens and remove them
I have a problem with the word stemming completion of my created corpus using the tm package.
Here are the most important lines of my code:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
Now to complete the word stemming I apply stemCompletion of the tm package.
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
However, this is where my corpus gets destroyed and messed up and the stemCompletion does not work properly. Peculiarly, R does not indicate an error, the code runs but the result is terrible.
Does anybody know a solution for this? BTW my "comments_final" data frame consist of youtube comments, which I downloaded using the tubeR package.
Thank you so much for your help in advance, I really need help for my master's thesis thank you.
It does seem to work in a bit weird way, so I came up with my own stemCompletion function and applied it to the corpus. In your case try this:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
Hope this helps!
I am new in supervised methods. Here is my way to normalize my data:
corpuscleaned1 <- tm_map(AI_corpus, removePunctuation) ## Revome punctuation.
corpuscleaned2 <- tm_map(corpuscleaned1, stripWhitespace) ## Remove Whitespace.
corpuscleaned3 <- tm_map(corpuscleaned2, removeNumbers) ## Remove Numbers.
corpuscleaned4 <- tm_map(corpuscleaned3, stemDocument, language = "english") ## Remove StemW.
corpuscleaned5 <- tm_map(corpuscleaned4, removeWords, stopwords("en")) ## Remove StopW.
head(AI_corpus[[1]]$content) ## Examine original txt.
head(corpuscleaned5[[1]]$content) ## Examine clean txt.
AI_corpus <- my corpus about Amnesty Int. reports 1993-2013.
When creating Wordclouds it is most common to make all the words lowercase. However, I want the wordclouds to display the words uppercase. After forcing the words to be uppercase the wordcloud still display lowercase words. Any ideas why?
Reproducable code:
library(tm)
library(wordcloud)
data <- data.frame(text = c("Creativity is the art of being ‘productive’ by using
the available resources in a skillful manner.
Scientifically speaking, creativity is part of
our consciousness and we can be creative –
if we know – ’what goes on in our mind during
the process of creation’.
Let us now look at 6 examples of creativity which blows the mind."))
text <- paste(data$text, collapse = " ")
# I am using toupper() to force the words to become uppercase.
text <- toupper(text)
source <- VectorSource(text)
corpus <- VCorpus(source, list(language = "en"))
# This is my function for cleaning the text
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
return(corpus)
}
clean_corp <- clean_corpus(corpus)
data_tdm <- TermDocumentMatrix(clean_corp)
data_m <- as.matrix(data_tdm)
commonality.cloud(data_m, colors = c("#224768", "#ffc000"), max.words = 50)
This produces to following output
It's because behind the scenes TermDocumentMatrix(clean_corp) is doing TermDocumentMatrix(clean_corp, control = list(tolower = TRUE)). If you set it to TermDocumentMatrix(clean_corp, control = list(tolower = FALSE)), then the words stay uppercase. Alternatively, you can also adjust the row names of your matrix afterwards: rownames(data_m) <- toupper(rownames(data_m)).