I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:
q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
readerControl = list(language = "pt",
load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
If I use:
> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"
it does work! But if I use:
> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
it doesn't work. Why so?
Unfortunately you stumbled on a bug. stemDocument works if you pass on the language when you do:
stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"
But when using this in tm_map, the function starts of with stemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character, but without the language component. In stemDocument.character the default language is specified as English. So within the tm_map call (or the DocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.
A workaround could be using the package quanteda:
library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")
my_dfm
Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
features
docs pod
text1 1
text2 1
Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.
Related
Working with a document term matrix in R seems to be truncating the words.
I create a document term matrix from a corpus like below:
library(tm)
docs <- c("All that we are is the result of what we have thought.",
"Wisely, and slow. They stumble that run fast.",
"The future belongs to those who prepare for it today.",
"Our life is frittered away by detail... simplify, simplify.",
"Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.")
myCorpus <- Corpus(VectorSource(docs))
ndocs <- length(myCorpus)
minTermFreq <- 0.05 * ndocs
maxTermFreq <- 0.6 * ndocs
myDTM <- DocumentTermMatrix(myCorpus,
control = list(stopwords = TRUE,
wordLengths=c(3, Inf),
removePunctuation = TRUE,
removeNumbers = TRUE,
tolower=TRUE,
stemming = TRUE,
remove_separators = TRUE,
bounds = list(global = c(minTermFreq, maxTermFreq))
)
)
When I look at the terms, longer ones are truncated, but not consistently:
myDTM[["dimnames"]][["Terms"]]
# [1] "absolut" "away" "beauti" "belong" "better"
# [6] "bore" "detail" "fast" "fritter" "futur"
# [11] "genius" "imperfect" "it’" "life" "mad"
# [16] "prepar" "result" "ridicul" "run" "simplifi"
# [21] "slow" "stumbl" "thought" "today" "wise"
"Absolutely" is truncated to 7 characters, while "beauty" is truncated to 6. What's the fix for this? Or am I missing something obvious?
You have stemmed the words by using the option stemming = TRUE.
You can either set this to false to avoid stemming, meaning that words such as stumble, stumbles and stumbled will all be counted separately, or complete the stems using stemCompletion. This will replace the stems with the most common option from the text by default (though you can change the behaviour with the type parameter.
I am having problem with getting the right text after stemming in R.
Eg. 'papper' should show as 'papper' but instead shows up as 'papp', 'projekt' becomes 'projek'.
The frequency cloud generated thus shows these shortened versions which loses the actual meaning or becomes incomprehensible.
What can I do to get rid of this problem? I am using the latest version of snowball(0.6.0).
R Code:
library(tm)
library(SnowballC)
text_example <- c("projekt", "papper", "arbete")
stem_doc <- stemDocument(text_example, language="sv")
stem_doc
Expected:
stem_doc
[1] "projekt" "papper" "arbete"
Actual:
stem_doc
[1] "projek" "papp" "arbet"
What you describe here is actually not stemming but is called lemmatization (see #Newl's link for the difference).
To get the correct lemmas, you can use the R package UDPipe, which is a wrapper around the UDPipe C++ library.
Here is a quick example of how you would do what you want:
# install.packages("udpipe")
library(udpipe)
dl <- udpipe_download_model(language = "swedish-lines")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe
udmodel_swed <- udpipe_load_model(file = dl$file_model)
text_example <- c("projekt", "papper", "arbete")
x <- udpipe_annotate(udmodel_swed, x = text_example)
x <- as.data.frame(x)
x$lemma
#> [1] "projekt" "papper" "arbete"
Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords.
Web Search: Being new to text-mining and the tm package in R, I went to the web to figure out how to do this. Below are some relevant links that I found:
FAQS on the tm-package website
finding 2 & 3 word phrases using r tm package
counter ngram with tm package in r
findassocs for multiple terms in r
Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R, but I ran into a problem. In the example code below, I create three documents and place them in a corpus. Note that Docs 1 and 2 each contain two words. Doc 3 only contains one word. My dictionary keywords are two bigrams and a unigram.
Problem: The NGramTokenizer solution in the above links does not correctly count the unigram keyword in the Doc 3.
library(tm)
library(RWeka)
my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=BigramTokenizer,
dictionary=my.dict)))
# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
# 1 1 0 1
# 2 1 1 0
# 3 0 0 0
I was expecting the row for Doc 3 to give 1 for jedi and 0 for the other two. Is there something I am misunderstanding?
I ran into the same problem and found that token counting functions from the TM package rely on an option called wordLengths, which is a vector of two numbers -- the minimum and the maximum length of tokens to keep track of. By default, TM uses a minimum word length of 3 characters (wordLengths = c(3, Inf)). You can override this option by adding it to the control list in a call to DocumentTermMatrix like this:
DocumentTermMatrix(my.corpus,
control=list(
tokenize=newBigramTokenizer,
wordLengths = c(1, Inf)))
However, your 'jedi' word is more than 3 characters long. Although, you probably tweaked the option's value earlier while trying to figure out how to count ngrams, so still try this. Also, look at the bounds option, which tells TM to discard words less or more frequent than specified values.
I noticed that NGramTokenizer returns character(0) when a one-word string is submitted as input and NGramTokenizer is asked to return unigrams and bigrams.
NGramTokenizer('jedi', Weka_control(min = 1, max = 2))
# character(0)
I am not sure why this is the output, but I believe this behavior is the reason why the keyword jedi was not counted in Doc 3. However, a simple if-then-else solution appears to work for my situation: both for the sample set and my actual data set.
library(tm)
library(RWeka)
my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')
newBigramTokenizer = function(x) {
tokenizer1 = NGramTokenizer(x, Weka_control(min = 1, max = 2))
if (length(tokenizer1) != 0L) { return(tokenizer1)
} else return(WordTokenizer(x))
} # WordTokenizer is an another tokenizer in the RWeka package.
inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=newBigramTokenizer,
dictionary=my.dict)))
# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
# 1 1 0 1
# 2 1 1 0
# 3 1 0 0
Please let me know if anyone finds a "gotcha" that I am not considering in the code above. I would also appreciate any insight into why NGramTokenizer returns character(0) in my observation above.
I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).
The text is displayed absolutely right in inspect function of the tm package.
However, when I search for word frequency everything is displayed incorrectly:
The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.
Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.
Let a sample text be:
Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді.
Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді.
Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық.
Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы.
Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.
My simple code is this:
(Based on onertipaday.blogspot.com tutorials:)
require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
1 2
44 4
findFreqTerms(ap.tdm, lowfreq=2)
[1] "<U+04D9>лем" "арман" "еді"
[4] "м<U+04D9><U+04A3>гілік"
Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.
Highly appreciate any help! :)
The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).
scan_tokenizer function (x) {
scan(text = x, what = "character", quote = "", quiet = TRUE) }
One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:
scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))
Then you get the result well encoded:
findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
Actually, I disagree with agstudy's answer. It does not seem to be a tokenizer problem. I'm using version 0.6.0 of the tm package and your code works just fine for me, except that I had to explicitly set the encoding of your text data to UTF-8 using :
Encoding(text) <- "UTF-8"
Below is the complete piece of reproducible code. Just make sure you save it in a file with UTF-8 encoding, and use source() to run it; do not use source.with.encoding(), it'll throw an error.
text <- "Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді."
Encoding(text)
# [1] "unknown"
Encoding(text) <- "UTF-8"
# [1] "UTF-8"
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
content(ap.corpus[[1]])
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
print(table(ap.d$freq))
# 1 2 3
# 62 5 1
print(findFreqTerms(ap.tdm, lowfreq=2))
# [1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
It worked for me, hope it does for you too.
I can get a list of all the available packages with the function:
ap <- available.packages()
But how can I also get a description of these packages from within R, so I can have a data.frame with two columns: package and description?
Edit of an almost ten-year old accepted answer. What you likely want is not to scrape (unless you want to practice scraping) but use an existing interface: tools::CRAN_package_db(). Example:
> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978 2
>
The function brings (currently) 66 columns back of which the of interest here are a part.
I actually think you want "Package" and "Title" as the "Description" can run to several lines. So here is the former, just put "Description" in the final subset if you really want "Description":
R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+ contrib.url(getOption("repos")["CRAN"], "source")
+ description <- sprintf("%s/web/packages/packages.rds",
+ getOption("repos")["CRAN"])
+ con <- if(substring(description, 1L, 7L) == "file://") {
+ file(description, "rb")
+ } else {
+ url(description, "rb")
+ }
+ on.exit(close(con))
+ db <- readRDS(gzcon(con))
+ rownames(db) <- NULL
+
+ db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle()) # I shortened one Title here...
Package Title
[1,] "abc" "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd" "The Analysis of Biological Data"
[4,] "abind" "Combine multi-dimensional arrays"
[5,] "abn" "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>
Dirk has provided an answer that is terrific and after finishing my solution and then seeing his I debated for some time posting my solution for fear of looking silly. But I decided to post it anyway for two reasons:
it is informative to beginning scrapers like myself
it took me a while to do and so why not :)
I approached this thinking I'd need to do some web scraping and choose crantastic as the site to scrape from. First I'll provide the code and then two scraping resources that have been very helpful to me as I learn:
library(RCurl)
library(XML)
URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T,
strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999",
"NA", " "))[, 1])
Trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)])
scraper <- function(urls){ #function to grab description
doc <- htmlTreeParse(urls, useInternalNodes=TRUE)
nodes <- getNodeSet(doc, "//p")[[3]]
return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
if(class(x)[1] != "XMLInternalElementNode"){
NA
} else {
Trim(gsub("\\s+", " ", xmlValue(x)))
}
}
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)],
description=info2) #make a dataframe of it all
Resources:
talkstats.com thread on web scraping (great beginner
examples)
w3schools.com site on html stuff (very
helpful)
I wanted to try to do this using a HTML scraper (rvest) as an exercise, since the available.packages() in OP doesn't contain the package Descriptions.
library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)
P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items
length(P1); length(P2);
mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")
# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)
# PackageName Description
# A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR Access to Abbyy Optical Character Recognition (OCR) API
# abc Tools for Approximate Bayesian Computation (ABC)
# abc.data Data Only: Tools for Approximate Bayesian Computation (ABC)
# ABC.RAP Array Based CpG Region Analysis Pipeline
# ABCanalysis Computed ABC Analysis
# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)
However, although working quite well for small array/dataframe list (subset), I ran into a display problem with the full list, where the data would be shown either column-by-column or unaligned. I would have been great to have this paged and properly formatted in a new window somehow. I tried using page, but I couldn't get it to work very well.
EDIT:
The recommended method is not the above, but rather using Dirk's suggestion (from the comments below):
db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)
However, this still suffers from the display problem mentioned...