I would like to create a Term document matrix containing character n-grams. For example, take the following sentence:
"In this paper, we focus on a different but simple text representation."
Character 4-grams would be: |In_t|, |n_th|, |_thi|, |this|, |his__|, |is_p|, |s_pa|, |_pap|, |pape|, |aper|, etc.
I have used the R/Weka package to work with "bag of words" n-grams, but I'm having difficulty adapting tokenizers such as the one below to work with characters:
BigramTokenizer <- function(x){
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(corpus,
control = list(
tokenize = BigramTokenizer, wordLengths=c(2,Inf)))
Any thoughts on how to use R/Weka or an other package to create character n-grams?
I find quanteda quite useful:
library(tm)
library(quanteda)
txts <- c("In this paper.", "In this lines this.")
tokens <- tokenize(gsub("\\s", "_", txts), "character", ngrams=4L, conc="")
dfm <- dfm(tokens)
tdm <- as.TermDocumentMatrix(t(dfm), weighting=weightTf)
as.matrix(tdm)
# Docs
# Terms text1 text2
# In_t 1 1
# n_th 1 1
# _thi 1 2
# this 1 2
# his_ 1 1
# is_p 1 0
# s_pa 1 0
# _pap 1 0
# pape 1 0
# aper 1 0
# per. 1 0
# is_l 0 1
# s_li 0 1
# _lin 0 1
# line 0 1
# ines 0 1
# nes_ 0 1
# es_t 0 1
# s_th 0 1
# his. 0 1
You need to use the CharacterNGramTokenizer instead.
The NGramTokenizer splits on characters like spaces.
##########
### the following lines are mainly a one to one copy from RWeka.
### Only hardocded CharacterNGramTokenizer is new
library(rJava)
CharacterNGramTokenizer <- structure(function (x, control = NULL)
{
tokenizer <- .jnew("weka/core/tokenizers/CharacterNGramTokenizer")
x <- Filter(nzchar, as.character(x))
if (!length(x))
return(character())
.jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
"weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)),
.jarray(as.character(x)))
}, class = c("R_Weka_tokenizer_interface", "R_Weka_interface"
), meta = structure(list(name = "weka/core/tokenizers/NGramTokenizer",
kind = "R_Weka_tokenizer_interface", class = "character",
init = NULL), .Names = c("name", "kind", "class", "init")))
### copy till here
###################
BigramTokenizer <- function(x){
CharacterNGramTokenizer(x, Weka_control(min = 2, max = 2))}
Sadly it is not included in RWeka by default.
However, if you want to use weka this seems to be a kind of holistic version
Related
I'm trying to figure out how to identify unigrams and bigrams in a text in R, and then keep both in the final output based on a threshold. I've done this in Python with gensim's Phraser model, but haven't figured out how to do it in R.
For example:
strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]
Thank you!
You could use quanteda framework for this:
library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text)
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)
# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)
Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
features
docs great movie yesterday great_movie went theater
text1 1 1 1 1 0 0
text2 0 0 0 0 1 0
text3 1 1 0 1 0 1
text4 0 0 1 0 1 1
# use convert function to turn this into a data.frame
Alternatively you could use tidytext package, tm, tokenizers etc etc. It all depends a bit on the output you are expecting.
An example using tidytext / dplyr looks like this:
library(tidytext)
library(dplyr)
strings %>%
unnest_ngrams(bigs, text, n = 2, n_min = 1, ngram_delim = "_", stopwords = stopwords::stopwords()) %>%
count(bigs) %>%
filter(n >= 2)
bigs n
1 great 2
2 great_movie 2
3 movie 2
4 theater 2
5 went 2
6 yesterday 2
Both quanteda and tidytext have a lot of online help available. See vignettes wiht both packages on cran.
From answer posted on: Keep document ID with R corpus by #MrFlick
I am trying to slightly modify what is a great example.
Question: How do I modify the content_transformer function to keep only exact words? You can see in the inspect output that wonderful is counted as wonder and ratio is counted as rationale. I do not have a strong understanding of gregexpr and regmatches.
Create data frame:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale")
, stringsAsFactors = F
)
Now, in order to read special attributes from a data.frame, we will use the readTabular function to make our own custom data.frame reader
library(tm)
myReader <- readTabular(mapping = list(content = "text", id = "id"))
specify the column to use for the contents and the id in the data.frame. Now we read it in with DataframeSource but use our custom reader.
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
Now if we want to only keep a certain set of words, we can create our own content_transformer function. One way to do this is
keepOnlyWords <- content_transformer(function(x, words) {
regmatches(x,
gregexpr(paste0("\\b(", paste(words, collapse = "|"), "\\b)"), x)
, invert = T) <- " "
x
})
This will replace everything that's not in the word list with a space. Note that you probably want to run stripWhitespace after this. Thus our transformations would look like
keep <- c("wonder", "then", "that", "the")
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
Inspect dtm matrix:
> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity : 56%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs ratio that the wonder
10 0 1 1 1
11 0 1 0 0
12 0 0 1 0
13 1 0 1 0
Switching grammars to tidytext, your current transformation would be
library(tidyverse)
library(tidytext)
library(stringr)
dd %>% unnest_tokens(word, text) %>%
mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 wonder
## 2 10 the
## 3 10 that
## 4 11 that
## 5 12 the
## 6 12 the
## 7 13 the
Keeping exact matches is easier, as you can use joins (which use ==) instead of regex:
dd %>% unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep))
## id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13 the
To take it back to a document-term matrix,
library(tm)
dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM
unnest_tokens(word, text) %>%
inner_join(data_frame(word = keep)) %>%
mutate(i = 1) %>%
cast_dtm(id, word, i) %>%
inspect()
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs then that the
## 10 1 1 0
## 11 0 1 0
## 12 0 0 0
## 13 0 0 1
Currently, your function is matching words with a boundary before or after. To change it to before and after, change the collapse parameter to include boundaries:
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)
, invert = T) <- " "
x
})
tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)
inspect(DocumentTermMatrix(tm))
## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity : 67%
## Maximal term length: 4
## Weighting : term frequency (tf)
##
## Terms
## Docs that the then
## 10 1 0 1
## 11 1 0 0
## 12 0 0 0
## 13 0 1 0
I got same result as #alistaire with tm, with the following modified line in keepOnlyWords content transformer first defined by #BEMR:
gregexpr(paste0("\\b(", paste(words, collapse = "|"), ")\\b"), x)
There was a misplaced ")" in gregexpr first specified by #BEMR i.e. should be ")\\b" not "\\b)"
I think the above gregexpr is equivalent to that specified by #alistaire:
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)
I need to get the unigrame and trigram without bigrame
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
how to edit this code to get the answer
One way is to use the dfm function from quanteda package as follows,
library(quanteda)
dfm('I only want uni and trigrams', ngrams = c(1,3), verbose = FALSE)
#Document-feature matrix of: 1 document, 10 features.
#1 x 10 sparse Matrix of class "dfmSparse"
# features
#docs i only want uni and trigrams i_only_want only_want_uni want_uni_and uni_and_trigrams
# text1 1 1 1 1 1 1 1 1 1 1
How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".
The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.
Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.
library(qdap)
library(tm)
# from the crude data set, create a text file from the first three documents, then clean it
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10)
# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")
# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Here is where the effort falters.
Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.
Thank you.
First, modify your initial text list from:
text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])
to:
text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, " ", "")
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))
Build a new Corpus:
v <- Corpus(VectorSource(text))
Create a bigram tokenizer function:
BigramTokenizer <- function(x) {
unlist(
lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE
)
}
Create your TermDocumentMatrix using the control parameter tokenize:
tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))
Now that you have your new tdm, to get your desired output, you could do:
library(dplyr)
data.frame(inspect(tdm)) %>%
add_rownames() %>%
mutate(total = rowSums(.[,-1])) %>%
arrange(desc(total))
Which gives:
#Source: local data frame [272 x 5]
#
# rowname X1 X2 X3 total
#1 crude oil 2 0 1 3
#2 mln bpd 0 3 0 3
#3 oil prices 0 3 0 3
#4 cut contract 2 0 0 2
#5 demand opec 0 2 0 2
#6 dlrs barrel 2 0 0 2
#7 effective today 1 0 1 2
#8 emergency meeting 0 2 0 2
#9 oil companies 1 1 0 2
#10 oil industry 0 2 0 2
#.. ... .. .. .. ...
One idea here , is to create a new corpus with bigrams.:
A bigram or digram is every sequence of two adjacent elements in a string of tokens
A recursive function to extract bigram :
bigram <-
function(xs){
if (length(xs) >= 2)
c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))
}
Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).
res <- unlist(lapply(crude,function(x){
x <- tm::removeNumbers(tolower(x))
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
freqs <- table(bigram(strsplit(x," ")[[1]]))
freqs[freqs>1]
}))
as.data.frame(tail(sort(res),5))
tail(sort(res), 5)
reut-00022.xml.hold_a 3
reut-00022.xml.in_the 3
reut-00011.xml.of_the 4
reut-00022.xml.a_futures 4
reut-00010.xml.abdul_aziz 5
The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.
edit after OP comments :
In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){
x <- removeNumbers(tolower(x))
x <- removeWords(x, words=c("the","of"))
x <- removePunctuation(x)
x <- gsub('\n|[[:punct:]]',' ',x)
x <- gsub(' +','',x)
## after cleaning a compute frequency using table
words <- strsplit(x," ")[[1]]
bigrams <- bigram(words[nchar(words)>2])
}))
xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]
# res Freq
# 1: abdulaziz_bin 1
# 2: ability_hold 1
# 3: ability_keep 1
# 4: ability_sell 1
# 5: able_hedge 1
# ---
# 2177: last_month 6
# 2178: crude_oil 7
# 2179: oil_minister 7
# 2180: world_oil 7
# 2181: oil_prices 14
A questionnaire was passed to teachers to check their curriculum preferences. They had to choose 20 items from about 50 options.
The resulting data is a long list of choices of the following type:
Teacher ID, Question ID
i want to format it to a list with one row for each teacher and a colomn per each question with the possible values: 0 (not chosen), 1 (chosen).
In pseudo code (of a programming language)
it would probably be something like this:
iterate list {
data [teacher_id] [question_id] = 0
}
Here is a sample data and the intended result:
a <- data.frame(
Case_ID = c(1,1,2,2,4,4),
Q_ID = c(3,5,5,8,2,6)
)
intended result is
res <- data.frame(
Case_ID = c(1,2,4),
Q_1 = c(0,0,0),
Q_2 = c(0,0,1),
Q_3 = c(1,0,0),
Q_4 = c(0,0,0),
Q_5 = c(1,1,0),
Q_6 = c(0,0,1),
Q_7 = c(0,0,0),
Q_8 = c(0,1,0)
)
Any help would be greatly appreciated.
Tnx
Hed
Returning a matrix and using matrix indexing to do the work:
m <- matrix(0, nrow=3, ncol=8)
rownames(m) <- c(1,2,4)
colnames(m) <- 1:8
idx <-apply(a, 2, as.character)
m[idx] <- 1
m
## 1 2 3 4 5 6 7 8
## 1 0 0 1 0 1 0 0 0
## 2 0 0 0 0 1 0 0 1
## 4 0 1 0 0 0 1 0 0
Note that you can think of a as a list of indecies, which themselves reference which cells in a "master array" are TRUE.
Then if you have a master matrix, say res of all 0's, you can then tell R: "all of the elements that are referenced in a should be 1"
This is done below
First we create the "master matrix"
# identify the unique teacher ID's
teacherIDs <- unique(a$Case_ID)
# count how many teachers there are
numbTeachers <- length(teacherIDs)
# create the column names for the questions
colNames <- c(paste0("Q_", 1:50))
# dim names for matrix. Using T_id for the row names
dnames <- list(paste0("T_", teacherIDs),
colNames)
# create the matrix
res2 <- matrix(0, ncol=50, nrow=numbTeachers, dimnames=dnames)
Next we convert a to a set of indices.
*Note that the first two lines below are only needed if there are Teacher ID's that are not present. ie in your example, T_3 is not present*
# create index out of a
indx <- a
indx$Case_ID <- as.numeric(as.factor(indx$Case_ID))
indx <- as.matrix(indx)
# populate those in a with 1
res2[indx] <- 1
res2