Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.
The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.
How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?
Thanks in advance.
I have done the POS task in this way. I generated a couple of functions.
suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))
# load the corpus
tm_corpus <- VCorpus(DirSource(
"working_path,
pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))
# load udpipe
library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)
# functions to annotate the corpus
f_udpipe_anot <- function(n){
txt <- as.character(tm_corpus[[n]]) %>% #lista simia
unlist()
y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
y <- as.data.frame(y)
}
pinkillazo <- function(desde, hasta){
resultado <- data.frame()
for (item in desde:hasta){
print(item)
resultado <- rbind(resultado, f_udpipe_anot(item))
}
return(resultado)
}
leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe
To identify the named entities, I have tried this:
spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)
organiz <- spacy_extract_entity(
quan_corpus,
output = c("data.frame", "list"),
type = c("all", "named", "extended"),
multithread = TRUE,
)
I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.
If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.
If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).
Related
I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.
Fix it this way:
kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex",
window = 5
I would like to keep such 2-3 word phrases (i.e.features) within my dfm that have a PMI value greater than 3x the number of words in the phrase*.
PMI is hereby defined as: pmi(phrase) = log(p(phrase)/Product(p(word))
with
p(phrase): the probability of the phrase based on its relative frequency
Product(p(word): the product of the probabilities of each word in the phrase.
Thus far I used the following code, however the PMI values do not seem to be correct, but I am not able to find the issue:
#creating dummy data
id <- c(1:5)
text <- c("positiveemoticon my name is positiveemoticon positiveemoticon i love you", "hello dont", "i love you", "i love you", "happy birthday")
ids_text_clean_test <- data.frame(id, text)
ids_text_clean_test$id <- as.character(ids_text_clean_test$id)
ids_text_clean_test$text <- as.character(ids_text_clean_test$text)
test_corpus <- corpus(ids_text_clean_test[["text"]], docnames = ids_text_clean_test[["id"]])
tokens_all_test <- tokens(test_corpus, remove_punct = TRUE)
## Create a document-feature matrix(dfm)
doc_phrases_matrix_test <- dfm(tokens_all_test, ngrams = 2:3) #extracting two- and three word phrases
doc_phrases_matrix_test
# calculating the pointwise mututal information for each phrase to identify phrases that occur at rates much higher than chance
tcmrs = Matrix::rowSums(doc_phrases_matrix_test) #number of words per user
tcmcs = Matrix::colSums(doc_phrases_matrix_test) #counts of each phrase
N = sum(tcmrs) #number of total words used
colp = tcmcs/N #proportion of the phrases by total phrases
rowp = tcmrs/N #proportion of each users' words used by total words used
pp = doc_phrases_matrix_test#p + 1
ip = doc_phrases_matrix_test#i + 1
tmpx = rep(0,length(doc_phrases_matrix_test#x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for (i in 1:(length(doc_phrases_matrix_test#p) - 1) ) {
ind = pp[i]:(pp[i + 1] - 1)
not0 = ip[ind]
icol = doc_phrases_matrix_test#x[ind]
tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
tmpx[ind] = tmp
}
doc_phrases_matrix_test#x = tmpx
doc_phrases_matrix_test
I believe the PMI should not vary within one phrase by user but I thought it would be easier to apply the PMI to the dfm directly so it is easier to subset it based on the features PMI.
An alternative approach I tried is to apply the PMI to the features directly:
test_pmi <- textstat_keyness(doc_phrases_matrix_test, measure = "pmi",
sort = TRUE)
test_pmi
However, firstly, here I am getting a warning Warning that NaNs were produced and secondly, I don't understand the PMI values (e.g. why are there negative values)?
Does anyone have a better idea how to exctract features based on their PMI values as defined above?
Any hint is highly appreciated :)
*following Park et al.(2015)
You can use the following R code which uses the udpipe R package to get what you are asking. Example on a tokenised data.frame which is part of udpipe
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
## find keywords with PMI > 3
keyw <- keywords_collocation(x, term = "lemma",
group = c("doc_id", "sentence_id"), ngram_max = 3, n_min = 10)
keyw <- subset(keyw, pmi > 3)
## recodes to keywords
x$term <- txt_recode_ngram(x$lemma, compound = keyw$keyword, ngram = keyw$ngram)
## create DTM
dtm <- document_term_frequencies(x = x$term, document = x$doc_id)
dtm <- document_term_matrix(dtm)
If you want to get a dataset in a similar structure as x. Just use udpipe(text, "english") or any language of your choice. If you want to use quanteda for tokenisation, you can still get it into a nicer enriched data.frame - example of this is given here and here. Look to the help of the udpipe R package it has many vignettes (?udpipe).
Note that PMI is usefull, it is many more usefull to use the dependency parsing output of the udpipe R package. If you look at dep_rel field you will find there categories which identify multi-word expressions (e.g. dep_rel fixed/flat/compound are multi-word expressions as defined at http://universaldependencies.org/u/dep/index.html) you could also use these to put them in your document/term/matrix
I am using R text2vec package for creating document-term-matrix. Here is my code:
library(lime)
library(text2vec)
# load data
data(train_sentences, package = "lime")
#
tokens <- train_sentences$text %>%
word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )
dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")
Another method is hash_vectorizer() instead of vocab_vectorizer() as:
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)
But when I am using hash_vectorizer, there is no option for stopwords removal and pruning vocabulary. In a study case, hash_vectorizer works better than vocab_vectorizer for me. I know one can remove stopwords after creating dtm or even when creating tokens. Is there any other options, similar to the vocab_vectorizer and how it is created. Particularly I am interested in a method that also supports pruning vocabulary similar to prune_vocabulary().
I appreciate your responses.
Thanks, Sam
This is not possible. The whole point of using hash_vectorizer and feature hashing is to avoid hashmap lookups (getting index of a given word). Removing stop-words is essentially the thing - check whether word is in the set of stop-words.
Usually it is recommended to use hash_vectorizer only if you dataset is very big and if it takes a lot of time/memory to build vocabulary. Otherwise according to my experience vocab_vectorizer with prune_vocabulary will perform at least not worse.
Also if you use hash_vectorized with small hash_size it acts as a dimensionality reduction step and hence can reduce variance for your dataset. So if your dataset is not very big I suggest to use vocab_vectorizer and play with prune_vocabulary parameters to reduce vocabulary and document-term-matrix size.
I am having trouble with the STM package in R. I have built a corpus in Quanteda and I want to convert it into the STM format. I have saved the metadata as an independent CSV file and I want code that merges the text documents with the metadata. The readCorpus() and the "convert() functions do not automatically add the metadata information to the corpus.
This what it looks like in Quanteda:
EUdocvars <- read.csv("EU_metadata.csv", stringsAsFactors = FALSE)
EUdocvars$Period <- as.factor(EUdocvars$Period)
EUdocvars$Country <-as.factor(EUdocvars$Country)
EUdocvars$Region <- as.factor(EUdocvars$Region)
EUCorpus <- corpus(textfile(file='PROJECT/*.txt'), encodingFrom = "UTF-8-BOM")
docvars(EUCorpus) <- EUdocvars
EUDfm <- dfm(EUCorpus)
Is there a way to do the same thing using the STM package?
Support for this was added just recently (v0.99), after addressing https://github.com/kbenoit/quanteda/issues/209.
So this should work:
EUstm <- convert(EUdfm, to = "stm", docvars = docvars(EUCorpus))
And then EUstm has all of the elements including meta that you need for fitting STM models.
The stm object (a list) has an element called $meta which takes a dataframe of dimensions number of documents x number of covariates. So for your problem:
EUCorpus$meta <- EUdocvars
I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!