Syntaxnet: Can I use my own POS tags? - syntaxnet

I have a large corpus and I would like to train my own Syntaxnet model with it. Training the POS tagger yields 0 % accuracy, and I believe it's because my own corpus uses different POS tags. The training generates a tagged corpus using tags like "NOUN", "DET", "ADV", and so on, while the original annotiation doesn't have tags like these. Instead, it has tags like "A", "N", "T", and so on.
How can I use my own corpus to train the POS tagger?
On what basis does the tagger assign those POS tags to the words in this case?

Related

Create a Document Frequency Matrix in R

I am attempting to create a document frequency matrix in R.
I currently have a dataframe (df_2), which is made up of 2 columns:
doc_num: which details which document each term is coming from
text_token: which contains each tokenized word relating to each document.
The df's dimensions are 79,447 * 2.
However, there are only 400 actual documents in the 79,447 rows.
I have been trying to create this dfm using the tm package.
I have tried creating a corpus (vectorsource) and then attempting to coerce that into a dfm using
the appropriately named "dfm" command.
However, this indicates that "dfm() only works on character, corpus, dfm, tokens objects."
I understand my data isn't currently in the correct format for the dfm command to work.
My issue is that I don't know how to get from my current point to a matrix as appears below.
Example of what I would like the matrix to look like when complete:
Where 2 is the number of times cat appears in doc_2.
Any help on this would be greatly appreciated.
Is mise le meas.
It will be useful for you and others if all pertinent details are made available with your code - such as the use of quanteda package for dfm().
If the underlying text is setup correctly, the dfm() will directly give you what you are looking for - that is precisely what it is set up for.
Here is a simulation:
library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)
doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility"
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten"
doc4 <- "Corona Virus can be defeated through maximization of social distance"
write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt
corp <- corpus(txt)
x <- dfm(corp)
View(x)
If the issue is one of formatting /cleaning your data so that you can run dfm(), then you need to post a new question which provides necessary details on your data.

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

Adding metadata to STM in R

I am having trouble with the STM package in R. I have built a corpus in Quanteda and I want to convert it into the STM format. I have saved the metadata as an independent CSV file and I want code that merges the text documents with the metadata. The readCorpus() and the "convert() functions do not automatically add the metadata information to the corpus.
This what it looks like in Quanteda:
EUdocvars <- read.csv("EU_metadata.csv", stringsAsFactors = FALSE)
EUdocvars$Period <- as.factor(EUdocvars$Period)
EUdocvars$Country <-as.factor(EUdocvars$Country)
EUdocvars$Region <- as.factor(EUdocvars$Region)
EUCorpus <- corpus(textfile(file='PROJECT/*.txt'), encodingFrom = "UTF-8-BOM")
docvars(EUCorpus) <- EUdocvars
EUDfm <- dfm(EUCorpus)
Is there a way to do the same thing using the STM package?
Support for this was added just recently (v0.99), after addressing https://github.com/kbenoit/quanteda/issues/209.
So this should work:
EUstm <- convert(EUdfm, to = "stm", docvars = docvars(EUCorpus))
And then EUstm has all of the elements including meta that you need for fitting STM models.
The stm object (a list) has an element called $meta which takes a dataframe of dimensions number of documents x number of covariates. So for your problem:
EUCorpus$meta <- EUdocvars

Defining synonyms within a corpus of Documents using R

I have a corpus of documents of a very specific topic (e.g. sports/athelics). Within that corpus, I would like to define synonyms myself. The reason why I want to define synonyms myself is because sometimes, given two words, it is possible that the synonyms() function within the WordNet package does not recognise them as synonyms, but within the text they can be interpreted as such (for example, "fit" and "strong").
My idea is to use word associations with Bygrams and Trigrams and define a synonym when words appear frequently in a phrase and have similar semantic content. For example, using the crude dataset within the tm package I would do something like:
data(crude)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
crudetdm <- TermDocumentMatrix(crude, control=list(stripWhitespace = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removeSparseTerms = TRUE,
tokenize = BigramTokenizer))
ListAssoc <- lapply(crudetdm$dimnames$Terms, function(x) findAssocs(crudetdm, x, 0.9))
However this returns (as expected) Bigrams associated with Bigrams, while my idea would be to find individual words associated with the Bigrams in crudetdm$dimnames$Terms (the same excersise with Trigrams would be the next step). For example, using Bygrams and the crude dataset, the ideal scenario would be ending up with a data.frame like:
Bigram Associated Words
oil companies policies, marketing, prices, measures, market, revenue...
Then I would go myself trough the table and manually select those words that I believe can be considered synonyms in my dataset (my dataset is not that big). I can think of some ways around by defining multiple data.frames of bigrams and trigrams and match common words. However, I am sure there is a more elegant and efficient way of doing this in R.
Overall, my question is. Given a series of Bigrams and Trigrams, how can I find individual words that are associated to them?

tm combine list of corpora

I have a list of URL for which i have fetched the webcontent, and included that into tm corpora:
library(tm)
library(XML)
link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",
"http://had.co.nz/",
"http://vita.had.co.nz/articles.html",
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",
"http://www.analyticstory.com/hadley-wickham/"
)
create.corpus <- function(url.name){
doc=htmlParse(url.name)
parag=xpathSApply(doc,'//p',xmlValue)
if (length(parag)==0){
parag="empty"
}
cc=Corpus(VectorSource(parag))
meta(cc,"link")=url.name
return(cc)
}
link=catch$url
cc <- lapply(link, create.corpus)
This gives me a "large list" of corpora, one for each URL.
Combining them one by one works:
x=cc[[1]]
y=cc[[2]]
z=c(x,y,recursive=T) # preserved metadata
x;y;z
# A corpus with 8 text documents
# A corpus with 2 text documents
# A corpus with 10 text documents
But this becomes unfeasible for a list with a few thousand corpora.
So how can a list of corpora be merged into one corpus while maintaining the meta data?
You can use do.call to call c:
do.call(function(...) c(..., recursive = TRUE), cc)
# A corpus with 155 text documents
I don't think that tm offer any built-in function to join/merge many corpus. But after all a corpus is a list of document , so how the question is how to transform a list of list to a list. I would do create a new corpus using all documents , then assign meta manually:
y = Corpus(VectorSource(unlist(cc)))
meta(y,'link') = do.call(rbind,lapply(cc,meta))$link
Your code does not work because catch is not defined, so I don't know exactly what that is supposed to do.
But now tm corpora can just be put into a vector to make one big corpora: https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine
So maybe c(unlist(cc)) would work. I have no way to test if that would work though because your code doesn't run.

Resources