I have a dataframe like this
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
I perform text clean for lda with this
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
dtm <- convert(myDfm, to = "topicmodels")
lda <- LDA(dtm, k = 2, control = list(seed = 1234))
However I noticed that in dtm when the text column doesn't not contain anything it remove it.
gammaDF <- as.data.frame(lda#gamma)
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
However it gives me a problem when I want to take the topic and related id of the first dataframe. What can I do to have the right results?
id, topic
2 1
3 2
4 1
The problem here is that LDA() removes the rownames from your document-term matrix and replaces them with a simple serial number. This no longer corresponds to your original dtext$id. But you can replace the LDA id with the document name, and then link this back to your input text.
To make this more clear, we are first going to replace your dtext$id with something that can be more clearly distinguished from the serial number that LDA() returns.
# to distinguish your id from those from LDA()
dtext$id <- paste0("doc_", dtext$id)
# this takes the document name from "id"
toks <- corpus(dtext, docid_field = "id") %>%
tokens()
Then run your other steps exactly as above.
We can see that the first document is empty (has zero feature counts). This is the one that is dropped in the conversion of the dfm to the "topicmodels" format.
ntoken(myDfm)
## text1 text2 text3 text4
## 0 49 63 201
as.matrix(dtm[, 1:3])
## Terms
## Docs dataset_contain contain_movi movi_review
## text2 1 1 1
## text3 1 0 0
## text4 0 0 0
These document names are obliterated by LDA(), however.
toptopics
## document topic
## 1 1 V2
## 2 2 V2
## 3 3 V1
But we can (re)assign them from the rownames of dtm, which will correspond 1:1 to the documents returned by LDA().
toptopics$docname <- rownames(dtm)
toptopics
## document topic docname
## 1 1 V2 text2
## 2 2 V2 text3
## 3 3 V1 text4
And now, toptopics$docname can be merged with dtext$id, solving your problem.
You can grab the ids of any texts with 0 words prior to converting to a dtm using apply and which:
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
removed <- which(apply(myDfm, 1, sum) == 0)
Result:
> removed
text1
1
Related
The question is:
How do count words that are extracted from the dictionary in texts?
As one may notice that in text1, 火勢 & 撲滅 are extracted from the dictionary,
I can count the character, yes, 4.
How can I count in R?
Thanks!
Here's the code:
dict <- dictionary(list(season = c("火勢", "撲滅", "應變", "小組")))
txt <- ll$content03
season_dfm <- dfm(txt, dictionary = dict, verbose = FALSE)
dict_dfm <- dfm(txt, select = dict, verbose = FALSE)
cbind(season_dfm, dict_dfm)
Document-feature matrix of: 2 documents,
5 features (30.0% sparse).
features
docs season 火勢 撲滅 應變 小組
text1 2 1 1 0 0
text2 3 1 0 1 1
https://imgur.com/a/Z4rpWxs
I imported 720 sentences from this website (https://www.cs.columbia.edu/~hgs/audio/harvard.html). There are 72 lists (each list contains 10 sentences.) and saved it in an appropriate structure. I did those step in R. The code is immediately depicted below.
#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
html_nodes("li") %>%
html_text()
headers <- read_html(url) %>%
html_nodes("h2") %>%
html_text()
#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1
for(sentence in sentences){
sentenceList <- c(sentenceList, sentence)
print(sentence)
if(length(sentenceList) == 10) { #if we have 10 sentences
harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
n <- n+1 #set our list name to the next one
}
}
#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)
THEN, in PYTHON, I computed a list of all the words ending in "ing" and what their frequency was, aka, how many times they appeared across all 72 lists.
path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)
import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)
ingWordList = [];
for line in sen_List:
for word in line.split():
if word.endswith('ing'):
ingWordList.append(word)
ingWordCountDictionary = {};
for word in ingWordList:
word = word.replace('"', "")
word = word.lower()
if word in ingWordCountDictionary:
ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
else:
ingWordCountDictionary[word] = 1
print(ingWordCountDictionary)
f = open("ingWordCountDictionary.txt", "w")
for key, value in ingWordCountDictionary.items():
keyValuePairToWrite = "%s, %s\n"%(key, value)
f.write(keyValuePairToWrite)
f.close()
Now, I am being asked to create a dataset which shows what list (1 from 72) each "ing" word is derived from. THIS IS WHAT I DON'T KNOW HOW TO DO. I obviously know they are a subset of huge 72 item list, but how do I figure out what list those words came from.
The expected output should look something like this:
[List Number] [-ing Word]
List 1 swing, ring, etc.,
List 2 moving
so and so forth
Here is one way for you. As far as I see the expected result, you seem to want to get verbs in progressive forms (V-ing). (I do not understand why you have king in your result. If you have king, you should have spring here as well, for example.) If you need to consider lexical classes, I think you want to use the koRpus package. If not, you can use the textstem package, for example.
First, I scraped the link and created a data frame. Then, I split sentences into words using unnest_tokens() in the tidytext package, and subsetted words ending with 'ing'. Then, I used treetag() in the koRpus package. You need to install Treetagger by yourself before you use the package. Finally, I counted how many times these verbs in progressive forms appear in the data set. I hope this will help you.
library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("h2") %>%
html_text() -> so_list
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("li") %>%
html_text() -> so_text
# Create a data frame
sodf <- tibble(list_name = rep(so_list, each = 10),
text = so_text)
# Split senteces into words and get words ending with ING.
unnest_tokens(sodf, input = text, output = word) %>%
filter(grepl(x = word, pattern = "ing$")) -> sowords
# Use koRpus package to lemmatize the words in sowords$word.
treetag(sowords$word, treetagger = "manual", format = "obj",
TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
preset = "en")) -> out
# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.
filter(out#TT.res, grepl(x = token, pattern = "ing$") & wclass == "verb") %>%
count(token)
# A tibble: 16 x 2
# token n
# <chr> <int>
# 1 adding 1
# 2 bring 4
# 3 changing 1
# 4 drenching 1
# 5 dying 1
# 6 lodging 1
# 7 making 1
# 8 raging 1
# 9 shipping 1
#10 sing 1
#11 sleeping 2
#12 wading 1
#13 waiting 1
#14 wearing 1
#15 winding 2
#16 working 1
How did you store the data from the lists (ie what does your data.frame look like? Could you provide an example?
Without seeing this, I suggest you save the data in a list as follows:
COLUMN 1 , COLUMN 2, COLUMN 3
"List number", "Sentence", "-ING words (as vector)"
I hope this makes sense, let me know if you need more help. I wasn't able to comment on this post unfortunately.
I use Quanteda R package in order to extract ngrams (here 1grams and 2grams) from text Data_clean$Review, but I am looking for a way with R to compte Chi-square between document and the extracted ngrams :
Here the R code that I did to clean Up text (revoiew) and generate the n-grams.
Any idea please?
thank you
#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]
Data_clean$id <- seq.int(nrow(Data_clean))
train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)
#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))
train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]
temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
dfm # generate dfm
You would not use ngrams for this, but rather a function called textstat_collocations().
It's a bit hard to follow your exact example since none of those objects are explained or supplied, but let's try it with some of quanteda's built-in data. I'll get the texts from the inaugural corpus and apply some filters similar to what you have above.
So to score bigrams for chi^2, you would use:
# create the corpus, subset on some conditions (could be Note != "" for instance)
corp_example <- data_corpus_inaugural
corp_example <- corpus_subset(corp_example, Year > 1960)
# this will remove punctuation and numbers
toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)
# find and score chi^2 bigrams
coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
head(coll2, 10)
# collocation count X2
# 1 reverend clergy 2 28614.00
# 2 Majority Leader 2 28614.00
# 3 Information Age 2 28614.00
# 4 Founding Fathers 3 28614.00
# 5 distinguished guests 3 28614.00
# 6 Social Security 3 28614.00
# 7 Chief Justice 9 23409.82
# 8 middle class 4 22890.40
# 9 Abraham Lincoln 2 19075.33
# 10 society's ills 2 19075.33
Added:
# needs to be a list of the collocations as separate character elements
coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)
# compound the tokens using top 100 collocations
toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
toks_example_comp[[1]][1:20]
# [1] "Vice_President" "Johnson" "Mr_Speaker" "Mr_Chief" "Chief_Justice"
# [6] "President" "Eisenhower" "Vice_President" "Nixon" "President"
# [11] "Truman" "reverend_clergy" "fellow_citizens" "we" "observe"
# [16] "today" "not" "a" "victory" "of"
I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:
mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)
What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:
mydfm = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))
These are the terms I got:
"i_like" "like_apple" "apple_." "._me" "me_too" "one_two" "two_." "._thank" "thank_you"
These are What I expect, basically "." is skipped and used to separate the terms:
"i_like" "like_apple" "me_too" "one_two" "thank_you"
Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!
#Jota's answer works but there is a way to control the tokenisation more finely while calling it only once:
(toks <- tokenize(toLower(mydat$text), removePunct = 2, ngrams = 2))
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "apple_me" "me_too"
##
## Component 2 :
## [1] "one_two" "two_thank" "thank_you"
dfm(toks)
## Document-feature matrix of: 2 documents, 7 features.
## 2 x 7 sparse Matrix of class "dfmSparse"
## features
## docs i_like like_apple apple_me me_too one_two two_thank thank_you
## text1 1 1 1 1 0 0 0
## text2 0 0 0 0 1 1 1
Added:
Then to remove any ngram with . punctuation, you can use: the following, which defaults to valuetype = "glob":
removeFeatures(toks2, "*.*")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "i_like" "like_apple" "me_too"
##
## Component 2 :
## [1] "one_two" "thank_you"
If your goal is just to extract those bigrams, then you could use tokens twice. Once to tokenize to sentences, then again to make the ngrams for each sentence.
library("quanteda")
mydat$text %>%
tokens(mydat$text, what = "sentence") %>%
as.character() %>%
tokens(ngrams = 2, remove_punct = TRUE) %>%
as.character()
#[1] "I_like" "like_apple" "Me_too" "One_two" "Thank_you"
Insert a tokens_tolower() after the first tokens() call if you like, or use char_tolower() at the end.
Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.
An example:
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
gives an error.
The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.
Also recall that 'dfm' from 'quanteda' is a S4 class.
Should work "out of the box", if you are using the latest version:
packageVersion("quanteda")
## [1] ‘0.9.6.9’
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## is one sample surprise text this
## doc1 1 1 2 0 1 1
## doc2 1 1 2 1 1 1
See also ?selectFeatures where features is a dfm object (there are examples in the help file).
Added:
Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:
require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
This almost gets it, but seems to duplicate the repeated feature:
as.matrix(rbind(c(dtm1, dtm2)))
## Terms
## Docs one sample sample. text this surprise!
## 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1