Include ID number in dfm() output - r

I have a dataset with an ID number column and a text column, and I am running a LIWC analysis on the text data using the quanteda package. Here's an example of my data setup:
mydata<-data.frame(
id=c(19,101,43,12),
text=c("No wonder, then, that ever gathering volume from the mere transit ",
"So that in many cases such a panic did he finally strike, that few ",
"But there were still other and more vital practical influences at work",
"Not even at the present day has the original prestige of the Sperm Whale"),
stringsAsFactors=F
)
I have been able to conduct the LIWC analysis using scores <- dfm(as.character(mydata$text), dictionary = liwc)
However, when I view the results (View(scores)), I find that the function does not reference the original ID numbers (19, 101, 43, 12) in the final results. Instead, a row.names column is included but it contains non-descriptive identifiers (e.g., "text1", "text2"):
How can I get the dfm() function to include the ID numbers in its output? Thank you!

It sounds like you would like the row names of the dfm object to be the ID numbers from your mydata$id. This will happen automatically if you declare this ID to be the docnames for the texts. The easiest way to do this is to create a quanteda corpus object from your data.frame.
The corpus() call below assigns the docnames from your id variable. Note: The "Text" from the summary() call looks like a numeric value but it's actually the document name for the text.
require(quanteda)
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]])
summary(myCorpus)
# Corpus consisting of 4 documents.
#
# Text Types Tokens Sentences
# 19 11 11 1
# 101 13 14 1
# 43 12 12 1
# 12 12 14 1
#
# Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Tue Dec 29 11:54:00 2015
# Notes:
From there, the document name is automatically the row label in your dfm. (You can add the dictionary = argument for your LIWC application.)
myDfm <- dfm(myCorpus, verbose = FALSE)
head(myDfm)
# Document-feature matrix of: 4 documents, 45 features.
# (showing first 4 documents and first 6 features)
# features
# docs no wonder then that ever gathering
# 19 1 1 1 1 1 1
# 101 0 0 0 2 0 0
# 43 0 0 0 0 0 0
# 12 0 0 0 0 0 0

Related

How to count collocations in quanteda based on grouping variables?

I have been working on identifying and classfying collocations over Quenteda package in R.
For instance;
I create token object from a list of documents, and apply collocation analysis.
toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)
however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.
I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?
It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.
So the solution to group by document (or any other variable) is to
Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document.
This is a bit trickier
Implementation using the built-in inaugural corpus:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tail(10) %>%
tokens(remove_punct = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 let us 34 0 2 6.257000 17.80637
## 2 fellow citizens 14 0 2 6.451738 16.18314
## 3 fellow americans 15 0 2 6.221678 16.16410
## 4 one another 14 0 2 6.592755 14.56082
## 5 god bless 15 0 2 8.628894 13.57027
## 6 united states 12 0 2 9.192044 13.22077
Now we compound them and keep only the collocations, then get the frequencies by document:
dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
dfm() %>%
dfm_keep("* *")
That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.
textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
head(10)
## feature frequency rank docfreq group
## 1 nuclear weapons 4 1 1 1985-Reagan
## 2 human freedom 3 2 1 1985-Reagan
## 3 new breeze 4 1 1 1989-Bush
## 4 new engagement 3 2 1 1989-Bush
## 5 let us 7 1 1 1993-Clinton
## 6 fellow americans 4 2 1 1993-Clinton
## 7 let us 6 1 1 1997-Clinton
## 8 new century 6 1 1 1997-Clinton
## 9 nation's promise 2 1 1 2001-Bush
## 10 common good 2 1 1 2001-Bush

Factor Level issues after filling data frame using match

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very good film
..
Then I tried to create a DocumentTermMatris using quanteda package :
mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)
I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value
Can you help me to resolve this problem please?
EDIT :
head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
features
docs bon accueil conseillèr efficac écout répond
text1 0 0 0 0 0 0
text2 1 1 1 1 1 1
text3 0 0 0 0 0 0
text4 0 0 0 0 0 0
text5 0 0 1 0 0 0
text6 0 0 0 0 1 0
...
text60300 0 0 1 1 1 1
Here I have my dfm matrix, then I create my tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :
textstat_keyness(mydfm, target = 2)
But, since I have 60300 target, I don't know how to do this automatically .
I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(
EDIT 2 :
Rank Review
10 always good
1 nice film
3 fine as usual
Here I try to group document with dfm :
mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)
But it fails to group documents
Can you help me please to resolve this problem
Thank you
See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.
textstat_keyness(mydfm, target = 1)
for the first document against the frequencies of all others, or
textstat_keyness(mydfm, target = 2)
for the second against all others, etc.
If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

using findAssocs in R to find frequently occurred words with central term

As I was working with findAssocs in R, I realised that the function don't actually pick up the words that occur together with the searched term across documents, but rather the words that occur when searched term frequently appeared.
I've tried using a simple test script below:
test <- list("housekeeping bath towel housekeeping room","housekeeping dirty","housekeeping very dirty","housekeeping super dirty")
test <-Corpus(VectorSource(test))
test_dtm<-DocumentTermMatrix(test)
test_dtms<-removeSparseTerms(test_dtm,0.99)
findAssocs(test_dtms,"housekeeping",corlimit = 0.1)
And the returning result from R is:
$housekeeping
bath room towel
1 1 1
Noticed that the word "dirty" occur in 3 out of the 4 documents, compared to the returned keywords which only occurred once in all documents.
Anyone has any idea what went wrong in my script or if there is a better way to do this?
The result I want to achieve is the model should reflect the words that occurs frequently with the search term across all documents and not within a specific document. I've tried combining the 4 documents into 1 but it doesn't work as findAssocs doesn't work on a single document.
Any advise?
How about an alternative, using the quanteda package? It imposes no mystery restrictions on the correlations returned, and has many other options (see ?similarity).
require(quanteda)
testDfm <- dfm(unlist(test), verbose = FALSE)
## Document-feature matrix of: 4 documents, 7 features.
## 4 x 7 sparse Matrix of class "dfmSparse"
## features
## docs housekeeping bath towel room dirty very super
## text1 2 1 1 1 0 0 0
## text2 1 0 0 0 1 0 0
## text3 1 0 0 0 1 1 0
## text4 1 0 0 0 1 0 1
similarity(testDfm, "housekeeping", margin = "features")
## similarity Matrix:
## $housekeeping
## bath towel room very super dirty
## 1.0000 1.0000 1.0000 -0.3333 -0.3333 -1.0000

Identify most variable rows within multiple subsets of a data.frame and merge this information into a final data.frame

I have a data.frame named data containing 18472 rows by 2229 columns. The last column of this data.frame (data$bin) contains a bin number from 1:7, although this may be dynamic later down the road. What I'd like to accomplish is to identify the 25 most variable rows for each bin and create a final data.frame with these. Ultimately this would result in a data.frame with 25*7 rows by 2228 columns. I am able to identify variable rows but I'm not sure how to preform this on all bins within data:
> # identify variable rows
> library(genefilter)
> mostVarRows = head(order(rowVars(data), decreasing=TRUE), 25)
Data looks something like this:
> head(data[(ncol(data)-3):ncol(data)])
D6_NoSort_6000x3b_CCCCCGCCCTGA D6_NoSort_2250b_ATTATACTATTT D6_EcadSort_6000x3b_CACGACCTCCAC bin
0610005C13RIK 0 0 0 2
0610007P14RIK 0 0 0 6
0610009B22RIK 0 0 0 3
0610009L18RIK 0 0 0 2
0610009O20RIK 0 0 0 3
0610010B08RIK 0 0 0 6
I need to extract out the most variant rows from each bin into a separate data.frame!
Below I create a mock data set. For future reference, the burden is on you to do this since you know what you want better than I do.
# create mock data
set.seed(1)
data<-replicate(1000,rnorm(500,500,100))
data<-data.frame(data,bins= sample(c(1:7),500,replace=TRUE)) # create bins column
Next I find the variance of each row (assuming this is how you want to define "most variable"). Then I sort by bin and variance (greatest to lowest).
data$var_by_row<-apply(data[,1:1000],1,var) # find variance of each row
data<-data[order(data$bins, -data$var_by_row),] # sort by bin and variance
Since the data is sorted properly, it remains to take the first 25 observations of each bin and stack them together. You were definitely on the right track with your use of order() and head(). The do.call() step afterwards is necessary to stack the head() results and is probably what you're looking for.
data_sub_list<-by(data,INDICES = data$bins, head,n=25) # grab the first 25 observations of each bin
data_sub<-do.call('rbind',data_sub_list) # the above returns a list of 7 data frames...one per bin. this stacks them
> table(data_sub$bins) # each bin appears 25 times.
1 2 3 4 5 6 7
25 25 25 25 25 25 25
> nrow(data_sub) # number of rows is 25*7
[1] 175

Resources