Document-Term Matrix with Quanteda - r

I have a dataframe df with this structure :
Rank Review
5 good film
8 very goood film
..
Then I tried to create a DocumentTermMatris using quanteda package :
temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens
+ dfm %>% # generate dfm
+ convert(to = "tm")
I get this matrix :
> inspect(temp.tf)
<<DocumentTermMatrix (documents: 63023, terms: 23892)>>
Non-/sparse entries: 520634/1505224882
Sparsity : 100%
Maximal term length: 77
Weighting : term frequency (tf)
Sample :
Whith this structure :
Terms
Docs good very film my excellent heart David plus always so
text14670 1 0 0 0 1 0 0 0 2 0
text19951 3 0 0 0 0 0 0 1 1 1
text24305 7 0 2 1 0 0 0 2 0 0
text26985 6 0 0 0 0 0 0 4 0 1
text29518 4 0 1 0 1 0 0 3 0 1
text34547 5 2 0 0 0 0 2 3 1 3
text3781 3 0 1 4 0 0 0 3 0 0
text5272 4 0 0 4 0 5 0 3 1 2
text5367 3 0 1 3 0 0 1 4 0 1
text6001 3 0 9 1 0 6 0 1 0 1
So I think It is good , but I think that : text6001 , text5367, text5272 ... refer to document's name...
My question is that rows in this matrix are ordered? or randoms putted in the matrix?
Thank you
EDIT :
I created a document term frequency :
mydfm <- dfm(df$Review, remove = stopwords("french"), stem = TRUE)
Then, I created a tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
Then I would like to merge the tfidf matrix to the Rank column to have something like this
features
Docs good very film my excellent heart David plus always so Rank
text14670 1 0 0 0 1 0 0 0 2 0 3
text19951 3 0 0 0 0 0 0 1 1 1 2
text24305 7 0 2 1 0 0 0 2 0 0 4
text26985 6 0 0 0 0 0 0 4 0 1 5
Can you help to make this merge?
Thank you

The rows (documents) are alphabetically ordered, which is why text14670 comes before text19951. It is possible that the conversion has reordered the documents, but you can test this using
sum(rownames(temp.tf) == sort(rownames(temp.tf))
If that is not 0, then they are not alphabetically ordered.
The feature ordering, at least in the quanteda dfm, come from the order in which they are found in the texts. You can resort both using dfm_sort().
In your code, the tokens(ngrams = 1:1) is unnecessary since dfm() does that and ngrams = 1 is the default.
Also, do you need to convert this to a tm object? Probably most of what you need can be done in quanteda.

Related

mlogit gives error: the two indexes don't define unique observations

My dataframe named longData looks like:
ID Set Choice Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia
1 1 1 0 1 0 0 0 0 0 0 0 0 0
2 1 2 0 0 1 0 0 0 0 0 0 0 0
3 1 3 0 0 0 1 0 0 0 0 0 0 0
4 1 4 1 0 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 0 0 0 0 0 0 0 1
6 1 6 0 -1 0 0 0 0 0 0 0 0 0
I am trying to run mlogit on it by:
logitModel = mlogit(Choice ~ Apple+Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, shape = "long")
it gives the following error:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
after looking for some time I found that this error was given by dfidx as seen in here as:
z <- data[, c(posid1[1], posid2[1])]
if (nrow(z) != nrow(unique(z)))
stop("the two indexes don't define unique observations")
but upon calling the following code, it runs without the error and gives the names of two idx that are uniquely able to identify a row in dataframe:
dfidx(longData)$idx
this gives expected output as:
~~~ indexes ~~~~
ID Set
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
indexes: 1, 2
So what am I doing wrong, I saw some related questions 1, 2 but couldn't find what I am missing.
It looks like your example comes from here: https://docs.displayr.com/wiki/MaxDiff_Analysis_Case_Study_Using_R
The code seems outdated, I remember it worked for me, but not anymore.
The error message is valid because every pair (ID, Set) appears several times, once for each alternative.
However this works:
# there will be complaint that choice can't be coerced to logical otherwise
longData$Choice <- as.logical(longData$Choice)
# create alternative number (nAltsPerSet is 5 in this example)
longData$Alternative <- 1+( 0:(nrow(longData)-1) %% nAltsPerSet)
# define dataset
mdata <- mlogit.data(data=longData,shape="long", choice="Choice",alt.var="Alternative",id.var="ID")
# model
logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0,
data = mdata
)
summary(logitModel)

Is there an R function for finding keywords within a certain 'word distance'?

What I need is a function to find words within a certain 'word distance'. The words 'bag' and 'tool' are interesting in a sentence "He had a bag of tools in his car."
With the Quanteda kwic function I can find 'bag' and 'tool' individually, but this often gives me an overload of results. I need e.g. 'bag' and 'tools' within five words from eachother.
You can use the fcm() function to count the co-occurrences within a fixed window, for instance 5 words. This creates a "feature co-occurrence matrix" and can be defined for any size of token span, or for the context of an entire document.
For your example, or at least an example based on my interpretation of your questions, this would look like:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
Here, the term bag occurs once within 5 tokens of tool, in the first document. In the second document, they are more than 5 tokens apart, so this is not counted.

Quadratic Assignment Procedure(QAP) in R is producing different results

I would like to say thank you in advance for anyone who looks at my question and shares their thoughts and experiences. I am trying to run a quadratic assignment procedure (QAP) on correlations of behaviors between a community of five individuals. I have ten matrices that represent frequencies of behavior between individuals, and I calculated correlations (pearson's r) between pairs of matrices. For example, I found the correlation between matrix 1 and matrix 2, matrix 2 and matrix 3, matrix 3 and matrix 4... and so on. I then wanted to assess the significance of these correlations using the qaptest function from the R package sna. As per the R documentation on qaptest, I placed all of my matrices into an array. I then calculated the QAP p-value between pairs of matrices (matrix 1 and matrix 2, matrix 2 and matrix 3... etc.). However, I noticed that if I changed the number of matrices in the array (for example, if I only placed the first five into the array), the QAP p-values for the first set of matrices changed dramatically. Based on my understanding of arrays and QAP, this should not happen because the removed matrices have nothing to do with running a QAP test on matrix 1 and matrix 2. Has anyone else ran into this problem before? I included my matrices and my script below.
Here are my matrices in a list format (in the code below, this is the step where I made filelist1. The second half of the code only uses matrices 1-5):
[[1]]
1 2 3 4 5
1 1 0 0 0 0
2 5 0 3 5 0
3 0 0 0 0 0
4 0 0 0 0 0
5 2 0 1 0 0
[[2]]
1 2 3 4 5
1 0 0 1 0 0
2 3 6 10 1 2
3 0 0 0 0 0
4 0 5 0 0 0
5 0 0 5 0 0
[[3]]
1 2 3 4 5
1 0 1 0 0 0
2 2 0 5 7 0
3 0 0 0 0 3
4 1 0 0 0 0
5 1 2 2 3 0
[[4]]
1 2 3 4 5
1 0 6 0 0 2
2 2 0 8 5 0
3 0 5 0 0 0
4 1 0 0 0 0
5 0 0 1 3 2
[[5]]
1 2 3 4 5
1 0 0 0 0 0
2 1 0 2 5 1
3 0 0 0 0 0
4 1 2 3 0 1
5 0 3 3 1 0
[[6]]
1 2 3 4 5
1 0 0 0 0 0
2 2 0 3 0 3
3 0 0 0 0 0
4 1 0 4 0 0
5 1 5 7 0 0
[[7]]
1 2 3 4 5
1 0 0 0 0 0
2 2 0 6 0 3
3 0 0 0 0 0
4 6 0 4 0 0
5 1 0 2 0 0
[[8]]
1 2 3 4 5
1 0 0 0 1 0
2 2 0 1 6 0
3 0 0 0 0 0
4 0 0 0 0 0
5 6 0 2 2 0
[[9]]
1 2 3 4 5
1 0 0 0 0 0
2 0 0 2 3 2
3 0 0 0 0 0
4 0 0 0 0 0
5 1 0 2 0 0
[[10]]
1 2 3 4 5
1 0 0 0 0 0
2 1 0 1 1 0
3 0 0 0 0 0
4 0 0 0 0 0
5 6 0 1 2 0
This is my R script:
# read in all ten of the matrices
a<-read.csv("test1.csv")
b<-read.csv("test2.csv")
c<-read.csv("test3.csv")
d<-read.csv("test4.csv")
e<-read.csv("test5.csv")
f<-read.csv("test6.csv")
g<-read.csv("test7.csv")
h<-read.csv("test8.csv")
i<-read.csv("test9.csv")
j<-read.csv("test10.csv")
filelist<-list(a,b,c,d,e,f,g,h,i,j) #place files in a list
filelist1<-lapply(filelist,function(x){
x<-x[1:5, 2:6] #choose only columns in the matrix
colnames(x)<-1:5 #rename columns according to identity
x<-as.matrix(x) #make a matrix
return(x)
})
ee<-array(dim=c(5,5,10)) #create an empty array
array<-function(files) {
names(files) <- c("c1","c2","c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10") #name the matrices
invisible(lapply(names(files), function(x) assign(x,files[[x]],envir=.GlobalEnv))) #place the matrices in a global environment
ee[,,1]<-c(c1) #place each matrix in order into the array
ee[,,2]<-c(c2)
ee[,,3]<-c(c3)
ee[,,4]<-c(c4)
ee[,,5]<-c(c5)
ee[,,6]<-c(c6)
ee[,,7]<-c(c7)
ee[,,8]<-c(c8)
ee[,,9]<-c(c9)
ee[,,10]<-c(c10)
return(ee) #return the completely filled in array
}
a.array<-array(filelist1) # apply the function to the list of matrices
q1.2<-qaptest(a.array,gcor,g1=1,g2=2) #run the qaptest funtion
#a.array is the array with the matrices,gcor tells the function that we want a correlation
#g1=1 and g2=2 indicates that the qap analysis should be run between the first and second matrices in the array.
summary.qaptest(q1.2) #provides a summary of the qap results
#in this case, the p-value is roughly: p(f(perm) >= f(d)): 0.176
############ If I take out the last five matrices, the q1.2 p-value changes dramatically
#first clear the memory or R will not create another blank array
rm(list = ls())
a<-read.csv("test1.csv") #read in all five files
b<-read.csv("test2.csv")
c<-read.csv("test3.csv")
d<-read.csv("test4.csv")
e<-read.csv("test5.csv")
filelist<-list(a,b,c,d,e) #create a list of the files
filelist1<-lapply(filelist,function(x){
x<-x[1:5, 2:6] #include only the matrix
colnames(x)<-1:5 #rename the columns
x<-as.matrix(x) #make it a matrix
return(x)
})
ee<-array(dim=c(5,5,5)) #this time the array only has five slots
array<-function(files) {
names(files) <- c("c1","c2","c3", "c4", "c5")
invisible(lapply(names(files), function(x) assign(x,files[[x]],envir=.GlobalEnv)))
ee[,,1]<-c(c1)
ee[,,2]<-c(c2)
ee[,,3]<-c(c3)
ee[,,4]<-c(c4)
ee[,,5]<-c(c5)
return(ee)
}
a.array<-array(filelist1)
q1.2<-qaptest(a.array,gcor,g1=1,g2=2)
#in this case, the p-value is roughly: p(f(perm) >= f(d)): 0.804
summary.qaptest(q1.2)
I cannot think of a reason why the p-values would be so different when I am analyzing the exact same pair of matrices. The only difference is the number of additional matrices placed in the array. Has anyone else experienced this issue?
Thank you!
qaptest() reads graphs from the first dimension of the array, not the last. So ee[,,1]<-c(c1) (etc.) should read ee[1,,]<-c(c1) (etc.). When you place all the graph in the first dimension, the qaptests should yield identical results. Personally, I prefer using list() instead of array() with qaptest.

Error running the "netlm" command (sna)

I have four matrices of one multigraph, like this:
> projects
1 2 3 4 5
1 0 0 4 1 0
2 0 0 3 2 5
3 0 0 0 0 0
4 0 0 0 0 1
5 0 0 0 0 0
> infrastructure
1 2 3 4 5
1 0 0 0 5 0
2 0 0 4 0 0
3 0 0 0 2 2
4 0 0 0 0 3
5 0 0 0 0 0
> information
1 2 3 4 5
1 0 1 3 0 0
2 0 0 2 3 4
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
> problems
1 2 3 4 5
1 0 1 0 1 0
2 0 0 0 0 0
3 0 0 0 1 1
4 0 0 0 0 0
5 0 0 0 0 0
I rearrange it's with ...
x <- array(NA, c(length(infrastructure[1,]),length(infrastructure[,1]),3))
x[,,1] <- infrastructure
x[,,2] <- information
x[,,3] <- problems
nl <- netlm(projects,x,reps=100)
when i perform "netlm" command, the next message appears:
"Error in netlm(projects, x, reps = 100) :
Homogeneous graph orders required in netlm."
How can I fix it?
Thanks
The problem here is that netlm expects a list rather than an array, so I think it is not reading the entries as separate networks. The error indicates as much. It is not seeing three 5x5 matrices. Use list() instead.
nets <- rgraph(5,4)
y <- nets[1,,]
info <- nets[2,,]
infra <- nets[3,,]
prob <- nets[4,,]
Now, you can use list() in the netlm() command itself (saves a step):
nl <- netlm(y,list(info,infra,prob),reps=100)
Or you can create the list as an object and use it that way:
x <- list(info,infra,prob)
nl <- netlm(y,x,reps=100)
Since you have three separate networks already, you can just do:
nl <- netlm(projects,list(problems, information, infrastructure),reps=100)
I made a mistake in defining the array, I should write the following code: array(NA,c(3,length(infrastructure[1,]),length(infrastructure[,1])))

get the number of character vector elements in a corpus

my goal is to use R for lexicon based sentiment analysis!
i have two character vectors. one with positive words and one with negative words.
e.g.
pos <- c("good", "accomplished", "won", "happy")
neg <- c("bad", "loss", "damaged", "sued", "disaster")
i now have a corpus of thousands of news articles and i want to know for each article, how
many elements of my vectors pos and neg are in the article.
e.g. (not sure about how the corpus function works here but you get the idea: there are two articles in my corpus)
mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")
i want to get something like this:
article 1: 2 element of pos and 0 element of neg
article 2: 0 elements of pos, 2 elements of neg
another good thing would be, if i can get the following for each article:
(number of pos words - number of neg words)/(number of total words in article)
thank you very much!!
EDIT:
# Victorp: this doesn't seem to work
the matrix i get looks good:
mytdm[1:6,1:10]
Docs
Terms 1 2 3 4 5 6 7 8 9 10
aaron 0 0 0 0 0 1 0 0 0 0
abandon 1 1 0 0 0 0 0 0 0 0
abandoned 0 0 0 3 0 0 0 0 0 0
abbey 0 0 0 0 0 0 0 0 0 0
abbott 0 0 0 0 0 0 0 0 0 0
abbotts 0 0 1 0 0 0 0 0 0 0
but when i do your command i get zero for every document!
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
why is that??
Hello you can use the TermDocumentMatrix for doing that :
mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")))
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE))
mytdm <- as.matrix(mytdm)
# Positive words
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2
2 0
# Negative words
colSums(mytdm[rownames(mytdm) %in% neg, ])
1 2
0 2
# Total number of words per documents
colSums(mytdm)
1 2
9 5
Here's another approach:
## pos <- c("good", "accomplished", "won", "happy")
## neg <- c("bad", "loss", "damaged", "sued", "disaster")
##
## mycorpus <- Corpus(VectorSource(
## list("The CEO is happy that they finally won the case.",
## "The disaster caused a huge loss.")))
library(qdap)
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg)))
## docs word.count pos neg
## 1 1 10 2(20.00%) 0
## 2 2 6 0 2(33.33%)

Resources