In R I used the [tm package][1] for building a term-document matrix from a corpus of documents.
My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job.
This is my code so far:
txtData <- read.csv("file.csv", header = T, sep = ",")
txtCorpus <- Corpus(VectorSource(txtData$text))
...further preprocessing
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(txtCorpus, control = list(tokenize = BigramTokenizer))
#term argument holds two words since the BigramTokenizer extracted all pairs from txtCorpus
findAssocs(txtTdmBi, "cat shop", 0.5)
cat cabi cat scratch ...
0.96 0.91
I tried to define a variable with all the row.names from txtTdmBi and feed it to the findAssocs() function. However, with the following result:
allRows <- c(row.names(txtTdmBi))
findAssocs(txtTdmBi, allRows, 0.5)
Error in which(x[term, ] > corlimit) : subscript out of bounds
In addition: Warning message:
In term == Terms(x) :
longer object length is not a multiple of shorter object length
Because extracting associations for a term spent over multiple term-document matrices is already explained here, I guess it would be possible to find the associations for multiple terms in a single term-document matrix. Except how?
I hope someone can clarify me how to solve this. Thanks in advance for any support.
If I understand correctly, an lapply solution is probably the way to answer your question. This is the same approach as the answer that you link to, but here's a self-contained example that might be closer to your use case:
Load libraries and reproducible data (please include these in your future questions here)
Your bigram tokenizer...
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
Check that it worked by inspecting a random sample...
inspect(txtTdmBi[1000:1005, 10:15])
A term-document matrix (6 terms, 6 documents)
Non-/sparse entries: 1/35
Sparsity : 97%
Maximal term length: 18
Weighting : term frequency (tf)
Terms 248 273 349 352 353 368
for their 0 0 0 0 0 0
for west 0 0 0 0 0 0
forced it 0 0 0 0 0 0
forced to 0 0 0 0 0 0
forces trying 1 0 0 0 0 0
foreign investment 0 0 0 0 0 0
Here is the answer to your question:
Now use a lapply function to calculate the associated words for every item in the vector of terms in the term-document matrix. The vector of terms is most simply accessed with txtTdmBi$dimnames$Terms. For example txtTdmBi$dimnames$Terms[[1005]] is "foreign investment".
Here I've used llply from the plyr package so we can have a progress bar (comforting for big jobs), but it's basically the same as the base lapply function.
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5), .progress = "text" )
The output is a list where each item in the list is a vector of named numbers where the name is the term and the number is the correlation value. For example, to see the terms associated with "foreign investment", we can access the list like so:
and here are the terms associated with that term (I've just pasted in the top few)
168 million 1986 was 1987 early 300 mln 31 pct
1.00 1.00 1.00 1.00 1.00
a bit a crossroads a leading a political a population
1.00 1.00 1.00 1.00 1.00
a reduced a series a slightly about zero activity continues
1.00 1.00 1.00 1.00 1.00
advisers are agricultural sector agriculture the all such also reviews
1.00 1.00 1.00 1.00 1.00
and advisers and attract and imports and liberalised and steel
1.00 1.00 1.00 1.00 1.00
and trade and virtual announced since appears to are equally
1.00 1.00 1.00 1.00 1.00
are recommending areas for areas of as it as steps
1.00 1.00 1.00 1.00 1.00
asia with asian member assesses indonesia attract new balance of
1.00 1.00 1.00 1.00 1.00
Is that what you want to do?
Incidentally, if your term-document matrix is very large, you may want to try this version of findAssocs:
# u is a term document matrix
# term is your term
# corlimit is a value -1 to 1
findAssocsBig <- function(u, term, corlimit){
suppressWarnings(x.cor <- gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),
as.matrix(t(u[ u$dimnames$Terms == term, ])) ))
x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
This can be used like so:
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5), .progress = "text" )
The advantage of this is that it uses a different method of converting the TDM to a matrix tm:findAssocs. This different method uses memory more efficiently and so prevents this kind of message: Error: cannot allocate vector of size 1.9 Gb from occurring.
Quick benchmarking shows that both findAssocs functions are about the same speed, so the main difference is in the use of memory:
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)),
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)),
times = 10)
Unit: seconds
expr min lq median
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)) 10.82369 11.03968 11.25492
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)) 10.70980 10.85640 11.14156
uq max neval
11.39326 11.89754 10
11.18877 11.97978 10
I have a population vector with juveniles and adults that I would like to record new population size after each sub-annual transition. The expected output would have the original population vector on the first row, and population at each following time step at the following row. I've modified the code presented at section 4 here but haven't arrived at what I need The original algorithm use an annual projection matrix and project populations for 8 years.
A <- matrix(c(0, .3, 2, .7), nrow=2) # spring transition matrix
B <- matrix(c(0.5, .3, 3, .7), nrow = 2) # summer transition matrix
C <- matrix(c(0, .3, 4, .7), nrow=2) # fall transition matrix
D <- matrix(c(0.1, .1, 6, .7), nrow = 2) # winter transition matrix
N0 <- c(Juveniles=1,Adults=10) # initial population
steps <- 12 # number of time steps; each chain of 4 time step represent a year
My rough idea is to record population size at the end of each season on every row of the blank matrix N.
# with a column for each stage and a row for each time step
N <- rbind(N0, matrix(0, ncol=2, nrow=steps) )
# use a for-loop to project the population each season and store it.
for(t in 1:steps) {
N[t+1,] <- A%*%N[t,]
N[t+2,] <- B%*%A%*%N[t,]
N[t+3,] <- C%*%B%*%A%*%N[t,]
N[t+4,] <- D%*%C%*%B%*%A%*%N[t,]
N[t+5,] <- A%*%D%*%C%*%B%*%A%*%N[t,]
To continue, at N[t+6,], the population should be B%*%A%*%D%*%C%*%B%*%A%*%N[t,], and so on.
At this point, I got an error Error in D %*% C : requires numeric/complex matrix/vector arguments, which I don't understand what it means, and why my N[t+4,] and N[t+5,] were not calculated despite the supplied formulae.
Here is an incomplete table of N[t+i]
Juveniles Adults
N0 1.00 10.000
20.00 7.300
31.90 11.110
44.44 17.347
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
0.00 0.000
How do I change my code so that I don't have to spell out every multiplication chain? Thanks for stopping by my question.
I am quite new to the use of lists so I apologize if this problem may sound very dumb.
From an original set of 459,046 customers, I have created a function that splits and stores the base in several elements of a list.
sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
Executing this function (baseSample) you will get a new object list, containing mutually exclusive groups of customers (each group will be made of 10,000 customers - apart from the last one who may be smaller, depending on the initial volume)
> sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
[1] "Seed: 12345"
[1] "Total groups created: 46"
[1] "Group size: 10000"
In this case, the output is a list of 46 elements stored in the object called sample_list.
Now, I want to pass each of these 46 elements to a BTYD model that will forecast the number of transactions in the next 90 days (given the learnings from the input).
The reason why I cannot pass the full dataset to the BTYD model is because this model heavily uses mcmc, therefore there is a long time of calculation that stops the model to provide any output. So I have decided to generate forecasts running the same model several times (on sample big enough) until I manage to pass all the base as model input.
The operations that need to be performed on each of the elements are the following
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(element1,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(element1, pggg.draws1)
# conditional expectations
element1$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
element1$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
element1$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers
head(element1[, c("x", "t.x", "",
"xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
round(apply(median.est1, 1, mean), 3)
Ideally, the output should be stored straight into a new data.frame - so I can retrieve the Id and the forecast (amongst other stuff originally included in the dataset).
Here below some mock data to play with from a publicly available dataset.
dataset<-elog2cbs(groceryElog, = "2006-12-01")
# FUNCTION baseSample ####
baseSample <- function(dataset, sample.size, seed=NULL) {
seed.value <- if(is.null(seed)) {
} else {
data <- with(dataset, dataset[order(sample(cust, nrow(dataset))),])
set.sample.size <- sample.size
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))
df_list <- split(data, data$cycles_group)
print(paste0("Seed: ", seed.value))
print(paste0("Total groups created: ", length(unique(data$cycles_group))))
print(paste0("Group size: ", set.sample.size))
# ** OUTPUT: Base split in lists ####
sampled_list <- baseSample(dataset = dataset, sample.size = 100, seed = 12345)
In base R, you can use lapply to iterate a function over the elements of a list and return a new list with the results of those iterations. After using your example code to generate a list called sampled_list...
# turn the code for the operations you want to perform on each list element into a function,
# with a couple of minor tweaks
thingy <- function(i) {
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(i,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(i, pggg.draws1)
# conditional expectations
i$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
i$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
i$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers [commenting out for this iterated version]
# head(element1[, c("x", "t.x", "", "xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
# get the bits you want in a named vector
z <- round(apply(median.est1, 1, mean), 3)
# convert that named vector of results into a one-row data frame to make collapsing easier
# now use lapply to iterate that function over the elements of your list
results <- lapply(sampled_list, thingy)
# now bind the results into a data frame
boundresults <-, results)
Results (which took a while to get):
k lambda mu tau z
sample_1 4.200 0.174 0.091 102.835 0.27
sample_10 3.117 0.149 0.214 128.143 0.29
sample_11 4.093 0.154 0.115 130.802 0.30
sample_12 4.191 0.142 0.053 114.108 0.33
sample_13 2.605 0.155 0.071 160.743 0.35
sample_14 9.196 0.210 0.084 111.747 0.36
sample_15 2.005 0.145 0.091 298.872 0.40
sample_16 2.454 0.111 0.019 78731750.121 0.70
sample_2 2.808 0.138 0.059 812.278 0.40
sample_3 4.327 0.166 0.116 559.318 0.42
sample_4 9.266 0.166 0.038 146.283 0.40
sample_5 3.277 0.157 0.073 105.915 0.33
sample_6 9.584 0.184 0.086 118.299 0.31
sample_7 4.244 0.189 0.118 54.945 0.23
sample_8 4.388 0.147 0.085 325.054 0.36
sample_9 7.898 0.181 0.052 83.892 0.33
You can also combine those last two steps into a single line of, lapply(...)). If you want to make the row names in the results table into a column, you could do boundresults$sample <- row.names(boundresults) after making that table. And if you don't like creating new objects in your environment, you could put that function inside the call to lapply, i.e., lapply(sampled_list, function(i) { [your code] }).
I am looking to identify the simulation package in R to identify the perfect weights, that enables me allocate my datapoints into the maximum bucket.
Basically, i want to tune my weights in a such a way the achieve my goal.
Below is the example.
Final Score= Score1*Weight1+ Score2*Weight2+Score3*Weight3
The sum of my weights is 1. W1+W2+W3=1
i want to tune my weights in such a way that most of my cases lie into the "90-100" bucket. I know there won't be a perfect combination, but want to capture the maximum cases. I am currently trying to do the same in excel manually, using Pivot, but want to know if there is any package in R, that helps me to achieve my goal.
THe group allocation "70-80" "80-90" is something i have made in excel, using if else condition.
R Pivot Result:
Would appreciate if someone can help me to for the same.
Here's an approach that tries to get all the final scores as close as possible to 0.9 using a nested optimisation approach.
Here's your original data:
# Original data
df <- read.table(text = "Score1, Score2, Score3
1,1,0.9", header = TRUE, sep = ",")
This is the cost function for the first weight.
# Outer cost function
cost_outer <- function(w1){
# Run nested optimisation
res <- optimise(cost_nested, lower = 0, upper = 1 - w1, w1 = w1)
# Spit second weight into a global variable
res_outer <<- res$minimum
# Return the cost function value
This is the cost function for the second weight.
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Distance from desired interval
res <- 0.9 - rowSums(w*df)
# Zero if negative distance, square distance otherwise
res <- sum(ifelse(res < 0, 0, res^2))
Next, I run the optimisation.
# Repackage weights
weight <- c(optimise(cost_outer, lower = 0, upper = 1)$minimum, res_outer)
weight <- c(weight, 1 - sum(weight))
Finally, I show the results.
# Final scores
cbind(df, Final = rowSums(weight * df))
# Score1 Score2 Score3 Final
# 1 0.87 0.73 0.41 0.7615286
# 2 0.82 0.73 0.85 0.8229626
# 3 0.82 0.37 0.85 0.8267400
# 4 0.58 0.95 0.42 0.8666164
# 5 1.00 1.00 0.90 0.9225343
Notice, however, that this code gets the final scores as close as possible to the interval, which is different from getting the most scores in that interval. That can be achieved by switching out the nested cost function with something like:
# Nested cost function
cost_nested <- function(w2, w1){
# Calculate final weight
w <- c(w1, w2, 1 - w2 -w1)
# Number of instances in desired interval
res <- sum(rowSums(w*df) < 0.9)
This can be formulated as a Mixed Integer Programming (MIP) problem. The mathematical model can look like:
The binary variable δi indicates if final weight Fi is inside the interval [0.9,1]. M is "large" value (if all your data is between 0 and 1 we can choose M=1). ai,j is your data.
The objective function and all constraints are linear, so we can use standard MIP solvers to solve this problem. MIP solvers for R are readily available.
PS in the example groups overlap. That does not make much sense to me. I think if we have "90-100" we should not also have "60-100".
PS2. If all data is between 0 and 1, we can simplify the sandwich equation a bit: we can drop the right part.
For the small example data set I get:
---- 56 PARAMETER a
j1 j2 j3
i1 0.870 0.730 0.410
i2 0.820 0.730 0.850
i3 0.820 0.370 0.850
i4 0.580 0.950 0.420
i5 1.000 1.000 0.900
---- 56 VARIABLE w.L weights
j1 0.135, j2 0.865
---- 56 VARIABLE f.L final scores
i1 0.749, i2 0.742, i3 0.431, i4 0.900, i5 1.000
---- 56 VARIABLE delta.L selected
i4 1.000, i5 1.000
---- 56 VARIABLE z.L = 2.000 objective
(zeros are not printed)
I am using findAssocs() of the tm package on a document frequency matrix to identify words which are associated with particular term(s) across various documents in a corpus.
My problem is that I get different output when giving a vector of terms as input to the function compared to giving a single term as input.
Here is my example.
txt <- c("alpha bravo", "alpha charlie", "alpha charlie", "zulu")
corp <- Corpus(VectorSource(txt))
dtm <- DocumentTermMatrix(corp)
Returns the following dtm
> as.matrix(dtm)
Docs alpha bravo charlie zulu
1 1 1 0 0
2 1 0 1 0
3 1 0 1 0
4 0 0 0 1
If I would want to identify all terms associated with "alpha" I get the following output (as intended):
> findAssocs(dtm, "alpha", 0.00)
charlie bravo
0.58 0.33
I could do the same for "bravo" and get the following output (as intended):
> findAssocs(dtm, "bravo", 0.00)
As I would like to find those associations for a number of terms I have passed a vector to findAssocs in order to get the required output. However, if I pass a vector of terms (chr) to the function the output is different from the one I get for single inputs:
> findAssocs(dtm, c("alpha","bravo"), 0.00)
Actually, the assocation between "alpha"and "bravo" is omitted which is not the behavior I would have expected here. The function seems to treat the individual terms independently of each other and thus does not analyze the correlation between "alpha" and "bravo" if they are both passed to the function in a vector.
Can anyone explain that behavior and tell me how to omitt it? As a workaround I could apply the function for each single term but that is not really handy...
What I find odd is that the correlation between "alpha" and "bravo" is not omitted if we plot the associations, e.g. through the following code:
> freqTerm <- findFreqTerms(dtm, 1)
> freqTerm
[1] "alpha" "bravo" "charlie" "zulu"
plot(dtm, term=freqTerm, corThreshold=0.0, weighting=T, attrs=list(node=list(fixedsize=FALSE, shape="ellipse")))
How is plot(dtm, term=freqTerm ... different from "findAssocs()"?
tm::findAssocs() omits direct comparisons for exactly the reasons stated in the comment by #Steven Beauport. Given that you are searching for a small set of terms likely to be highly correlated, this seems more like a bug than a feature. This is illustrated by the example of this function (see ?tm::findAssocs) where the terms oil and opec are the most similar, but this is masked by the omission of each from the other's association vector.
An alternative is to use the equivalent feature from the quanteda package:
txt <- c("alpha bravo", "alpha charlie", "alpha charlie", "zulu")
corp <- corpus(txt)
dtm <- dfm(corp, verbose = FALSE)
# this also works fine if you want to go straight from text:
# dtm <- dfm(txt, verbose = FALSE)
(simlist <- similarity(dtm, c("alpha","bravo"), margin = "features"))
## similarity Matrix:
## $alpha
## charlie bravo zulu
## 0.5774 0.3333 -1.0000
## $bravo
## alpha zulu charlie
## 0.3333 -0.3333 -0.5774
Or if you prefer it as a matrix:
## alpha bravo
## alpha 1.0000000 0.3333333
## charlie 0.5773503 -0.5773503
## bravo 0.3333333 1.0000000
## zulu -1.0000000 -0.3333333
similarity() can do cosine similarities as well as other similarities defined in the proxy package, but the (Pearson's) correlation and cosine methods are currently implemented in fully sparse computation, whereas the others are not (yet). By defining margin = "documents", furthermore, you can compare documents instead of terms, for instance for clustering.
I am building a tree using the partykit R package, and I am wondering if there is a simple, efficient way to determine the depth number at each internal node. For example, the root node would have depth 0, the first two kid nodes have depth 1, the next kid nodes have depth 2, and so forth. This will eventually be used to calculate the minimal depth of a variable. Below is a very basic example (taken from vignette("constparty", package="partykit")):
data("Titanic", package = "datasets")
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ ., data = ttnc)
#This is one of my many attempts which does NOT work
internalNodes<-nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)]
depth(ttncTree)-unlist(nodeapply(ttncTree, ids=internalNodes, FUN=function(n){depth(n)}))
In this example, I want to output something similar to:
nodeid = 1 2 4 7
depth = 0 1 2 1
I apologize if my question is too specific.
Here's a possible solution which should be efficient enough as usually the trees have no more than several dozens of nodes.
I'm ignoring node #1, as it is always 0 an hence no point neither calculating it or showing it (IMO)
Inters <- nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)][-1]
table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(ttncTree, from = x)))))
# 2 4 7
# 1 2 1
I had to revisit this problem recently. Below is a function to determine the depth of each node. I count the depth based on the number of times a vertical line | appears running the function.
idDepth <- function(tree) {
outTree <- capture.output(tree)
idCount <- 1
depthValues <- rep(NA, length(tree))
names(depthValues) <- 1:length(tree)
for (index in seq_along(outTree)){
if (grepl("\\[[0-9]+\\]", outTree[index])) {
depthValues[idCount] <- str_count(outTree[index], "\\|")
idCount = idCount + 1
> idDepth(ttncTree)
1 2 3 4 5 6 7 8 9
0 1 2 2 3 3 1 2 2
There definitely seems to be a simpler, faster solution, but this is faster than using the intersect() function. Below is an example of the computation time for a large tree (around 1,500 nodes)
# Compare computation time for large tree #
dat <- data.frame(mlbench.friedman1(5000))
rp <- rpart(as.formula(paste0("y ~ ", paste(paste0("x.", 1:10), collapse=" + "))),
data=dat, control = rpart.control(cp = -1, minsplit=3, maxdepth = 10))
partyTree <-
> length(partyTree) #Number of splits
[1] 1503
> # Intersect() computation time
> Inters <- nodeids(partyTree)[-nodeids(partyTree, terminal = TRUE)][-1]
> system.time(table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(partyTree, from = x))))))
user system elapsed
22.38 0.00 22.44
> # Proposed computation time
> system.time(idDepth(partyTree))
user system elapsed
2.38 0.00 2.38