I am using findAssocs() of the tm package on a document frequency matrix to identify words which are associated with particular term(s) across various documents in a corpus.
My problem is that I get different output when giving a vector of terms as input to the function compared to giving a single term as input.
Here is my example.
library(tm)
txt <- c("alpha bravo", "alpha charlie", "alpha charlie", "zulu")
corp <- Corpus(VectorSource(txt))
dtm <- DocumentTermMatrix(corp)
Returns the following dtm
> as.matrix(dtm)
Terms
Docs alpha bravo charlie zulu
1 1 1 0 0
2 1 0 1 0
3 1 0 1 0
4 0 0 0 1
If I would want to identify all terms associated with "alpha" I get the following output (as intended):
> findAssocs(dtm, "alpha", 0.00)
$alpha
charlie bravo
0.58 0.33
I could do the same for "bravo" and get the following output (as intended):
> findAssocs(dtm, "bravo", 0.00)
$bravo
alpha
0.33
As I would like to find those associations for a number of terms I have passed a vector to findAssocs in order to get the required output. However, if I pass a vector of terms (chr) to the function the output is different from the one I get for single inputs:
> findAssocs(dtm, c("alpha","bravo"), 0.00)
$alpha
charlie
0.58
$bravo
numeric(0)
Actually, the assocation between "alpha"and "bravo" is omitted which is not the behavior I would have expected here. The function seems to treat the individual terms independently of each other and thus does not analyze the correlation between "alpha" and "bravo" if they are both passed to the function in a vector.
Can anyone explain that behavior and tell me how to omitt it? As a workaround I could apply the function for each single term but that is not really handy...
UPDATE
What I find odd is that the correlation between "alpha" and "bravo" is not omitted if we plot the associations, e.g. through the following code:
> freqTerm <- findFreqTerms(dtm, 1)
> freqTerm
[1] "alpha" "bravo" "charlie" "zulu"
plot(dtm, term=freqTerm, corThreshold=0.0, weighting=T, attrs=list(node=list(fixedsize=FALSE, shape="ellipse")))
How is plot(dtm, term=freqTerm ... different from "findAssocs()"?
tm::findAssocs() omits direct comparisons for exactly the reasons stated in the comment by #Steven Beauport. Given that you are searching for a small set of terms likely to be highly correlated, this seems more like a bug than a feature. This is illustrated by the example of this function (see ?tm::findAssocs) where the terms oil and opec are the most similar, but this is masked by the omission of each from the other's association vector.
An alternative is to use the equivalent feature from the quanteda package:
library(quanteda)
txt <- c("alpha bravo", "alpha charlie", "alpha charlie", "zulu")
corp <- corpus(txt)
dtm <- dfm(corp, verbose = FALSE)
# this also works fine if you want to go straight from text:
# dtm <- dfm(txt, verbose = FALSE)
(simlist <- similarity(dtm, c("alpha","bravo"), margin = "features"))
## similarity Matrix:
## $alpha
## charlie bravo zulu
## 0.5774 0.3333 -1.0000
##
## $bravo
## alpha zulu charlie
## 0.3333 -0.3333 -0.5774
Or if you prefer it as a matrix:
as.matrix(simlist)
## alpha bravo
## alpha 1.0000000 0.3333333
## charlie 0.5773503 -0.5773503
## bravo 0.3333333 1.0000000
## zulu -1.0000000 -0.3333333
similarity() can do cosine similarities as well as other similarities defined in the proxy package, but the (Pearson's) correlation and cosine methods are currently implemented in fully sparse computation, whereas the others are not (yet). By defining margin = "documents", furthermore, you can compare documents instead of terms, for instance for clustering.
Related
I'm trying to implement a simple evaluation scheme using exams, but none of the options seems to do what I'd like:
There are 5 answer options. I want to give 0.2 points for all marked correct answers and all unmarked incorrect answers, and zero points for all unmarked good answers and marked incorrect answers. Therefore, a task can yield 0, .2, .4, .6, .8, or 1 points.
I'm aware that this evaluation scheme may have some shortcomings, but I'm counterbalancing those in other ways.
I was able to implement this when I did scanned exams, because I could use a string distance function to tell how many characters differ in two strings that coded the answers and the solutions. But I want to do this in moodle now, so I cannot control the evaluation.
Here are some examples that I tried:
ee <- exams_eval(partial = TRUE, rule = "all", negative = FALSE)
ee$pointsum("01111", "10000") # should be 0 and returns 0
ee$pointsum("01111", "10001") # should be .2 but returns 0
ee$pointsum("11111", "10001") # should be .4 and returns .4
ee$pointsum("00000", "11001") # should be .4 but returns 0
ee$pointsum("11011", "00011") # should be .6 but returns .5
ee$pointsum("11111", "10101") # should be .6 and returns .6
ee$pointsum("11001", "10001") # Should be .8 but returns .66
ee$pointsum("00000", "00001") # should be .8 but returns 0
ee$pointsum("11001", "11001") # Should be 1 and returns 1
ee$pointsum("00000", "00000") # Should be 1 but returns 0
The previous examples yield the same result when using rule = "false" or rule = "false2", or rule = "true". When using rule = "none", this is the only change:
ee$pointsum("01111", "10001") # should be .2 but returns 0.25
Is there a way to implement the above mentioned evaluation scheme in moodle?
The R/exams package currently does not support the desired evaluation scheme because Moodle does not support it. Looking at the Moodle docs at https://docs.moodle.org/36/en/Moodle_XML_format#Multiple_choice shows that you can see that partial credit schemes always work in the following way:
Not marking/clicking an <answer> does not yield any points.
Marking/clicking an answer option yields a certain fraction of the overall points.
Hence, R/exams handles this by assigning the fraction 1/#correct to marking/clicking a correct answer. The rule argument only controls which fraction is subtracted when marking/clicking an incorrect answer. The default is the "false2" rule that essentially subtracts 1/#incorrect. For example, an item with 2 correct and 3 incorrect answers is processed with:
ee2 <- exams_eval(partial = TRUE, rule = "false2", negative = FALSE)
ee2$pointvec("11000")
## pos neg
## 0.5000000 -0.3333333
When you use rule = "all" then 100% of the points are removed if an incorrect answer is marked/clicked:
ee <- exams_eval(partial = TRUE, rule = "all", negative = FALSE)
ee$pointvec("11000")
## pos neg
## 0.5 -1.0
There are learning management systems that support more flexible ways of computing the points (e.g., in QTI this is in principle possible) but I don't think your particular scheme can be implemented in Moodle. (If anyone knows more than the Moodle docs above, let me know!)
(You said you are aware of the drawbacks of your evaluation scheme - which is, of course, fair enough. However, just for the record in case anybody else reads this: I'm personally not very fond of the scheme you are proposing. Even if on average each answer option is correct with 50% probability, students can obtain 50% of the points on average by either always clicking all of the answer options or by always clicking none of the answer options. This can even get higher if the probability for each options deviates from 50%. Hence, this sets strange incentives for some students...at least the business and economics students I'm typically teaching.)
Strangely, there is no combination of exams_eval parameters that would give the desired scheme. One can verify that by checking all 20 combinations. For instance,
combs <- expand.grid(partial = c(TRUE, FALSE), negative = c(TRUE, FALSE),
rule = c("false2", "false", "true", "all", "none"),
stringsAsFactors = FALSE)
sapply(1:nrow(combs), function(n) {
ee <- exams_eval(partial = combs[n, 1], negative = combs[n, 2], rule = combs[n, 3])
ee$pointsum("11001", "10001")
})
# [1] 0.6666667 -1.0000000 0.6666667 0.0000000 0.6666667 -1.0000000 0.6666667 0.0000000
# [9] 0.6666667 -1.0000000 0.6666667 0.0000000 0.6666667 -1.0000000 0.6666667 0.0000000
# [17] 0.6666667 -1.0000000 0.6666667 0.0000000
That's particularly strange because the scheme is quite simple - counting matches, which we may implement as follows:
pointsum <- function(correct, answer) {
correct <- as.numeric(strsplit(correct, "")[[1]])
answer <- as.numeric(strsplit(answer, "")[[1]])
mean(correct == answer)
}
pointsum("01111", "10000") # should be 0
# [1] 0
pointsum("01111", "10001") # should be .2
# [1] 0.2
pointsum("11111", "10001") # should be .4
# [1] 0.4
pointsum("00000", "11001") # should be .4
# [1] 0.4
pointsum("11011", "00011") # should be .6
# [1] 0.6
pointsum("11111", "10101") # should be .6
# [1] 0.6
pointsum("11001", "10001") # should be .8
# [1] 0.8
pointsum("00000", "00001") # should be .8
# [1] 0.8
pointsum("11001", "11001") # should be 1
# [1] 1
pointsum("00000", "00000") # should be 1
# [1] 1
I'm using the mgcv package to fit some polynomial splines to some data via:
x.gam <- gam(cts ~ s(time, bs = "ad"), data = x.dd,
family = poisson(link = "log"))
I'm trying to extract the functional form of the fit. x.gam is a gamObject, and I've been reading the documentation but haven't found enough information in order to manually reconstruct the fitted function.
x.gam$smooth contains information about whether the knots have been placed;
x.gam$coefficients gives the spline coefficients, but I don't know what order polynomial splines are used and looking in the code has not revealed anything.
Is there a neat way to extract the knots, coefficients and basis used so that one can manually reconstruct the fit?
I don't have your data, so I take the following example from ?adaptive.smooth to show you where you can find information you want. Note that though this example is for Gaussian data rather than Poisson data, only the link function is different; all the rest are just standard.
x <- 1:1000/1000 # data between [0, 1]
mu <- exp(-400*(x-.6)^2)+5*exp(-500*(x-.75)^2)/3+2*exp(-500*(x-.9)^2)
y <- mu+0.5*rnorm(1000)
b <- gam(y~s(x,bs="ad",k=40,m=5))
Now, all information on smooth construction is stored in b$smooth, we take it out:
smooth <- b$smooth[[1]] ## extract smooth object for first smooth term
knots:
smooth$knots gives you location of knots.
> smooth$knots
[1] -0.081161 -0.054107 -0.027053 0.000001 0.027055 0.054109 0.081163
[8] 0.108217 0.135271 0.162325 0.189379 0.216433 0.243487 0.270541
[15] 0.297595 0.324649 0.351703 0.378757 0.405811 0.432865 0.459919
[22] 0.486973 0.514027 0.541081 0.568135 0.595189 0.622243 0.649297
[29] 0.676351 0.703405 0.730459 0.757513 0.784567 0.811621 0.838675
[36] 0.865729 0.892783 0.919837 0.946891 0.973945 1.000999 1.028053
[43] 1.055107 1.082161
Note, three external knots are placed beyond each side of [0, 1] to construct spline basis.
basis class
attr(smooth, "class") tells you the type of spline. As you can read from ?adaptive.smooth, for bs = ad, mgcv use P-splines, hence you get "pspline.smooth".
mgcv use 2nd order pspline, you can verify this by checking the difference matrix smooth$D. Below is a snapshot:
> smooth$D[1:6,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 -2 1 0 0 0
[2,] 0 1 -2 1 0 0
[3,] 0 0 1 -2 1 0
[4,] 0 0 0 1 -2 1
[5,] 0 0 0 0 1 -2
[6,] 0 0 0 0 0 1
coefficients
You have already known that b$coefficients contain model coefficients:
beta <- b$coefficients
Note this is a named vector:
> beta
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4 s(x).5
0.37792619 -0.33500685 -0.30943814 -0.30908847 -0.31141148 -0.31373448
s(x).6 s(x).7 s(x).8 s(x).9 s(x).10 s(x).11
-0.31605749 -0.31838050 -0.32070350 -0.32302651 -0.32534952 -0.32767252
s(x).12 s(x).13 s(x).14 s(x).15 s(x).16 s(x).17
-0.32999553 -0.33231853 -0.33464154 -0.33696455 -0.33928755 -0.34161055
s(x).18 s(x).19 s(x).20 s(x).21 s(x).22 s(x).23
-0.34393354 -0.34625650 -0.34857906 -0.05057041 0.48319491 0.77251118
s(x).24 s(x).25 s(x).26 s(x).27 s(x).28 s(x).29
0.49825345 0.09540020 -0.18950763 0.16117012 1.10141701 1.31089436
s(x).30 s(x).31 s(x).32 s(x).33 s(x).34 s(x).35
0.62742937 -0.23435309 -0.19127140 0.79615752 1.85600016 1.55794576
s(x).36 s(x).37 s(x).38 s(x).39
0.40890236 -0.20731309 -0.47246357 -0.44855437
basis matrix / model matrix / linear predictor matrix (lpmatrix)
You can get model matrix from:
mat <- predict.gam(b, type = "lpmatrix")
This is an n-by-p matrix, where n is the number of observations, and p is the number of coefficients. This matrix has column name:
> head(mat[,1:5])
(Intercept) s(x).1 s(x).2 s(x).3 s(x).4
1 1 0.6465774 0.1490613 -0.03843899 -0.03844738
2 1 0.6437580 0.1715691 -0.03612433 -0.03619157
3 1 0.6384074 0.1949416 -0.03391686 -0.03414389
4 1 0.6306815 0.2190356 -0.03175713 -0.03229541
5 1 0.6207361 0.2437083 -0.02958570 -0.03063719
6 1 0.6087272 0.2688168 -0.02734314 -0.02916029
The first column is all 1, giving intercept. While s(x).1 suggests the first basis function for s(x). If you want to view what individual basis function look like, you can plot a column of mat against your variable. For example:
plot(x, mat[, "s(x).20"], type = "l", main = "20th basis")
linear predictor
If you want to manually construct the fit, you can do:
pred.linear <- mat %*% beta
Note that this is exactly what you can get from b$linear.predictors or
predict.gam(b, type = "link")
response / fitted values
For non-Gaussian data, if you want to get response variable, you can apply inverse link function to linear predictor to map back to original scale.
Family information are stored in gamObject$family, and gamObject$family$linkinv is the inverse link function. The above example will certain gives you identity link, but for your fitted object x.gam, you can do:
x.gam$family$linkinv(x.gam$linear.predictors)
Note this is the same to x.gam$fitted, or
predict.gam(x.gam, type = "response").
Other links
I have just realized that there were quite a lot of similar questions before.
This answer by Gavin Simpson is great, for predict.gam( , type = 'lpmatrix').
This answer is about predict.gam(, type = 'terms').
But anyway, the best reference is always ?predict.gam, which includes extensive examples.
I am writing my Masters final project in which I am deriving probability of default using Black Scholes Merton Model.I have got stuck in R code. Mathematically, I want to solve this system of nonlinear equations with the package nleqslv:
library(nleqslv)
T <- 1
D1 <- 20010.75
R <- 0.8516
sigmaS <- .11
SO1 <- 1311.74
fnewton <- function(x){
y <- numeric(2)
d1 <- (log(x[1]/D1)+(R+x[2]^2/2)*T)/x[2]*sqrt(T)
d2 <- d1 - x[2]*sqrt(T)
y[1] <- SO1 - (x[1]*pnorm(d1) - exp(-R*T)*D1*pnorm(d2))
y[2] <- sigmaS*SO1 - pnorm(d1)*x[2]*x[1]
y
}
xstart <- c(1311.74,0.11)
nleqslv(xstart, fnewton, method="Broyden")
# $x
# [1] 1311.74 0.11
# $fvec
# [1] 1311.7400 144.2914
# $termcd
# [1] 6
# $message
# [1] "Jacobian is singular (see allowSingular option)"
# $scalex
# [1] 1 1
# $nfcnt
# [1] 0
# $njcnt
# [1] 1
# $iter
# [1] 1
I have tried this with many values of the 5 inputs( stated above that I have computed for 2 companies for different years), but I am not getting the final values of S0 and sigma V.
I am getting message as "Jacobian is singular (see allowSingular option)" If I allow singular Jacobean using "control=list(trace=1,allowSingular=TRUE)", then also no answer is displayed. I do not know how to obtain the solution of these 2 variables now.
I really don’t know, what I am doing wrong as I oriented my model on Teterevas slides ( on slide no.5 is her model code), who’s presentation is the first result by googeling
https://www.google.de/search?q=moodys+KMV+in+R&rlz=1C1SVED_enDE401DE401&aq=f&oq=moodys+KMV+in+R&aqs=chrome.0.57.13309j0&sourceid=chrome&ie=UTF-8#q=distance+to+default+in+R
q=distance+to+default+in+R
Like me, however more successful, she calculates the Distance to Default risk measure via the Black Scholes Merton approach. In this model, the value of equity (usually represented by the market capitalization, > SO1) can be written as a European call option.
The other variables are:
x[1]: the variable I want to derive, value of total assets
x[2]: the variable I want to derive, volatility of total assets
D1: the book value of debt (19982009)
R: a riskfree interest rate
T: is set to 1 year (time)
sigmaS: estimated (historical) equity volatility
You should be able to use the initial values of SO1 and sigmaS as starting values for nleqslv.
First of all the R code given by Tetereva doesn't seem quite correct (the variable Z should be D1 as you have named it; similar changes for her S0 and D).
I have modified Tetereva's into this:
library(nleqslv)
T <- 1
D1 <- 33404048
R <- 2.32
sigmaS <- .02396919
SO1 <- 4740291 # Ve?
fnewton <- function(x){
y <- numeric(2)
d1 <- (log(x[1]/D1)+(R+x[2]^2/2)*T)/x[2]*sqrt(T)
d2 <- d1 - x[2]*sqrt(T)
y[1] <- SO1 - (x[1]*pnorm(d1) - exp(-R*T)*D1*pnorm(d2))
y[2] <- sigmaS*SO1 - pnorm(d1)*x[2]*x[1]
y
}
xstart <- c(SO1,sigmaS)
nleqslv(xstart, fnewton, method="Broyden",control=list(trace=1))
nleqslv(xstart, fnewton, method="Newton",control=list(trace=1))
which will give the solution given by Tetereva. (I use trace=1 here just to check the iteration steps.)
I believe the value you give for R should be 8.516 and not something else. Using your values for the parameters
T <- 1
D1 <- 20010.75
R <- 8.516 # modified
sigmaS <- .11
SO1 <- 1311.74
like this
xstart <- c(1311.74,0.11)
nleqslv(xstart, fnewton, method="Broyden")
nleqslv(xstart, fnewton, method="Newton")
Then running nleqslv with these values converges very quickly.
If one uses R <- 2.32 (like Tetereva) nleqslv will also converge albeit with more iterations.
I cannot help you with what R should actually be but from Tetereva's presentation I assume R is in percentages. Since I don't have enough knowledge on the Black-Scholes model I can't be of any help for finding out what the correct values are for the various parameters. It's up to you.
I want to code travelling salesman problem in R. I am going to begin with 3 cities at first then I will expand to more cities. distance matrix below gives distance between 3 cities. Objective (if someone doesn't know) is that a salesman will start from a city and will visit 2 other cities such that he has to travel minimum distance.
In below case he should start either from ny or LA and then travel to chicago and then to the remaining city. I need help to define A_ (my constraint matrix).
My decision variables will of same dimension as distances matrix. It will be a 1,0 matrix where 1 represents travel from city equal to row name to a city equal to column name. For instance if a salesman travels from ny to chicago, 2nd element in row 1 will be 1. My column and row names are ny,chicago and LA
By looking at the solution of the problem I concluded that my constraints will be::
Row sums have to be less than 1 as he cannot leave from same city twice
Column sums have to be less than 1 as he cannot enter the same city twice
total sum of matrix elements has to be 2 as the salesman will be visiting 2 cities and leaving from 2 cities.
I need help to define A_ (my constraint matrix). How should I tie in my decision variables into constraints?
ny=c(999,9,20)
chicago=c(9,999,11)
LA=c(20,11,999)
distances=cbind(ny,chicago,LA)
dv=matrix(c("a11","a12","a13","a21","a22","a23","a31","a32","a33"),nrow=3,ncol=3)
c_=c(distances[1,],distances[2,],distances[3,])
signs = c((rep('<=', 7)))
b=c(1,1,1,1,1,1,2)
res = lpSolve::lp('min', c_, A_, signs, b, all.bin = TRUE)
There are some problems with your solution. The first is that the constraints you have in mind don't guarantee that all the cities will be visited -- for example, the path could just go from NY to LA and then back. This could be solved fairly easily, for example, by requiring that each row and column sum to exactly one rather than at most 1 (although in that case you'd be finding a traveling salesman tour rather than just a path).
The bigger problem is that, even if we fix this problem, your constraints wouldn't guarantee that the selected vertices actually form one cycle through the graph, rather than multiple smaller cycles. And I don't think that your representation of the problem can be made to address this issue.
Here is an implementation of Travelling Salesman using LP. The solution space is of size n^3, where n is the number of rows in the distance matrix. This represents n consecutive copies of the nxn matrix, each of which represents the edge traversed at time t for 1<=t<=n. The constraints guarantee that
At most one edge is traversed each step
Ever vertex is visited exactly once
The startpoint of the i'th edge traversed is the same as the endpoint of the i-1'st
This avoids the problem of multiple small cycles. For example, with four vertices, the sequence (12)(21)(34)(43) would not be a valid solution because the endpoint of the second edge (21) does not match the start point of the third (34).
tspsolve<-function(x){
diag(x)<-1e10
## define some basic constants
nx<-nrow(x)
lx<-length(x)
objective<-matrix(x,lx,nx)
rowNum<-rep(row(x),nx)
colNum<-rep(col(x),nx)
stepNum<-rep(1:nx,each=lx)
## these constraints ensure that at most one edge is traversed each step
onePerStep.con<-do.call(cbind,lapply(1:nx,function(i) 1*(stepNum==i)))
onePerRow.rhs<-rep(1,nx)
## these constraints ensure that each vertex is visited exactly once
onceEach.con<-do.call(cbind,lapply(1:nx,function(i) 1*(rowNum==i)))
onceEach.rhs<-rep(1,nx)
## these constraints ensure that the start point of the i'th edge
## is equal to the endpoint of the (i-1)'st edge
edge.con<-c()
for(s in 1:nx){
s1<-(s %% nx)+1
stepMask<-(stepNum == s)*1
nextStepMask<- -(stepNum== s1)
for(i in 1:nx){
edge.con<-cbind(edge.con,stepMask * (colNum==i) + nextStepMask*(rowNum==i))
}
}
edge.rhs<-rep(0,ncol(edge.con))
## now bind all the constraints together, along with right-hand sides, and signs
constraints<-cbind(onePerStep.con,onceEach.con,edge.con)
rhs<-c(onePerRow.rhs,onceEach.rhs,edge.rhs)
signs<-rep("==",length(rhs))
list(constraints,rhs)
## call the lp solver
res<-lp("min",objective,constraints,signs,rhs,transpose=F,all.bin=T)
## print the output of lp
print(res)
## return the results as a sequence of vertices, and the score = total cycle length
list(cycle=colNum[res$solution==1],score=res$objval)
}
Here is an example:
set.seed(123)
x<-matrix(runif(16),c(4,4))
x
## [,1] [,2] [,3] [,4]
## [1,] 0.2875775 0.9404673 0.5514350 0.6775706
## [2,] 0.7883051 0.0455565 0.4566147 0.5726334
## [3,] 0.4089769 0.5281055 0.9568333 0.1029247
## [4,] 0.8830174 0.8924190 0.4533342 0.8998250
tspsolve(x)
## Success: the objective function is 2.335084
## $cycle
## [1] 1 3 4 2
##
## $score
## [1] 2.335084
We can check the correctness of this answer by using a primitive brute force search:
tspscore<-function(x,solution){
sum(sapply(1:nrow(x), function(i) x[solution[i],solution[(i%%nrow(x))+1]]))
}
tspbrute<-function(x,trials){
score<-Inf
cycle<-c()
nx<-nrow(x)
for(i in 1:trials){
temp<-sample(nx)
tempscore<-tspscore(x,temp)
if(tempscore<score){
score<-tempscore
cycle<-temp
}
}
list(cycle=cycle,score=score)
}
tspbrute(x,100)
## $cycle
## [1] 3 4 2 1
##
## $score
## [1] 2.335084
Note that, even though these answers are nominally different, they represent the same cycle.
For larger graphs, though, the brute force approach doesn't work:
> set.seed(123)
> x<-matrix(runif(100),10,10)
> tspsolve(x)
Success: the objective function is 1.296656
$cycle
[1] 1 10 3 9 5 4 8 2 7 6
$score
[1] 1.296656
> tspbrute(x,1000)
$cycle
[1] 1 5 4 8 10 9 2 7 6 3
$score
[1] 2.104487
This implementation is pretty efficient for small matrices, but, as expected, it starts to deteriorate severely as they get larger. At about 15x15 it starts slowing down quite a bit:
timetsp<-function(x,seed=123){
set.seed(seed)
m<-matrix(runif(x*x),x,x)
gc()
system.time(tspsolve(m))[3]
}
sapply(6:16,timetsp)
## elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
## 0.011 0.010 0.018 0.153 0.058 0.252 0.984 0.404 1.984 20.003
## elapsed
## 5.565
You can use the gaoptim package to solve permutation/real valued problems - it's pure R, so it's not so fast:
Euro tour problem (see ?optim)
eurodistmat = as.matrix(eurodist)
# Fitness function (we'll perform a maximization, so invert it)
distance = function(sq)
{
sq = c(sq, sq[1])
sq2 <- embed(sq, 2)
1/sum(eurodistmat[cbind(sq2[,2], sq2[,1])])
}
loc = -cmdscale(eurodist, add = TRUE)$points
x = loc[, 1]
y = loc[, 2]
n = nrow(eurodistmat)
set.seed(1)
# solving code
require(gaoptim)
ga2 = GAPerm(distance, n, popSize = 100, mutRate = 0.3)
ga2$evolve(200)
best = ga2$bestIndividual()
# solving code
# just transform and plot the results
best = c(best, best[1])
best.dist = 1/max(ga2$bestFit())
res = loc[best, ]
i = 1:n
plot(x, y, type = 'n', axes = FALSE, ylab = '', xlab = '')
title ('Euro tour: TSP with 21 cities')
mtext(paste('Best distance found:', best.dist))
arrows(res[i, 1], res[i, 2], res[i + 1, 1], res[i + 1, 2], col = 'red', angle = 10)
text(x, y, labels(eurodist), cex = 0.8, col = 'gray20')
In R I used the [tm package][1] for building a term-document matrix from a corpus of documents.
My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job.
This is my code so far:
library(tm)
library(RWeka)
txtData <- read.csv("file.csv", header = T, sep = ",")
txtCorpus <- Corpus(VectorSource(txtData$text))
...further preprocessing
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(txtCorpus, control = list(tokenize = BigramTokenizer))
#term argument holds two words since the BigramTokenizer extracted all pairs from txtCorpus
findAssocs(txtTdmBi, "cat shop", 0.5)
cat cabi cat scratch ...
0.96 0.91
I tried to define a variable with all the row.names from txtTdmBi and feed it to the findAssocs() function. However, with the following result:
allRows <- c(row.names(txtTdmBi))
findAssocs(txtTdmBi, allRows, 0.5)
Error in which(x[term, ] > corlimit) : subscript out of bounds
In addition: Warning message:
In term == Terms(x) :
longer object length is not a multiple of shorter object length
Because extracting associations for a term spent over multiple term-document matrices is already explained here, I guess it would be possible to find the associations for multiple terms in a single term-document matrix. Except how?
I hope someone can clarify me how to solve this. Thanks in advance for any support.
If I understand correctly, an lapply solution is probably the way to answer your question. This is the same approach as the answer that you link to, but here's a self-contained example that might be closer to your use case:
Load libraries and reproducible data (please include these in your future questions here)
library(tm)
library(RWeka)
data(crude)
Your bigram tokenizer...
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
Check that it worked by inspecting a random sample...
inspect(txtTdmBi[1000:1005, 10:15])
A term-document matrix (6 terms, 6 documents)
Non-/sparse entries: 1/35
Sparsity : 97%
Maximal term length: 18
Weighting : term frequency (tf)
Docs
Terms 248 273 349 352 353 368
for their 0 0 0 0 0 0
for west 0 0 0 0 0 0
forced it 0 0 0 0 0 0
forced to 0 0 0 0 0 0
forces trying 1 0 0 0 0 0
foreign investment 0 0 0 0 0 0
Here is the answer to your question:
Now use a lapply function to calculate the associated words for every item in the vector of terms in the term-document matrix. The vector of terms is most simply accessed with txtTdmBi$dimnames$Terms. For example txtTdmBi$dimnames$Terms[[1005]] is "foreign investment".
Here I've used llply from the plyr package so we can have a progress bar (comforting for big jobs), but it's basically the same as the base lapply function.
library(plyr)
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5), .progress = "text" )
The output is a list where each item in the list is a vector of named numbers where the name is the term and the number is the correlation value. For example, to see the terms associated with "foreign investment", we can access the list like so:
dat[[1005]]
and here are the terms associated with that term (I've just pasted in the top few)
168 million 1986 was 1987 early 300 mln 31 pct
1.00 1.00 1.00 1.00 1.00
a bit a crossroads a leading a political a population
1.00 1.00 1.00 1.00 1.00
a reduced a series a slightly about zero activity continues
1.00 1.00 1.00 1.00 1.00
advisers are agricultural sector agriculture the all such also reviews
1.00 1.00 1.00 1.00 1.00
and advisers and attract and imports and liberalised and steel
1.00 1.00 1.00 1.00 1.00
and trade and virtual announced since appears to are equally
1.00 1.00 1.00 1.00 1.00
are recommending areas for areas of as it as steps
1.00 1.00 1.00 1.00 1.00
asia with asian member assesses indonesia attract new balance of
1.00 1.00 1.00 1.00 1.00
Is that what you want to do?
Incidentally, if your term-document matrix is very large, you may want to try this version of findAssocs:
# u is a term document matrix
# term is your term
# corlimit is a value -1 to 1
findAssocsBig <- function(u, term, corlimit){
suppressWarnings(x.cor <- gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),
as.matrix(t(u[ u$dimnames$Terms == term, ])) ))
x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
return(x)
}
This can be used like so:
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5), .progress = "text" )
The advantage of this is that it uses a different method of converting the TDM to a matrix tm:findAssocs. This different method uses memory more efficiently and so prevents this kind of message: Error: cannot allocate vector of size 1.9 Gb from occurring.
Quick benchmarking shows that both findAssocs functions are about the same speed, so the main difference is in the use of memory:
library(microbenchmark)
microbenchmark(
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)),
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)),
times = 10)
Unit: seconds
expr min lq median
dat1 <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocsBig(txtTdmBi, i, 0.5)) 10.82369 11.03968 11.25492
dat <- llply(txtTdmBi$dimnames$Terms, function(i) findAssocs(txtTdmBi, i, 0.5)) 10.70980 10.85640 11.14156
uq max neval
11.39326 11.89754 10
11.18877 11.97978 10