I have a simple problem. I have 2 text documents and I want to make a graph of each document through Igraph or other similar library. I actually want to make a large graph combine both subgraphs of two documents. I tried the following code. But,
> Topic1 = c("I love Pakistan")
> Topic2 = c("Pakistan played well")
> src = data.frame(Topic1,Topic2)
> mycorpus = Corpus(VectorSource(src))
> tdm = as.matrix(TermDocumentMatrix(mycorpus))
Now, don't know what should do next.
First graph of Topic1 will have 3 nodes and 3 edges, similarly, Second graph Topic2 will have 3 nodes and 3 edges. Now, I want o merge these two graph into one graph. The large graph now will have 5 nodes and 6 edges, where, node Pakistan will have 4 edges.
Anybody can help me?
Finally, I got the solution myself. First, we should make a graph of terms from Topic1. We will use every term that have frequency greater than 0.
tdm = as.matrix(TermDocumentMatrix(my))
x = names(tdm[,1][tdm[,1]>0])
k = t(combn(x,2))
g = graph_from_edgelist(k,directed = FALSE)
plot(g)
x2 = names(tdm[,2][tdm[,2]>0])
k2 = t(combn(x2,2))
g2 = graph_from_edgelist(k2,directed = FALSE)
plot(g2)
E1 = get.edgelist(g)
E2 = get.edgelist(g2)
E3 = rbind(E1,E2)
g3 = graph_from_edgelist(E3,directed = FALSE)
plot(g3)
g3 = simplify(g3,remove.multiple = TRUE, remove.loops = TRUE)
I'm trying to generate a four way Venn diagram using draw.quad.venn in the VennDiagram package in R, but it keeps throwing up the error message:
ERROR [2019-05-14 11:28:24] Impossible: a7 <- n234 - a6 produces negative area
Error in draw.quad.venn(length(gene_lists[[1]]), length(gene_lists[[2]]), :
Impossible: a7 <- n234 - a6 produces negative area
I'm using 4 different lists of genes as the input. calculate.overlap works fine, then I get the numbers by using the length(x) function over the overlap values, parsed as a list. I pass all of the overlap values, along with the appropriate total group sizes, to the draw.quad.venn function, but it keeps claiming that one of the groups is impossible because it generates a negative number.
I've checked the numbers manually and they clearly add up to the correct values. I've also tested the script on a random set of 20000 genes, generated using something similar to the script below, and it works fine i.e. generates a four way Venn diagram. There are no differences between the randomly generated gene lists and the ones I've curated from actual results files, apart from their sizes. A minimal working example can be seen below:
# working example that fails
# get vector of 10000 elements (representative of gene list)
values <- c(1:10000)
# generate 4 subsets by random sampling
list_1 <- sample(values, size = 5000, replace = FALSE)
list_2 <- sample(values, size = 4000, replace = FALSE)
list_3 <- sample(values, size = 3000, replace = FALSE)
list_4 <- sample(values, size = 2000, replace = FALSE)
# compile them in to a list
lists <- list(list_1, list_2, list_3, list_4)
# find overlap between all possible combinations (11 plus 4 unique to each list = 15 total)
overlap <- calculate.overlap(lists)
# get the lengths of each list - these will be the numbers used for the Venn diagram
overlap_values <- lapply(overlap, function(x) length(x))
# rename overlap values (easier to identify which groups are intersecting)
names(overlap_values) <- c("n1234", "n123", "n124", "n134", "n234", "n12", "n13", "n14", "n23", "n24", "n34", "n1", "n2", "n3", "n4")
# generate the venn diagram
draw.quad.venn(length(lists[[1]]), length(lists[[2]]), length(lists[[3]]), length(lists[[4]]), overlap_values$n12,
overlap_values$n13, overlap_values$n14, overlap_values$n23, overlap_values$n24, overlap_values$n34,
overlap_values$n123, overlap_values$n124, overlap_values$n134, overlap_values$n234, overlap_values$n1234)
I expect a four way Venn diagram regardless of whether or not some groups are 0, they should still be there, but labelled as 0. This is what it should look like:
I'm not sure if it's because I have 0 values in the real data i.e. certain groups where there is no overlap? Is there any way to force draw.quad.venn() to take any values? If not, is there another package that I can use to achieve the same results? Any help greatly appreciated!
So nothing I tried could solve the error with the draw.quad.venn in the VennDiagram package. There's something wrong with the way it's written. As long as all of the numbers in each of the 4 ellipses add up to the total number of elements in that particular list, the Venn diagram is valid. For some reason, VennDiagram will only accept data where fewer intersections lead to higher numbers e.g. the intersection of groups 1, 2 and 3 MUST be higher than the intersection of all 4 groups. This doesn't represent real world data. It's entirely possible for groups 1, 2 and 3 to not intersect at all, whilst all 4 groups do intersect. In a Venn diagram, all of the numbers are independent, and represent the total number of elements common at each intersection. They do not have to have any bearing on each other.
I had a look at the eulerr package, but actually found a very simple method of plotting the venn diagram using venn in gplots, as follows:
# simple 4 way Venn diagram using gplots
# get some mock data
values <- c(1:20000)
list_1 <- sample(values, size = 5000, replace = FALSE)
list_2 <- sample(values, size = 4000, replace = FALSE)
list_3 <- sample(values, size = 3000, replace = FALSE)
list_4 <- sample(values, size = 2000, replace = FALSE)
lists <- list(list_1, list_2, list_3, list_4)
# name thec list (required for gplots)
names(lists) <- c("G1", "G2", "G3", "G4")
# get the venn table
v.table <- venn(lists)
# show venn table
print(v.table)
# plot Venn diagram
plot(v.table)
I now consider the matter solved. Thank you zx8754 for your help!
I have had a look at the source code of the package. In case you are still interested in the reason for the error, there are two ways to send data to venn.diagram. One is the nxxxx (e. g., n134) form and the other is the an (e. g., a5) form. In the examples, n134 means "which elements belong at least to groups 1, 3 and 4". On the other hand, a5 means "which elements only belong to groups 1, 3 and 4". The relationship between both forms is really convoluted, for instance a6 corresponds to n1234. This means that n134 = a5 + a6.
The problem is that calculate.overlap gives the numbers in the an form, whereas by default draw.quad.venn expects numbers in the nxxxx form. To use the values from calculate.overlap, you can set direct.area to true and provide the result of calculate.overlap in the area.vector parameter. For instance,
tmp <- calculate.overlap(list(a=c(1, 2, 3, 4, 10), b=c(3, 4, 5, 6), c=c(4, 6, 7, 8, 9), d=c(4, 8, 1, 9)))
overlap_values <- lapply(tmp, function(x) length(x))
draw.quad.venn(area.vector = c(overlap_values$a1, overlap_values$a2, overlap_values$a3, overlap_values$a4,
overlap_values$a5, overlap_values$a6, overlap_values$a7, overlap_values$a8,
overlap_values$a9, overlap_values$a10, overlap_values$a11, overlap_values$a12,
overlap_values$a13, overlap_values$a14, overlap_values$a15), direct.area = T, category = c('a', 'b', 'c', 'd'))
If you are interested in something simpler and more flexible, I made the nVennR package for this type of problems:
library(nVennR)
g1 <- c('AF029684', 'M28825', 'M32074', 'NM_000139', 'NM_000173', 'NM_000208', 'NM_000316', 'NM_000318', 'NM_000450', 'NM_000539', 'NM_000587', 'NM_000593', 'NM_000638', 'NM_000655', 'NM_000789', 'NM_000873', 'NM_000955', 'NM_000956', 'NM_000958', 'NM_000959', 'NM_001060', 'NM_001078', 'NM_001495', 'NM_001627', 'NM_001710', 'NM_001716')
g2 <- c('NM_001728', 'NM_001835', 'NM_001877', 'NM_001954', 'NM_001992', 'NM_002001', 'NM_002160', 'NM_002162', 'NM_002258', 'NM_002262', 'NM_002303', 'NM_002332', 'NM_002346', 'NM_002347', 'NM_002349', 'NM_002432', 'NM_002644', 'NM_002659', 'NM_002997', 'NM_003032', 'NM_003246', 'NM_003247', 'NM_003248', 'NM_003259', 'NM_003332', 'NM_003383', 'NM_003734', 'NM_003830', 'NM_003890', 'NM_004106', 'AF029684', 'M28825', 'M32074', 'NM_000139', 'NM_000173', 'NM_000208', 'NM_000316', 'NM_000318', 'NM_000450', 'NM_000539')
g3 <- c('NM_000655', 'NM_000789', 'NM_004107', 'NM_004119', 'NM_004332', 'NM_004334', 'NM_004335', 'NM_004441', 'NM_004444', 'NM_004488', 'NM_004828', 'NM_005214', 'NM_005242', 'NM_005475', 'NM_005561', 'NM_005565', 'AF029684', 'M28825', 'M32074', 'NM_005567', 'NM_003734', 'NM_003830', 'NM_003890', 'NM_004106', 'AF029684', 'NM_005582', 'NM_005711', 'NM_005816', 'NM_005849', 'NM_005959', 'NM_006138', 'NM_006288', 'NM_006378', 'NM_006500', 'NM_006770', 'NM_012070', 'NM_012329', 'NM_013269', 'NM_016155', 'NM_018965', 'NM_021950', 'S69200', 'U01351', 'U08839', 'U59302')
g4 <- c('NM_001728', 'NM_001835', 'NM_001877', 'NM_001954', 'NM_005214', 'NM_005242', 'NM_005475', 'NM_005561', 'NM_005565', 'ex1', 'ex2', 'NM_003890', 'NM_004106', 'AF029684', 'M28825', 'M32074', 'NM_000139', 'NM_000173', 'NM_000208', 'NM_000316', 'NM_000318', 'NM_000450', 'NM_000539')
myV <- plotVenn(list(g1=g1, g2=g2, g3=g3, g4=g4))
myV <- plotVenn(nVennObj = myV)
myV <- plotVenn(nVennObj = myV)
The last command is repeated on purpose. The result:
You can then explore the intersections:
> getVennRegion(myV, c('g1', 'g2', 'g4'))
[1] "NM_000139" "NM_000173" "NM_000208" "NM_000316" "NM_000318" "NM_000450" "NM_000539"
There is a vignette with more information.
High level question is in the subject title: what can you do to debug linear optimisation when using R lp.
The detailed issue is that I have a working program adapted from: [http://pena.lt/y/2014/07/24/mathematically-optimising-fantasy-football-teams/][1]
Based on player data it chooses an optimal 15 man squad - handy for start of year or when you can change all players
I have changed it to:
1) Read player data from an Excel file (which I can supply - just tell me how)
2) Add 2 constraints to show players I definitely want to include in team and those I definitely don't.
Player data has the following columns:
web_name
team_name
type_name
now_cost
total_points
InTeam
In
Out
Good start, so I go about modelling the normal weeks when you can only transfer 1 player. I think I have the right constraint but now lp chooses about 200 players for me - not 15. Something very wrong - but I can't see it how it gets there.
I have tried going back from my new code to strip out the new feature and it still works.
I have tried removing the In/Out constraints and keeping the new "1 change" constraint. Same result.
Have upgraded packages and to latest R
Any pointers?
Code is
#Straight lift from Web - http://pena.lt/y/2014/07/24/mathematically-optimising-fantasy-football-teams/
# plus extra constraints to exclude and include specific players via Excel In/Out columns
# This variant looks to limit changes (typically 1 or 2) for a normal week
library(gdata)
library(lpSolve)
library(stringr)
library(RCurl)
library(jsonlite)
library(plyr)
excelfile<-"C:/Users/mike/Documents/FF/Start2015R.xlsx"
df=read.xls(excelfile)
# Constants
num_teams = 20
num_constraints = 8
# InTeam,In,Out,Cost + 4 positions
#Create the constraints
num_gk = 2
num_def = 5
num_mid = 5
num_fwd = 3
team_size = num_gk + num_def + num_mid + num_fwd
#max_cost = 1000
max_cost = 998
#max_cost = 2000
max_changes = 2
min_same = team_size - max_changes
# Create vectors to constrain by position
df$Goalkeeper = ifelse(df$type_name == "Goalkeeper", 1, 0)
df$Defender = ifelse(df$type_name == "Defender", 1, 0)
df$Midfielder = ifelse(df$type_name == "Midfielder", 1, 0)
df$Forward = ifelse(df$type_name == "Forward", 1, 0)
# Create vector to constrain by max number of players allowed per team
team_constraint = unlist(lapply(unique(df$team_name), function(x, df){
ifelse(df$team_name==x, 1, 0)
}, df=df))
# next we need the constraint directions. First is for MinSame
const_dir <- c(">=","=","=","=", "=", "=", "=", rep("<=", 21))
# The vector to optimize against
objective = df$total_points
# Put the complete matrix together
# nrow is number of constraints
const_mat = matrix(c(df$Inteam,df$In,df$Out,df$Goalkeeper, df$Defender, df$Midfielder, df$Forward,
df$now_cost, team_constraint),
nrow=( num_constraints + length(unique(df$team_name))),
byrow=TRUE)
const_rhs = c(min_same ,sum(df$In),0,num_gk, num_def, num_mid, num_fwd, max_cost, rep(3, num_teams))
# And solve the linear system
x = lp ("max", objective, const_mat, const_dir, const_rhs, all.bin=TRUE, all.int=TRUE)
print(arrange(df[which(x$solution==1),], desc(Goalkeeper), desc(Defender), desc(Midfielder), desc(Forward), desc(total_points)))
print (df[which(x$solution==1),"web_name",drop=FALSE], row.names = FALSE)
# what changed
df[which(x$solution != df$InTeam),"web_name",drop=FALSE]
In R, I would like to insert frequencies (as numbers) in a plot:
my code to create the plot:
par(mar=c(4.5,4.5,9.5,4), xpd=TRUE)
plot(factor(ArtMehrspr)~Mehrspr_Vielf, data=datProjektMehr, col=terrain.colors(4),
bty='L', main="Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(datProjektMehr$ArtMehrspr)),
fill=terrain.colors(4), horiz=TRUE)
par(mar=c(5,4,4,2)+0.1)
In the plot, 2 columns of my dataframe are depicted: ArtMehrspr and Mehrspr_Vielf.
Now what I would like to know is, how many "Kombi" are in category "1", how many "Paral" are in category "1" and so on, and then to print this number in the plot, so that in every box of the plot, I can see the corresponding number of observations. R must know these numbers, otherwise it could not vary the height of the different boxes according to the number of observations. So it cannot be that hard to get these numbers into the plot, can it?
With the command table(), I can get these numbers, but I would have to have 5 table()-commands to get all the numbers. Example for category = 1:
> table(subset(datProjektMehr, Mehrspr_Vielf=="1")$ArtMehrspr)
einspr Kombi Paral Versc Wechs
0 1 9 2 1
Apparently, you can achieve what I am looking for by adding the command labels = TRUE. But it does not work:
par(mar=c(4.5,4.5,9.5,4), xpd=TRUE, labels = TRUE)
plot(factor(ArtMehrspr)~Mehrspr_Vielf, data=datProjektMehr, col=terrain.colors(4),
bty='L', main="Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(datProjektMehr$ArtMehrspr)),
fill=terrain.colors(4), horiz=TRUE)
par(mar=c(5,4,4,2)+0.1)
R gives me the following warning message:
Warning message:
In par(mar = c(4.5, 4.5, 9.5, 4), xpd = TRUE, labels = TRUE) :
"labels" is not a graphical parameter
Is this not the right command? Does anyone know how to do this?
First of all, the warning informs that there is not a labels argument you can use inside par.
Regarding the plotting of the table output, I'm not aware if there is an easy way of doing this, but I managed a pretty UNreliable and, maybe, inefficient code. In my machine, though, it works every time I run it.
The concept I had in mind is to text all values from your table inside the plot. To do so, coordinates in xx' and yy' had to be estimated. I prefer the term "estimated" instead of "calculated" because I didn't find a way to compute absolute values for the coordinates, due to the fact that the plot method was plot.factor.
So:
#random data. DF = datProjektMehr, artmehr = ArtMehrspr, mehrviel = Mehrspr_Vielf
DF <- data.frame(artmehr = sample(letters[1:4], 20, T), mehrviel = as.factor(sample(1:5, 20, T)))
#your code of plotting
par(mar = c(4.5,4.5,9.5,4), xpd = TRUE)
plot(factor(artmehr) ~ mehrviel, data = DF, col = terrain.colors(4),
bty = 'L', main = "Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(DF$artmehr)),
fill=terrain.colors(4), horiz=TRUE)
#no need to "table()" many times
tab = table(DF$artmehr, DF$mehrviel)
#maximum value of x axis (at least in my machine)
#I found -through trial and error- that for a factor of n levels, x.max = 1 + (n-1)*0.02
x.max = 1 + (length(levels(DF$mehrviel)) - 1) * 0.02
#coordinates of "mehrviel" (as I named it)
mehrviel.coords = ((cumsum(apply(tab, 2, sum)) / sum(tab)) * x.max) - ((apply(tab, 2, sum) / sum(tab)) / 2)
#coordinates of "artmehr" (as I named it)
artmehr.coords <- apply(tab, 2, function(x) { cumsum(x / sum(x)) })
artmehr.coords <- apply(artmehr.coords, 2, function(x) { x - c(x[1]/2, diff(x)/2) })
#"text" the values in your table
#don't plot "0"s
for(i in 1:ncol(artmehr.coords))
{
text(x = mehrviel.coords[i], y = artmehr.coords[,i], labels = ifelse(tab[,i] != 0, tab[,i], ""), cex = 2)
}
The values of table:
tab
1 2 3 4 5
a 1 1 0 1 0
b 0 0 2 1 2
c 1 1 2 1 0
d 2 0 0 3 2
The plot:
EDIT: 1) "Tidied" the answer. 2) Aadded an extra level to the factor ploted in xx' axis to match your data exactly. 3)texted the frequencies in the middle of each box.