How can use a centrality metrics in Gremlin on the text file? - gremlin

Please someone send me the correct code in (apache-tinkerpop(Gremlin) for both(closeness centrality, betweenness centrality, PageRank, and EiegenValue ) metrics without any error during the execution step on the list file dataset that like ("edges.txt) file.
Note: This is a sample of the ("edges.txt") datasets that include only the source and destination vertex without any properties(only two columns) as shown below:
Source Destination
1 2
2 3
2 4
3 5
4 5
5 6

A great place to start is the Gremlin Recipes document which can be found here. Doing these types of calculation all in Gremlin will work at modest scale but for a large graph you may need to consider other approaches.

Related

What can be said about these graph in regards to correlation?

I am (desperately) trying to generate a few graphs based on some data that I know has a correlation value of over 0.50 and I get these 2 graphs.
Needless to say, I am no statistician nor have I played with this subject before.
Here are the 2 graphs that I get:
What can be said about the 2 graphs individually? I am super confused by the outcome.
hard to say without knowing the full scope and context of your data. Few remarks;
there is some uper limit to the first graph above which all data points are considered to be 'out of range' (can't be 40 years with a company if I'm 30 years old.)
pay attention to simpsons paradox and make sure you have the right segmentation of your data (and check it).
2nd graph: if you only have values of 3 and 4 on the y-axis there is no use of plotting the grids and values for 3.2 etc. (it implies some significance/accuracy that's not there)
2nd graph: seems there is some 'rule' that says you need rating of 4 to get certain % of salary raise.
but again, more (business) context and info on the data is needed to be able to help you out more.

igraph R vertex ids get changed

I have very basic issue with igraph (in R): renaming of the node ids.
For example, I have following graph in form of edgelist.
10,12
10,14
12,14
12,15
14,15
12,17
17,34
17,100
100,34
I want to calculate local clustering coefficient for each node. First I have read the edgelist in object g using readcsv. Then, I used the following command to dump the local CC for each node.
write.csv(transitivity(g,type="local"),file="DumpLocalCC.csv")
Now the problem is, igraph changes the node IDs starting from 1 and I get following output
"","x"
"1",NA
"2",0.333333333333333
"3",0.333333333333333
"4",0.333333333333333
"5",1
"6",1
"7",1
Now how can I resolute which node id is what ? That is if 7 in the output file points to 100 or 34 ?
Is there anyway, we can force igraph to dump actual nodeids like 10, 34, 100 etc and their respective Local CC ?
I was googling and found people suggested "V(g)$name <- as.character(V(g))" for preserving the nodeids. I tried however, I think I am not using it correctly.
Also, since the data is large, I would not like to change the nodeids manually to make them sequential from 1 .... myself.
P.s: Here I noticed a similar question has been asked. It has been suggested to "assign these numbers as vertex names".
How to do that ?
Can someone exemplify it please ?
Another similar question like this (I understand its the similar question), where it was suggested to open an issue. I am not sure if this has been resolved ?
Thanks in advance.
You just need to combine the stats with the node names when you write the table. For example
DF <- read.csv(text="10,12
10,14
12,14
12,15
14,15
12,17
17,34
17,100
100,34", header=FALSE)
g <- graph.data.frame(DF)
outdata <- data.frame(node=names(V(g)), trans=transitivity(g, type="local"))
write.csv(outdata, file="DumpLocalCC.csv")

How to use GermaNet (WordNet German correspondent) with R

I want to use GermaNet for the lemmatization (corresponding to getLemma() in WordNet), of a list (actually DTM terms -- for enhancing text classification performance). But, I couldn't find any hint, or R package for GermaNet. Is it somehow possible to still use it in R?
I assume you have access to the raw files where the wordnet data is stored (Germanet seems to allow for a free licency). You could parse them (simply using some nifty regular expressions) and extract the information you need (I don't know exactly what a DTM is, but I suppose it's something to do with synsets or links between then). A wordnet (not German) I worked on was organized in multiple files, some containing the links, some information in a form like
0 #1# WORD_MEANING
1 PART_OF_SPEECH "v"
1 VARIANTS
2 LITERAL "someverb"
3 SENSE 7
3 DEFINITION "adefinition"
3 EXAMPLES
4 EXAMPLE "anexample"
3 EXTERNAL_INFO
...
That shouldn't be too hard to parse.

Reading undirected graph relationships (A-B) in R and renaming vertices with igraph

In R I'm trying to map all Madrid tube stations using igraph and then calculate the shortest route between two stations (just the number of stations, not the distance). I'm following this syntax: "An undirected graph with two vertices called ‘A’ and ‘B’ and one edge only:
graph.formula(A-B)"
Below I just copy two tube lines for clarity's sake.
library("igraph")
metro<- graph.formula(PinardeChamartin-Bambu-Chamartin-PlazadeCastilla-Valdeacederas-Tetuan-Estrecho-Alvarado-CuatroCaminos-RiosRosas-Iglesia-Bilbao-Tribunal-GranVia-Sol-TirsodeMolina-AntonMartin-Atocha-AtochaRenfe-MenendezPelayo-Pacifico-PuentedeVallecas-NuevaNumancia-Portazgo,LasRosas-AvenidadeGuadalajara-Alsacia-LaAlmudena-LaElipa-Ventas-ManuelBecerra-Goya-PrincipedeVergara-Retiro-BancodeEspana-Sevilla-Sol-Opera-SantoDomingo-Noviciado-SanBernardo-Quevedo-Canal-CuatroCaminos)
sp <- get.shortest.paths(metro,from="Canal",to="Chamartin")
V(metro)[sp[[1]]]
It seems to work but I have two question:
1. How can I input the tube stations (nodes) and their relationships A-B for long lists into the graph more efficiently, reading a csv for instance?
2.How can I rename those nodes to include tildes, spaces and "ñ"? Because I tried double quotes before and after each node's name but I get an error. A + sign. I haver checked the long string many times and I cannot see the error, no parenthesis missing.
Sorry if they're very basic questions. I'm a very novice user.
Thank you very much
For the first question, see ?graph.data.frame and ?read.csv.
I am not quite sure what you are asking in the second question, what is the error you are getting. Your code works fine for me, with the modification required for igraph 0.7.x:
V(metro)[sp$vpath[[1]]]
# Vertex sequence:
# [1] "Canal" "CuatroCaminos" "Alvarado" "Estrecho"
# [5] "Tetuan" "Valdeacederas" "PlazadeCastilla" "Chamartin"

How can I query a genetics database of SNPs (preferably with R)?

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.
Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

Resources