Going crazy with forceNetwork in R: no edges displayed - r
I've been trying to plot a network using networkD3 package in R for a week now. The simpleNetwork function works normally, but it doesn't allow much control over the appearance of the graph. The forceNetwork function is there for this purpose: display graph with enriched visual features.
The problem I have is pretty much the same exposed in this question. I have carefully read the package documentation, and tried the solution proposed in the above thread, with no luck: all I get is a cloud of nodes with no edges linking them.
Here my data.frames:
edg
Gene1 Gene2 Prob
1 22 3
2 22 6
3 22 6
4 22 9
5 22 3
6 22 4
7 22 8
8 22 4
9 22 6
10 22 8
11 22 6
12 22 10
13 22 6
14 22 3
15 22 6
16 22 6
17 22 0
18 22 4
19 22 6
20 22 4
vert
Symbol Chr Expr
1 21 9
2 17 10
3 17 0
4 20 0
5 6 9
6 5 11
7 12 0
8 1 20
9 17 11
10 17 7
11 17 11
12 10 0
13 17 0
14 7 7
15 17 6
16 17 0
17 2 5
18 5 10
19 17 10
20 17 9
21 12 4
22 3 2
Well, this results in the above mentioned cloud of nodes with no edges. Same thing if I change 'Symbol' column with actual labels I'd put on the nodes (respecting the order of Links' table, as required by the package).
Note that the package illustrates the use of this function with this example, and if you open the datsets used (MisLinks, MisNodes), their content is the same as mine, except for the labels of the nodes. Running that very same example works; running with my data does not.
Here is the function I use to plot the network:
forceNetwork( Links = edg, Nodes = vert, Source = "Gene1", Target = "Gene2",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
colourScale = "d3.scale.category20b()", Nodesize = "Expr", zoom = T,
legend = T )
Every other property is correctly displayed (node size, legend, colours), but I keep seeing no edges. There must be a mistake somewhere in my datasets, which I cannot find in any way.
I was having the same problem (simpleNetwork working normally, forceNetwork first displaying only nodes & no edges, then subsequently no display at all).
The problem (which you presumably fixed when you "rebuilt dataframes starting numbering from 0") was your original Links data, edg, starting from 1 instead of 0?
The networkD3 documentation, http://christophergandrud.github.io/networkD3/, has this note:
Note: You are probably used to R’s 1-based numbering (i.e. counting in R starts from 1). However, networkD3 plots are created using JavaScript, which is 0-based. So, your data links will need to start from 0.
Re. incorrect data types which I also originally thought might be the problem, I tested casting all the different columns (except the factor variable for the NodeID) as.numeric vs as.integer - however having now corrected my data to be 0-based instead of 1-based, my forceNetwork display works normally with either data type.
Hope this helps!
I have just fixed the same problem in my own forceNetwork. It turned out that the dataframe of edges that I had created (exported from iGraph) had character types, not int types. Casting the edge 'from' and 'to' columns using as.numeric() resolved the problem and the links drew correctly.
I hope this helps.
With regards,
Will
Technically speaking, the reason your example data will not work, even if you address other possible problems (like edg$Gene1 and edg$Gene2 being non-numeric), is because you refer to a node 22 in your edg data, which in "0-based index" terms points to the 23rd row of your vert data frame, which does not exist.
As has been pointed out, this is probably because it is in 1-based indexing and should be converted, which could easily be done with
edg$Gene1 <- edg$Gene1 - 1
edg$Gene2 <- edg$Gene2 - 1
Alternatively, one might have intended to refer to another node which, for whatever reason did not make it into the vert data frame, in which case that node would need to be added to the vert data frame, which could easily be done with (for example)...
vert <- rbind(vert, c(23,1,1))
You could test whether or not you refer to a node in your edj data that doesn't exist in your vert data with something like...
all(unique(c(edg$Gene1, edg$Gene2)) %in% (1:nrow(vert) - 1))
# [1] FALSE
which should return TRUE. If not, something's wrong.
You could determine which nodes are referred to in your edg data that do not exist in your vert data with...
unique(c(edg$Gene1, edg$Gene2))[which(!unique(c(edg$Gene1, edg$Gene2)) %in% (1:nrow(vert) - 1))]
# [1] 22
fully reproducible example adjusting the indices in edg to be "0-based"
edg <- read.csv(header = TRUE, colClasses = 'character', text = '
Gene1,Gene2,Prob
1,22,3
2,22,6
3,22,6
4,22,9
5,22,3
6,22,4
7,22,8
8,22,4
9,22,6
10,22,8
11,22,6
12,22,10
13,22,6
14,22,3
15,22,6
16,22,6
17,22,0
18,22,4
19,22,6
20,22,4
')
vert <- read.csv(header = TRUE, colClasses = 'character', text = '
Symbol,Chr,Expr
1,21,9
2,17,10
3,17,0
4,20,0
5,6,9
6,5,11
7,12,0
8,1,20
9,17,11
10,17,7
11,17,11
12,10,0
13,17,0
14,7,7
15,17,6
16,17,0
17,2,5
18,5,10
19,17,10
20,17,9
21,12,4
22,3,2
')
# cast to numeric just to be sure
edg$Gene1 <- as.numeric(edg$Gene1)
edg$Gene2 <- as.numeric(edg$Gene2)
# adjust the indices so they're "0-based"
edg$Gene1 <- edg$Gene1 - 1
edg$Gene2 <- edg$Gene2 - 1
# Nodesize is also necessarily numeric
vert$Expr <- as.numeric(vert$Expr)
library(networkD3)
forceNetwork(Links = edg, Nodes = vert, Source = "Gene1", Target = "Gene2",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
Nodesize = "Expr", zoom = TRUE, legend = TRUE)
fully reproducible example adding a node to vert
edg <- read.csv(header = TRUE, colClasses = 'character', text = '
Gene1,Gene2,Prob
1,22,3
2,22,6
3,22,6
4,22,9
5,22,3
6,22,4
7,22,8
8,22,4
9,22,6
10,22,8
11,22,6
12,22,10
13,22,6
14,22,3
15,22,6
16,22,6
17,22,0
18,22,4
19,22,6
20,22,4
')
vert <- read.csv(header = TRUE, colClasses = 'character', text = '
Symbol,Chr,Expr
1,21,9
2,17,10
3,17,0
4,20,0
5,6,9
6,5,11
7,12,0
8,1,20
9,17,11
10,17,7
11,17,11
12,10,0
13,17,0
14,7,7
15,17,6
16,17,0
17,2,5
18,5,10
19,17,10
20,17,9
21,12,4
22,3,2
')
# cast to numeric just to be sure
edg$Gene1 <- as.numeric(edg$Gene1)
edg$Gene2 <- as.numeric(edg$Gene2)
vert$Expr <- as.numeric(vert$Expr)
# add another node to the Nodes data frame
vert <- rbind(vert, c(23,1,1))
library(networkD3)
forceNetwork(Links = edg, Nodes = vert, Source = "Gene1", Target = "Gene2",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
Nodesize = "Expr", zoom = TRUE, legend = TRUE)
I met the same problem, but fixed it by setting the factor levels of source and target to be consistent with node names before transferring into numeric:
edg$Gene1<-factor(edg$Gene1,levels=vert$Symbol)
edg$Gene2<-factor(edg$Gene2,levels=vert$Symbol)
edg$source<-as.numeric(edg$Gene1)-1
edg$target<-as.numeric(edg$Gene2)-1
so that source and target vectors have consistent factor levels as node names (vert$Symbol), then
forceNetwork( Links = edg, Nodes = vert, Source = "source", Target = "target",
Value = "Prob", NodeID = "Symbol", Group = "Chr", opacity = 0.7,
colourScale = "d3.scale.category20b()", Nodesize = "Expr", zoom = T,
legend = T )
works for me.
Hope this is helpful.
Related
Median and Boxplot (R)
I am writing to your forum because I do not find solution for my problem. I am trying to represent graphically the Median catching time (MCT) of mosquito that we (my team and I) have collected (I am currently in an internship to study the malaria in Ivory Coast). The MCT represents the time for which 50% of the total malaria vectors were caught on humans. For example, we collected this sample: Hour of collection / Mosquitoes number: 20H-21H = 1 21H-22H = 1 22H-23H = 2 23H-00H = 2 00H-01H = 13 01H-02H = 10 02H-03H = 15 03H-04H = 15 04H-05H = 8 05H-06H = 10 06H-07H = 6 Here the effective cumulated is 83 mosquitoes. And I am assuming that the median of this mosquito serie is 83+1/2 = 42 (And I don't even find this number on R), inducing a Median catching time at 2 am (02). Therefore, I have tried to use the function "boxplot" with different parameters, but I cannot have what I want to represent. Indeed, I have boxes for each hour of collection when I want the representation of the effective cumulated over the time of collection. And the time use in R is "20H-21H" = 20, "21H-22H" = 21 etc. I have found an article (Nicolas Moiroux, 2012) who presents the Median Catching Time and a boxplot that I should like to have. I copy the image of the cited boxplot: Boxplot_Moiroux2012 Thank you in advance for your help, and I hope that my grammar is fine (I speak and write mainly in French, my mother tongue). Kind Regards, Edouard PS : And regarding the code I have used with this set of data, here I am (with "Eff" = Number of mosquito and "Heure" = time of collection): sum(Eff) as.factor(Heure) tapply(Eff,Heure,median) tapply(Heure,Eff,median) boxplot(Eff,horizontal=T) boxplot(Heure~Eff) boxplot(Eff~Heur)) (My skills on R are not very sharp...)
You need to use a trick since you already have counts and not the time data for each catch. First, you convert your time values to a more continuous variable, then you generate a vector with all the time values and then you boxplot (with a custom axis). txt <- "20H-21H = 1 21H-22H = 1 22H-23H = 2 23H-00H = 2 00H-01H = 13 01H-02H = 10 02H-03H = 15 03H-04H = 15 04H-05H = 8 05H-06H = 10 06H-07H = 6" dat <- read.table(text = txt, sep = "=", h = F) colnames(dat) <- c("collect_time", "nb_mosquito") # make a continuous numerical proxy for time dat$collect_time_num <- 1:nrow(dat) # get values of proxy according to your data tvals <- rep(dat$collect_time_num, dat$nb_mosquito) # plot boxplot(tvals, horizontal = T, xaxt = "n") axis(1, labels = as.character(dat$collect_time), at = dat$collect_time_num) outputs the following plot :
How to create an edge list for each user mentioned in a tweet when there are observations containing several user mentioned
I want to do an network analysis of the tweets of some users of my interest and the mentioned users in their tweets. I retrieved the tweets (no retweets) from several user timelines using the rtweet package in r and want to see who they mention in their tweets. There is even a variable with the screen names of those useres who are mentioned which will serve me as the target group for my edge list. But sometimes they mention several users and then the observation looks for example like this: c('luigidimaio', 'giuseppeconteit') whereas there is only one user mentioned it is naming just this one user as an observation (eg. agorarai). I want to split those observations containing several mentioned users into single observations for each user. So out of one observation containing both mentioned users as a vector I would have to split it into two observation each containing one of the mentioned users. The code looks like this so far: # get user timelines of the most active italian parties (excluding retweets) tmls_nort <- get_timelines(c("Mov5Stelle", "pdnetwork", "LegaSalvini"), n = 3200, include_rts = FALSE ) # create an edge list tmls_el = as.data.frame(cbind(Source = tolower(tmls_nort$screen_name), Target = tolower(tmls_nort$mentions_screen_name))) Here is an extract of my dataframe: Source Target n <fct> <fct> <int> 1 legasalvini circomassimo 2 2 legasalvini 1giornodapecora 2 3 legasalvini 24mattino 2 4 legasalvini agorarai 28 5 legasalvini ariachetira 2 6 legasalvini "c(\"raiportaaporta\", \"brunovespa\")" 7 ```
We can start from this: first you could clean up your columns, tidy up the data and plot your network. The data I used are: tmls_el Source Target n 1 legasalvini circomassimo 2 2 legasalvini 1giornodapecora 2 3 legasalvini 24mattino 2 4 legasalvini agorarai 28 5 legasalvini ariachetira 26 6 legasalvini c("raiportaaporta", "brunovespa") 7 7 movimento5stelle c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8") 20 Now the what I've done: # here you replace the useless characer with nothing tmls_el$Target <- gsub("c\\(\"", "", tmls_el$Target) tmls_el$Target <- gsub("\\)", "", tmls_el$Target) tmls_el$Target <- gsub("\"", "", tmls_el$Target) library(stringr) temp <- data.frame(str_split_fixed(tmls_el$Target, ", ", 8)) tmls_el_2 <- data.frame( Source = c(rep(as.character(tmls_el$Source),8)) , Target = c(as.character(temp$X1),as.character(temp$X2),as.character(temp$X3), as.character(temp$X4),as.character(temp$X5),as.character(temp$X6), as.character(temp$X7),as.character(temp$X8)) , n = c(rep(as.character(tmls_el$n),8))) Note: it works with the example you give, if you have more than 8 target, you have to change the number 2 to 2,3,...k, and paste the new column in Target, and repeat k times Source and n. Surely there is a more elegant way, but this works. Here you can create edges and nodes: library(dplyr) el <- tmls_el_2 %>% filter(Target !='') no <- data.frame(name = unique(c(as.character(el$Source),as.character(el$Target)))) Now you can use igraph to plot the results: library(igraph) g <- graph_from_data_frame(el, directed=TRUE, vertices=no) plot(g, edge.width = el$n/2) With data: tmls_el <- data.frame(Source = c("legasalvini","legasalvini","legasalvini","legasalvini","legasalvini","legasalvini","movimento5stelle"), Target = c("circomassimo","1giornodapecora","24mattino","agorarai","ariachetira","c(\"raiportaaporta\", \"brunovespa\")","c(\"test1\", \"test2\", \"test3\", \"test4\", \"test5\", \"test6\", \"test7\", \"test8\")"), n = c(2,2,2,28,26,7,20))
How do you plot the first few values of a PCA
I've run a PCA with a moderately-sized data set, but I only want to visualize a certain amount of points from that analysis because they are from repeat observations and I want to see how close the paired observations are to each other on the plot. I've set it up so that the first 18 individuals are the ones I want to plot, but I can't seem to only plot just the first 18 points without only doing an analysis of only the first 18 instead of the whole data set (43 individuals). # My data file TrialsMR<-read.csv("NER_Trials_Matrix_Retrials.csv", row.names = 1) # I ran the PCA of all of my values (without the categorical variable in col 8) R.pca <- PCA(TrialsMR[,-8], graph = FALSE) # When I try to plot only the first 18 individuals with this method, I get an error fviz_pca_ind(R.pca[1:18,], labelsize = 4, pointsize = 1, col.ind = TrialsMR$Bands, palette = c("red", "blue", "black", "cyan", "magenta", "yellow", "gray", "green3", "pink" )) # This is the error Error in R.pca[1:18, ] : incorrect number of dimensions The 18 individuals are each paired up, so only using 9 colours shouldn't cause an error (I hope). Could anyone help me plot just the first 18 points from a PCA of my whole data set? My data frame looks similar to this in structure TrialsMR Trees Bushes Shrubs Bands JOHN1 1 4 18 BLUE JOHN2 2 6 25 BLUE CARL1 1 3 12 GREEN CARL2 2 4 15 GREEN GREG1 1 1 15 RED GREG2 3 11 26 RED MIKE1 1 7 19 PINK MIKE2 1 1 25 PINK where each band corresponds to a specific individual that has been tested twice.
You are using the wrong argument to specify individuals. Use select.ind to choose the individuals required, for eg.: data(iris) # test data If you want to rename your rows according to a specific grouping criteria for readily identifiable in a plot. For eg. let setosa lies in series starting with 1, something like in 100-199, similarly versicolor in 200-299 and virginica in 300-399. Do it before the PCA. new_series <- c(101:150, 201:250, 301:350) # there are 50 of each rownames(iris) <- new_series R.pca <- prcomp(iris[,1:4],scale. = T) # pca library(factoextra) fviz_pca_ind(X= R.pca, labelsize = 4, pointsize = 1, select.ind= list(name = new_series[1:120]), # 120 out of 150 selected col.ind = iris$Species , palette = c("blue", "red", "green" )) Always refer to R documentation first before using a new function. R documentation: fviz_pca {factoextra} X an object of class PCA [FactoMineR]; prcomp and princomp [stats]; dudi and pca [ade4]; expOutput/epPCA [ExPosition]. select.ind, select.var a selection of individuals/variables to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib For your particular dummy data, this should do: R.pca <- prcomp(TrailsMR[,1:3], scale. = TRUE) fviz_pca_ind(X= R.pca, select.ind= list(name = row.names(TrialsMR)[1:4]), # 4 out of 8 pointsize = 1, labelsize = 4, col.ind = TrialsMR$Bands, palette = c("blue", "green" )) + ylim(-1,1) Dummy Data: TrialsMR <- read.table( text = "Trees Bushes Shrubs Bands JOHN1 1 4 18 BLUE JOHN2 2 6 25 BLUE CARL1 1 3 12 GREEN CARL2 2 4 15 GREEN GREG1 1 1 15 RED GREG2 3 11 26 RED MIKE1 1 7 19 PINK MIKE2 1 1 25 PINK", header = TRUE)
Capture the output of arules::inspect as data.frame
In "Zero frequent items" when using the eclat to mine frequent itemsets, the OP is interested in the groupings/clusterings based on how frequent they are ordered together. This grouping can be inspected by the arules::inspect function. library(arules) dataset <- read.transactions("8GbjnHK2.txt", sep = ";", rm.duplicates = TRUE) f <- eclat(dataset, parameter = list( supp = 0.001, maxlen = 17, tidLists = TRUE)) inspect(head(sort(f, by = "support"), 10)) The data set can be downloaded from https://pastebin.com/8GbjnHK2. However, the output cannot be easily saved to another object as a data frame. out <- inspect(f) So how do we capture the output of inspect(f) for use as data frame?
We can use the methods labels to extract the associations/groupings and quality to extract the quality measures (support and count). We can then use cbind to store these into a data frame. out <- cbind(labels = labels(f), quality(f)) head(out) # labels support count # 1 {3031093,3059242} 0.001010 16 # 2 {3031096,3059242} 0.001073 17 # 3 {3060614,3060615} 0.001010 16 # 4 {3022540,3072091} 0.001010 16 # 5 {3061698,3061700} 0.001073 17 # 6 {3031087,3059242} 0.002778 44
Coercing the itemsets to a data.frame also creates the required output. > head(as(f, "data.frame")) items support count 1 {3031093,3059242} 0.001010101 16 2 {3031096,3059242} 0.001073232 17 3 {3060614,3060615} 0.001010101 16 4 {3022540,3072091} 0.001010101 16 5 {3061698,3061700} 0.001073232 17 6 {3031087,3059242} 0.002777778 44
How to combine multiple variable data to a single variable data?
After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data. Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White 1 9 1 1 1 2 18 4 4 4 2 26 As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset
names <- c("red","white","water") df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names) for(col in names){ df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))]) } here grep(col,tolower(names(df))) looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths
I would just create a new data.frame, easiest to do with dplyr but also doable with base R: with dplyr newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White) with base R (not complete but you get the point) newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)
One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted. library(dplyr) names <- c("red", "white", "water") cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names))) # Mock red white water # 1 1 24 28 8 Data: df <- read.table(text = "Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White 1 9 1 1 1 2 18 4 4 4 2 26", header = TRUE, stringsAsFactors = FALSE)