Related
I am trying to generate a heatmap as the following figure. I have already tried pheatmap and the code is as follows:
breaks_2 <- seq(min(0), max(2), by = 0.1)
pheatmap::pheatmap(
mat = data,
cluster_cols = F,
cluster_rows = F,
scale = "column",
border_color = "white",
color = inferno(20),
show_colnames = TRUE,
show_rownames = FALSE,
breaks = breaks_2
)
But this does not seem to work. So far I am understanding I am mistaking with defining break or have to use another package than pheatmap. Any suggestion will be really helpful.
The color scale in pheatmap adjusts to the range of the input data. If you want anything above a certain value to be coloured daffodil, then simply send pheatmap a copy of your data with the highest values rounded to 2.
Suppose you have a data frame like this, with values anywhere between 0 and 3:
set.seed(1)
data <- as.data.frame(matrix(runif(64, 0, 3), nrow = 8))
names(data) <- LETTERS[1:8]
data
#> A B C D E F G H
#> 1 0.7965260 1.8873421 2.1528555 0.801662 1.4806239 2.46283888 2.1969412 0.9488151
#> 2 1.1163717 0.1853588 2.9757183 1.158342 0.5586528 1.94118058 2.0781947 1.5559028
#> 3 1.7185601 0.6179237 1.1401055 0.040171 2.4821200 2.34879829 1.4328589 1.9860152
#> 4 2.7246234 0.5296703 2.3323357 1.147164 2.0054002 1.65910893 2.5836284 1.2204906
#> 5 0.6050458 2.0610685 2.8041157 2.609073 2.3827196 1.58915874 1.3142913 2.7386278
#> 6 2.6951691 1.1523112 0.6364276 1.021047 0.3238309 2.36806870 0.7343918 0.8808101
#> 7 2.8340258 2.3095243 1.9550213 1.446240 2.1711328 0.06999361 0.2120371 1.3771972
#> 8 1.9823934 1.4930977 0.3766653 1.798697 1.2338233 1.43169020 0.2983985 0.9971840
Some of the values are greater than two. We want all of these to appear the same colour on our heatmap, so we create a copy of our data for plotting, and round down all of the values that were greater than 2 to be exactly 2:
data_2 <- data
data_2[] <- lapply(data_2, function(x) { x[x > 2] <- 2; x })
So now if we run pheatmap on data_2, we see that all the values that were greater than 2 in our original data frame are coloured daffodil.
library(viridis)
library(pheatmap)
breaks_2 <- seq(0, 2, by = 0.1)
pheatmap(
mat = data_2,
cluster_cols = F,
cluster_rows = F,
border_color = "white",
scale = 'none',
color = inferno(22),
show_colnames = TRUE,
show_rownames = FALSE,
legend_breaks = breaks_2
)
Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.
Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)
There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.
I created a Sankey diagram using the plotly package.
Please look at below example. I tried to make five streams, 1_6_7, 2_6_7, and so on. But two of five links between 6 and 7 disappeared. As far as I see, plotly allows to make only three or less links between two nodes.
Can I remove this restrictions ? Any help would be greatly appreciated.
Here is an example code and the outputs:
d <- expand.grid(1:5, 6, 7)
node_label <- 1:max(d)
node_colour <- scales::alpha(RColorBrewer::brewer.pal(7, "Set2"), 0.8)
link_source_nodeind <- c(d[,1], d[,2]) - 1
link_target_nodeind <- c(d[,2], d[,3]) - 1
link_value <- rep(100, nrow(d) * 2)
link_label <- rep(paste(d[,1], d[,2], d[,3], sep = "_"), 2)
link_colour <- rep(scales::alpha(RColorBrewer::brewer.pal(5, "Set2"), 0.2), 2)
p <- plotly::plot_ly(type = "sankey",
domain = c(x = c(0,1), y = c(0,1)),
orientation = "h",
node = list(label = node_label,
color = node_colour),
link = list(source = link_source_nodeind,
target = link_target_nodeind,
value = link_value,
label = link_label,
color = link_colour))
p
I have some authors with their city or country of affiliation. I would like to know if it is possible to plot the coauthors' networks (figure 1), on the map, having the coordinates of the countries. Please consider multiple authors from the same country. [EDIT: Several networks could be generated as in the example and should not show avoidable overlaps]. This is intended for dozens of authors. A zooming option is desirable. Bounty promise +100 for future better answer.
refs5 <- read.table(text="
row bibtype year volume number pages title journal author
Bennett_1995 article 1995 76 <NA> 113--176 angiosperms. \"Annals of Botany\" \"Bennett Md, Leitch Ij\"
Bennett_1997 article 1997 80 2 169--196 estimates. \"Annals of Botany\" \"Bennett MD, Leitch IJ\"
Bennett_1998 article 1998 82 SUPPL.A 121--134 weeds. \"Annals of Botany\" \"Bennett MD, Leitch IJ, Hanson L\"
Bennett_2000 article 2000 82 SUPPL.A 121--134 weeds. \"Annals of Botany\" \"Bennett MD, Someone IJ\"
Leitch_2001 article 2001 83 SUPPL.A 121--134 weeds. \"Annals of Botany\" \"Leitch IJ, Someone IJ\"
New_2002 article 2002 84 SUPPL.A 121--134 weeds. \"Annals of Botany\" \"New IJ, Else IJ\"" , header=TRUE,stringsAsFactors=FALSE)
rownames(refs5) <- refs5[,1]
refs5<-refs5[,2:9]
citations <- as.BibEntry(refs5)
authorsl <- lapply(citations, function(x) as.character(toupper(x$author)))
unique.authorsl<-unique(unlist(authorsl))
coauth.table <- matrix(nrow=length(unique.authorsl),
ncol = length(unique.authorsl),
dimnames = list(unique.authorsl, unique.authorsl), 0)
for(i in 1:length(citations)){
paper.auth <- unlist(authorsl[[i]])
coauth.table[paper.auth,paper.auth] <- coauth.table[paper.auth,paper.auth] + 1
}
coauth.table <- coauth.table[rowSums(coauth.table)>0, colSums(coauth.table)>0]
diag(coauth.table) <- 0
coauthors<-coauth.table
bip = network(coauthors,
matrix.type = "adjacency",
ignore.eval = FALSE,
names.eval = "weights")
authorcountry <- read.table(text="
author country
1 \"LEITCH IJ\" Argentina
2 \"HANSON L\" USA
3 \"BENNETT MD\" Brazil
4 \"SOMEONE IJ\" Brazil
5 \"NEW IJ\" Brazil
6 \"ELSE IJ\" Brazil",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
matched<- authorcountry$country[match(unique.authorsl, authorcountry$author)]
bip %v% "Country" = matched
colorsmanual<-c("red","darkgray","gainsboro")
names(colorsmanual) <- unique(matched)
gdata<- ggnet2(bip, color = "Country", palette = colorsmanual, legend.position = "right",label = TRUE,
alpha = 0.9, label.size = 3, edge.size="weights",
size="degree", size.legend="Degree Centrality") + theme(legend.box = "horizontal")
gdata
In other words, adding the names of authors, lines and bubbles to the map. Note, several authors maybe from the same city, or country and should not overlap.
Figure 1 Network
EDIT: The current JanLauGe answer overlaps two non-related networks. authors "ELSE" and "NEW" need to be apart from others as in figure 1.
Are you looking for a solution using exactly the packages you used, or would you be happy to use suite of other packages? Below is my approach, in which I extract the graph properties from the network object and plot them on a map using the ggplot2 and map package.
First I recreate the example data you gave.
library(tidyverse)
library(sna)
library(maps)
library(ggrepel)
set.seed(1)
coauthors <- matrix(
c(0,3,1,1,3,0,1,0,1,1,0,0,1,0,0,0),
nrow = 4, ncol = 4,
dimnames = list(c('BENNETT MD', 'LEITCH IJ', 'HANSON L', 'SOMEONE ELSE'),
c('BENNETT MD', 'LEITCH IJ', 'HANSON L', 'SOMEONE ELSE')))
coords <- data_frame(
country = c('Argentina', 'Brazil', 'USA'),
coord_lon = c(-63.61667, -51.92528, -95.71289),
coord_lat = c(-38.41610, -14.23500, 37.09024))
authorcountry <- data_frame(
author = c('LEITCH IJ', 'HANSON L', 'BENNETT MD', 'SOMEONE ELSE'),
country = c('Argentina', 'USA', 'Brazil', 'Brazil'))
Now I generate the graph object using the snp function network
# Generate network
bip <- network(coauthors,
matrix.type = "adjacency",
ignore.eval = FALSE,
names.eval = "weights")
# Graph with ggnet2 for centrality
gdata <- ggnet2(bip, color = "Country", legend.position = "right",label = TRUE,
alpha = 0.9, label.size = 3, edge.size="weights",
size="degree", size.legend="Degree Centrality") + theme(legend.box = "horizontal")
From the network object we can extract the values of each edge, and from the ggnet2 object we can get degree of centrality for nodes as below:
# Combine data
authors <-
# Get author numbers
data_frame(
id = seq(1, nrow(coauthors)),
author = sapply(bip$val, function(x) x$vertex.names)) %>%
left_join(
authorcountry,
by = 'author') %>%
left_join(
coords,
by = 'country') %>%
# Jittering points to avoid overlap between two authors
mutate(
coord_lon = jitter(coord_lon, factor = 1),
coord_lat = jitter(coord_lat, factor = 1))
# Get edges from network
networkdata <- sapply(bip$mel, function(x)
c('id_inl' = x$inl, 'id_outl' = x$outl, 'weight' = x$atl$weights)) %>%
t %>% as_data_frame
dt <- networkdata %>%
left_join(authors, by = c('id_inl' = 'id')) %>%
left_join(authors, by = c('id_outl' = 'id'), suffix = c('.from', '.to')) %>%
left_join(gdata$data %>% select(label, size), by = c('author.from' = 'label')) %>%
mutate(edge_id = seq(1, nrow(.)),
from_author = author.from,
from_coord_lon = coord_lon.from,
from_coord_lat = coord_lat.from,
from_country = country.from,
from_size = size,
to_author = author.to,
to_coord_lon = coord_lon.to,
to_coord_lat = coord_lat.to,
to_country = country.to) %>%
select(edge_id, starts_with('from'), starts_with('to'), weight)
Should look like this now:
dt
# A tibble: 8 × 11
edge_id from_author from_coord_lon from_coord_lat from_country from_size to_author to_coord_lon
<int> <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
1 1 BENNETT MD -51.12756 -16.992729 Brazil 6 LEITCH IJ -65.02949
2 2 BENNETT MD -51.12756 -16.992729 Brazil 6 HANSON L -96.37907
3 3 BENNETT MD -51.12756 -16.992729 Brazil 6 SOMEONE ELSE -52.54160
4 4 LEITCH IJ -65.02949 -35.214117 Argentina 4 BENNETT MD -51.12756
5 5 LEITCH IJ -65.02949 -35.214117 Argentina 4 HANSON L -96.37907
6 6 HANSON L -96.37907 36.252312 USA 4 BENNETT MD -51.12756
7 7 HANSON L -96.37907 36.252312 USA 4 LEITCH IJ -65.02949
8 8 SOMEONE ELSE -52.54160 -9.551913 Brazil 2 BENNETT MD -51.12756
# ... with 3 more variables: to_coord_lat <dbl>, to_country <chr>, weight <dbl>
Now moving on to plotting this data on a map:
world_map <- map_data('world')
myMap <- ggplot() +
# Plot map
geom_map(data = world_map, map = world_map, aes(map_id = region),
color = 'gray85',
fill = 'gray93') +
xlim(c(-120, -20)) + ylim(c(-50, 50)) +
# Plot edges
geom_segment(data = dt,
alpha = 0.5,
color = "dodgerblue1",
aes(x = from_coord_lon, y = from_coord_lat,
xend = to_coord_lon, yend = to_coord_lat,
size = weight)) +
scale_size(range = c(1,3)) +
# Plot nodes
geom_point(data = dt,
aes(x = from_coord_lon,
y = from_coord_lat,
size = from_size,
colour = from_country)) +
# Plot names
geom_text_repel(data = dt %>%
select(from_author,
from_coord_lon,
from_coord_lat) %>%
unique,
colour = 'dodgerblue1',
aes(x = from_coord_lon, y = from_coord_lat, label = from_author)) +
coord_equal() +
theme_bw()
Obviously you can change the colour and design in the usual way with ggplot2 grammar. Notice that you could also use geom_curve and the arrow aesthetic to get a plot similar to the one in the uber post linked in the comments above.
As an effort to avoid the overlapping of the 2 networks, I came to this modification of the x and y coordenates of the ggplot, which by default does not overlap the networks, see figure 1 in the question.
# get centroid positions for countries
# add coordenates to authorcountry table
# download and unzip
# https://worldmap.harvard.edu/data/geonode:country_centroids_az8
setwd("~/country_centroids_az8")
library(rgdal)
cent <- readOGR('.', "country_centroids_az8", stringsAsFactors = F)
countrycentdf<-cent#data[,c("name","Longitude","Latitude")]
countrycentdf$name[which(countrycentdf$name=="United States")]<-"USA"
colnames(countrycentdf)[names(countrycentdf)=="name"]<-"country"
authorcountry$Longitude<-countrycentdf$Longitude[match(authorcountry$country,countrycentdf$country)]
authorcountry$Latitude <-countrycentdf$Latitude [match(authorcountry$country,countrycentdf$country)]
# original coordenates of plot and its transformation
ggnetbuild<-ggplot_build(gdata)
allcoord<-ggnetbuild$data[[3]][,c("x","y","label")]
allcoord$Latitude<-authorcountry$Latitude [match(allcoord$label,authorcountry$author)]
allcoord$Longitude<-authorcountry$Longitude [match(allcoord$label,authorcountry$author)]
allcoord$country<-authorcountry$country [match(allcoord$label,authorcountry$author)]
# increase with factor the distance among dots
factor<-7
allcoord$coord_lat<-allcoord$y*factor+allcoord$Latitude
allcoord$coord_lon<-allcoord$x*factor+allcoord$Longitude
allcoord$author<-allcoord$label
# plot as in answer of JanLauGe, without jitter
library(tidyverse)
library(ggrepel)
authors <-
# Get author numbers
data_frame(
id = seq(1, nrow(coauthors)),
author = sapply(bip$val, function(x) x$vertex.names)) %>%
left_join(
allcoord,
by = 'author')
# Continue as in answer of JanLauGe
networkdata <- ##
dt <- ##
world_map <- map_data('world')
myMap <- ##
myMap
Assuming I am creating a directed graph as follows
a = sample(state.name, 10)
b = sample(state.name, 10)
c = sample(state.name, 10)
d = data.frame(a,b,c)
d1 = as.matrix(d)
edges1 = rbind(d1[,1:2],d1[,2:3])
g = graph.data.frame(edges1, directed=F)
adj <- as.matrix(get.adjacency(g))
g2 <- new("graphAM", adjMat=adj, edgemode="directed")
plot(g2, attrs = list(graph = list(rankdir="LR"),
node = list(fillcolor = "lightblue")))
I get a very ugly graph with very disproportionate node size and node text. (sorry dont have enough credit to post the image of the graph)
I tried node.size and V(g).size parameters and it didnt work. Any advise ?