I am using the ggseqplot package to visualize sequential data, however, I cannot add a count label to each stack. Following is my code:
ggseqdplot(biofam.seq, border = T) +
scale_fill_discrete_sequential("PuBuGn")+
scale_x_discrete() +
labs(x = "Path Number") +
theme(legend.position = "right") +
theme_minimal()
This sample dataset can be used:
data(biofam)
biofam <- biofam[sample(nrow(biofam),300),]
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
How can I add a count data label to each stack of this plot.
Including count labels in a dplot is unconventional, because in most instances the resulting plots are quite crowded. This is also true for the example data you used but, of course, this might be different in your own application.
As virtually no one uses count labels, this option is not implemented in ggseqdplot, but you can add another plot layer containing the labels using geom_text. For this purpose, we have to generate an additional data set containing the labels (state frequencies) and their positions on the x and y axes.
In the following code, I first extract the state frequencies using table and then reshape the data into the long (tidy) format required by ggplot. If you save the plot generated with ggseqdplot as an object (p) you can inspect the data underlying the plot (p$data). This gives you an idea of the required data structure for your count label data. Once the data are in the right shape, you can add the label plot layer. Note that I only display count labels if states with a frequency higher than 10.
# load required libraries
library(TraMineR)
library(ggseqplot)
library(tidyverse)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Example data
data(biofam)
set.seed(10)
biofam <- biofam[sample(nrow(biofam),300),]
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Render & save dplot
p <- ggseqdplot(biofam.seq, border = T) +
scale_x_discrete() +
theme_minimal() +
theme(legend.position = "bottom")
#> Scale for 'x' is already present. Adding another scale for 'x', which will
#> replace the existing scale.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Table with state frequencies
state.freqs <- map_dfc(biofam.seq, table) |>
map_dfc(as.numeric) |>
mutate(state = row_number(), .before = 1)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reshape & prepare data for plot
label.data <- state.freqs |>
pivot_longer(cols = -1,
names_to = "x",
values_to = "label") |>
group_by(state) |>
mutate(x = row_number() |> factor()) |>
ungroup() |>
filter(label != 0) |>
mutate(state = factor(state, labels = stlab(biofam.seq))) |>
group_by(x) |>
mutate(y = cumsum(label/sum(label)))
# Have a look at the data
label.data
#> # A tibble: 92 × 4
#> # Groups: x [16]
#> state x label y
#> <fct> <fct> <dbl> <dbl>
#> 1 Parent 1 296 0.987
#> 2 Parent 2 285 0.95
#> 3 Parent 3 281 0.937
#> 4 Parent 4 269 0.897
#> 5 Parent 5 249 0.83
#> 6 Parent 6 222 0.74
#> 7 Parent 7 189 0.63
#> 8 Parent 8 162 0.54
#> 9 Parent 9 135 0.45
#> 10 Parent 10 112 0.373
#> # … with 82 more rows
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Add labels to original plot
# We only show labels for states with more than 10 occurences
p +
geom_text(data = label.data |> filter(label > 10),
aes(x = x, y = y, label = label),
nudge_y = -.02, size = 3)
Created on 2022-10-17 with reprex v2.0.2
Related
With my dataframe that looks like this (I have in total 1322 rows) :
I'd like to make a bar plot with the percentage of rating of the CFS score. It should look similar to this :
With this code, I can make a single bar plot for the column cfs_triage :
ggplot(data = df) +
geom_bar(mapping = aes(x = cfs_triage, y = (..count..)/sum(..count..)))
But I can't find out to make one with the three varaibles next to another.
Thank you in advance to all of you that will help me with making this barplot with the percentage of rating for this three variable !(I'm not sure that my explanations are very clear, but I hope that it's the case :))
Your best bet here is to pivot your data into long format. We don't have your data, but we can reproduce a similar data set like this:
set.seed(1)
df <- data.frame(cfs_triage = sample(10, 1322, TRUE, prob = 1:10),
cfs_silver = sample(10, 1322, TRUE),
cfs_student = sample(10, 1322, TRUE, prob = 10:1))
df[] <- lapply(df, function(x) { x[sample(1322, 300)] <- NA; x})
Now the dummy data set looks a lot like yours:
head(df)
#> cfs_triage cfs_silver cfs_student
#> 1 9 NA 1
#> 2 8 4 2
#> 3 NA 8 NA
#> 4 NA 10 9
#> 5 9 5 NA
#> 6 3 1 NA
If we pivot into long format, then we will end up with two columns: one containing the values, and one containing the column name that the value belonged to in the original data frame:
library(tidyverse)
df_long <- df %>%
pivot_longer(everything())
head(df_long)
#> # A tibble: 6 x 2
#> name value
#> <chr> <int>
#> 1 cfs_triage 9
#> 2 cfs_silver NA
#> 3 cfs_student 1
#> 4 cfs_triage 8
#> 5 cfs_silver 4
#> 6 cfs_student 2
This then allows us to plot with value on the x axis, and we can use name as a grouping / fill variable:
ggplot(df_long, aes(value, fill = name)) +
geom_bar(position = 'dodge') +
scale_fill_grey(name = NULL) +
theme_bw(base_size = 16) +
scale_x_continuous(breaks = 1:10)
#> Warning: Removed 900 rows containing non-finite values (`stat_count()`).
Created on 2022-11-25 with reprex v2.0.2
Maybe you need something like this: The formatting was taken from #Allan Cameron (many Thanks!):
library(tidyverse)
library(scales)
df %>%
mutate(id = row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(percent = value/sum(value, na.rm = TRUE)) %>%
mutate(percent = ifelse(is.na(percent), 0, percent)) %>%
mutate(my_label = str_trim(paste0(format(100 * percent, digits = 1), "%"))) %>%
ggplot(aes(x = factor(name), y = percent, fill = factor(name), label = my_label))+
geom_col(position = position_dodge())+
geom_text(aes(label = my_label), vjust=-1) +
facet_wrap(. ~ id, nrow=1, strip.position = "bottom")+
scale_fill_grey(name = NULL) +
scale_y_continuous(labels = scales::percent)+
theme_bw(base_size = 16)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
I'm using the tidyverse-ggplot2 combination to plot multiple bar plots. In one of my comparisons i would like to have even up to 300 single plots. I was wondering if there is a possibility to make sure that the plots will be visible in the pdf file and not look like the attached example
If possible I would prefer to have all the plots in one single pdf file, but if not, also multiple pages will be ok.
The command to plot the bar charts is
common %>%
as_tibble(rownames="gene") %>%
left_join(x= ., y = up[,1:2], by = c("gene" = "ensembl_gene_id") ) %>%
pivot_longer(starts_with("S"), names_to="sample", values_to="counts") %>%
left_join(groups, by="sample") %>%
group_by(mgi_symbol, group, cond, time) %>%
summarize(mean_count=mean(counts)) %>%
ggplot( aes(x = time, y = mean_count, fill=cond)) +
geom_bar(stat = "identity", position = position_dodge(width=0.9) ) +
scale_fill_manual(values=c("darkblue", "lightblue", "black")) +
facet_wrap(~mgi_symbol, scales = "free", ncol = 5) +
theme_bw()
I forgot to add the group table
groups <- tibble(
sample= colnames(normCounts),
group = rep(seq(1, ncol(normCounts)/3), each=3),
cond = rep(c("WT", "GCN2-KO", "GCN1-KO"), each = 12),
time = rep(rep(c("0h", "1h", "4h", "8h"), each=3), times = 3 )
)
thanks
Adding the command with the group_map was as such
common %>%
as_tibble(rownames="gene") %>%
left_join(x= ., y = up[,1:2], by = c("gene" = "ensembl_gene_id") ) %>%
pivot_longer(starts_with("S"), names_to="sample", values_to="counts") %>%
left_join(groups, by="sample") %>%
group_by(mgi_symbol, group, cond, time) %>%
summarize(mean_count=mean(counts)) %>%
group_map(function(g, ...)
ggplot(g, aes(x = time, y = mean_count, fill=cond)) +
geom_bar(stat = "identity", position = position_dodge(width=0.9) ) +
scale_fill_manual(values=c("darkblue", "lightblue", "black")) +
facet_wrap(~mgi_symbol, scales = "free", ncol = 5) +
theme_bw()
)
EDIT
This is how the data looks like in the input table (after summarizing the means)
df <-
common %>%
as_tibble(rownames="gene") %>%
left_join(x= ., y = up[,1:2], by = c("gene" = "ensembl_gene_id") ) %>%
pivot_longer(starts_with("S"), names_to="sample", values_to="counts") %>%
left_join(groups, by="sample") %>%
group_by(mgi_symbol, group, cond, time) %>%
summarize(mean_count=mean(counts)) %>%
ungroup()
df
#>`summarise()` regrouping output by 'mgi_symbol', 'group', 'cond' (override with `.groups` argument)
#> # A tibble: 1,212 x 5
#> mgi_symbol group cond time mean_count
#> <chr> <int> <chr> <chr> <dbl>
#> 1 0610031O16Rik 1 WT 0h 14.4
#> 2 0610031O16Rik 2 WT 1h 30.9
#> 3 0610031O16Rik 3 WT 4h 45.5
#> 4 0610031O16Rik 4 WT 8h 56.0
#> 5 0610031O16Rik 5 GCN2-KO 0h 18.9
#> 6 0610031O16Rik 6 GCN2-KO 1h 39.4
#> 7 0610031O16Rik 7 GCN2-KO 4h 13.9
#> 8 0610031O16Rik 8 GCN2-KO 8h 13.3
#> 9 0610031O16Rik 9 GCN1-KO 0h 12.3
#> 10 0610031O16Rik 10 GCN1-KO 1h 25.3
#> # … with 1,202 more rows
Start with some dummy data. This is the data after you've finished running left_join, pivot_longer, group_by, summarize.
library(tidyverse)
df <- tibble(
time = 1:5,
mean_count = 1:5,
cond = "x"
) %>%
expand_grid(mgi_symbol = c(letters, LETTERS))
Create a column group which represents what page the mgi_symbol belongs on.
plots_per_page <- 20
df <-
df %>%
mutate(group = (dense_rank(mgi_symbol) - 1) %/% plots_per_page)
Create all the plots with group_map.
plots <-
df %>%
group_by(group) %>%
group_map(function(g, ...) {
ggplot(g, aes(x = time, y = mean_count, fill=cond)) +
geom_bar(stat = "identity", position = position_dodge(width=0.9) ) +
scale_fill_manual(values=c("darkblue", "lightblue", "black")) +
facet_wrap(~mgi_symbol, scales = "free", ncol = 5) +
theme_bw()
})
Save as multiple pages using ggpubr
ggpubr::ggexport(
ggpubr::ggarrange(plotlist = plots, nrow = 1, ncol = 1),
filename = "plots.pdf"
)
I want to create a chart, using ggplot, relating the variables "var_share" (in the y-axis) and "cbo" (in the x-axis), but by three time periods: 1996-2002, 2002-2008 and 2008-2012. Also, I want to calculate the "cbo" variable, by percentile. Here is my dataset:
ano cbo ocupado quant total share var_share
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1996 20 1 32 39675 0.0807 -0.343
2 1997 20 1 52 41481 0.125 0.554
3 1998 20 1 34 40819 0.0833 -0.336
4 1999 20 1 44 41792 0.105 0.264
5 2001 20 1 57 49741 0.115 0.0884
6 1996 21 1 253 39675 0.638 -0.0326
You can download the full dataset here.
The result is almost like this:
I believe this is what you are looking for. After reading your data in, a new variable called ano2 is build and after that a new DF which contains the bins called new you have defined.
The first plot then builds on this DF and uses stat_summary.
You also said something about the quantiles. I am not sure what exactly you have meant, but I grouped over this new variable and used technique from purrr to calculate the desired quantiles.
library(tidyverse)
df <- ocupacoes
df$ano2 <- readr::parse_date(paste0('01-01-', df$ano), '%d-%m-%Y')
ddf <- df %>%
mutate(new = case_when(
lubridate::year(ano2) %in% 1996:2002 ~ '96-02',
lubridate::year(ano2) %in% 2003:2008 ~ '02-08',
lubridate::year(ano2) %in% 2009:2012 ~ '08-12'
))
ggplot(ddf,aes(x = new, y = var_share, color = new,)) +
stat_summary(fun = mean, colour = "red", size = 1) +
scale_x_discrete(limits = c('96-02', '02-08', '08-12'))
# I think you were also looking for quantiles of cbo
ddf %>%
group_by(new) %>%
group_modify(~ {
quantile(.x$cbo, probs = seq(0,1, by = .2)) %>%
tibble::enframe(name = "prob", value = "quantile")
}) %>%
ggplot(aes(x = prob, quantile, color = new, group = new)) +
geom_line() +
scale_x_discrete(limits = c('0%', '20%' ,
'40%', '60%',
'80%' , '100%'))
I've got a dataset similar to this:
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
z <- exp(-(dist / 8)^2)
which can be visualised as follows:
data.frame(x, y, z) %>%
ggplot() + geom_point(aes(x, y, color = z))
What I would like to do is a stacked half-circle plot with averaged value of z in subsequent layers. I think it can be done with the combination of geom_col and coord_polar(), although the farthest I can get is
data.frame(x, y, z, dist) %>%
mutate(dist_fct = cut(dist, seq(0, max(dist), by = 5))) %>%
ggplot() + geom_bar(aes(x = 1, y = 1, fill = dist_fct), stat = 'identity', position = 'fill') +
coord_polar()
which is obviously far from the expectation (layers should be of equal size, plot should be clipped on the right half).
The problem is that I can't really use coord_polar() due to further use of annotate_custom(). So my question are:
can plot like this can be done without coord_polar()?
If not, how can it be done with coord_polar()?
The result should be similar to a graphic below, except from plotting layers constructed from points I would like to plot only layers as a whole with color defined as an average value of z inside a layer.
If you want simple radius bands, perhaps something like this would work as you pictured it in your question:
# your original sample data
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
nbr_bands <- 6 # set nbr of bands to plot
# calculate width of bands
band_width <- max(dist)/(nbr_bands-1)
# dist div band_width yields an integer 0 to nbr bands
# as.factor makes it categorical, which is what you want for the plot
band = as.factor(dist %/% (band_width))
library(dplyr)
library(ggplot2)
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed() +
theme_dark() # dark theme
Edit to elaborate:
As you first attempted, it would be nice to use the very handy cut() function to calculate the radius color categories.
One way to get categorical (discrete) colors, rather than continuous shading, for your plot color groups is to set your aes color= to a factor column.
To directly get a factor from cut() you may use option ordered_result=TRUE:
band <- cut(dist, nbr_bands, ordered_result=TRUE, labels=1:nbr_bands) # also use `labels=` to specify your own labels
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Or more simply you may use cut() without options and convert to a factor using as.factor():
band <- as.factor( cut(dist, nbr_bands, labels=FALSE) )
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Sounds like you may find the circle & arc plotting functions from the ggforce package useful:
# data
set.seed(1234)
df <- data.frame(x = 100 - abs(rnorm(1e6, 0, 5)),
y = 50 + rnorm(1e6, 0, 3)) %>%
mutate(dist = sqrt((x - 100)^2 + (y - 50)^2)) %>%
mutate(z = exp(-(dist / 8)^2))
# define cut-off values
cutoff.values <- seq(0, ceiling(max(df$dist)), by = 5)
df %>%
# calculate the mean z for each distance band
mutate(dist_fct = cut(dist, cutoff.values)) %>%
group_by(dist_fct) %>%
summarise(z = mean(z)) %>%
ungroup() %>%
# add the cutoff values to the dataframe for inner & outer radius
arrange(dist_fct) %>%
mutate(r0 = cutoff.values[-length(cutoff.values)],
r = cutoff.values[-1]) %>%
# add coordinates for circle centre
mutate(x = 100, y = 50) %>%
# plot
ggplot(aes(x0 = x, y0 = y,
r0 = r0, r = r,
fill = z)) +
geom_arc_bar(aes(start = 0, end = 2 * pi),
color = NA) + # hide outline
# force equal aspect ratio in order to get true circle
coord_equal(xlim = c(70, 100), expand = FALSE)
Plot generation took <1s on my machine. Yours may differ.
I'm not sure this satisfies everything, but it should be a start. To cut down on the time for plotting, I'm summarizing the data into a grid, which lets you use geom_raster. I don't entirely understand the breaks and everything you're using, so you might want to tweak some of how I divided the data for making the distinct bands. I tried out a couple ways with cut_interval and cut_width--this would be a good place to plug in different options, such as the number or width of bands.
Since you mentioned getting the average z for each band, I'm grouping by the gridded x and y and the cut dist, then using mean of z for setting bands. I threw in a step to make labels like in the example--you probably want to reverse them or adjust their positioning--but that comes from getting the number of each band's factor level.
library(tidyverse)
set.seed(555)
n <- 1e6
df <- data_frame(
x = 100 - abs(rnorm(n, 0, 5)),
y = 50 + rnorm(n, 0, 3),
dist = sqrt((x - 100)^2 + (y - 50)^2),
z = exp(-(dist / 8)^2)
) %>%
mutate(brk = cut(dist, seq(0, max(dist), by = 5), include.lowest = T))
summarized <- df %>%
filter(!is.na(brk)) %>%
mutate(x_grid = floor(x), y_grid = floor(y)) %>%
group_by(x_grid, y_grid, brk) %>%
summarise(avg_z = mean(z)) %>%
ungroup() %>%
# mutate(z_brk = cut_width(avg_z, width = 0.15)) %>%
mutate(z_brk = cut_interval(avg_z, n = 9)) %>%
mutate(brk_num = as.numeric(z_brk))
head(summarized)
#> # A tibble: 6 x 6
#> x_grid y_grid brk avg_z z_brk brk_num
#> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
#> 1 75 46 (20,25] 0.0000697 [6.97e-05,0.11] 1
#> 2 75 47 (20,25] 0.000101 [6.97e-05,0.11] 1
#> 3 75 49 (20,25] 0.0000926 [6.97e-05,0.11] 1
#> 4 75 50 (20,25] 0.0000858 [6.97e-05,0.11] 1
#> 5 75 52 (20,25] 0.0000800 [6.97e-05,0.11] 1
#> 6 76 51 (20,25] 0.000209 [6.97e-05,0.11] 1
To make the labels, summarize that data to have a single row per band--I did this by taking the minimum of the gridded x, then using the average of y so they'll show up in the middle of the plot.
labels <- summarized %>%
group_by(brk_num) %>%
summarise(min_x = min(x_grid)) %>%
ungroup() %>%
mutate(y_grid = mean(summarized$y_grid))
head(labels)
#> # A tibble: 6 x 3
#> brk_num min_x y_grid
#> <dbl> <dbl> <dbl>
#> 1 1 75 49.7
#> 2 2 88 49.7
#> 3 3 90 49.7
#> 4 4 92 49.7
#> 5 5 93 49.7
#> 6 6 94 49.7
geom_raster is great for these situations where you have data in an evenly spaced grid that just needs uniform tiles at each position. At this point, the summarized data has 595 rows, instead of the original 1 million, so the time to plot shouldn't be an issue.
ggplot(summarized) +
geom_raster(aes(x = x_grid, y = y_grid, fill = z_brk)) +
geom_label(aes(x = min_x, y = y_grid, label = brk_num), data = labels, size = 3, hjust = 0.5) +
theme_void() +
theme(legend.position = "none", panel.background = element_rect(fill = "gray40")) +
coord_fixed() +
scale_fill_brewer(palette = "PuBu")
Created on 2018-11-04 by the reprex package (v0.2.1)
I need help to do a triangle heatmap in R using ggplot2, reshape2, and Hmisc, because I need to show r and P-values on the plot.
I have tried inserting cordata[lower.tri(c),] in numerous places and it hasnt helped. I have also tried using different methods but they didnt show the p value an rho, which i need! I have tried searching "Hmisc+triangle+heatmap" here and on google and have found nothing that works.
Here is the raw data, which is imported from an excel sheet:
df
# A tibble: 8 x 7
Urine Glucose Soil LB Gluconate River Colon
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3222500 377750000 7847250 410000000 3252500 3900000 29800000
2 3667500 187000000 3937500 612000000 5250000 4057500 11075000
3 8362500 196250000 6207500 491000000 2417500 2185000 9725000
4 75700000 513000000 2909750 1415000000 3990000 3405000 NA
5 4485000 141250000 7241000 658750000 3742500 3470000 6695000
6 1947500 235000000 3277500 528500000 7045000 1897500 25475000
7 4130000 202500000 111475 442750000 6142500 4590000 4590000
8 1957500 446250000 8250000 233250000 5832500 5320000 5320000
code:
library(readxl)
data1 <- read_excel("./pca-mean-data.xlsx", sheet = 1)
df <- data1[c(2,3,4,5,6,7,8,9,10,11)]
library(ggplot2)
library(reshape2)
library(Hmisc)
library(stats)
library(RColorBrewer)
abbreviateSTR <- function(value, prefix){ # format string more concisely
lst = c()
for (item in value) {
if (is.nan(item) || is.na(item)) { # if item is NaN return empty string
lst <- c(lst, '')
next
}
item <- round(item, 2) # round to two digits
if (item == 0) { # if rounding results in 0 clarify
item = '<.01'
}
item <- as.character(item)
item <- sub("(^[0])+", "", item) # remove leading 0: 0.05 -> .05
item <- sub("(^-[0])+", "-", item) # remove leading -0: -0.05 -> -.05
lst <- c(lst, paste(prefix, item, sep = ""))
}
return(lst)
}
d <- df
cormatrix = rcorr(as.matrix(d), type='pearson')
cordata = melt(cormatrix$r)
cordata$labelr = abbreviateSTR(melt(cormatrix$r)$value, 'r')
cordata$labelP = abbreviateSTR(melt(cormatrix$P)$value, 'P')
cordata$label = paste(cordata$labelr, "\n",
cordata$labelP, sep = "")
hm.palette <- colorRampPalette(rev(brewer.pal(11, 'Spectral')), space='Lab')
txtsize <- par('din')[2] / 2
pdf(paste("heatmap-MEANDATA-pearson.pdf",sep=""))
ggplot(cordata, aes(x=Var1, y=Var2, fill=value)) + geom_tile() +
theme(axis.text.x = element_text(angle=90, hjust=TRUE)) +
xlab("") + ylab("") +
geom_text(label=cordata$label, size=txtsize) +
scale_fill_gradient(colours = hm.palette(100))
dev.off()
I have attached an example figure of what I have, I just need to cut in half! Please help if you can, I really appreciate it!
Here's a way that uses some dplyr functions for reshaping and filtering the data. After making the correlation matrix, I'm melting both df_cor$r and df_cor$P and joining them, making it a little more concise (and safer) to bring these data frames together, then make the labels.
Then I give each row a pair ID, which is a sorted version of the combination of Var1 and Var2 pasted together. Because I sort it, the rows for (Urine, Soil) and (Soil, Urine) will have the same ID without regard for which is Var1 and which is Var2. Then, grouping by this ID, I take distinct observations, using the ID as the only criteria for picking duplicates. The head of that long-shaped data is below.
library(tidyverse)
library(Hmisc)
library(reshape2)
# ... function & df definitions removed
df_cor <- rcorr(as.matrix(df), type = "pearson")
df_long <- inner_join(
melt(df_cor$r, value.name = "r"),
melt(df_cor$P, value.name = "p"),
by = c("Var1", "Var2")
) %>%
mutate(r_lab = abbreviateSTR(r, "r"), p_lab = abbreviateSTR(p, "P")) %>%
mutate(label = paste(r_lab, p_lab, sep = "\n")) %>%
rowwise() %>%
mutate(pair = sort(c(Var1, Var2)) %>% paste(collapse = ",")) %>%
group_by(pair) %>%
distinct(pair, .keep_all = T)
head(df_long)
#> # A tibble: 6 x 8
#> # Groups: pair [6]
#> Var1 Var2 r p r_lab p_lab label pair
#> <fct> <fct> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 Urine Urine 1 NA r1 "" "r1\n" 1,1
#> 2 Glucose Urine 0.627 0.0963 r.63 P.1 "r.63\nP.1" 1,2
#> 3 Soil Urine -0.288 0.489 r-.29 P.49 "r-.29\nP.49" 1,3
#> 4 LB Urine 0.936 0.000634 r.94 P<.01 "r.94\nP<.01" 1,4
#> 5 Gluconate Urine -0.239 0.569 r-.24 P.57 "r-.24\nP.57" 1,5
#> 6 River Urine -0.102 0.811 r-.1 P.81 "r-.1\nP.81" 1,6
Plotting is then straightforward. I used the minimal theme so it won't show that the upper half of the matrix is blank, and turned off the grid since it doesn't have much meaning here.
ggplot(df_long, aes(x = Var1, y = Var2, fill = r)) +
geom_raster() +
geom_text(aes(label = label)) +
scale_fill_distiller(palette = "Spectral") +
theme_minimal() +
theme(panel.grid = element_blank())
Created on 2018-08-05 by the reprex package (v0.2.0).
I'm sure there is a much more dynamic way of doing this, but I just hard coded the stuff you didn't want.
cordata %>%
arrange(Var1) %>%
mutate_at(vars(value, label), funs(
ifelse(row_number() > 1 & Var2 == "Urine" |
row_number() > 9 & Var2 == "Glucose"|
row_number() > 17 & Var2 == "Soil" |
row_number() > 25 & Var2 == "LB" |
row_number() > 33 & Var2 == "Gluconate" |
row_number() > 41 & Var2 == "River", NA, .))) %>%
ggplot(aes(x=Var1, y=Var2, fill=value)) +
geom_tile()+
theme(axis.text.x = element_text(angle=90, hjust=TRUE))+
xlab("") +
ylab("") +
geom_text(aes(label=label), size=txtsize)
For some reason I could not get your color scheme to work on my computer. I'll also think about it some more and see if I can make this more dynamic.
EDIT:
I had another idea and this works way better. I'll keep the old one up for reference.
cordata %>%
arrange(Var1) %>%
group_by(Var1) %>%
filter(row_number() >= which(Var1 == Var2)) %>%
ggplot(aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
theme(axis.text.x = element_text(angle=90, hjust=TRUE)) +
xlab("") +
ylab("") +
geom_text(aes(label=label), size=txtsize)
What I'm doing here is saying that I want to filter out all the data that is below the position where Var1 = Var2 by group. This essentially deletes the lower half of the map, whereas the first approach only changes specific variable rows to NA.