How to create 'clustered dotplots' for categorical data? - r

I wish to create a graphic, like this one from the software called Fathom.
I have a two-way table of categorical frequency data that I wish to create something like a fluctuation plot, but the key difference is that you can see the individual data points.
I've tried ggfluctuation(...), levelplots(...) and all manner of packages (like ggplot2), but with no success. I can find nothing on any forums to help either.
I'd be exceptionally grateful if someone could help direct me to, or create some code, that would achieve my objective.

Here is improved version.
sample_data = structure(list(set = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), class = "factor", .Label = c("09t0101 TJ",
"09t0102 MW", "09t0201 EH", "09t0202 NH")), grade = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), freq = c(7L, 8L, 2L, 3L, 11L,
4L, 11L, 3L, 3L, 8L, 3L, 8L, 3L, 9L, 3L, 2L)), .Names = c("set",
"grade", "freq"), row.names = c(NA, -16L), class = "data.frame")
group = unique(sample_data$set) #Obtain the unique 'set' values for y-axis
max_x = length(unique(sample_data$grade)) #Obtain the maximum number of 'grades' to plot on x-axis
max_y = length(group) #Obtain the maximum number of 'set' to plot on y-axis
pdf(file="plot.pdf",width=8,height=6)
par(mar = c(5, 10, 4, 2)) #c(bottom, left, top, right)
plot(max_x,max_y,xlim=c(0.5,max_x+0.5),ylim=c(0.5,max_y +0.5),pch=NA,xlab="Grades",ylab=NA,xaxt="n",yaxt="n",asp=1) #asp = 1 IMPORTANT
axis(side = 2, at=c(1:length(group)), labels=c(as.vector(group)),las=2)
axis(side = 1, at=c(1:length(unique(sample_data$grade))), labels=c(as.vector(unique(sample_data$grade))))
r = 0.15 #The diameter of circles to be plotted
for (i in 1:length(group)){
a = subset(sample_data,sample_data$set==group[i]) #Subset new data.frame corresponding to first 'set'
for (j in 1:nrow(a)){
matrix_sz = ceiling(sqrt(a$freq[j])) #Determine the size of square matrix that can accomodate all the frequency
matrix_x = matrix(nrow = matrix_sz, ncol = matrix_sz) #Initiate matrix
matrix_y = matrix(nrow = matrix_sz, ncol = matrix_sz) #Initiate matrix
matrix_x[,1] = -1*((matrix_sz/2) - 0.5) #Find out relatve x co-ordinates for first column
matrix_y[1,] = 1*((matrix_sz/2) - 0.5) #Find out relatve y co-ordinates for first row
# Find out other relative co-ordinates if the size of square matrix is more than 1x1
if (matrix_sz > 1){
for (column in 2:matrix_sz){
matrix_x[,column] = matrix_x[,column - 1] + 1
}
for (row in 2:matrix_sz){
matrix_y[row,] = matrix_y[row-1,] - 1
}
}
#Determine the co-ordinate of the center of the square matrix grid
xx = as.integer(a$grade[j])
yy = i
fq = 1 #To keep track of the corresponding 'freq'
# Plot circles around the center based on relative co-ordinates
for (row in 1:matrix_sz){
for (column in 1:matrix_sz){
if (fq > a$freq[j]){break} #Break if the necessary number of points have been plotted
xx1 = xx + r * matrix_x[row, column]
yy1 = yy + r * matrix_y[row, column]
# points (x = xx1, y = yy1, pch=1)
fq = fq + 1
symbols (x = xx1, y = yy1, circles=c(r/2.25),add =TRUE,inches=FALSE,bg = "gray")
}
}
}
}
dev.off()

Related

gganimate transition_reveal() with geom_line() breaking on the final frame?

I am trying to animate a line graph with multiple lines. It seems that there is an error with the gganimate package involving transition_reveal() that is causing the final frame to revert for all of the lines but one. This error is not present when not using gganimate. Here is the code:
df <- read.csv("test.csv", stringsAsFactors = TRUE)
anim <- ggplot(df, aes(Day, Accidents, group = State, color = State)) +
geom_line() +
transition_reveal(Day) +
ease_aes('cubic-in-out')
jiff <- animate(anim, fps = 24, duration = 5, start_pause = 0, end_pause = 72, height = 4, width = 7, units = "in", res = 150)
jiff
Here is the dput of the dataframe:
structure(list(State = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L), levels = c("A", "B", "C", "D"), class = "factor"),
Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Accidents = c(5L, 2L, 5L, 6L, 1L, 2L, 6L, 8L, 4L, 10L, 2L,
4L)), class = "data.frame", row.names = c(NA, -12L))
Here is the output:
Regardless of the ending pause or how many values I have along the x-axis, the final frame will always look like this with only one line appearing as updated. Does anyone know why this might be happening?
UPDATE: Reverting the gganimate package from 1.0.8 to 1.0.7 did seem to do the trick after all.
The issue is in this line start_pause = 0, end_pause = 72,. Remove or adapt it:
anim <- ggplot(df, aes(Day, Accidents, group= State, color = State)) +
geom_line() +
transition_reveal(Day) +
ease_aes('cubic-in-out')
animate(anim, fps = 24, duration = 5,
height = 4, width = 7, units = "in", res = 150)

How to add custom label to grouped frequency stat_bin plot without zeros?

I have a frequency histogram, with 42 groups such that each box represents an individual observation/row. I need to label each 'cell' with raw x value (i.e., estimate). However, ggplot2 seems to add a large amount of superfluous labels at the base and top of every cell (see below).
I am assuming ggplot2 is bugged when ..count.. == 0. Indeed, adding in an label=ifelse(..count.. == 0, "", ..x..) correctly plots the ..x.. variable, but this ..x.. is not the raw estimate. See:
The code to generate this is here:
library(ggplot2)
mydata = structure(list(estimate = c(cor = 0.325795456913319, cor = 0.562197877060912,
cor = 0.440719760612754, cor = -0.0936850084700603, cor = 0.0360156238340214,
cor = 0.290449045144756, cor = 0.351442182968952, cor = 0.282652330413659,
cor = 0.484382008605981, cor = 0.555190439953125, cor = 0.153963602626727,
cor = 0.389799442186418, cor = 0.102658050525012, cor = 0.539213427685732,
cor = 0.599952880067505, cor = 0.353135730646411, cor = 0.5459587711875,
cor = 0.380085983041004, cor = 0.494013540678857, cor = 0.506029397264374,
cor = 0.796184962852028, cor = 0.152349436981737, cor = 0.474356676277947,
cor = 0.585975728042781, cor = 0.278773851537417, cor = 0.380637414940095,
cor = 0.392275909026939, cor = 0.419554193309306, cor = 0.488358015824324,
cor = 0.199407247922171, cor = 0.260254145583898, cor = 0.349291291301302,
cor = 0.464177992152635, cor = 0.0747318120424813, cor = 0.60432048579698,
cor = 0.295662258461811, cor = 0.0278690641141737, cor = -0.0337558821556421,
cor = 0.211670641689536, cor = 0.285200869849266, cor = 0.51828476555577,
cor = 0.44882613302634), groupid = 1:42,
magnitiude = structure(c(4L, 5L, 4L, 1L, 2L, 3L, 4L, 3L, 4L, 5L, 3L, 4L, 3L, 5L, 5L, 4L, 5L,
4L, 4L, 5L, 5L, 3L, 4L, 5L, 3L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 4L,
2L, 5L, 3L, 2L, 1L, 3L, 3L, 5L, 4L),
.Label = c("Negative", "Negligible", "Small", "Medium", "Large"), class = "factor")),
row.names = c(NA, -42L), class = c("tbl_df", "tbl", "data.frame"))
ggplot(data = mydata, aes(estimate)) +
stat_bin(aes(fill = magnitiude, group = groupid, label=estimate), color = "#424242", binwidth = 0.05) +
stat_bin(binwidth=0.05, geom="text", aes(label=round(estimate,2), group = groupid), position=position_stack(vjust=0.5))
Can anyone help me generate the raw estimates in each grouped cell?
This is a reasonable usecase for the stage() function. It allows you to setup an aesthetic that you can modify later in the plotting process.
library(ggplot2)
ggplot(data = mydata, aes(estimate)) +
stat_bin(aes(fill = magnitiude,
group = groupid),
color = "#424242", binwidth = 0.05) +
stat_bin(
binwidth=0.05, geom="text",
aes(label = stage(mydata$estimate,
after_stat = ifelse(count > 0, round(label, 2), "")),
group = groupid),
position=position_stack(vjust=0.5)
)
#> Warning: Use of `mydata$estimate` is discouraged. Use `estimate` instead.
For reasons I don't understand, it was telling me it couldn't find the estimate column unless I prefixed mydata$ in the staging. Whereas according to the documentation it should be able to find the estimate column.

How to prevent R from alphabetically ranking data in ggplot and specify the order in which data is plotted (Data + Code + Graphs provided)?

I'm trying to fix an issue with my GGBalloonPlot graph with regards to how R processes the axis labels.
By default R plots the data using the labels ranked in reverse alphabetical order but to reveal the pattern of the data, the data need to be plotted in a specific order. The only way I've been able to do trick the software is by manually adding a prefix to each label in my .csv table so that R would rank them properly in my output. This is time consuming since I need to manually order the data first before adding the prefix and then plotting.
I would like to input a character vector (or something like that) which would essentially specify the order in which I want to have the data plotted which would reveal the pattern without the need for a prefix in the label name.
I have made some attempts with "scale_y_discrete" without success. I would also like to do the same thing for the X axis since I've had to use the same "trick" to display the columns in the proper non-alphabetical order which offsets the position of the labels. Any idea on how to get GGplot to display my values as seen in the graph without having to "trick" the software since this is quite time consuming ?
Data + Code
#Assign data to "Stack_Overflow_DummyData"
Stack_Overflow_DummyData <- structure(list(Species = structure(c(8L, 3L, 1L, 5L, 6L, 2L,
7L, 4L, 8L, 3L, 1L, 5L, 6L, 2L, 7L, 4L, 8L, 3L, 1L, 5L, 6L, 2L,
7L, 4L, 8L, 3L, 1L, 5L, 6L, 2L, 7L, 4L), .Label = c("Ani", "Cal",
"Can", "Cau", "Fis", "Ort", "Sem", "Zan"), class = "factor"),
Species_prefix = structure(c(8L, 7L, 6L, 5L, 4L, 3L, 2L,
1L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 8L, 7L, 6L, 5L, 4L, 3L,
2L, 1L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("ac.Cau",
"ad.Sem", "af.Cal", "ag.Ort", "as.Fis", "at.Ani", "be.Can",
"bf.Zan"), class = "factor"), Dist = structure(c(2L, 3L,
5L, 2L, 1L, 1L, 4L, 5L, 2L, 3L, 5L, 2L, 1L, 1L, 4L, 5L, 2L,
3L, 5L, 2L, 1L, 1L, 4L, 5L, 2L, 3L, 5L, 2L, 1L, 1L, 4L, 5L
), .Label = c("End", "Ind", "Pan", "Per", "Wid"), class = "factor"),
Region = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Cen", "Col",
"Far", "Nor"), class = "factor"), Region_prefix = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("a.Far", "b.Nor", "c.Cen", "d.Col"), class = "factor"),
Frequency = c(75, 50, 25, 50, 0, 0, 0, 0, 11.1, 22.2, 55.6,
55.6, 11.1, 0, 5.6, 0, 0, 2.7, 36.9, 27.9, 65.8, 54.1, 37.8,
28.8, 0, 0, 0, 3.1, 34.4, 21.9, 78.1, 81.3)), class = "data.frame", row.names = c(NA,
-32L))
# Plot Data With Prefix Trick
library(ggplot2)
library(ggpubr)
# make color base on Dist, size and alpha dependent on Frequency
ggballoonplot(Stack_Overflow_DummyData, x = "Region_prefix", y = "Species_prefix",
size = "Frequency", size.range = c(1, 9), fill = "Dist") +
theme_set(theme_gray() +
theme(legend.key=element_blank())) +
# Sets Grey Theme and removes grey background from legend panel
theme(axis.title = element_blank()) +
# Removes X axis title (Region)
geom_text(aes(label=Frequency), alpha=1.0, size=3, nudge_x = 0.4)
# Add Frequency Values Next to the circles
# Plot Data Without Prefix Trick
library(ggplot2)
library(ggpubr)
# make color base on Dist, size and alpha dependent on Frequency
ggballoonplot(Stack_Overflow_DummyData, x = "Region", y = "Species",
size = "Frequency", size.range = c(1, 9), fill = "Dist") +
theme_set(theme_gray() +
theme(legend.key=element_blank())) +
# Sets Grey Theme and removes grey background from legend panel
theme(axis.title = element_blank()) +
# Removes X axis title (Region)
geom_text(aes(label=Frequency), alpha=1.0, size=3, nudge_x = 0.4)
# Add Frequency Values Next to the circles
Here below are the graphs
Good Graph.
Using the label prefix trick with the visible pattern in the data:
Wrong Graph (R default).
Without the prefix trick when GGplot automatically orders the data/labels and the graph makes no sense:
To sum up, I would like the Good graph output without having to have to previously add a prefix in my labels.
Many Thanks in advance for your help.
For the axis labels I would define a previous function to override the breaks:
shlab <- function(lbl_brk){
sub("^[a-z]+\\.","",lbl_brk) # removes the starts of strings as a. or ab.
}
Then, to change the labels you just have to use scale_x,y_discrete with labels = shlab (if you look at the help of scale_x_discrete you will see that one of the options for labels is A function that takes the breaks as input and returns labels as output).
For the colours would be enough to change them (values) in scale_fill_manual and for the sizes, using guides so:
library(ggplot2)
library(ggpubr)
shlab <- function(lbl_brk){
sub("^[a-z]+\\.","",lbl_brk)
}
ggballoonplot(Stack_Overflow_DummyData, x = "Region_prefix", y = "Species_prefix", size = "Frequency", size.range = c(1, 9), fill = "Dist") +
scale_x_discrete(labels = shlab) +
scale_y_discrete(labels = shlab) +
scale_fill_manual(values = c("green", "blue", "red", "black", "white")) +
guides(fill = guide_legend(override.aes = list(size=8))) +
theme_set(theme_gray() + theme(legend.key=element_blank())) + # Sets Grey Theme and removes grey background from legend panel
theme(axis.title = element_blank()) + # Removes X axis title (Region)
geom_text(aes(label=Frequency), alpha=1.0, size=3, nudge_x = 0.4) # Add Frequency Values Next to the circles
UPDATE:
With the new dataset and vector labels:
library(ggplot2)
library(ggpubr)
# make color base on Dist, size and alpha dependent on Frequency
ggballoonplot(Stack_Overflow_DummyData, x = "Region", y = "Species",
size = "Frequency", size.range = c(1, 9), fill = "Dist") +
scale_y_discrete(limits = c("Cau", "Sem", "Cal", "Ort", "Fis", "Ani", "Can", "Zan")) +
scale_x_discrete(limits = c("Far", "Nor", "Cen", "Col")) +
theme_set(theme_gray() +
theme(legend.key=element_blank())) +
# Sets Grey Theme and removes grey background from legend panel
theme(axis.title = element_blank()) +
# Removes X axis title (Region)
geom_text(aes(label=Frequency), alpha=1.0, size=3, nudge_x = 0.4)

Convert two ggplots into one

I am facing some problem to have one plot instead of two from separate data frames. I explained the situation a bit below. The data frames look like:
df1 <- structure(list(value = c(9921L, 21583L, 11822L, 1054L, 13832L,
16238L, 13838L, 20801L, 20204L, 13881L, 19935L, 13829L, 14012L,
20654L, 13862L, 21191L, 3777L, 15552L, 13817L, 20428L, 16850L,
21003L, 11072L, 22477L, 12321L, 12856L, 16295L, 11431L, 13469L,
14680L, 10552L, 15272L, 9132L, 9374L, 15123L, 22754L, 10363L,
12160L, 13729L, 11151L, 11451L, 11272L, 14900L, 14688L, 17133L,
7315L, 7268L, 6262L, 72769L, 7650L, 16389L, 13027L, 7134L, 6465L,
6490L, 15183L, 7201L, 14070L, 11210L, 10146L), limit = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1Mbit",
"5Mbit", "10Mbit"), class = "factor")), class = "data.frame", row.names = c(NA,
-60L))
df2 <- structure(list(value = c(37262L, 39881L, 30914L, 32976L, 28657L,
39364L, 39915L, 30115L, 29326L, 36199L, 37976L, 36694L, 33718L,
36945L, 33182L, 35866L, 34188L, 33426L, 32804L, 34986L, 29355L,
30470L, 37420L, 26465L, 28975L, 29144L, 27491L, 30507L, 27146L,
26257L, 31231L, 30521L, 30370L, 31683L, 33774L, 35654L, 34172L,
38554L, 38030L, 33439L, 34817L, 31278L, 33579L, 31175L, 31001L,
29908L, 31658L, 33381L, 28709L, 34794L, 34154L, 30157L, 33362L,
30363L, 31097L, 29116L, 27703L, 31229L, 30196L, 30077L), limit = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("180ms",
"190ms", "200ms"), class = "factor")), class = "data.frame", row.names = c(NA,
-60L))
from the data frames above, I have these plots:
limit_bw <- factor(df1$limit, levels = c("1Mbit", "5Mbit", "10Mbit"))
limit_lt <- factor(df2$limit, levels = c("200ms", "190ms", "180ms"))
(to use them sequentially)
bw_line <- ggplot(df1, aes(x = limit_bw, y = value, group=1)) + geom_quantile(method = "loess")
lt_line <- ggplot(df2, aes(x = limit_lt, y = value, group=1)) + geom_quantile(method = "loess")
(I actually have many data so I used geom_quantile())
And also two plots in a grid using rbind/cbind (which is not I want now):
grid.draw(rbind(ggplotGrob(ggplot(df1, aes(limit_bw,value,group=1)) + geom_quantile(method = "loess") + labs(title = "value vs bw",x="bandwidth",y="value")),
ggplotGrob(ggplot(df2, aes(limit_lt, value, group = 1)) + geom_quantile(method="loess") + labs(title="value vs latency", x="latency", y="value")), size = "last"))
I am seeking your help to merge them together into one plot (putting bw_line and lt_line together in the same graph) showing two x-axes either at the top and bottom or two axes in the bottom mentioning their title. Please note, the value has different range for each of the data set. However I need to show two y-axes for separate ranges for each data frame or may be one y-axis showing all the values (min to max) from the both data frame.
I actually seen one very close solution here from #RichieCotton but could not figure out for my data since I have some factors instead of integer values.
I really appreciate your help. Thank you.
I think it's probably easiest to approach this by combining the data into one data frame first. Here I make combined x-values and map your data to those. Then we map as usual, with the addition of a secondary y axis.
library(tidyverse); library(forcats)
# Create shared x axis and combine data frames
limit_combo <- data.frame(level_num = 1:3,
level = as_factor(c("1Mbit\n200ms",
"5Mbit\n190ms",
"10Mbit\n180ms")))
df1b <- df1 %>%
mutate(level_num = limit %>% as.numeric) %>%
left_join(limit_combo)
df2b <- df2 %>%
mutate(level_num = 4 - (limit %>% as.numeric)) %>%
left_join(limit_combo)
df3 <- bind_rows(df1b, df2b, .id = "plot") %>%
mutate(plot = if_else(plot == "1", "bw", "lt"))
# plot with adjusted y values and second axis for reference
ggplot(df3, aes(x = level,
y = value * if_else(plot == "lt", 0.44, 1),
group=plot, color = plot)) +
geom_quantile(method = "loess") +
scale_y_continuous("value", sec.axis = sec_axis(~./0.44)) +
theme(axis.text.y.left = element_text(color = "#F8766D"),
axis.text.y.right = element_text(color = "#00BFC4"))
Here is a different approach to create a single plot from the two datasets which avoids to combine both datasets into one and deal with the factors of limit. df1, df2, limit_bw, and limit_lt are used as given by the OP.
The plot is refined in three steps.
1. Common x axis, common y scale
library(ggplot2)
ggplot() + aes(y = value) +
geom_quantile(aes(x = as.integer(limit_bw), colour = "bw"), df1, method = "loess") +
geom_quantile(aes(x = as.integer(limit_lt), colour = "lt"), df2, method = "loess") +
scale_x_continuous("limit",
breaks = 1:nlevels(limit_bw),
labels = paste(levels(limit_bw), levels(limit_lt), sep = "\n")) +
scale_colour_discrete(NULL)
2. Separate x axes, common y scale
library(ggplot2)
ggplot() + aes(y = value) +
geom_quantile(aes(x = as.integer(limit_bw), colour = "bw"), df1, method = "loess") +
geom_quantile(aes(x = as.integer(limit_lt), colour = "lt"), df2, method = "loess") +
scale_x_continuous("limit",
breaks = 1:nlevels(limit_bw),
labels = levels(limit_bw),
sec.axis = dup_axis(labels = levels(limit_lt))) +
scale_colour_manual(NULL, values = c(bw = "blue", lt = "red")) +
theme(axis.text.x.bottom = element_text(color = "blue"),
axis.text.x.top = element_text(color = "red"))
3. Separate x axes, separate y axes
Here, the y-values of the second dataset are scaled such that the min and max values of the two datasets will coincide.
# compute scaling factor and offset
library(magrittr) # used to improve readability
bw_rng <- loess(df1$value ~ as.integer(limit_bw)) %>% fitted() %>% range()
lt_rng <- loess(df2$value ~ as.integer(limit_lt)) %>% fitted() %>% range()
scl <- diff(bw_rng) / diff(lt_rng)
ofs <- bw_rng[1] - scl * lt_rng[1]
library(ggplot2)
ggplot() +
geom_quantile(aes(x = as.integer(limit_bw), y = value, colour = "bw"),
df1, method = "loess") +
geom_quantile(aes(x = as.integer(limit_lt), y = scl * value + ofs, colour = "lt"),
df2, method = "loess") +
scale_x_continuous("limit",
breaks = 1:nlevels(limit_bw),
labels = levels(limit_bw),
sec.axis = dup_axis(labels = levels(limit_lt))) +
scale_y_continuous(sec.axis = sec_axis(~ (. - ofs) / scl)) +
scale_colour_manual(NULL, values = c(bw = "blue", lt = "red")) +
theme(axis.text.x.bottom = element_text(color = "blue"),
axis.text.x.top = element_text(color = "red"),
axis.text.y.left = element_text(color = "blue"),
axis.text.y.right = element_text(color = "red"))

visualize associations between two groups of data

Where each datapoint has a pairing of A and B and there multiple entries in A and multiple entires in B. IE multiple syndromes and multiple diagnoses, although for each datapoint there is one single syndrome-diagnoses pair.
Examples, suggestions, or ideas much appreciated
here's what the data is like. And I want to see connections between values of A and B (how many GG's are linked to TTs etc). Both are nominal datatypes.
ID,A ,B
1,GG,TT
2,AA,SS
3,BB,XX
4,DD,SS
5,DD,TT
6,CC,XX
7,HH,ZZ
8,AA,TT
9,CC,RR
10,DD,ZZ
11,AA,XX
12,AA,TT
13,DD,SS
14,DD,XX
15,AA,YY
16,CC,ZZ
17,FF,SS
18,FF,XX
19,BB,VV
20,GG,VV
21,GG,SS
22,AA,RR
23,AA,TT
24,AA,SS
25,CC,VV
26,CC,TT
27,FF,RR
28,GG,UU
29,CC,TT
30,BB,ZZ
31,II,TT
32,FF,RR
33,BB,SS
34,GG,YY
35,FF,RR
36,BB,VV
37,II,RR
38,CC,YY
39,FF,VV
40,AA,XX
41,AA,ZZ
42,GG,VV
43,BB,UU
44,II,UU
45,II,SS
46,DD,SS
47,AA,UU
48,BB,VV
49,GG,TT
50,BB,TT
Since your data is bipartite, I would suggest plotting points in the first factor on one side, points in the other factor on the other, with lines between them, like so:
The code I used to generate this was:
## Make up data.
data <- data.frame(X1=sample(state.region, 10),
X2=sample(state.region, 10))
## Set up plot window.
plot(0, xlim=c(0,1), ylim=c(0,1),
type="n", axes=FALSE, xlab="", ylab="")
factor.to.int <- function(f) {
(as.integer(f) - 1) / (length(levels(f)) - 1)
}
segments(factor.to.int(data$X1), 0, factor.to.int(data$X2), 1,
col=data$X1)
axis(1, at = seq(0, 1, by = 1 / (length(levels(data$X1)) - 1)),
labels = levels(data$X1))
axis(3, at = seq(0, 1, by = 1 / (length(levels(data$X2)) - 1)),
labels = levels(data$X2))
This is what I do. A darker colour indicates a more important combination of A and B.
dataset <- data.frame(A = sample(LETTERS[1:5], 200, prob = runif(5), replace = TRUE), B = sample(LETTERS[1:5], 200, prob = runif(5), replace = TRUE))
Counts <- as.data.frame(with(dataset, table(A, B)))
library(ggplot2)
ggplot(Counts, aes(x = A, y = B, fill = Freq)) + geom_tile() + scale_fill_gradient(low = "white", high = "black")
Or if you prefer lines
library(ggplot2)
dataset <- data.frame(A = sample(letters[1:5], 200, prob = runif(5), replace = TRUE), B = sample(letters[1:5], 200, prob = runif(5), replace = TRUE))
Counts <- as.data.frame(with(dataset, table(A, B)))
Counts$X <- 0
Counts$Xend <- 1
Counts$Y <- as.numeric(Counts$A)
Counts$Yend <- as.numeric(Counts$B)
ggplot(Counts, aes(x = X, xend = Xend, y = Y, yend = Yend, size = Freq)) +
geom_segment() + scale_x_continuous(breaks = 0:1, labels = c("A", "B")) +
scale_y_continuous(breaks = 1:5, labels = letters[1:5])
This third options add labels to the data points using geom_text().
library(ggplot2)
dataset <- data.frame(
A = sample(letters[1:5], 200, prob = runif(5), replace = TRUE),
B = sample(LETTERS[20:26], 200, prob = runif(7), replace = TRUE)
)
Counts <- as.data.frame(with(dataset, table(A, B)))
Counts$X <- 0
Counts$Xend <- 1
Counts$Y <- as.numeric(Counts$A)
Counts$Yend <- as.numeric(Counts$B)
ggplot(Counts, aes(x = X, xend = Xend, y = Y, yend = Yend)) +
geom_segment(aes(size = Freq)) +
scale_x_continuous(breaks = 0:1, labels = c("A", "B")) +
scale_y_continuous(breaks = -1) +
geom_text(aes(x = X, y = Y, label = A), colour = "red", size = 7, hjust = 1, vjust = 1) +
geom_text(aes(x = Xend, y = Yend, label = B), colour = "red", size = 7, hjust = 0, vjust = 0)
Maybe mosaicplot:
X <- structure(list(
ID = 1:50,
A = structure(c(6L, 1L, 2L, 4L, 4L, 3L, 7L, 1L, 3L, 4L, 1L, 1L, 4L, 4L, 1L, 3L, 5L, 5L, 2L, 6L, 6L, 1L, 1L, 1L, 3L, 3L, 5L, 6L, 3L, 2L, 8L, 5L, 2L, 6L, 5L, 2L, 8L, 3L, 5L, 1L, 1L, 6L, 2L, 8L, 8L, 4L, 1L, 2L, 6L, 2L), .Label = c("AA","BB", "CC", "DD", "FF", "GG", "HH", "II"), class = "factor"),
B = structure(c(3L, 2L, 6L, 2L, 3L, 6L, 8L, 3L, 1L, 8L, 6L, 3L, 2L, 6L, 7L, 8L, 2L, 6L, 5L, 5L, 2L, 1L, 3L, 2L, 5L, 3L, 1L, 4L, 3L, 8L, 3L, 1L, 2L, 7L, 1L, 5L, 1L, 7L, 5L, 6L, 8L, 5L, 4L, 4L, 2L, 2L, 4L, 5L, 3L, 3L), .Label = c("RR", "SS", "TT", "UU", "VV", "XX", "YY", "ZZ"), class = "factor")
), .Names = c("ID", "A", "B"), class = "data.frame", row.names = c(NA, -50L)
)
mosaicplot(with(X,table(A,B)))
For you example dataset:
Thanks! I think that the connectivity between elements in each class is best visualized by the link graph examples given by both Jonathon and Thierry. Thierry's 2nd which shows the magnitude is definitely where i will start.
update
thanks everyone for you ideas and tips!
I came acrossthe bipartite package that has functions to visualize this kind of data. I think its a clean visualization of the relationships I am trying to show.
did:
library(bipartite)
dataset <- data.frame(
A = sample(letters[1:5], 200, prob = runif(5), replace = TRUE),
B = sample(LETTERS[20:26], 200, prob = runif(7), replace = TRUE)
)
datamat <- as.matrix(table(dataset$A, dataset$B))
visweb(datamat, text = "interaction", textsize = .8)
giving:
visweb result
couldnt put image in as a new user :(

Resources