Plotting distribution of multiple measurements in two different groups in R

Plotting distribution of multiple measurements in two different groups in R - r

I have measurements of approximately 1000 variables in 2 groups with 10 replicates in each, in other words I have 2 dataframes with 10 columns and 1000 rows in each.
I would like to show the distribution of my measurements, in two different groups, to pick up variables that differ significantly between the groups. My initial idea was to do a large scatter plot where the x-coordinate would be an iteration of variables, and the y-coordinate would be measurement, and the points could be color coded. It doesn't quite work as expected however, I get a scatter plot matrix instead.
I tried to go with a boxplot,
ratios1 <- as.data.frame(matrix(rnorm(10000) * 100, 1000, 10))
boxplot(t(log2(ratios1)), horizontal = T)
which sort of works but all lines for the boxes makes the plot undecipherable, even for a single group (see figure below). Then I tried to remove the boxes and add the points afterwards as suggested here
boxplot(t(log2(ratios1)), horizontal = T, border = "white")
points(t(log2(ratios1)), pch=1)
But that didn't quite work either, as I only got the first variable drawn on the graph.
How can I display this type of information?

First of all, columns correspond to variables and rows to observations, not the other way around.
set.seed(42)
ratios1 <- as.data.frame(matrix(rnorm(10000) * 100, 10, 1000))
You could plot quantiles like this:
library(reshape2)
ratios2 <- melt(ratios1)
library(ggplot2)
ggplot(ratios2, aes(x = as.numeric(variable), y = value)) +
stat_summary(fun.data = function(y) as.data.frame(setNames(as.list(quantile(y, probs = c(0.025, 0.5, 0.975))), c("ymin", "y", "ymax"))),
color = "blue") +
stat_summary(fun.data = function(y) as.data.frame(setNames(as.list(quantile(y, probs = c(0.25, 0.5, 0.75))), c("ymin", "y", "ymax"))),
color = "red") +
xlab("variable")
There are no groups in your data, so I don't know what to do with that. Maybe you could facet by group. However, I don't think this kind of plot would be very useful for your goal of "pick[ing] up variables that differ significantly between the groups". I would do a hypothesis test with the appropriate correction for alpha error inflation.

Related

ggplot2 does not plot multiple groups of a variable, only plots one line

I would like to make a plot with multiple lines corresponding to different groups of variable "Prob" (0.1, 0.5 and 0.9) using ggplot. Although that, when I run the code, it only plots one line instead of 3. Thanks for the help :)
Here my code:
Prob <- c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9)
nit <- c(0.9,0.902777775,0.90555555,0.908333325,0.9111111,0.913888875,0.91666665,0.919444425,0.9222222,0.924999975,0.92777775,0.930555525,0.9333333,0.936111075,0.93888885,0.941666625,0.9444444,0.947222175,0.94999995,0.952777725,0.9555555,0.958333275,0.96111105,0.963888825,0.9666666,0.969444375,0.97222215,0.974999925,0.9777777,0.980555475,0.98333325,0.986111025,0.9888888,0.991666575,0.99444435,0.997222125,0.9999999,0.9,0.902777775,0.90555555,0.908333325,0.9111111,0.913888875,0.91666665,0.919444425,0.9222222,0.924999975,0.92777775,0.930555525,0.9333333,0.936111075,0.93888885,0.941666625,0.9444444,0.947222175,0.94999995,0.952777725,0.9555555,0.958333275,0.96111105,0.963888825,0.9666666,0.969444375,0.97222215,0.974999925,0.9777777,0.980555475,0.98333325,0.986111025,0.9888888,0.991666575,0.99444435,0.997222125,0.9999999,0.9,0.902777775,0.90555555,0.908333325,0.9111111,0.913888875,0.91666665,0.919444425,0.9222222,0.924999975,0.92777775,0.930555525,0.9333333,0.936111075,0.93888885,0.941666625,0.9444444,0.947222175,0.94999995,0.952777725,0.9555555,0.958333275,0.96111105,0.963888825,0.9666666,0.969444375,0.97222215,0.974999925,0.9777777,0.980555475,0.98333325,0.986111025,0.9888888,0.991666575,0.99444435,0.997222125,0.9999999)
greek <- log((1-Prob)/Prob)/-10
italian <- ((0.997-nit)/(0.997-0.97))^3
Temp<-c(rep(25,111))
GT <- ((30-Temp)/(30-3.3))^3
GH <- 1-GT-italian
acid <- (-1*(((sign(GH)*(abs(GH)^(1/3)))*(7-5))-7))
Species<-c(rep("Case",111))
data <- as.data.frame(cbind(Prob,greek,GT,GH,italian, Temp,acid,nit, Species))
ggplot() +
geom_line(data = data, aes_string(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8)

The answer seems to be kind of two parts:
In your data frame data, the columns that should be numeric are not numeric.
The reason why you only see one line.
Fixing the Data Frame and Using aes() in place of aes_string()
I noticed something was odd when you had as.data.frame(cbind(... to make your data frame and are using aes_string(.. within the ggplot portion. If you do a quick check on data via str(data), you'll see all of your columns in data are characters, whereas in the environment the data prepared in the code for their respective columns are numeric. Ex. acid is numeric, yet data$acid is a character.
The reason for this is that you're binding the columns into a data frame by using as.data.frame(cbind(.... This results in all data being coerced into a character, so you loose the numeric nature of the data. This is also why you have to use aes_string(...) to make it work instead of aes(). To bind vectors together into a data frame, use data.frame(..., not as.data.frame(cbind(....
To fix all this, bind your columns together like this + the ggplot code:
data <- data.frame(Prob,greek,GT,GH,italian, Temp,acid,nit, Species)
# data <- as.data.frame(cbind(Prob,greek,GT,GH,italian, Temp,acid,nit, Species))
ggplot() +
geom_line(data=data, aes(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8)
Why is there only one line?
The simple answer to why you only see one line is that the line for each of the values of data$Prob is equal. What you see is the effect of overplotting. It means that the line for data$Prob == 0.1 is the same line when data$Prob == 0.5 and data$Prob = 0.9.
To demonstrate this, let's separate each. I'm going to do this realizing that Prob could be created by repeating 0.1, 0.5, and 0.9 each 37 times in a row. I'll create a factor that I'll use as multiplication factor for data$nit that will result in separating our our lines:
my_factor <- rep(c(1,1.1,1.5), each=37) # our multiplication fractor
data$nit <- data$nit * my_factor # new nit column
# same plot code
ggplot() +
geom_line(data=data, aes(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8)
There ya go. We have all lines there, you just could not see them due to overplotting. You can convince yourself of this without the multiplication business and the original data by comparing the plots for each data$Prob:
# use original dataset as above
ggplot() +
geom_line(data=data, aes(x = acid, y = nit, group = Prob, color = factor(Prob)), size = 0.8) +
facet_wrap(~Prob)

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.

I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]

I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value

Looking for a better way to plot data

I have data that describe several measurements taken from several individuals (each individual is represented by several measurements taken at several different time points).
I want to present the data as a scatter plot of measurements vs. individuals. Since for each individual I have several measurements, it means that I'll have a stack of points at each x-axis point.
Here's an example random code to generate these data:
set.seed(1)
n.individuals <- 10
n.measurements <- 15
vars <- runif(n.individuals, 0.1, 1)
means <- runif(n.individuals, 1, 5)
negative.idx <- sample(n.individuals, n.individuals/2)
means[negative.idx] <- -1*means[negative.idx]
df <- data.frame(measurement=c(sapply(1:n.individuals, function(x) rnorm(n.measurements, means[x], sqrt(vars[x])))),
individual=c(sapply(1:n.individuals, function(x) rep(x, n.measurements))))
Here's how I'm presenting the data so far:
#add colors
cols <- rgb(runif(n.measurements),runif(n.measurements),runif(n.measurements))
df$col <- rep(cols, n.individuals)
#simple plot
plot(df$individual, df$measurement, col=df$col, lwd=2, xlab = "individual", ylab = "measurement")
abline(h=0,lty=2)
abline(v=seq(min(df$individual)-0.5, max(df$individual)+0.5, 1),lty=2)
I'm wondering if there's a more elegant way to present the data (perhaps a ggplot way?)
Note that the signal I'm looking for in the data (and this is how I generated them) is that the measurements for each individual are correlated with respect to their sign. If they are uncorrelated with respect to their sign they should appear scattered on both sides of the y-axis.

Firstly, I would jitter your individuals so that individual measurements do not overlap. Use this code:
plot(jitter(df$individual), df$measurement, col=df$col,
lwd=2, xlab = "individual", ylab = "measurement")
There are a million ways to plot it in ggplot. Here's a quick violin graph:
p <- ggplot(df, aes(factor(individual), measurement))
p + geom_violin(aes(fill = factor(individual))) +
geom_hline((aes(yintercept = 0))) + geom_jitter( ) + xlab("Individual")

R ggplot - Coordinates transform (square root) with boxplot

I have a couple of box and whisker plots in R. In both, the x-axis corresponds to one categorical variable whilst the grouping colours correspond to the other.
If I draw both plots with an untransformed y-axis, they are both fine. However, if I try to square-root transform the y-axis (using: coord_trans(y = "sqrt")), one of those graph is still fine whilst the other drops the lines corresponding to the median in most boxes (except those for which there are only two groups and where the boxes are therefore slightly wider, see "Numbers" 1 and 2 on the first plot). Further, for the graph that does not draw properly, if I reduce the number of categories on my x-axis (hence getting the boxes wider again), the median lines appear again.
Is this a bug with coord_trans (if so, how can I get around it) or a problem with my code?
Thank you very much for any suggestion.
library(car)
library(gplots)
library(plyr)
library(ggplot2)
library(gridExtra)
library(gdata)
Category=factor(c(rep(1, times =3240), rep(2, times =2160)),
labels=c("A","B"), levels=c(1,2))
ID=factor(rep(seq(from = 1, to = 45),each = 120))
Months=factor(rep(seq(from = 1, to = 3), each = 40, times = 45),
labels=c("Jan","Feb","Mar"),levels=c(1:3))
Obs=rnorm(5400, mean=25, sd=15)
Data=data.frame(Category,ID,Months,Obs)
Data=subset(Data, (Data$Category=="B") | !(Data$ID%in%c(1,2)) |
(Data$Months%in%c("Jan","Feb")))
for (j in 1:2)
{
sel=which(Data$Category==unique(levels(Data$Category))[j])
Observ=Data$Obs[sel]
Month=Data$Months[sel]
Number=droplevels(Data$ID[sel])
Number=droplevels(Number)
Data_used=data.frame(Number,Month,Observ)
plot1 = ggplot(Data_used, aes(Number, Observ)) +
geom_boxplot(aes(fill=Month, drop=FALSE), na.rm=TRUE) +
scale_y_continuous(breaks = c(0,20,40,60,80,100), limits=c(0,115)) +
coord_trans(y = "sqrt")
plot(plot1)
}

#Dennis is correct in his comment that scale_y_sqrt() will correct this. Because median and quartiles are order statistics it doesn't matter whether the data are transformed before or after calculating them.

Visualize critical values / pairwise comparisons from posthoc Tukey in R

I'm trying to get a fine-grain visualisation of critical values I got from posthoc Tukey. There are some good guidelines out there for visualizing pairwise comparisons, but I need something more refined. The idea is that I would have a plot where each small square would represent a critical value from the matrix below, coded in such manner that:
if the value is higher or equal to 5.45 - it's a black square;
if the value is lower or equal to -5.45 - it's a gray square;
if the value is between -5.65 and 5.65 - it's a white square.
The data matrix is here.
Or maybe you would have better suggestion how to visualize those critical values?
EDIT: Following comments from #Aaron and #DWin I want to provide a bit more context for the above data and justification for my question. I am looking at the mean ratings of acceptability for seven virtual characters, each of them is animated on 5 different levels. So, I have two factors there - character (7 levels) and motion (5 levels). Because I have found interaction between those two factors, I decided to look at differences between the means for all the characters for all levels of motion , which resulted in this massive matrix, as an output of posthoc Tukey. It's probably too much detail now, but please don't throw me out to Cross Validated, they will eat me alive...

This is fairly straightforward with image:
d <- as.matrix(read.table("http://dl.dropbox.com/u/2505196/postH.dat"))
image(x=1:35, y=1:35, as.matrix(d), breaks=c(min(d), -5.45, 5.45, max(d)),
col=c("grey", "white", "black"))
For just half, set half to missing with d[upper.tri(d)] <- NA and add na.rm=TRUE to the
min and max functions.

Here is a ggplot2 solution. I'm sure there are simpler ways to accomplish this -- I guess I got carried away!
library(ggplot2)
# Load data.
postH = read.table("~/Downloads/postH.dat")
names(postH) = paste("item", 1:35, sep="") # add column names.
postH$item_id_x = paste("item", 1:35, sep="") # add id column.
# Convert data.frame to long form.
data_long = melt(postH, id.var="item_id_x", variable_name="item_id_y")
# Convert to factor, controlling the order of the factor levels.
data_long$item_id_y = factor(as.character(data_long$item_id_y),
levels=paste("item", 1:35, sep=""))
data_long$item_id_x = factor(as.character(data_long$item_id_x),
levels=paste("item", 1:35, sep=""))
# Create critical value labels in a new column.
data_long$critical_level = ifelse(data_long$value >= 5.45, "high",
ifelse(data_long$value <= -5.65, "low", "middle"))
# Convert to labels to factor, controlling the order of the factor levels.
data_long$critical_level = factor(data_long$critical_level,
levels=c("high", "middle", "low"))
# Named vector for ggplot's scale_fill_manual
critical_level_colors = c(high="black", middle="grey80", low="white")
# Calculate grid line positions manually.
x_grid_lines = seq(0.5, length(levels(data_long$item_id_x)), 1)
y_grid_lines = seq(0.5, length(levels(data_long$item_id_y)), 1)
# Create plot.
plot_1 = ggplot(data_long, aes(xmin=as.integer(item_id_x) - 0.5,
xmax=as.integer(item_id_x) + 0.5,
ymin=as.integer(item_id_y) - 0.5,
ymax=as.integer(item_id_y) + 0.5,
fill=critical_level)) +
theme_bw() +
opts(panel.grid.minor=theme_blank(), panel.grid.major=theme_blank()) +
coord_cartesian(xlim=c(min(x_grid_lines), max(x_grid_lines)),
ylim=c(min(y_grid_lines), max(y_grid_lines))) +
scale_x_continuous(breaks=seq(1, length(levels(data_long$item_id_x))),
labels=levels(data_long$item_id_x)) +
scale_y_continuous(breaks=seq(1, length(levels(data_long$item_id_x))),
labels=levels(data_long$item_id_y)) +
scale_fill_manual(name="Critical Values", values=critical_level_colors) +
geom_rect() +
geom_hline(yintercept=y_grid_lines, colour="grey40", size=0.15) +
geom_vline(xintercept=x_grid_lines, colour="grey40", size=0.15) +
opts(axis.text.y=theme_text(size=9)) +
opts(axis.text.x=theme_text(size=9, angle=90)) +
opts(title="Critical Values Matrix")
# Save to pdf file.
pdf("plot_1.pdf", height=8.5, width=8.5)
print(plot_1)
dev.off()

If you set this up with findInterval as an index into the bg, col, and/or pch arguments (although they are all squares at the moment), you should find the code fairly compact and understandable.
You'll need to get the data in long format first; here's one way:
d <- as.matrix(read.table("http://dl.dropbox.com/u/2505196/postH.dat"))
dat <- within(as.data.frame(as.table(d)),
{ Var1 <- as.numeric(Var1)
Var2 <- as.numeric(Var2) })
Then the code is as follows; pch=22 uses filled squares, bg sets the fill color of the square, col sets the border color, and cex=1.5 just makes them a little bigger than the default.
plot(dat$Var1, dat$Var2,
bg = c("grey", "white", "black")[1+findInterval(dat$Freq, c(-5.45,5.45))],
col="white", cex=1.5, pch = 22)
You need the 1+ in there because the values would be 0,1,2 and your indices need to start with 1.

To make a closure here I used majority of suggestions from #DWin and #Aaron to create the plot below. The lightest level of gray stands for non-significant values. I also used rect to create lines above axis names to better differentiate between conditions:
d <- as.matrix(read.table("http://dl.dropbox.com/u/2505196/postH.dat"))
#remove upper half of the values (as they are mirrored values)
d[upper.tri(d)] <- NA
dat <- within(as.data.frame(as.table(d)),{
Var1 <- as.numeric(Var1)
Var2 <- as.numeric(Var2)})
par(mar=c(6,3,3,6))
colPh=c("gray50","gray90","black")
plot(dat$Var1,dat$Var2,bg = colPh[1+findInterval(dat$Freq, c(-5.45,5.45))],
col="white",cex=1.2,pch = 21,axes=F,xlab="",ylab="")
labDis <- rep(c("A","B","C","D","E"),times=7)
labChar <- c(1:7)
axis(1,at=1:35,labels=labDis,cex.axis=0.5,tick=F,line=-1.4)
axis(1,at=seq(3,33,5),labels=labChar, tick=F)
#drawing lines above axis for better identification
rect(1,0,5,0,angle=90);rect(6,0,10,0,angle=90);rect(11,0,15,0,angle=90);
rect(16,0,20,0,angle=90);rect(21,0,25,0,angle=90);rect(26,0,30,0,angle=90);
rect(31,0,35,0,angle=90)
axis(4,at=1:35,labels=labDis,cex.axis=0.5,tick=F,line=-1.4)
axis(4,at=seq(3,33,5),labels=labChar,tick=F)
#drawing lines above axis for better identification
rect(36,1,36,5,angle=90);rect(36,6,36,10,angle=90);rect(36,11,36,15,angle=90);
rect(36,16,36,20,angle=90);rect(36,21,36,25,angle=90);rect(36,26,36,30,angle=90);
rect(36,31,36,35,angle=90)
legend("topleft",legend=c("not significant","p<0.01","p<0.05"),pch=16,
col=c("gray90","gray50","black"),cex=0.7,bty="n")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Plotting distribution of multiple measurements in two different groups in R - r

Related

ggplot2 does not plot multiple groups of a variable, only plots one line

R - Control Histogram Y-axis Limits by second-tallest peak

Looking for a better way to plot data

R ggplot - Coordinates transform (square root) with boxplot

Visualize critical values / pairwise comparisons from posthoc Tukey in R

Categories

Resources