Create multiple lattice plots from a data table using lapply - r

I am trying to generate mean values by group for each of numerous variables (species) and then plot each of these separately. I have tried list and data table formats. The base plot function works in a for loop:
library(data.table)
for (i in 3:5) {
# generate a list of mean value for the species in column number i
temp <- v2[, lapply(.SD, mean), by="Season", .SDcols=i]
# plot each col of the list as a scatterplot with main title = header of 2nd col
plot(temp[[2]]~temp[[1]], main = colnames(temp)[[2]])
}
But when I try to create lattice plots only a single plot is generated for the last variable:
library(data.table)
library(lattice)
for (i in 3:5) {
# generate a list of mean value by season for the species in column number i
temp <- v2[, lapply(.SD, mean), by=c("Season", "Location"), .SDcols=i]
# Each group in a separate mini plot
xyplot(temp[[3]]~temp[[1]] | temp[[2]], main = colnames(temp)[3])
}
Have tried saving or printing each lattice plot, is that the right idea? Perhaps I am going about this the wrong way altogether?
Here is a small sample of my data:
structure(list(Location = structure(c(1L, 1L, 1L, 1L, 4L, 4L,
4L, 6L, 6L, 1L), .Label = c("BN", "BS", "GN", "GS", "SB", "SL"
), class = "factor"), Season = c(1981L, 1981L, 1981L, 1981L,
1995L, 1995L, 1995L, 1997L, 1997L, 2000L), Agrostis.magellanica = c(0.3,
0.3, 0.3, 0.01, 0.6, 0.3, 0.3, 0.3, 0.6, 0.05), Festuca.contracta = c(0.6,
0.05, 0.3, 0.01, 0.01, 0, 0, 0, 0.01, 0.05), Poa.annua = c(0.01,
0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.05, 0.01, 0.01)), .Names = c("Location",
"Season", "Agrostis.magellanica", "Festuca.contracta", "Poa.annua"
), class = c("data.table", "data.frame"), row.names = c(NA, -10L
)

This is in the R-FAQ. Need a print statement around grid graphics (lattice or ggplot) when used inside a function, and the for loop is a function:
# Needed
require(data.table) # before defining the object.
pdf() # pdf is a multipage device.
for (i in 3:5) {
# generate a list of mean value by season for the species in column number i
temp <- v2[, lapply(.SD, mean), by=c("Season", "Location"), .SDcols=i]
# Each group in a separate mini plot
print( xyplot(temp[[3]]~temp[[1]] | temp[[2]], main = colnames(temp)[3]) )
}
dev.off() # Failing to properly close a file graphics device is common error.

Related

How to plot many probability density functions (pdfs) without sharp edges?

I have an issue with plotting continuous distributions without sharp edges in ggplot2. I need to show two of them on one plot. Also, it does not have to be ggplot2 to achieve this result.
I am aware, that the number of data points directly influences the smoothness of the line, but it was not the case here. Below you can see some sample data (from dput)
sample.data<-list(beta.data = structure(list(cluster = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), beta.density = c(0, 3.42273368363934e-43, 8.42987148403021e-29,
2.04764468657484e-20, 1.69485562831516e-14, 6.07999638837842e-10, 2.88180370232676e-06, 0.00314064636750876, 0.954118897015866, 0, 0, 3.80101893822358e-36, 6.43342582657081e-22, 6.82956252277493e-14, 1.75291058931833e-08, 0.000131874335695378, 0.0750918340641428, 3.72532418759802, 5.05242078519544, 0), pr = c(0, 0.111111111111111, 0.222222222222222, 0.333333333333333, 0.444444444444444, 0.555555555555556, 0.666666666666667, 0.777777777777778, 0.888888888888889, 1, 0, 0.111111111111111, 0.222222222222222, 0.333333333333333, 0.444444444444444, 0.555555555555556, 0.666666666666667, 0.777777777777778, 0.888888888888889, 1)), row.names = c(NA, -20L), class = "data.frame"), beta.params = structure(list(cluster = 1:2, a = c(49, 50), b = c(2, 10), ni.beta = c(0.961,0.833), sd.beta = c(0.00072, 0.00228)), row.names = c(NA,-2L), class = "data.frame"))
Before I was using geom_col, but it discretizes values. I went with geom_area:
ggplot(sample.data$beta.data, aes(x = pr, y = beta.density)) +
geom_area(stat = "function",
fun = dbeta,
args = list(shape1 = sample.data$beta.params[1,"a"], shape2 = sample.data$beta.params[1,"b"]),
fill = "#F1C40F",
colour = "black",
alpha = 0.7) +
geom_area(stat = "function",
fun=dbeta,
args = list(shape1 = sample.data$beta.params[2,"a"], shape2 = sample.data$beta.params[2,"b"]),
fill = "#3498DB",
colour = "black",
alpha = 0.7)
I presented you the data with 10 points, but 1000 points look almost the same. It is not the case here, where even 100 points looks ok:
p = seq(0,1, length=100)
plot(p, dbeta(p, 50, 10), ylab="Density", type ="l", col=4, , lwd = 2)
Here I am attaching code to simulate the data. Oh, and these troublesome beta parameters were a = 49 and b = 2.
len <- 100
p <- seq(0,1, length.out = len)
df <- data.frame(rbind(cbind("cl" = rep(1, times = length(p)), "beta" = dbeta(p, 50, 10),"p"= p),
cbind("cl" = rep(1, times = length(p)), "beta" = dbeta(p, 40, 2),"p"= p)))
Do you have any ideas?
EDIT: The pdfs stands here for probability density functions. That is why I have not put "pdf" as a tag. My apologies for the confusion!
Anyway, when I tried to print graphic to PDF file, the result was poor as well (sharp edges). But it the end, it shouldn't matter. I want to see smooth lines whatever I do (reasonably).
EDIT2 It is possible to achieve because:
library(mosaic)
theme_set(theme_bw())
xpbeta(c(0.7, 0.90), shape1 = 49, shape2 = 2)
It produces nice, smoothed beta dist with parameters (49, 2). But then again, I need to show two dists in one chart.
I have found the answer. It still needs some editing (like transparency/alpha which I couldn't figure out), but in general, this is what I meant. Code:
library(mosaic)
plotDist('beta', params=list(49,2), kind='density', type = "h", col = "#3498DB", xlim = c(0,1))
plotDist('beta', params=list(50, 10), kind='density', , type = "h", col = "#F1C40F", add = TRUE)
plotDist('beta', params=list(49,2), kind='density', add = TRUE, col = "black")
plotDist('beta', params=list(50, 10), kind='density', add = TRUE, col = "black")
Result:
We can add as many distributions as we want, using "add" parameter.
Parameter type = "h", is used to draw filled distribution. Without it, the only line is visible. In my answer, I draw the two lines and two filled dists. I would be really happy if someone could show a better answer, though.
EDIT:
I think I found my perfect answer!
Here is the code:
library(ggformula)
theme_set(theme_bw())
gf_dist("beta", shape1 = 49, shape2 = 2, geom = "area", alpha = 0.5, fill = "#F1C40F") %>%
gf_dist("beta", shape1 = 49, shape2 = 2) %>%
gf_dist("beta", shape1 = 50, shape2 = 10, geom = "area", alpha = 0.5, fill = "#3498DB") %>%
gf_dist("beta", shape1 = 50, shape2 = 10)
It is much faster than the previous code, parameter alpha is obvious and it is relatively easy to combine many plots! Because of transparency, you can nicely see the overlap of both distributions.

How to best represent a matrix with one column of data with a heatmap

I am used to making heatmaps in a particular manner. I have included a toy example and a generic code of how I make heatmaps.
>library(reshape2)
>library(gplots)
>library(RColorBrewer)
>dput(head(X))
structure(c(1.1, 0.4, 2.1, 2.3, 1.3, 1.3, 1.7, 2.4, 2.7), .Dim = c(3L,
3L), .Dimnames = list(c("GeneA", "GeneB", "GeneC"), c("sample1",
"sample2", "sample3")))
>hr <- hclust(as.dist(1-cor(t(X), method = "pearson")) ,method= "average")
>mycl <- cutree(hr, h=max(hr$height/1.3))
>clusterCols <- rainbow(length(unique(mycl)))
>myClusterSideBar <- clusterCols[mycl]
>myheatcol <- rev(redblue(75))
>heatmap.2(X, main="RBP enrichment profiles", Rowv=as.dendrogram(hr), Colv = NULL, dendrogram="row",scale="row", col=myheatcol, density.info="none", trace="none", RowSideColors= myClusterSideBar) -> HM
This works well for matrix's with multiple rows and columns but I have a matrix with only one column of data. Is there a method of altering this code to produce a similar plot.
A small sample of the data in question is below.
> dput(head(mydata))
structure(c(21.1545836987, -1.6564670831, -2.8577854708, -2.2236733666,
2.0775473007, 3.0734379897), .Dim = c(6L, 1L), .Dimnames = list(
c("LENG1.1", "Dec-01", "Mar-02", "A1CF", "AADACL2-AS1", "AANAT"
), "Log2FC"))
Can I turn this data into a heatmap? Any help would be most welcome.

How to plot asymptote of a curve in R?

I have this data called mydf where I have hybrid sample comparison for efficiency. There are seven different efficiency columns for the intermixing of sampleA and sampleB. I want to see the plot for these seven efficiencies to see at which efficiency level will they significantly drop compared to the first few columns.
mydf<-structure(list(sample_A = structure(c(1L, 2L, 2L, 2L, 3L, 4L), .Label = c("2568",
"2669", "2670", "2671", "2946", "LPH-001-10_AK1", "LPH-001-12_AK2",
"LPH-001-9"), class = "factor"), sample_B = structure(c(1L, 2L,
3L, 4L, 3L, 4L), .Label = c("2568", "2669", "2670", "2671", "2946",
"LPH-001-10_AK1", "LPH-001-12_AK2", "LPH-001-9"), class = "factor"),
efficiency = c(1.02, 0.964, 0.415, 0.422, 0.98, 0.986), efficiency2 = c(1,
0.944, 0.395, 0.402, 0.96, 0.966), efficiency3 = c(0.9, 0.844,
0.295, 0.302, 0.86, 0.866), efficiency4 = c(0.32, 0.264,
-0.285, -0.278, 0.28, 0.286), efficiency5 = c(0.02, -0.0360000000000001,
-0.585, -0.578, -0.0200000000000001, -0.0140000000000001),
efficiency6 = c(0.12, 0.0639999999999999, -0.485, -0.478,
0.08, 0.086), efficiency7 = c(0.02, -0.036, -0.585, -0.578,
-0.02, -0.014)), .Names = c("sample_A", "sample_B", "efficiency",
"efficiency2", "efficiency3", "efficiency4", "efficiency5", "efficiency6",
"efficiency7"), row.names = c(NA, 6L), class = "data.frame")
Here's one way to plot your data:
mydf <- structure(list(sample_A=structure(c(1L,2L,2L,2L,3L,4L),.Label=c('2568','2669','2670','2671','2946','LPH-001-10_AK1','LPH-001-12_AK2','LPH-001-9'),class='factor'),sample_B=structure(c(1L,2L,3L,4L,3L,4L),.Label=c('2568','2669','2670','2671','2946','LPH-001-10_AK1','LPH-001-12_AK2','LPH-001-9'),class='factor'),efficiency=c(1.02,0.964,0.415,0.422,0.98,0.986),efficiency2=c(1,0.944,0.395,0.402,0.96,0.966),efficiency3=c(0.9,0.844,0.295,0.302,0.86,0.866),efficiency4=c(0.32,0.264,-0.285,-0.278,0.28,0.286),efficiency5=c(0.02,-0.0360000000000001,-0.585,-0.578,-0.0200000000000001,-0.0140000000000001),efficiency6=c(0.12,0.0639999999999999,-0.485,-0.478,0.08,0.086),efficiency7=c(0.02,-0.036,-0.585,-0.578,-0.02,-0.014)),.Names=c('sample_A','sample_B','efficiency','efficiency2','efficiency3','efficiency4','efficiency5','efficiency6','efficiency7'),row.names=c(NA,6L),class='data.frame');
effCis <- grep('^efficiency',names(mydf));
xlim <- c(1,length(effCis));
ylim <- range(mydf[,effCis],na.rm=T);
ylim[1L] <- floor(ylim[1L]/0.1)*0.1;
ylim[2L] <- ceiling(ylim[2L]/0.1)*0.1;
xticks <- seq_along(effCis);
yticks <- seq(ylim[1L],ylim[2L],0.1);
plot(NA,xlim=xlim,ylim=ylim,xlab='measurement',ylab='efficiency',xaxs='i',yaxs='i',axes=F);
abline(v=xticks,col='lightgrey');
abline(h=yticks,col='lightgrey');
abline(h=0,lwd=2);
axis(1L,xticks,xticks,font=2L,cex.axis=0.7);
axis(2L,yticks,sprintf('%.1f',yticks),las=1L,font=2L,cex.axis=0.7);
hybrid.col <- data.frame(hybrid=seq_len(nrow(mydf)),col=c('red','green','blue','gold','cyan','magenta'),stringsAsFactors=F);
splineN <- 200L;
for (ri in seq_len(nrow(hybrid.col))) {
hybrid <- hybrid.col$hybrid[ri];
col <- hybrid.col$col[ri];
x <- xticks;
y <- c(as.matrix(mydf[hybrid,effCis]));
points(x,y,pch=16L,col=col,xpd=NA);
with(spline(x,y,splineN),{
lines(x,y,col=col,lwd=2,xpd=NA);
localwin <- which(x>2 & x<3);
tp <- which.min(abs(diff(y[localwin])));
if (length(tp)>0L) points(x[localwin[tp]],y[localwin[tp]],col=col,pch=4L);
localwin <- which(x>2 & x<5);
tp <- which.min(diff(y[localwin]));
if (length(tp)>0L) {
m <- diff(y[localwin[seq(tp,len=2L)]])/diff(x[localwin[seq(tp,len=2L)]]);
if (is.finite(m)) abline(y[localwin[tp]]-m*x[localwin[tp]],m,col=col,lty=2L);
};
});
};
legend(5.5,0.95,paste0(mydf$sample_A,' / ',mydf$sample_B),fill=hybrid.col$col,cex=0.7,title='hybrid');
I wasn't 100% sure what you meant by the asymptote. I initially thought maybe you wanted the local maxima of the curves just prior to where they begin to drop, which is why I marked the local maxima with points (symbol X, i.e. pch=4L). But then I realized maybe you meant the tangent line along the drop, so I added lines tangent to the points of steepest slope.
This is the definition of asymptote:
a straight line approached by a given curve as one of the variables in the equation of the curve approaches infinity.
I don't think that's applicable here; plotting this data does not involve taking anything to infinity. I think you want either the local maxima or tangent lines.

How to scale y-axis to intuitively detect small differences in data

I have a data set from a literature survey, where we looked at effects of pH to certain parameters (Metrics) in a group of animals. Because experiments are done on different time scales, I divided the response ratio by time.
This leads to very small differences around 1 (less than 1, there is a negative effect, greater than 1 a positive effect), which are still interesting and important (because the real values are divided by time). The problem is that some of the values are either very low or very high and the differences close to 1 are not visible.
Since values are close to 1, log transformation of y-axis scale does not help. How can I transform the y-axis scale in ggplot2 so that differences close to 1 are visible and yet intuitive? (that the reader can detect differences without thinking too much; I could standardize the values to minimum value, multiply by 10000 and take a log10 scale, but this would not lead to understandable differences.)
df <- structure(list(Study = c(1, 1, 2, 2, 3), pH_control = c(8.06,
8.06, 8.01, 8.01, 7.99), pH_treatment = c(7.86, 7.75, 7.8, 7.8,
7.45), time = c(120, 120, 60, 150, 140), Metrics = structure(c(3L,
1L, 2L, 3L, 1L), .Label = c("Growth", "Metabolism", "Survival"
), class = "factor"), RR_per_time_unit = c(0.9998, 1.001, 1.002,
0.98, 0.9), CI.max = c(1, 1.003, 1.00003, 0.9999, 0.92), CI.min = c(0.9996,
0.9999, 1.004, 0.9789, 0.89), pH_diff = c(0.2, 0.31, 0.21, 0.21,
0.54)), .Names = c("Study", "pH_control", "pH_treatment", "time",
"Metrics", "RR_per_time_unit", "CI.max", "CI.min", "pH_diff"), row.names = c(NA,
-5L), class = "data.frame")
df$pH_diff <- df$pH_control - df$pH_treatment
library(ggplot2)
ggplot(df, aes(y = RR_per_time_unit, x = pH_diff, ymin = CI.min, ymax = CI.max)) +
geom_pointrange(aes(color = Metrics)) + geom_hline(aes(yintercept = 1)) + coord_trans(y = "log10")

barchart with multiple overlaying errorbars

I am trying to create a barplot with multiple errorbars. Something like this: http://flyordie.sin.khk.be/r/histogram%20error%20bars.PNG
I have the following dataset:
http://flyordie.sin.khk.be/r/output.csv
I have tried using ggplot2 and lattice graphics, but haven't found anything that suits my needs.
My current code for showing the barchart is this:
data <- read.csv("c:/output.csv")
data
par(las=3)
barplot(data$PlateId,
height=data$HC.Maximum,
names.arg=data$PlateId,
col="lightblue")
And to show the highest errorbars i use this code
library(ggplot2)
limits <- aes(ymax = qc$HC.Maximum, ymin = qc$HC.Minimum)
p <- ggplot(qc, aes(colour=HC.Median,x=PlateId))
p + geom_bar(position="dodge")+ geom_errorbar(limits,position="dodge")
But I have no clue on how to put them on the same graphic (like in my example)
The data:
qc <- structure(list(row = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Row", class = "factor"),
ID = 1:14, PlateId = c(35276L, 35279L, 35280L, 35281L, 35282L,
35290L, 35291L, 35292L, 35293L, 35294L, 35295L, 35296L, 35297L,
35298L), LC.Median = c(439688.495, 509376.055, 475218.99,
497368.215, 481801.9, 468603.43, 494713.175, 459047.385,
482819.47, 495162.31, 449592.51, 460564.95, 478715.915, 452293.465
), LC.Stdev = c(52290.12229, 49648.49436, 55743.10306, 62002.53552,
46908.66149, 52489.615, 48016.94019, 52082.23899, 47934.37133,
58977.84845, 45827.62648, 53514.21095, 49638.98286, 139686.144
), LC.Minimum = c(279610.16, 423651.45, 356422.31, 411639.77,
397362.84, 345178.07, 406073.72, 352834.86, 339035.77, 369554.11,
348688.39, 357341.56, 370463.11, 210367.91), LC.Maximum = c(498195.9,
630648.53, 614625.78, 686737.35, 621372.36, 576491.41, 579708.95,
580633.28, 580125.9, 622108.73, 530234.87, 563616.65, 614936.33,
730272.63), HC.Median = c(507356.465, 553226.525, 447067.77,
452223.76, 453439.37, 422491.755, 447438.8, 435034.635, 446148.105,
438089.69, 466748.63, 440005.81, 454927.74, 483599.71), HC.Stdev = c(65355.46121,
72762.07338, 80118.37641, 43653.99318, 73389.12355, 62590.47601,
46421.36678, 62822.88532, 61175.4241, 64418.56174, 63101.2232,
68166.51814, 61256.74139, 87354.9441), HC.Minimum = c(381552.05,
391124.94, 280614.72, 395454.12, 291433.84, 252579.15, 331661.03,
296223.64, 240262.37, 299431.98, 375224.27, 278780.87, 310275.66,
213170.04), HC.Maximum = c(626483.6, 635111.41, 555357.3,
528822.8, 534172.42, 514927.42, 538385.26, 533024.74, 524973.99,
544335.94, 564954.87, 572206.98, 547489.1, 565338.09), zPrime = c(-3.96,
-23.73, -7.88, -5.81, -5.32, -5.54, -4.48, -7.98, -6.99,
-5.63, -22.54, -33.83, -11.92, -17.44), Sb = c(1.17, 1.03,
0.91, 0.91, 0.89, 0.89, 0.9, 0.92, 0.92, 0.89, 1.04, 0.98,
0.95, 1.09), Sn = c(1.37, 0.3, -0.83, -0.76, -1.22, -1.01,
-1.08, -0.74, -0.86, -0.95, 0.31, -0.2, -0.52, 0.27)), .Names = c("row",
"ID", "PlateId", "LC.Median", "LC.Stdev", "LC.Minimum", "LC.Maximum",
"HC.Median", "HC.Stdev", "HC.Minimum", "HC.Maximum", "zPrime",
"Sb", "Sn"), class = "data.frame", row.names = c(NA, -14L))
When creating plots with multiple layers, I approach it as follows:
Define common aesthetics in the initial call to ggplot
Define additional aesthetics in each additional layer
Note that I have modified your code, since I couldn't get your example to work:
Provide an explicit binwidth=1 to geom_bar to remove the warnings
Remove the position=dodge since this is the default and redundant
Supply an explicit stat=identity to geom_bar
The code:
ggplot(qc, aes(x=PlateId)) +
geom_bar(aes(y=HC.Median), binwidth=1, stat="identity", fill="cyan") +
geom_errorbar(aes(ymin=HC.Minimum, ymax=LC.Minimum), colour="red") +
geom_errorbar(aes(ymin=LC.Maximum, ymax=HC.Maximum), colour="purple")
Just add another geom_errorbar.
#Rename limits to limits_hi
limits_hi <- aes(ymax = qc$HC.Maximum, ymin = qc$HC.Minimum)
#Define the other error bar
limits_lo <- aes(ymax = qc$LC.Maximum, ymin = LC.Minimum)
#I'm not quite sure what you want in the bars; see if this looks right
p <- ggplot(qc, aes(factor(PlateId), HC.Median))
p +
geom_bar(position="dodge") +
geom_errorbar(limits_hi, position="dodge", colour = "red") +
geom_errorbar(limits_lo, position="dodge", colour = "blue") +
opts(axis.text.x = theme_text(angle = 30))

Resources