Creating boxplot on log scale in R - r

I am trying to plot a boxplot in R, where the input file has multiple columns and each column has different number of rows. With the help given on help on the following link:
boxplot of vectors with different length
I am trying:
x <- read.csv( 'filename.csv', header = T )
plot(
1, 1,
xlim=c(1,ncol(x)), ylim=range(x[-1,], na.rm=TRUE),
xaxt='n', xlab='', ylab=''
)
axis(1, labels=colnames(x), at=1:ncol(x))
for(i in 1:ncol(x)) {
p <- x[,i]
boxplot(p, add=T, at=i)
}
I am trying to plot the values in log scale. But defining log ="y", I am getting the following error:
Error in xypolygon(xx, yy, lty = "blank", col = boxfill[i]) :
plot.new has not been called yet
Following is the sample of my input csv data:
A B C D
2345.42 932.19 40.8 26.19
138.48 1074.1 4405.62 4077.16
849.35 0.0 1451.66 1637.39
451.38 146.22 4579.6 5133.14
5749.01 7250.08 12.23 0.09
4125.48 129.46 49.51
440.38 6405.02

Your data as a reproducible example
Note I had to remove an extra element
library(data.table)
df <- fread("A,B,C,D
2345.42,932.19,40.8,26.19
138.48,1074.1,4405.62,4077.16
849.35,0.0,1451.66,1637.39
451.38,146.22,4579.6,5133.14
5749.01,7250.08,12.23,0.09
4125.48,129.46,49.51,440.38", sep=",", header=T)
dplyr and tidyr solution
library(dplyr)
library(tidyr)
df1 <- df %>%
replace(.==0,NA) %>% # make 0 into NA
gather(var,values,A:D) %>% # convert from wide (4-col) to long (2-col) format
mutate(values = log10(values)) # log10 transform
If you want log2, simply replace log10 with log2
Output
boxplot(values ~ var, df1)
A little extra
For log10 scale, I like to add 1 to my values to eliminate negative values since log10(0 < x < 1) = -value. This sets the minimum value on your plot as 0 since 0 + 1 = 1 and log10(1) = 0

Related

Calculating mean and interquartile range of 'cut' data to plot

Apologies I am new to R, I have a dataset with height and canopy density of trees for example:
i_h100 i_cd
2.89 0.0198
2.88 0.0198
17.53 0.658
27.23 0.347
I want to regroup 'h_100' into 2m intervals going from 2m min to 30m max, I then want to calculate the mean i_cd value and interquartile range for each of these intervals so that I can then plot these with a least squares regression. There is something wrong with the code I am using to get the mean. This is what I have so far:
mydata=read.csv("irelandish.csv")
height=mydata$i_h100
breaks=seq(2,30,by=2) #2m intervals
height.cut=cut(height, breaks, right=TRUE)
#attempt at calculating means per group
install.packages("dplyr")
mean=summarise(group_by(cut(height, breaks, right=TRUE),
mean(mydata$i_cd)))
install.packages("reshape2")
dcast(mean)
Thanks in advance for any advice.
Using aggregate() to calculate the groupwise means.
# Some example data
set.seed(1)
i_h100 <- round(runif(100, 2, 30), 2)
i_cd <- rexp(100, 1/i_h100)
mydata <- data.frame(i_cd, i_h100)
# Grouping i_h100
mydata$i_h100_2m <- cut(mydata$i_h100, seq(2, 30, by=2))
head(mydata)
# i_cd i_h100 i_h100_2m
# 1 2.918093 9.43 (8,10]
# 2 13.735728 12.42 (12,14]
# 3 13.966347 18.04 (18,20]
# 4 2.459760 27.43 (26,28]
# 5 8.477551 7.65 (6,8]
# 6 6.713224 27.15 (26,28]
# Calculate groupwise means of i_cd
i_cd_2m_mean <- aggregate(i_cd ~ i_h100_2m, mydata, mean)
# And IQR
i_cd_2m_iqr <- aggregate(i_cd ~ i_h100_2m, mydata, IQR)
upper <- i_cd_2m_mean[,2]+(i_cd_2m_iqr[,2]/2)
lower <- i_cd_2m_mean[,2]-(i_cd_2m_iqr[,2]/2)
# Plotting the result
plot.default(i_cd_2m_mean, xaxt="n", ylim=range(c(upper, lower)),
main="Groupwise means \U00B1 0.5 IQR", type="n")
points(upper, pch=2, col="lightblue", lwd=1.5)
points(lower, pch=6, col="pink", lwd=1.5)
points(i_cd_2m_mean, pch=16)
axis(1, i_cd_2m[,1], as.character(i_cd_2m[,1]), cex.axis=0.6, las=2)
Here is a solution,
library(reshape2)
library(dplyr)
mydata <- data_frame(i_h100=c(2.89,2.88,17.53,27.23),i_cd=c(0.0198,0.0198,0.658,0.347))
height <- mydata$i_h100
breaks <- seq(2,30,by=2) #2m intervals
height.cut <- cut(height, breaks, right=TRUE)
mydata$height.cut <- height.cut
mean_i_h100 <- mydata %>% group_by(height.cut) %>% summarise(mean_i_h100 = mean(i_h100))
A few remarks:
it is better to avoid naming variables with function names, so I changed the mean variable to mean_i_h100
I am using the pipe notation, which makes the code more readable, it avoids repeating the first argument of each function, you can find a more detailed explanation here.
Without the pipe notation, the last line of code would be:
mean_i_h100 <- summarise(group_by(mydata,height.cut),mean_i_h100 = mean(i_h100))
you have to load the two packages you installed with library

Overlap plots in R - from zoo package

Using the following code:
library("ggplot2")
require(zoo)
args <- commandArgs(TRUE)
input <- read.csv(args[1], header=F, col.names=c("POS","ATT"))
id <- args[2]
prot_len <- nrow(input)
manual <- prot_len/100 # 4.3
att_name <- "Entropy"
att_zoo <- zoo(input$ATT)
att_avg <- rollapply(att_zoo, width = manual, by = manual, FUN = mean, align = "left")
autoplot(att_avg, col="att1") + labs(x = "Positions", y = att_name, title="")
With data:
> str(input)
'data.frame': 431 obs. of 2 variables:
$ POS: int 1 2 3 4 5 6 7 8 9 10 ...
$ ATT: num 0.652 0.733 0.815 1.079 0.885 ...
I do:
I would like to upload input2 which has different lenght (therefore, different x-axis) and overlap the 2 curves in the same plot (I mean overlap because I want the two curves in the same plot size, so I will "ignore" the overlapped axis labels and tittles), I would like to compare the shape, regardles the lenght of input.
First I've tried by generating toy input2 changing manual value, so that I have att_avg2 in which manual equals e.g. 7. In between original autoplot and new autoplot-2 I add par(new=TRUE), but this is not my expected output. Any hint on how doing this? Maybe it's better to save att_avg from zoo series to data.frame and not use autoplot? Thanks
UPDATE, response to G. Grothendieck:
If I do:
[...]
att_zoo <- zoo(input$ATT)
att_avg <- rollapply(att_zoo, width = manual, by = manual, FUN = mean, align = "left") #manual=4.3
att_avg2 <- rollapply(att_zoo, width = 7, by = 7, FUN = mean, align = "left")
autoplot(cbind(att_avg, att_avg2), facet=NULL) +
labs(x = "Positions", y = att_name, title="")
I get
and a warning message:
Removed 1 rows containing missing values (geom_path).
par is used with classic graphics, not for ggplot2. If you have two zoo series just cbind or merge the series together and autoplot them using facet=NULL:
library(zoo)
library(ggplot2)
z1 <- zoo(1:3) # length 3
z2 <- zoo(5:1) # length 5
autoplot(cbind(z1, z2), facet = NULL)
Note: The question omitted input2 so there could be some additional considerations from aspects not shown.

Print frequencies (as numbers) in plot

In R, I would like to insert frequencies (as numbers) in a plot:
my code to create the plot:
par(mar=c(4.5,4.5,9.5,4), xpd=TRUE)
plot(factor(ArtMehrspr)~Mehrspr_Vielf, data=datProjektMehr, col=terrain.colors(4),
bty='L', main="Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(datProjektMehr$ArtMehrspr)),
fill=terrain.colors(4), horiz=TRUE)
par(mar=c(5,4,4,2)+0.1)
In the plot, 2 columns of my dataframe are depicted: ArtMehrspr and Mehrspr_Vielf.
Now what I would like to know is, how many "Kombi" are in category "1", how many "Paral" are in category "1" and so on, and then to print this number in the plot, so that in every box of the plot, I can see the corresponding number of observations. R must know these numbers, otherwise it could not vary the height of the different boxes according to the number of observations. So it cannot be that hard to get these numbers into the plot, can it?
With the command table(), I can get these numbers, but I would have to have 5 table()-commands to get all the numbers. Example for category = 1:
> table(subset(datProjektMehr, Mehrspr_Vielf=="1")$ArtMehrspr)
einspr Kombi Paral Versc Wechs
0 1 9 2 1
Apparently, you can achieve what I am looking for by adding the command labels = TRUE. But it does not work:
par(mar=c(4.5,4.5,9.5,4), xpd=TRUE, labels = TRUE)
plot(factor(ArtMehrspr)~Mehrspr_Vielf, data=datProjektMehr, col=terrain.colors(4),
bty='L', main="Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(datProjektMehr$ArtMehrspr)),
fill=terrain.colors(4), horiz=TRUE)
par(mar=c(5,4,4,2)+0.1)
R gives me the following warning message:
Warning message:
In par(mar = c(4.5, 4.5, 9.5, 4), xpd = TRUE, labels = TRUE) :
"labels" is not a graphical parameter
Is this not the right command? Does anyone know how to do this?
First of all, the warning informs that there is not a labels argument you can use inside par.
Regarding the plotting of the table output, I'm not aware if there is an easy way of doing this, but I managed a pretty UNreliable and, maybe, inefficient code. In my machine, though, it works every time I run it.
The concept I had in mind is to text all values from your table inside the plot. To do so, coordinates in xx' and yy' had to be estimated. I prefer the term "estimated" instead of "calculated" because I didn't find a way to compute absolute values for the coordinates, due to the fact that the plot method was plot.factor.
So:
#random data. DF = datProjektMehr, artmehr = ArtMehrspr, mehrviel = Mehrspr_Vielf
DF <- data.frame(artmehr = sample(letters[1:4], 20, T), mehrviel = as.factor(sample(1:5, 20, T)))
#your code of plotting
par(mar = c(4.5,4.5,9.5,4), xpd = TRUE)
plot(factor(artmehr) ~ mehrviel, data = DF, col = terrain.colors(4),
bty = 'L', main = "Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(DF$artmehr)),
fill=terrain.colors(4), horiz=TRUE)
#no need to "table()" many times
tab = table(DF$artmehr, DF$mehrviel)
#maximum value of x axis (at least in my machine)
#I found -through trial and error- that for a factor of n levels, x.max = 1 + (n-1)*0.02
x.max = 1 + (length(levels(DF$mehrviel)) - 1) * 0.02
#coordinates of "mehrviel" (as I named it)
mehrviel.coords = ((cumsum(apply(tab, 2, sum)) / sum(tab)) * x.max) - ((apply(tab, 2, sum) / sum(tab)) / 2)
#coordinates of "artmehr" (as I named it)
artmehr.coords <- apply(tab, 2, function(x) { cumsum(x / sum(x)) })
artmehr.coords <- apply(artmehr.coords, 2, function(x) { x - c(x[1]/2, diff(x)/2) })
#"text" the values in your table
#don't plot "0"s
for(i in 1:ncol(artmehr.coords))
{
text(x = mehrviel.coords[i], y = artmehr.coords[,i], labels = ifelse(tab[,i] != 0, tab[,i], ""), cex = 2)
}
The values of table:
tab
1 2 3 4 5
a 1 1 0 1 0
b 0 0 2 1 2
c 1 1 2 1 0
d 2 0 0 3 2
The plot:
EDIT: 1) "Tidied" the answer. 2) Aadded an extra level to the factor ploted in xx' axis to match your data exactly. 3)texted the frequencies in the middle of each box.

Sorting values in for plotting in R

I have a data that looks like this:
> print(dat)
cutoff tp fp
1 0.6 414 45701
2 0.7 172 16820
3 0.8 51 4326
4 0.9 49 3727
5 1.0 0 0
I want to plot them in reverse-order from smallest dat$tp to largest.
However this code plot them in order like above (i.e. largest to smallest) instead.
> fp_max <- max(dat$fp);
> tp_max <- max(dat$tp);
> op <- par(xaxs = "i", yaxs = "i")
> plot(tp ~ fp, data = dat, xlim = c(0,fp_max),ylim = c(0,tp_max), type = "n")
> with(dat, lines(c(0, fp, fp_max), c(0, tp, tp_max), lty=1, type = "l", col = "black"))
> lines( par()$usr[1:2], par()$usr[3:4], col="red" )
How can I modify the code above to address the problem?
Of course, the x-axis & y-axis coordinates should be from smallest to largest value
The following shows the result of my current code.
Notice that the line started at 0,0 and it 'goes back' to 0 again.
we want to avoid it going back to 0.
Ahh, I understand.
It's because lines draws lines between the points in the order they are given.
There are a few ways you could get around this:
do type='l' in your plot command and then with(dat,lines(...)) is not necessary:
# can also do the col='black',lty=1 in here.
plot(tp ~ fp, data = dat, xlim = c(0,fp_max),ylim = c(0,tp_max), type = "l")
Note that by definition of your fp_max and tp_max, you will include the point (fp_max,tp_max) already. And as long as you have a row with (0,0) for tp and fp in dat, you'll also get the (0,0) point.
Sort dat$tp and use that to sort dat$fp too:
plot(tp ~ fp, ..., type='n')
# sort dat$tp
obj <- sort(dat$fp,index.return=T)
# use obj$x as tp and obj$ix to sort dat$fp prior to plotting
with(dat,
lines(c(0, obj$x, fp_max), c(0, tp[obj$ix], tp_max),
lty=1, type = "l", col = "black"))
#Get order of rows
idx <- order(dat$tp)
#Select data in sorted order
sorted <- dat[idx,]

Generate multiple serial graphs/scatterplots from data in two dataframes

I have 2 dataframes, Tg and Pf, each of 127 columns. All columns have at least one row and can have up to thousands of them. All the values are between 0 and 1 and there are some missing values (empty cells). Here is a little subset:
Tg
Tg1 Tg2 Tg3 ... Tg127
0.9 0.5 0.4 0
0.9 0.3 0.6 0
0.4 0.6 0.6 0.3
0.1 0.7 0.6 0.4
0.1 0.8
0.3 0.9
0.9
0.6
0.1
Pf
Pf1 Pf2 Pf3 ...Pf127
0.9 0.5 0.4 1
0.9 0.3 0.6 0.8
0.6 0.6 0.6 0.7
0.4 0.7 0.6 0.5
0.1 0.6 0.5
0.3
0.3
0.3
Note that some cell are empty and the vector lengths for the same subset (i.e. 1 to 127) can be of very different length and are rarely the same exact length.
I want to generate 127 graph as follow for the 127 vectors (i.e. graph is for col 1 from each dataframe, graph 2 is for col 2 for each dataframe etc...):
Hope that makes sense. I'm looking forward to your assistance as I don't want to make those graphs one by one...
Thanks!
Here is an example to get you started (data at https://gist.github.com/1349300). For further tweaking, check out the excellent ggplot2 documentation that is all over the web.
library(ggplot2)
# Load data
Tg = read.table('Tg.txt', header=T, fill=T, sep=' ')
Pf = read.table('Pf.txt', header=T, fill=T, sep=' ')
# Format data
Tg$x = as.numeric(rownames(Tg))
Tg = melt(Tg, id.vars='x')
Tg$source = 'Tg'
Tg$variable = factor(as.numeric(gsub('Tg(.+)', '\\1', Tg$variable)))
Pf$x = as.numeric(rownames(Pf))
Pf = melt(Pf, id.vars='x')
Pf$source = 'Pf'
Pf$variable = factor(as.numeric(gsub('Pf(.+)', '\\1', Pf$variable)))
# Stack data
data = rbind(Tg, Pf)
# Plot
dev.new(width=5, height=4)
p = ggplot(data=data, aes(x=x)) + geom_line(aes(y=value, group=source, color=source)) + facet_wrap(~variable)
p
Highlighting the area between the lines
First, interpolate the data onto a finer grid. This way the ribbon will follow the actual envelope of the lines, rather than just where the original data points were located.
data = ddply(data, c('variable', 'source'), function(x) data.frame(approx(x$x, x$value, xout=seq(min(x$x), max(x$x), length.out=100))))
names(data)[4] = 'value'
Next, calculate the data needed for geom_ribbon - namely ymax and ymin.
ribbon.data = ddply(data, c('variable', 'x'), summarize, ymin=min(value), ymax=max(value))
Now it is time to plot. Notice how we've added a new ribbon layer, for which we've substituted our new ribbon.data frame.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3, data=ribbon.data)
Dynamic coloring between the lines
The trickiest variation is if you want the coloring to vary based on the data. For that, you currently must create a new grouping variable to identify the different segments. Here, for example, we might use a function that indicates when the "Tg" group is on top:
GetSegs <- function(x) {
segs = x[x$source=='Tg', ]$value > x[x$source=='Pf', ]$value
segs.rle = rle(segs)
on.top = ifelse(segs, 'Tg', 'Pf')
on.top[is.na(on.top)] = 'Tg'
group = rep.int(1:length(segs.rle$lengths), times=segs.rle$lengths)
group[is.na(segs)] = NA
data.frame(x=unique(x$x), group, on.top)
}
Now we apply it and merge the results back with our original ribbon data.
groups = ddply(data, 'variable', GetSegs)
ribbon.data = join(ribbon.data, groups)
For the plot, the key is that we now specify a grouping aesthetic to the ribbon geom.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax, group=group, fill=on.top), alpha=0.3, data=ribbon.data)
Code is available together at: https://gist.github.com/1349300
Here is a three-liner to do the same :-). We first reshape from base to convert the data into long form. Then, it is melted to suit ggplot2. Finally, we generate the plot!
mydf <- reshape(cbind(Tg, Pf), varying = 1:8, direction = 'long', sep = "")
mydf_m <- melt(mydf, id.var = c(1, 4), variable = 'source')
qplot(id, value, colour = source, data = mydf_m, geom = 'line') +
facet_wrap(~ time, ncol = 2)
NOTE. The reshape function in base R is extremely powerful, albeit very confusing to use. It is used to transform data between long and wide formats.
Kudos for automating something you used to do in Excel using R! That's exactly how I got started with R and a common path to R enlightenment :)
All you really need is a little looping. Here's an example, most of which is creating example data that represents your data structure:
## create some example data
Tg <- data.frame(Tg1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Tg[paste("Tg", i, sep="")] <- vec[1:10]
}
Pf <- data.frame(Pf1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Pf[paste("Pf", i, sep="")] <- vec[1:10]
}
## ok, sample data created
## now lets loop through all the columns
## if you didn't know how many columns there are you could
## use ncol(Tg) to figure out
for (i in 1:10) {
plot(1:10, Tg[,i], type = "l", col="blue", lwd=5, ylim=c(-3,3),
xlim=c(1, max(length(na.omit(Tg[,i])), length(na.omit(Pf[,i])))))
lines(1:10, Pf[,i], type = "l", col="red", lwd=5, ylim=c(-3,3))
dev.copy(png, paste('rplot', i, '.png', sep=""))
dev.off()
}
This will result in 10 graphs in your working directory that look like the following:

Resources