How to do row wise operations on .SD columns in data.table - r

Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...
I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.
As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?
Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)
library(data.table)
set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
q1_exam1 = round(rnorm(10, .78, .05), 2),
q1_exam2 = round(rnorm(10, .68, .02), 2),
q1_final = round(rnorm(10, .88, .08), 2),
q2_exam1 = round(rnorm(10, .78, .05), 2),
q2_exam2 = round(rnorm(10, .68, .10), 2),
q2_final = round(rnorm(10, .88, .04), 2))
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89

Here are a few thoughts on your options, largely gathered from the comments:
apply along rows
The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.
rowMeans and similarly-named functions also coerce to a matrix.
Split by rows
As #Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.
Efficiently create new columns
This approach taken from eddi is probably the most efficient if you must keep your data in wide format:
jwts = list(
q1_AVG = c(q1_exam1 = 1 , q1_exam2 = 1 , q1_final = 1)/3,
q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)
for (newj in names(jwts)){
w = jwts[[newj]]
dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}
This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).
Go long
As #alexis_laz suggested, you might gain clarity and efficiency with a different structure, like
# reshape
m = melt(dt, id.vars="id", value.name="score")[,
c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]
# input your weighting rules
w = unique(m[,c("quarter","exam")])
w[quarter=="q1" , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]
# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]
This is what I would do.
In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.
apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333
In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:
dt[i = TRUE,
q1_AVG := round(apply(.SD, 1, mean), 2),
.SDcols = grep("^q1", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77
The case with the weighted average can be calculated using matrix multiplication;
dt[i = TRUE,
q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2),
.SDcols = grep("^q2", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77 0.80
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73 0.78
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74 0.82
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72 0.87
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75 0.77
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78 0.80
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73 0.80
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75 0.78
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76 0.77
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77 0.82

Related

'x' and 'y' lengths differ in custom entropy function

I am trying to learn R and I am having problems with the way it works. I tried to make an entropy function of variables p and 1-p from scratch and I am having problems when I try to add some ifs to avoid the NaN when dividing by 0.
When I try the custom entropy with the plot, it just works but it shows the NaN when I print the results. But when I try to add the ifs, then it says:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
I don't understand it since I am using a plot of an array as x and a function with that array as parameter as y, so it should be the same lengths with and without ifs.
Console without the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= NaN 0.08079314 0.1414405 0.1943919 0.2422922 0.286397 0.3274449 0.3659237 0.4021792 0.4364698 0.4689956 0.499916 0.5293609 0.5574382 0.5842388 0.6098403 0.6343096 0.6577048 0.680077 0.7014715 0.7219281 0.7414827 0.7601675 0.7780113 0.7950403 0.8112781 0.8267464 0.8414646 0.8554508 0.8687212 0.8812909 0.8931735 0.9043815 0.9149264 0.9248187 0.9340681 0.9426832 0.9506721 0.958042 0.9647995 0.9709506 0.9765005 0.9814539 0.985815 0.9895875 0.9927745 0.9953784 0.9974016 0.9988455 0.9997114 1 0.9997114 0.9988455 0.9974016 0.9953784 0.9927745 0.9895875 0.985815 0.9814539 0.9765005 0.9709506 0.9647995 0.958042 0.9506721 0.9426832 0.9340681 0.9248187 0.9149264 0.9043815 0.8931735 0.8812909 0.8687212 0.8554508 0.8414646 0.8267464 0.8112781 0.7950403 0.7780113 0.7601675 0.7414827 0.7219281 0.7014715 0.680077 0.6577048 0.6343096 0.6098403 0.5842388 0.5574382 0.5293609 0.499916 0.4689956 0.4364698 0.4021792 0.3659237 0.3274449 0.286397 0.2422922 0.1943919 0.1414405 0.08079314 NaN
Console with the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= 0Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
You did not create a vector but a scalar since you did not used a vectorized functionality in you if else clause. The result of your function has been just one number.
This should work:
entropy <- function(p){
# initialize a vector of the desired length with zeros
result <- numeric(length(p))
# subset the vector for which you want to apply your formula on
x <- p[!(p %in% c(0,1))]
# overwrite only those positions for which you want to calculate values based
# on your formula
result[!(p %in% c(0,1))] <- - x*log2(x)-(1-x)*log2((1-x))
#cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
EDIT:
Even tho I was suggested to do it vectorizing it, I wanted to do it somewhat similar to other languages I know for the moment, since I am starting. I was able to fix it, althought I ended up using a for and printing 2 arrays instead of the function itself.
entropy <- function(p){
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
return(result)
}
x <- seq(0,1,0.01)
y <- numeric(length(p))
i = 1
for (p in x) {
y[i] = entropy(p)
cat(x[i],"=",y[i],"\n")
i=i+1
}
plot(x, y, type='l', main='Funcion entropia con dos valores posibles')
I just applied your entropy function to the p vector prior to trying to plot it using the sapply function.
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
# Apply the function over all the values of 'p'
entropy_p <- sapply(p,FUN = entropy)
plot(p, entropy_p, type='l', main='Funcion entropia con dos valores posibles')

How to find and replace min value in dataframe with text in r

i have dataframe with 20 columns and I like to identify the minimum value in each of the column and replace them with text such as "min". Appreciate any help
sample data :
a b c
-0.05 0.31 0.62
0.78 0.25 -0.01
0.68 0.33 -0.04
-0.01 0.30 0.56
0.55 0.28 -0.03
Desired output
a b c
min 0.31 0.62
0.78 min -0.01
0.68 0.33 min
-0.01 0.30 0.56
0.55 0.28 -0.03
You can apply a function to each column that replaces the minimum value with a string. This returns a matrix which could be converted into a data frame if desired. As IceCreamToucan pointed out, all rows will be of type character since each variable must have the same type:
apply(df, 2, function(x) {
x[x == min(x)] <- 'min'
return(x)
})
a b c
[1,] "min" "0.31" "0.62"
[2,] "0.78" "min" "-0.01"
[3,] "0.68" "0.33" "min"
[4,] "-0.01" "0.3" "0.56"
[5,] "0.55" "0.28" "-0.03"
You can use the method below, but know that this converts all your columns to character, since vectors must have elements which all have the same type.
library(dplyr)
df %>%
mutate_all(~ replace(.x, which.min(.x), 'min'))
# a b c
# 1 min 0.31 0.62
# 2 0.78 min -0.01
# 3 0.68 0.33 min
# 4 -0.01 0.3 0.56
# 5 0.55 0.28 -0.03
apply(df, MARGIN=2, FUN=(function(x){x[which.min(x)] <- 'min'; return(x)})

reshape unique strings in rows into columns in R

I would like to reshape my data based in unique string in a "Bull" column (all data frame):
EBV Bulls
0.13 NE001362
0.17 NE001361
0.05 NE001378
-0.12 NE001359
-0.14 NE001379
0.13 NE001380
-0.46 NE001379
-0.46 NE001359
-0.68 NE001394
0.28 NE001391
0.84 NE001394
-0.43 NE001393
-0.18 NE001707
My expected output:
NE001362 NE001361 NE001378 NE001359 NE001379 NE001380 NE001394 NE001391 NE001393 NE001707
0.13 0.17 0.05 -0.12 -0.14 0.13 -0.68 0.28 -0.43 -0.18
-0.46 -0.46 0.84
I tried dat2 <- dcast(all, EBV~variable, value.var = "Bulls") but do not works.
You have two options. Indexing the multiple occurrences for each level of Bulls or using a list to hold the different levels of EBV.
Option 1: Indexing multiple occurrences
You can use data.table to generate an index that numbers multiple occurrences of EBV:
require(data.table)
setDT(all) ## convert to data.table
all[, index:=1:.N, by=Bulls] ## generate index
dcast.data.table(all, formula=index ~ Bulls, value.var='EBV')
Option 2: Using a list to store multiple values
You could use a list as a value with data.table (I'm not sure if plain data.frame supports it).
require(data.table)
setDT(all) ## convert to data.table
all[, list(list(EBV)), by=Bulls] ## multiple values stored as list
Just to make sure that base R gets some acknowledgement:
## Add an ID, like ilir did, but with base R functions
mydf$ID <- with(mydf, ave(rep(1, nrow(mydf)), Bulls, FUN = seq_along))
Here's reshape:
reshape(mydf, direction = "wide", idvar="ID", timevar="Bulls")
# ID EBV.NE001362 EBV.NE001361 EBV.NE001378 EBV.NE001359 EBV.NE001379
# 1 1 0.13 0.17 0.05 -0.12 -0.14
# 7 2 NA NA NA -0.46 -0.46
# EBV.NE001380 EBV.NE001394 EBV.NE001391 EBV.NE001393 EBV.NE001707
# 1 0.13 -0.68 0.28 -0.43 -0.18
# 7 NA 0.84 NA NA NA
And xtabs. Note: This is a table-like matrix, so if you want a data.frame, you'll have to use as.data.frame.matrix on the output.
xtabs(EBV ~ ID + Bulls, mydf)
# Bulls
# ID NE001359 NE001361 NE001362 NE001378 NE001379 NE001380 NE001391
# 1 -0.12 0.17 0.13 0.05 -0.14 0.13 0.28
# 2 -0.46 0.00 0.00 0.00 -0.46 0.00 0.00
# Bulls
# ID NE001393 NE001394 NE001707
# 1 -0.43 -0.68 -0.18
# 2 0.00 0.84 0.00

Fast(er) way of indexing matrix in R

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:
for (i in 1:99000) {
subset.data <- data[index[, i], ]
}
Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon
this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset
the original data matrix to draw resamples of the data.
# The basic setup
B <- 1000 # no. of bootstrap replications
n <- 250 # no. of observations
m <- 100 # no. of models/data series
# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).
boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)
# Make matrix with m data series of length n.
sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)
subsetMatrix <- function(data, index) { # fn definition for timing
subset.data <- data[index, ]
return(subset.data)
}
# check how long it takes.
Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) {
for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000
boot.data <- subsetMatrix(sample.data, boot.index[, b])
# do some other stuff
}
# do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")
# > summaryRprof("subsetMatrix.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 9.96 100 9.96 100
# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 1.98 15.76 2.20 17.52
# makeIndex 1.08 8.60 2.12 16.88
# makeStats 0.66 5.25 9.66 76.91
# runif 0.60 4.78 0.72 5.73
# apply 0.30 2.39 0.42 3.34
# is.data.frame 0.22 1.75 0.22 1.75
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# estimateMCS 0.10 0.80 12.56 100.00
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
#
# $by.total
# total.time total.pct self.time self.pct
# estimateMCS 12.56 100.00 0.10 0.80
# makeStats 9.66 76.91 0.66 5.25
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 2.20 17.52 1.98 15.76
# makeIndex 2.12 16.88 1.08 8.60
# runif 0.72 5.73 0.60 4.78
# doTest 0.68 5.41 0.00 0.00
# apply 0.42 3.34 0.30 2.39
# aperm 0.26 2.07 0.00 0.00
# is.data.frame 0.22 1.75 0.22 1.75
# sweep 0.20 1.59 0.00 0.00
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# unlist 0.06 0.48 0.00 0.00
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
# mean 0.02 0.16 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'
Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several
parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but
still not very pleasing :/
Is there a better way to resample the data / get rid of the loop?
Can apply, Vectorize, replicate etc. come in anywhere?
Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?
Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!
related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix
from there
mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))
didn't really do it for me.
I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:
makeStats <- function(data, index) {
data.mean <- colMeans(data)
m <- nrow(data)
n <- ncol(index)
tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
boot.data.mean <- t(data) %*% weights - data.mean
return(list(data.mean = data.mean,
boot.data.mean = boot.data.mean))
}
makeIndex <- function(B, blocks){
n <- ncol(blocks)
l <- nrow(blocks)
z <- ceiling(n/l)
start.points <- sample.int(n, z * B, replace = TRUE)
index <- blocks[, start.points]
keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
boot.index <- matrix(as.vector(index)[keep],
nrow = n, ncol = B)
return(boot.index)
}
This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

Multiple boxplots with predefined statistics using lattice-like graphs in r

I have a dataset which looks like this
VegType 87MIN 87MAX 87Q25 87Q50 87Q75 96MIN 96MAX 96Q25 96Q50 96Q75 00MIN 00MAX 00Q25 00Q50 00Q75
1 0.02 0.32 0.11 0.12 0.13 0.02 0.26 0.08 0.09 0.10 0.02 0.28 0.10 0.11 0.12
2 0.02 0.45 0.12 0.13 0.13 0.02 0.20 0.09 0.10 0.11 0.02 0.26 0.11 0.12 0.12
3 0.02 0.29 0.13 0.14 0.14 0.02 0.27 0.11 0.11 0.12 0.02 0.26 0.12 0.13 0.13
4 0.02 0.41 0.13 0.13 0.14 0.02 0.58 0.10 0.11 0.12 0.02 0.34 0.12 0.13 0.13
5 0.02 0.42 0.12 0.13 0.14 0.02 0.46 0.10 0.11 0.11 0.02 0.28 0.12 0.12 0.13
6 0.02 0.32 0.13 0.14 0.14 0.02 0.52 0.12 0.12 0.13 0.02 0.29 0.13 0.14 0.14
7 0.02 0.55 0.12 0.13 0.14 0.02 0.24 0.10 0.11 0.11 0.02 0.37 0.12 0.12 0.13
8 0.02 0.55 0.12 0.13 0.14 0.02 0.19 0.10 0.11 0.12 0.02 0.22 0.11 0.12 0.13
In reality I have 26 variables and 5 years (87,96 and 00 in the column names are years). In an ideal world I would like to have a lattice-like graph with 26 plots, one per variable, with each plot containing 5 boxes, i.e. one per year. I understand that it is not possible to do this is lattice because lattice won't accept predefined statistics. Is there a fairly unpainful way to do this in R with predefined stats? I have used bxp for simple boxplots plotting all the variables for one year in a single plot e.g.
Yr01 = read.csv('dat.csv',header=T)
dat01=t(Yr01[,c("01Min","01Q25","01Mean","01Q75","01Max")])
bxp(list(stats=dat01, n=rep(26, ncol(dat01))),ylim=c(0.07,0.2))
but I don't know how to go from there to what I need.
Thanks.
This can be done, at least using ggplot2, but you'll have to reshape your data quite a bit. And you really have to have a data where the quantiles actually make sense!! Your quantile values are all messed up! For example, Var1 has 01Max = 0.26 and 01Q75 = .67!!
First, I'll recreate a valid data:
n <- c("01Min", "01Max", "01Med", "01Q25", "01Q75", "02Min",
"02Max", "02Med", "02Q25", "02Q75")
v1 <- c(0.03, 0.76, 0.41, 0.13, 0.67, 0.10, 0.43, 0.27, 0.2, 0.33)
v2 <- c(0.03, 0.28, 0.14, 0.08, 0.20, 0.02, 0.77, 0.13, 0.06, 0.44)
df <- data.frame(v1=v1, v2=v2)
df <- as.data.frame(t(df))
names(df) <- n
df <- cbind(var=c("v1","v2"), df)
> df
# var 01Min 01Max 01Med 01Q25 01Q75 02Min 02Max 02Med 02Q25 02Q75
# v1 v1 0.03 0.76 0.41 0.13 0.67 0.10 0.43 0.27 0.20 0.33
# v2 v2 0.03 0.28 0.14 0.08 0.20 0.02 0.77 0.13 0.06 0.44
Next, we'll reshape the data:
require(reshape2)
df.m <- melt(df, id="var")
# look for a bunch of numbers from the start of the string and capture it
# in the first variable: () captures the pattern. And replace it with the
# captured pattern with the variable "\\1"
df.m$year <- gsub("^([0-9]+)(.*$)", "\\1", df.m$variable)
# the same but instead refer to the captured pattern in the second
# paranthesis using "\\2"
df.m$quan <- gsub("^([0-9]+)(.*)$", "\\2", df.m$variable)
df.f <- dcast(df.m, var+year ~ quan, value.var="value")
To get to this format:
> df.f
# var year Max Med Min Q25 Q75
# 1 v1 01 0.76 0.41 0.03 0.13 0.67
# 2 v1 02 0.43 0.27 0.10 0.20 0.33
# 3 v2 01 0.28 0.14 0.03 0.08 0.20
# 4 v2 02 0.77 0.13 0.02 0.06 0.44
Now, we can plot by directly providing the quantile values to corresponding parameters using the corresponding column names as follows:
require(ggplot2)
require(scales)
p <- ggplot(df.f, aes(x=var, ymin=`Min`, lower=`Q25`, middle=`Med`,
upper=`Q75`, ymax=`Max`))
p <- p + geom_boxplot(aes(fill=year), stat="identity")
p
# if you want facetting:
p + facet_wrap( ~ var, scales="free")
You can now accomplish your task of plotting all years for each var in a separate plot using a lapply with this code and subsetting as follows:
lapply(levels(df.f$var), function(x) {
p <- ggplot(df.f[df.f$var == x, ],
aes(x=var, ymin=`Min`, lower=`Q25`,
middle=`Med`, upper=`Q75`, ymax=`Max`))
p <- p + geom_boxplot(aes(fill=year), stat="identity")
p
ggsave(paste0(x, ".pdf"), last_plot())
})
Edit: Your data is different from the earlier data you provided in some aspects. So, here's the version of the code for your new data:
# change var to VegType everywhere
require(reshape2)
df.m <- melt(df, id="VegType")
df.m$year <- gsub("^X([0-9]+)(.*$)", "\\1", df.m$variable) # pattern has a X
df.m$quan <- gsub("^X([0-9]+)(.*)$", "\\2", df.m$variable) # pattern has a X
df.f <- dcast(df.m, VegType+year ~ quan, value.var="value")
df.f$VegType <- factor(df.f$VegType) # convert integer to factor
require(ggplot2)
require(scales)
p <- ggplot(df.f, aes(x=VegType, ymin=`MIN`, lower=`Q25`, middle=`Q50`,
upper=`Q75`, ymax=`MAX`))
p <- p + geom_boxplot(aes(fill=year), stat="identity")
p
You can facet/write as separate plots using same code as before.

Resources