How to create a multi-dimensional barchart - r

I'm trying to create a bar chart with lattice, which has two groupings. The first grouping is stacked, whereas the second is not. For example:
a <- factor(rep(c(1,2), times = 6))
b <- factor(rep(c(1,2,3), times = 4))
c <- factor(rep(c(1,2,3,4), times = 3))
d <- factor(rep(c("true", "false"), each = 6))
e <- factor(rep(c("yes", "no", "may be"), each = 4))
value <- c(5,8,2,4,1,8,9,3,5,6,3,12)
At the moment I'm doing the following:
a <- factor(rep(c(1,2), times = 6))
b <- factor(rep(c(1,2,3), times = 4))
c <- factor(rep(c(1,2,3,4), times = 3))
d <- factor(rep(c("true", "false"), each = 6))
e <- factor(rep(c("yes", "no", "may be"), each = 4))
value <- c(5,8,2,4,1,8,9,3,5,6,3,12)
barchart(value ~ a | b + c,
groups = d, stack = FALSE,
auto.key=TRUE,
scales = list(x = "free"))
This results in length(b)*length(c) set of barplots, each with length(a) sets of bars. Each set of bars has a bar for "true" and a bar for "false". What I would also like to add is the stacked value of e, such that each "true" bar will be divided into three sections: the bottom one will be for "yes", then "no" and them "may be" and the same with the "false" bar.
I realise that the graph will be quite complex, however it is the best way to represent the data which I have. Adding e in the formula, as in b + c + e is not an option, as I already have a set of plots and I need to keep to the same format, as they are related to each other. On the other hand having 6 bars in each set will make readability much harder.
Thanks!

ggplot2 will do the job relatively easily, if using lattice isn't a hard requirement for you. I took the liberty of expanding your data set so that all of the combinations of a, b, c, d, and e would be present.
# Load required packages
require(ggplot2)
require(plyr)
# Make factors with the same levels as in the original post
# but 100x longer, and in random order so all combinations are present
a <- sample(factor(rep(c(1,2), times = 600)))
b <- sample(factor(rep(c(1,2,3), times = 400)))
c <- sample(factor(rep(c(1,2,3,4), times = 300)))
d <- sample(factor(rep(c("true", "false"), each = 600)))
e <- sample(factor(rep(c("yes", "no", "may be"), each = 400)))
value <- runif(1200)
# Put them in a data frame
df <- data.frame(a=a, b=b, c=c, d=d, e=e, value=value)
# Calculate the sum of the value columns for each unique combination of a, b, c, d, and e
# I think this is what you'd like - am not totally sure
ds <- ddply(df, c("a", "b", "c", "d", "e"), summarise, sum.value=sum(value, na.omit=TRUE))
# Make the plot
ggplot(ds, aes(x=d, y=sum.value, fill=e)) + geom_bar(stat="identity") +
facet_grid(a~b+c) +
theme(axis.text.x=element_text(angle=-90))

Related

Avoid writing large number of column names in a model formula with bs() terms

I want to use bs function for numerical variables in my dataset when fitting a logistic regression model.
df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
f= c("m","f"), output = c(0,1))
library(splines)
model <- glm(output~ bs(a, df=2)+ bs(b, df=2)+ bs(d, df=2)+ bs(e, df=2)+
factor(f) ,
data = df,
family = "binomial")
In my actual dataset, I need to apply bs() to way more columns than this example. Is there a way I can do this without writing all the terms?
We can use some string manipulation with sprintf, together with reformulate:
predictors <- c("a", "b", "d", "e")
bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 2) + bs(b, df = 2) + bs(d, df = 2) + bs(e,
# df = 2) + factor(f)
If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).
predictors <- c("a", "b", "d", "e")
dof <- c(3, 4, 3, 6)
degree <- c(2, 2, 2, 3)
bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 3, degree = 2) + bs(b, df = 4, degree = 2) +
# bs(d, df = 3, degree = 2) + bs(e, df = 6, degree = 3) + factor(f)
Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").
Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.

How to create multipe boxplots in one by only chosing certain rows from a data frame

What I would like to do is creating several boxplots (all displayed in a single boxplot) only from certain values of my original data frame.
My data frame looks as follows:
enter image description here
So now I want R to visualise Parameter ~ Station (Parameter are all variables coloured green and Station is the "station id")
Is there a way to tell R that I want all my Parameters on the x-axis ONLY for BB0028 for example, which would mean that I only take the first 6 values of mean_area, mean_area_exc, esd, feret, min and max into account in the boxplot?
That would look like this:
enter image description here
I tried it in very complicated way to add single boxplots one by one but I am sure there must be a more simple way.
This is what I tried:
bb28 <- df[c(1:6),]
bb28area <- boxplot(bb28$mean_area ~ bb28$BBnr)
bb28area_exc <- boxplot(bb28$mean_area_exc ~ bb28$BBnr)
bb28esd <- boxplot(bb28$mean_esd ~ bb28$BBnr)
bb28feret <- boxplot(bb28$mean_feret ~ bb28$BBnr)
bb28min <- boxplot(bb28$mean_min ~ bb28$BBnr)
bb28max <- boxplot(bb28$mean_max ~ bb28$BBnr)
boxplot(bb28$mean_area ~ bb28$BBnr)
boxplot(bb28$mean_area_exc ~ bb28$BBnr, add=TRUE, at = 1:1+0.45)
Also it doesn't look very nice because in the plot the x-axis does not adjust to the new boxplot which is cut off then:
enter image description here
I hope you can help me with simple a proper code to get my plot.
Thank you!
Cheers, Merle
Maybe the function multi.boxplot below is what you are looking for. It uses base R only.
Data.
First, make up a dataset, since you have not provided us with one in a copy&paste friendly format.
set.seed(1234)
n <- 50
BBnr <- sort(sprintf("BB%04d", sample(28:30, n, TRUE)))
bb28 <- data.frame(col1 = 1:n, col2 = n:1, BBnr = BBnr)
tmp <- matrix(runif(3*n), ncol = 3)
colnames(tmp) <- paste("mean", c("this", "that", "other"), sep = "_")
bb28 <- cbind(bb28, tmp)
rm(BBnr, tmp)
Code.
multi.boxplot <- function(x, by, col=0, ...){
x <- as.data.frame(x)
uniq.by <- unique(by)
len <- length(uniq.by) - 1
n <- ncol(x)
n1 <- n + 1
col <- rep(col, n)[seq_len(n)]
boxplot(x[[ 1 ]] ~ by, at = 0:len*n1 + 1,
xlim = c(0, (len + 1)*n1), ylim = range(unlist(x)), xaxt = "n", col=col[1], ...)
for(i in seq_len(n)[-1])
boxplot(x[[i]] ~ by, at = 0:len*n1 + i, xaxt = "n", add = TRUE, col=col[i], ...)
axis(1, at = 0:len*n1 + n1/2, labels = uniq.by, tick = TRUE)
}
inx <- grep("mean", names(bb28))
multi.boxplot(bb28[inx], by = bb28$BBnr, col = rainbow(length(inx)))

ffbase: merge on columns X and Y and closest column Z

I would like to accomplish the following using ffdf: Merge on columns X and Y and closest Time and then merge on the closes column B. However,the procedure that I know in smaller samples involves using outer merges (as shown below). What is a way around this for a large sample that won't fit in memory (and probably wouldn't work on sqldf), using ffbase? If not possible, what would be the best library for this?
As a reproducible example, same as below:
set.seed(1)
df.ff <- as.ffdf(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge.ff <- as.ffdf(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
I borrow the following example from #ChinmayPatil here to highlight the similar procedure I would like to follow: (R - merge dataframes on matching A, B and *closest* C?):
require(data.table)
set.seed(1)
df <- setDT(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge <- setDT(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
## First do a left outer merge
A <- merge(to.merge,df, by = c('x','y'), all.x = T )
## Then calculate a diff row as such
A$diff <- abs(A$time.x - A$time.y)
##then take the minimum distance
A[ , .I[which.min(diff)] , by = c('x', 'y' ) ]
Given that my question got so few views and no answers, I will describe the approach I came up with to solve this problem with the hopes that someone might find it useful (or even for me as a reminder for later in the future):
To me, the most difficult aspect of performing this match on one columns and then nearest match on another columns is that I kept thinking that doing an outer join (as described in the post) was necessary. The solution is pretty simple using data.table and ffdfdply. For the purpose of illustration, assume there is one large ffdf object and one regular data.table that fits in memory:
### Large ffdf object
A <- as.ffdf(data.table( dates.A = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-31'), by = '3 days'),
letters.A = LETTERS[1:4] , value.A = runif(4) ))
### Small data.table that fits in memory
B <- data.table( date.B = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-05'), by = 'days'),
letters.B = LETTERS[1:4] , value.B = runif(4) )
Then you can simply define a function that does the merging using data.table and roll = 'nearest':
merge.ff <- function(x){
setDT(x)
x[, ':=' (dates.merge = dates.A, letters.merge = letters.A)]
B[, ':=' (dates.merge = date.B, letters.merge = letters.B)]
setkeyv(x, c('letters.merge','dates.merge'))
setkeyv(B, c('letters.merge','dates.merge'))
as.data.frame(B[x, roll = 'nearest'])
}
and apply it to A:
result <- ffdfdply( A, split = A$dates.A, FUN = merge.ff)
the key was just essentially using the roll method in data.table and pass it to ffdfdply. It seemed to be quite efficient.

Is dplyr's left_join correct way to attach a data.frame to a SpatialPolygonDataFrame in R?

Merging extra data (frames) to spatial objects in R can be tricky (as explained here, or here)
Searching for a solution on how to correctly do the job I found this SO question listing several methods. dplyr's left_join was not listed there. I spotted it being used in Robin's tutorial.
My question is - is this a correc method to use? Are there any use cases (different number of rows? different rows names? sorting? etc.) that this solution would fail?
Here is some reproducible code illustarting the methods I found / came across:
# libraries
library("spdep"); library("sp"); library("dplyr")
# sopatial data
c <- readShapePoly(system.file("etc/shapes/columbus.shp", package="spdep")[1])
m <- c#data
c#data <- subset(c#data, select = c("POLYID", "INC"))
c#data$INC2 <- c#data$INC
c#data$INC <- NULL
ex <- subset(c, c$POLYID <= 2) # polygons with messed up data in merged df
c <- subset(c, c$POLYID < 49) # remove one polygon from shape so that df has one poly too many
# messing up merge data
m <- subset(m, POLYID != 1) # exclude polygon
m <- subset(m, select = c("POLYID", "INC")) # only two vars
rownames(m) <- m$POLYID - 2 # change rownames
m$POLYID[m$POLYID == 2] <- 0 # wrong ID
m <- m[order(m$INC),] # different sort
m$POLYID2 <- m$POLYID # duplicated to check dplyr
# left_join solution
s1 <- c
s1#data <- left_join(s1#data, m)
plot(c)
plot(s1, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s1#data)
# match solution
s2 <- c
s2#data = data.frame(s2#data, m[match(s2#data[,"POLYID"], m[,"POLYID"]),])
plot(c)
plot(s2, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s2#data)
# sp solution
s3 <- c
s3 <- sp::merge(s3, m, by="POLYID")
plot(c)
plot(s3, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s3#data)
# inner join solution
s4 <- c
s4#data <- inner_join(s4#data, m)
plot(c)
plot(s4, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s4#data)
# rebuild solution???
s5 <- c
s5.df <- as(s5, "data.frame")
s5.df1 <- merge(s5.df, m, sort=FALSE, by.x="POLYID", by.y="POLYID", all.x=TRUE, all.y=TRUE)
s51 <- SpatialPolygonsDataFrame(as(s5, "SpatialPolygons"), data=s5.df1)
plot(c)
plot(s51, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
Left join seems to do the job. Same as sp::merge and match ( I do hope there is no messing up the order so for instance plotted polygons are associated with different vales after the merge?). None of the solutions actually removes two polygons withmissing data, but I presume this is correct behaviour in R?

Simplest way to plot changes in ranking between two ordered lists in R?

I'm wondering if there is an easy way to plot the changes in position of elements between 2 lists in the form of a directed bipartite graph in R. For example, list 1 and 2 are vectors of character strings, not necessarily containing the same elements:
list.1 <- c("a","b","c","d","e","f","g")
list.2 <- c("b","x","e","c","z","d","a")
I would like to generate something similar to:
I've had a slight bash at using the igraph package, but couldn't easily construct what I would like, which I imagine and hope shouldn't be too hard.
Cheers.
Here is a simple function to do what you want. Essentially it uses match to match elements from one vector to another and arrows to draw arrows.
plotRanks <- function(a, b, labels.offset=0.1, arrow.len=0.1)
{
old.par <- par(mar=c(1,1,1,1))
# Find the length of the vectors
len.1 <- length(a)
len.2 <- length(b)
# Plot two columns of equidistant points
plot(rep(1, len.1), 1:len.1, pch=20, cex=0.8,
xlim=c(0, 3), ylim=c(0, max(len.1, len.2)),
axes=F, xlab="", ylab="") # Remove axes and labels
points(rep(2, len.2), 1:len.2, pch=20, cex=0.8)
# Put labels next to each observation
text(rep(1-labels.offset, len.1), 1:len.1, a)
text(rep(2+labels.offset, len.2), 1:len.2, b)
# Now we need to map where the elements of a are in b
# We use the match function for this job
a.to.b <- match(a, b)
# Now we can draw arrows from the first column to the second
arrows(rep(1.02, len.1), 1:len.1, rep(1.98, len.2), a.to.b,
length=arrow.len, angle=20)
par(old.par)
}
A few example plots
par(mfrow=c(2,2))
plotRanks(c("a","b","c","d","e","f","g"),
c("b","x","e","c","z","d","a"))
plotRanks(sample(LETTERS, 20), sample(LETTERS, 5))
plotRanks(c("a","b","c","d","e","f","g"), 1:10) # No matches
plotRanks(c("a", "b", "c", 1:5), c("a", "b", "c", 1:5)) # All matches
par(mfrow=c(1,1))
Here's a solution using igraph functions.
rankchange <- function(list.1, list.2){
grp = c(rep(0,length(list.1)),rep(1,length(list.2)))
m = match(list.1, list.2)
m = m + length(list.1)
pairs = cbind(1:length(list.1), m)
pairs = pairs[!is.na(pairs[,1]),]
pairs = pairs[!is.na(pairs[,2]),]
g = graph.bipartite(grp, as.vector(t(pairs)), directed=TRUE)
V(g)$color = c("red","green")[grp+1]
V(g)$label = c(list.1, list.2)
V(g)$x = grp
V(g)$y = c(length(list.1):1, length(list.2):1)
g
}
This builds and then plots the graph from your vectors:
g = rankchange(list.1, list.2)
plot(g)
Adjust the colour scheme and symbolism to suit using options detailed in the igraph docs.
Note this is not thoroughly tested (only tried on your sample data) but you can see how it builds a bipartite graph from the code.
With ggplot2:
v1 <- c("a","b","c","d","e","f","g")
v2 <- c("b","x","e","c","z","d","a")
o <- 0.05
DF <- data.frame(x = c(rep(1, length(v1)), rep(2, length(v2))),
x1 = c(rep(1 + o, length(v1)), rep(2 - o, length(v2))),
y = c(rev(seq_along(v1)), rev(seq_along(v2))),
g = c(v1, v2))
library(ggplot2)
library(grid)
ggplot(DF, aes(x=x, y=y, group=g, label=g)) +
geom_path(aes(x=x1), arrow = arrow(length = unit(0.02,"npc")),
size=1, color="green") +
geom_text(size=10) +
theme_minimal() +
theme(axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
This can of course be wrapped in a function easily.
Here's a generalization of nico's result for use with data frames:
plotRanks <- function(df, rank_col, time_col, data_col, color_col = NA, labels_offset=0.1, arrow_len=0.1, ...){
time_vec <- df[ ,time_col]
unique_dates <- unique(time_vec)
unique_dates <- unique_dates[order(unique_dates)]
rank_ls <- lapply(unique_dates, function(d){
temp_df <- df[time_vec == d, ]
temp_df <- temp_df[order(temp_df[ ,data_col], temp_df[ ,rank_col]), ]
temp_d <- temp_df[ ,data_col]
temp_rank <- temp_df[ ,rank_col]
if(is.na(color_col)){
temp_color = rep("blue", length(temp_d))
}else{
temp_color = temp_df[ ,color_col]
}
temp_rank <- temp_df[ ,rank_col]
temp_ls <- list(temp_rank, temp_d, temp_color)
names(temp_ls) <- c("ranking", "data", "color")
temp_ls
})
first_rank <- rank_ls[[1]]$ranking
first_data <- rank_ls[[1]]$data
first_length <- length(first_rank)
y_max <- max(sapply(rank_ls, function(l) length(l$ranking)))
plot(rep(1, first_length), 1:first_length, pch=20, cex=0.8,
xlim=c(0, length(rank_ls) + 1), ylim = c(1, y_max), xaxt = "n", xlab = NA, ylab="Ranking", ...)
text_paste <- paste(first_rank, "\n", "(", first_data, ")", sep = "")
text(rep(1 - labels_offset, first_length), 1:first_length, text_paste)
axis(1, at = 1:(length(rank_ls)), labels = unique_dates)
for(i in 2:length(rank_ls)){
j = i - 1
ith_rank <- rank_ls[[i]]$ranking
ith_data <- rank_ls[[i]]$data
jth_color <- rank_ls[[j]]$color
jth_rank <- rank_ls[[j]]$ranking
ith_length <- length(ith_rank)
jth_length <- length(jth_rank)
points(rep(i, ith_length), 1:ith_length, pch = 20, cex = 0.8)
i_to_j <- match(jth_rank, ith_rank)
arrows(rep(i - 0.98, jth_length), 1:jth_length, rep(i - 0.02, ith_length), i_to_j
, length = 0.1, angle = 10, col = jth_color)
offset_choice <- ifelse(length(rank_ls) == 2, i + labels_offset, i - labels_offset)
text_paste <- paste(ith_rank, "\n", "(", ith_data, ")", sep = "")
text(rep(offset_choice, ith_length), 1:ith_length, text_paste)
}
}
Here's an example using a haphazard reshape of the presidents dataset:
data(presidents)
years <- rep(1945:1974, 4)
n <- length(presidents)
q1 <- presidents[seq(1, n, 4)]
q2 <- presidents[seq(2, n, 4)]
q3 <- presidents[seq(3, n, 4)]
q4 <- presidents[seq(4, n, 4)]
quarters <- c(q1, q2, q3, q4)
q_label <- c(rep("Q1", n / 4), rep("Q2", n / 4), rep("Q3", n / 4), rep("Q4", n / 4))
q_colors <- c(Q1 = "blue", Q2 = "red", Q3 = "green", Q4 = "orange")
q_colors <- q_colors[match(q_label, names(q_colors))]
new_prez <- data.frame(years, quarters, q_label, q_colors)
new_prez <- na.omit(new_prez)
png("C:/users/fasdfsdhkeos/desktop/prez.png", width = 15, height = 10, units = "in", res = 300)
plotRanks(new_prez[new_prez$years %in% 1960:1970, ], "q_label", "years", "quarters", "q_colors")
dev.off()
This produces a time series ranking plot, and it introduces color if tracking a certain observation is desired:

Resources