R: visualizing differences across a large number of groups - r

I have a dataset having the unique IDs of manufacturing units, the industrial classification of their outputs (CAT) and the number of people each unit employs (EMP). I want to graphically show that EMP varies by CAT, i.e. employment size in general varies by the kind of output a unit produces. I tried boxplots arranged by median EMP:
a = read.csv("/filepath/plot.csv", header=T, stringsAsFactors=F)
bymedian = with(a, reorder(CAT, log(as.numeric(as.character(EMP))), median))
boxplot(log(EMP) ~ bymedian, data=a, horizontal=F, notch=T, pch=1, cex=.25, col="gray95", boxwex=.25, las=2, outline=F)
pch=1, cex=.25, col="gray95", boxwex=.25, las=2, outline=F)
The problem is that because of the large number of categories (400+), the plot becomes very messy. Is there a cleaner way of showing what I am trying to do?

Using ggplot2 you can show what you are trying to do with a scale_x_discrete
library(ggplot2)
a$bymedian = with(a, reorder(CAT, log(EMP), median))
p <- ggplot(a,aes(y=log(EMP),x=bymedian))+
geom_boxplot()
breaks <- levels(a$bymedian)[seq(1,nlevels(a$bymedian),20)]
p %+% scale_x_discrete(breaks = breaks, labels = breaks)

Related

R:par(cex.lab=2) doesn't work in plot(effect(),…)

I made a linear regression with a database including group(1=smoke,2=control) , gender(1=m,2=f) and a dependent variable like weight. I want to see the interactions between group and gender with a plot. I need to change the size of the label of axes but it doesn't work with par(). The code is like this:
lin <- lm(weight ~ group + gender + group:gender, data=data)
par(cex.lab = 2, cex.axis = 2)
library(effects)
plot(effect("group:gender",lin,,list(gender=c(1,2))),multiline=T)
The size doesn't change. And if I want to delete the axis like this:
plot(effect("group:gender",lin,,list(gender=c(1,2))),multiline=T,axes=FALSE)
It gives me this error:
$ operator is invalid for atomic vectors
how to solve this?
I don't know why that is happening, my guess is that the class(effect) is "eff" which perhaps may not be suited for plot to render it properly, to avoid this convert this object to data.frame and then use the par functionality to do your task.
Answer to your question: Here if you change par options with different values, your font size will change like in the graph which I have mentioned down.
You can do this:
library(effects)
lin <- lm(mpg ~ cyl + am + am:cyl, data=mtcars)
par(cex.lab=1.2, cex.axis=1.2, cex.main=1.2, cex.sub=1.2) #Here you can check, the par options, if you change it the font will incrase or decrese
effect1 <- data.frame(effect("cyl:am",lin,,list(cyl=c(4,6,8))))
effects <- effect1[,c("cyl","am", "fit")] ##Keeping only the required columns
You can do a plotting with effects , by using all three objects: cyl, am and fit, However, the lines are getting joined , I am not aware any functionality like ggplot's group in base plot R. So I will split it and then plot it.
xvals <- split(effects$am,effects$cyl) #split x-axis basis cyl
yvals <- split(effects$fit,effects$cyl) #split y-axis basis cyl
plot(1:max(unlist(xvals)),xlim = c(0,max(unlist(xvals))),ylim=(c(0,max(unlist(yvals)))),type="n", main="plot b/w mpg, am * cyl",
xlab="am", ylab="mpg") #adding header, labels and xlim and ylim to the graphs
Map(lines,xvals,yvals,col=c("red","blue","black"),pch=1:2,type="o") #plotting the objects using Map
legend("bottomright", legend=c("8", "6", "4"),
col=c("red", "blue", "black"), lty=1:2, cex=0.8) #adding the legend
Output:
With par options fixed at 1.2
With par options fixed at 1.5:

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value

Looking for a better way to plot data

I have data that describe several measurements taken from several individuals (each individual is represented by several measurements taken at several different time points).
I want to present the data as a scatter plot of measurements vs. individuals. Since for each individual I have several measurements, it means that I'll have a stack of points at each x-axis point.
Here's an example random code to generate these data:
set.seed(1)
n.individuals <- 10
n.measurements <- 15
vars <- runif(n.individuals, 0.1, 1)
means <- runif(n.individuals, 1, 5)
negative.idx <- sample(n.individuals, n.individuals/2)
means[negative.idx] <- -1*means[negative.idx]
df <- data.frame(measurement=c(sapply(1:n.individuals, function(x) rnorm(n.measurements, means[x], sqrt(vars[x])))),
individual=c(sapply(1:n.individuals, function(x) rep(x, n.measurements))))
Here's how I'm presenting the data so far:
#add colors
cols <- rgb(runif(n.measurements),runif(n.measurements),runif(n.measurements))
df$col <- rep(cols, n.individuals)
#simple plot
plot(df$individual, df$measurement, col=df$col, lwd=2, xlab = "individual", ylab = "measurement")
abline(h=0,lty=2)
abline(v=seq(min(df$individual)-0.5, max(df$individual)+0.5, 1),lty=2)
I'm wondering if there's a more elegant way to present the data (perhaps a ggplot way?)
Note that the signal I'm looking for in the data (and this is how I generated them) is that the measurements for each individual are correlated with respect to their sign. If they are uncorrelated with respect to their sign they should appear scattered on both sides of the y-axis.
Firstly, I would jitter your individuals so that individual measurements do not overlap. Use this code:
plot(jitter(df$individual), df$measurement, col=df$col,
lwd=2, xlab = "individual", ylab = "measurement")
There are a million ways to plot it in ggplot. Here's a quick violin graph:
p <- ggplot(df, aes(factor(individual), measurement))
p + geom_violin(aes(fill = factor(individual))) +
geom_hline((aes(yintercept = 0))) + geom_jitter( ) + xlab("Individual")

Combining 3 different matrix plots in R

Can anyone tell me how to create a plot which features 3 different matrices sets of data. In general, I have 3 different matricies of data all 1*1001 dimensions, and i wish to plot all 3 on the same graph.
I have managed to get one matrix to plot at once, and assemble the code to create the other 2 matrices but not to plot it. B[i,] is randomly generated data. What I would like to know is what would be the coding to get all 3 plots together on one graph.
Code for one matrix:
ntime<-1000
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
}
matplot(t, t(average.price.at.each.timestep), type="l", lty=1, main="MC Price of a Zero Coupon Bond", ylab="Price", xlab = "Option Exercise Date")
Code for 3:
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
s.e.at.each.time <-matrix(0,nrow=1,ncol=ntime+1)
upper.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
lower.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
std <- function(x) sd(x)/sqrt(length(x))
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
s.e.at.each.time[i] <- std(B[i,])
upper.c.l.at[i] <- average.price.at.each.timestep[i]+1.96*s.e.at.each.time[i]
lower.c.l.at[i] <- average.price.at.each.timestep[i]-1.96*s.e.at.each.time[i]
}
I'm still struggling with this as I cannot get the solutions given to match with my data sets, I have now included the code below that generates the matrix B as a working example so you can see the data I am dealing with. As you can see it produces a plot of the different prices, I would like a plot with the average price and confidence intervals of the average.
# Define Bond Price Parameters
#
P<-1 #par value
# Define Vasicek Model Parameters
#
rev.rate<-0.3 #speed of reversion
long.term.mean<-0.1 #long term level of the mean
sigma<-0.05 #volatility
r0<-0.03 #spot interest rate
Strike<-0.05
# Define Simulation Parameters
#
T<-50 #time to expiry
ntime<-1000 #number of timesteps
yearstep<-ntime/T #yearstep
npaths<-1000 #number of paths
dt<-T/ntime #timestep
R <- matrix(0,nrow=ntime+1,ncol=npaths) #matrix of Vasicek interest rate values
B <- matrix(0,nrow=ntime+1,ncol=npaths) # matrix of Bond Prices
R[1,]<-r0 #specifies that all paths start at specified spot rate
B[1,]<-P
# do loop which generates values to fill matrix R with multiple paths of Interest Rates as they evolve over time.
# stochastic process based on standard normal distribution
for (j in 1:npaths) {
for (i in 1:ntime) {
dZ <-rnorm(1,mean=0,sd=1)*sqrt(dt)
Rij<-R[i,j]
Bij<-B[i,j]
dr <-rev.rate*(long.term.mean-Rij)*dt+sigma*dZ
R[i+1,j]<-Rij+dr
B[i+1,j]<-Bij*exp(-R[i+1,j]*dt)
}
}
t<-seq(0,T,dt)
par(mfcol = c(3,3))
matplot(t, B[,1:pmin(20,npaths)], type="l", lty=1, main="Price of a Zero Coupon Bond", ylab="Price", xlab = "Time to Expiry")
Your example isn't reproducible, so I created some fake data that I hope is structured similarly to yours. If this isn't what you were looking for, let me know and I'll update as needed.
# Fake data
ntime <- 100
mat1 <- matrix(rnorm(ntime+1, 10, 2), nrow=1, ncol=ntime+1)
mat2 <- matrix(rnorm(ntime+1, 20, 2), nrow=1, ncol=ntime+1)
mat3 <- matrix(rnorm(ntime+1, 30, 2), nrow=1, ncol=ntime+1)
matplot(1:(ntime+1), t(mat1), type="l", lty=1, ylim=c(0, max(c(mat1,mat2,mat3))),
main="MC Price of a Zero Coupon Bond",
ylab="Price", xlab = "Option Exercise Date")
# Add lines for mat2 and mat3
lines(1:101, mat2, col="red")
lines(1:101, mat3, col="blue")
UPDATE: Is this what you're trying to do?
matplot(t, t(average.price.at.each.timestep), type="l", lty=1,
main="MC Price of a Zero Coupon Bond", ylab="Price",
xlab = "Option Exercise Date")
matlines(t, t(upper.c.l.at), lty=2, col="red")
matlines(t, t(lower.c.l.at), lty=2, col="green")
See plot below. If you have multiple columns that you want to plot (as in your updated example where you plot 20 separate paths) and you want to add lower and upper CIs for all of them (though this would make the plot unreadable), just use a matrix of upper and lower CI values that correspond to each path in average.price.at.each.timestep and use matlines to add them to your existing plot of the multiple paths.
This is doable using ggplot2 and reshape2. The structures you have are a little awkward, which you could improve by using a data frame instead of a matrix.
#Dummy data
average.price.at.each.timestep <- rnorm(1000, sd=0.01)
s.e.at.each.time <- rnorm(1000, sd=0.0005, mean=1)
#CIs (note you can vectorise this):
upper.c.l.at <- average.price.at.each.timestep+1.96*s.e.at.each.time
lower.c.l.at <- average.price.at.each.timestep-1.96*s.e.at.each.time
#create a data frame:
prices <- data.frame(time = 1:length(average.price.at.each.timestep), price=average.price.at.each.timestep, upperCI= upper.c.l.at, lowerCI= lower.c.l.at)
library(reshape2)
#turn the data frame into time, variable, value triplets
prices.t <- melt(prices, id.vars=c("time"))
#plot
library(ggplot2)
ggplot(prices.t, aes(time, value, colour=variable)) + geom_line()
This produces the following plot:
This can be improved somewhat by using geom_ribbon instead:
ggplot(prices, aes(time, price)) + geom_ribbon(aes(ymin=lowerCI, ymax=upperCI), alpha=0.1) + geom_line()
Which produces this plot:
Here's another, slightly different ggplot solution that does not require you to calculate the confidence limits first - ggplot does it for you.
# create sample dataset
set.seed(1) # for reproducible example
B <- matrix(rnorm(1000,mean=rep(10+1:10/2,each=10)),nc=10)
library(ggplot2)
library(reshape2) # for melt(...)
gg <- melt(data.frame(date=1:nrow(B),B), id="date")
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = function(y)mean(y)-1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = function(y)mean(y)+1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
theme_bw()
stat_summary(...) summarizes the y-values for a given value of x (the date). So in the first call, it calculates the mean, in the second the lowerCL, and in the third the upperCL.
You could also create a CL(...) function, and call that:
CL <- function(x,level=0.95,type=c("lower","upper")) {
fact <- c(lower=-1,upper=1)
mean(x) - fact[type]*qnorm((1-level)/2)*sd(x)/sqrt(length(x))
}
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = CL, type="lower", geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = CL, type="upper", geom="line",linetype="dotted", color="blue")+
theme_bw()
This produces a plot identical to the one above.

Plot frequency of a value of 2 factors in the same plot in R

I'd like to plot the frequency of a variable color coded for 2 factor levels for example blue bars should be the hist of level A and green the hist of level B both n the same graph? Is this possible with the hist command? The help of hist does not allow for a factor. Is there another way around?
I managed to do this by barplots manually but i want to ask if there is a more automatic method
Many thanks
EC
PS. I dont need density plots
Just in case the others haven't answered this is a way that satisfies. I had to deal with stacking histograms recently, and here's what I did:
data_sub <- subset(data, data$V1 == "Yes") #only samples that have V1 as "yes" in my dataset #are added to the subset
hist(data$HL)
hist(data_sub$HL, col="red", add=T)
Hopefully, this is what you meant?
It's rather unclear what you have as a data layout. A histogram requires that you have a variable that is ordinal or continuous so that breaks can be created. If you also have a separate grouping factor you can plot histograms conditional on that factor. A nice worked example of such a grouping and overlaying a density curve is offered in the second example on the help page for the histogram function in the lattice package.
A nice resource for learning relative merits of lattice and ggplot2 plotting is the Learning R blog. This is from the first of a multipart series on side-by=side comparison of the two plotting systems:
library(lattice)
library(ggplot2)
data(Chem97, package = "mlmRev")
#The lattice method:
pl <- histogram(~gcsescore | factor(score), data = Chem97)
print(pl)
# The ggplot method:
pg <- ggplot(Chem97, aes(gcsescore)) + geom_histogram(binwidth = 0.5) +
facet_wrap(~score)
print(pg)
I don't think you can do that easily with a bar histogram, as you would have to "interlace" the bars from both factor levels... It would need some kind of "discretization" of the now continuous x axis (i.e. it would have to be split in "categories" and in each category you would have 2 bars, for each factor level...
But it is quite easy and without problems if you are just fine with plotting the density line function:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
dx <- density(x)
dy <- density(y)
plot(dx, xlim = range(dx$x, dy$x), ylim = range(dx$y, dy$y),
type = "l", col = "red")
lines(dy, col = "blue")
It's very possible.
I didn't have data to work with but here's an example of a histogram with different colored bars. From here you'd need to use my code and figure out how to make it work for factors instead of tails.
BASIC SETUP
histogram <- hist(scale(vector)), breaks= , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2))
#EXAMPLE
x<-rnorm(1000)
histogram <- hist(scale(x), breaks=20 , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < 2, "red", "green"))
I agree with the others that a density plot is more useful than merging colored bars of a histogram, particularly if the group's values are intermixed. This would be very difficult and wouldn't really tell you much. You've got some great suggestions from others on density plots, here's my 2 cents for density plots that I sometimes use:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
DF <- data.frame("Group"=c(rep(c("y","x"), each=1000)), "Value"=c(y,x))
library(sm)
with(DF, sm.density.compare(Value, Group, xlab="Grouping"))
title(main="Comparative Density Graph")
legend(-9, .4, levels(DF$Group), fill=c("red", "darkgreen"))

Resources