How to plot a variable with selected number of rows using ggplot2? - r

A sample of my dataframe (speed) is as below with 45122 observations.
A B C
1 0.06483121 0.08834364 0.05814113
2 0.06904103 0.13169238 0.06082291
3 0.05556961 0.09767185 0.06039383
4 0.06483121 0.13388726 0.05996474
5 0.06651514 0.11632827 0.04891578
6 0.06904103 0.11687699 0.05953565
...
......
45122 0.06212749 0.08307191 0.07422524
I can create a simple plot by selecting number of observation I like using code below:
( temporal cyclic pattern- speed shown in the y axis, 0 to 500 in the x axis)
plot(speed[1:500,3], type="l", ylab="speed", xlab="unit time")
I am trying to do the same with ggplot2, but it's giving me a histogram.
How do I do the similar plot using ggplot?

We subset the first 500 rows and the third variable ('C') using [. Note that we have to add drop=FALSE as the default is drop=TRUE. According to ?"[", if drop=TRUE, the result is coerced to the lowest possible dimension, i.e. in this case a vector.
speed1 <- speed[1:500,3, drop=FALSE]
We specify the 'x' (1:nrow(speed1)) and 'y' variables in the aes, use geom_line() for a line plot and xlab and ylab to specify the labels for 'x axis' and 'y axis'.
library(ggplot2)
ggplot(speed1, aes(x=1:nrow(speed1), y=C))+
geom_line() +
ylab('speed') +
xlab('unit time')
data
set.seed(24)
speed <- as.data.frame(matrix(abs(rnorm(45122*3)), ncol=3,
dimnames=list(NULL, LETTERS[1:3])))

Related

R - ggplot2 - Get histogram of difference between two groups

Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.

Barplot with continuous x axis using base r graphics

I am looking to scale the x axis on my barplot to time, so as to accurately represent when measurements were taken.
I have these data frames:
> Botcv
Date Average SE
1 2014-09-01 4.0 1.711307
2 2014-10-02 5.5 1.500000
> Botc1
Date Average SE
1 2014-10-15 2.125 0.7180703
2 2014-11-12 1.000 0.4629100
3 2014-12-11 0.500 0.2672612
> Botc2
Date Average SE
1 2014-10-15 3.375 1.3354708
2 2014-11-12 1.750 0.4531635
3 2014-12-11 0.625 0.1829813
I use this code to produce a grouped barplot:
covaverage <- c(Botcv$Average,NA,NA,NA)
c1average <- c(NA,NA, Botc1$Average)
c2average <- c(NA,NA, Botc2$Average)
date <- c(Botcv$Date, Botc1$Date)
averagematrix <- matrix(c(covaverage,c1average, c2average), nrow=3, ncol=5, byrow=TRUE)
barplot(averagematrix,date, xlab="Date", ylab="Average", axis.lty=1, space=NULL,width=3,beside=T, ylim=c(0.00,6.00))
R plots the bars equal distances apart by default and I have been trying to find a workaround for this. I have seen several other solutions that utilise ggplot2 but I am producing plots for my masters thesis and would like to keep the appearance of my barplots in line with other graphs that I have created using base R graphics. I also want to add error bars to the plot. If anyone could provide a solution then I would be very grateful!! Thanks!
Perhaps you can use this as a start. It is probably easier to use boxplots, as they can be put at a given x position by using the at argument. For base barplots this cannot be done, but you can use rectangle instead to replicate the barplot look. Error bars can be added using arrows or segments.
bar_w = 1 # width of bars
offset = c(-1,1) # offset to avoid overlapping
cols = grey.colors(2) # colors for different types
# combine into a single data frame
d = data.frame(rbind(Botc1, Botc2), 'type' = c(1,1,1,2,2,2))
# set up empty plot with sensible x and y lims
plot(as.Date(d$Date), d$Average, type='n', ylim=c(0,4))
# draw data of data frame 1 and 2
for (i in unique(d$type)){
dd = d[d$type==i, ]
x = as.Date(dd$Date)
y = dd$Average
# rectangles
rect(xleft=x-bar_w+offset[i], ybottom=0, xright=x+bar_w+offset[i], ytop=y, col=cols[i])
# errors bars
arrows(x0=x+offset[i], y0=y-0.5*dd$SE, x1=x+offset[i], y1=y+0.5*dd$SE, col=1, angle=90, code=3, length = 0.1)
}
If what you want to get is simply the theme that will match the base theme the + theme_bw() in ggplot2 will achieve this:
data(mtcars)
require(ggplot2)
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_boxplot() +
theme_bw()
Result
Alternative
boxplot(mpg~cyl,data=mtcars)
If, as you said, the only thing you want to achieve is similar look, and you have working plot in the ggplot2 using the theme_bw() should produce plots that are indistinguishable from what would be derived via the standard plotting mechanism. If you feel so inclined you may tweak some minutiae details like font sizes, thickness of graph borders or visualisation of outliers.

Color Dependent Bar Graph in R

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

plotting aggregate data with ggplot

I have a data like this
subject<-1:208
ev<-runif(208, min=1, max=2)
seeds<-gl(6,40,labels=c('seed1', 'seed2','seed3','seed4','seed5','seed6'),length=208)
ngambles<-gl(2,1, labels=c('4','32'))
trial<-rep(1:20, each= 2, length=208)
ngambles<-rep('4','32' ,each=1, length=208)
data<-data.frame(subject,ev,seeds,ngambles,trial)
the data looks like this
subject ev seeds ngambles trial
1 1.996717 seed1 4 1
2 1.280977 seed1 32 1
3 1.571648 seed1 4 2
4 1.153311 seed1 32 2
5 1.502559 seed1 4 3
6 1.644001 seed1 32 3
I plot a graph with rep as x axis and expected_value as y axis for each seed and n_gambles by this command.
qplot(trial,ev,data=data,
facets=ngambles~seeds,xlab="Trial", ylab="Expected Value", geom="line")+
opts(title = "Expected Value for Each Seed")
now I want to draw a new graph by aggregating ev for trial equal to 1-5, 6-10,11-15,and 16-20. I also want to draw an error bar.
I have no clue how to do in R
maybe somebody can help me
thanks in advance
Assuming that your data frame is called df. First, added new column ag that show to which interval original trial value will belong with function cut().
df$ag<-cut(df$trial,c(1,6,11,16,21),right=FALSE)
Now there is two possibilities - first, aggregate your data using stat_.. functions of ggplot2. There is stat_summary() function already defined and then you should define also stat_sum_df() function (taken from stat_summary() help file) to calculate more than one summary value.
stat_sum_df <- function(fun, geom="crossbar", ...) {
stat_summary(fun.data=fun, colour="red", geom=geom, width=0.2, ...)
}
With stat_sum_df() and argument "mean_cl_normal" calculate confidence intervals to use in geom="errorbar" and with stat_summary() mean value for geom="line". As x value use new column ag. With scale_x_discrete() you can get right labels for x axis.
ggplot(df, aes(ag,ev,group=seeds))+stat_sum_df("mean_cl_normal",geom="errorbar")+
stat_summary(fun.y="mean",geom="line",color="red")+
facet_grid(ngambles~seeds)+
scale_x_discrete(labels=c("1-5","6-10","11-15","16-20"))
Second approach is to summarize data before plotting, for example, with function ddply() from library plyr. Also in this case you need column ag made in first example. And then use new data for plotting.
library(plyr)
df.new<-ddply(df,.(ag,seeds,ngambles),summarise,ev.m=mean(ev),
ev.lim=qt(0.975,length(ev)-1)*sd(ev)/sqrt(length(ev)))
ggplot(df.new,aes(ag,group=seeds))+
geom_errorbar(aes(y=ev.m,ymin=ev.m-ev.lim,ymax=ev.m+ev.lim))+
geom_line(aes(y=ev.m))+
facet_grid(ngambles~seeds)+
scale_x_discrete(labels=c("1-5","6-10","11-15","16-20"))

How do you apply the pch parameter in R to individual points in a scatter plot?

I am interested in changing the symbol used to represent the two most influential points in my scatter plot. In this case, they are rows 19 and 20 in the data frame. The code I have is as follows:
data1<-read.csv("data1.csv")
plot(h~w,data=data1,xlab="Weight",ylab="Height",
main="Scatterplot of H vs W",pch=c(17,19)[data1[c(19,20),]])
Obviously, I cannot get this to work depsite several suggestions and hours of trying to figure this out. Any suggestions would be appreciated.
The pch symbol is used for each data point and gets recyclyed to the length of the number of points you are plotting.
Consider this example
x <- 1:10 + rnorm(10)
y <- 1:10
plot( y ~ x )
The default is pch = 1 and it gets recycled to be used for each point.
Contrast that with:
plot( y ~ x , pch = rep(c(1,2),each=5))
You get the first five points with one symbol and the next5 with another, and that is because you have made a vector of values for pch that specifies the plotting symbol for each of the 10 values being plotted:
rep(c(1,2),each=5)
#[1] 1 1 1 1 1 2 2 2 2 2
In your case, all you need to do is
plot(h~w,data=data1,xlab="Weight",ylab="Height",
main="Scatterplot of H vs W",pch=c(rep(1,times=18),17,19) )

Resources