I'd like to create a bar chart using factors and more than two variables! My data looks like this:
Var1 Var2 ... VarN Factor1 Factor2
Obs1 1-5 1-5 ... 1-5
Obs2 1-5 1-5 ... ...
Obs3 ... ... ... ...
Each datapoint is a likert item ranging from 1-5
Plotting total sums using a dichotomized version (every item above 4 is a one, else 0)
I converted the data using this
MyDataFrame = dichotomize(MyDataFrame,>=4)
p <- colSums(MyDataFrame)
p <- data.frame(names(p),p)
names(p) <- c("var","value")
ggplot(p,aes(var,value)) + geom_bar() + coord_flip()
Doing this i loose the information provided by factor1 etc, i'd like to use stacking in order to visualize from which group of people the rating came
Is there a elegant solution to this problem? I read about using reshape to melt the data and then applying ggplot?
I would suggest the following: use one of your factor for stacking, the other one for faceting. You can remove position="fill" to geom_bar() to use counts instead of standardized values.
my.df <- data.frame(replicate(10, sample(1:5, 100, rep=TRUE)),
F1=gl(4, 5, 100, labels=letters[1:4]),
F2=gl(2, 50, labels=c("+","-")))
my.df[,1:10] <- apply(my.df[,1:10], 2, function(x) ifelse(x>4, 1, 0))
library(reshape2)
my.df.melt <- melt(my.df)
library(plyr)
res <- ddply(my.df.melt, c("F1","F2","variable"), summarize, sum=sum(value))
library(ggplot2)
ggplot(res, aes(y=sum, x=variable, fill=F1)) +
geom_bar(stat="identity", position="fill") +
coord_flip() +
facet_grid(. ~ F2) +
ylab("Percent") + xlab("Item")
In the above picture, I displayed observed frequencies of '1' (value above 4 on the Likert scale) for each combination of F1 (four levels) and F2 (two levels), where there are either 10 or 15 observations:
> xtabs(~ F1 + F2, data=my.df)
F2
F1 + -
a 15 10
b 15 10
c 10 15
d 10 15
I then computed conditional item sum scores with ddply,† using a 'melted' version of the original data.frame. I believe the remaining graphical commands are highly configurable, depending on what kind of information you want to display.
† In this simplified case, the ddply instruction is equivalent to with(my.df.melt, aggregate(value, list(F1=F1, F2=F2, variable=variable), sum)).
Related
I've tried to search for an answer, but can't seem to find the right one that does the job for me.
I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)
My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.
I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.
A sample dataset:
age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))
What I'm looking for is a histogram (or barplot) that represents this data.
Ideally, I'd want the ages to be split into age groups. For example,
20-30, 31-40, 41-50 and then the total number of awards for each group.
The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.
Thanks!
We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:
create sample data
#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
head(dat)
age awards
1 28 2
2 44 6
3 32 3
4 47 3
5 49 2
6 21 5
By age
#aggregate
sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
library(ggplot2)
ggplot(sum_by_age, aes(x = age, y = awards))+
geom_bar(stat = 'identity')
By age group
#create groups
dat$age_group <- ifelse(dat$age <= 30, '20-30',
ifelse(dat$age <= 40, '30-40',
'41 +'))
sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
geom_bar(stat = 'identity')
Note
We could skip the aggregate step altogether and just use:
ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.
For completeness, I am adding the base R solution to #bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.
# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
right = FALSE)
cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.
Now we can aggregate based on the groups we created.
> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
ageGroups awards
1 [20,30) 188
2 [30,40) 212
3 [40, Inf) 194
Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.
> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])
I've tried to search for an answer, but can't seem to find the right one that does the job for me.
I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)
My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.
I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.
A sample dataset:
age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))
What I'm looking for is a histogram (or barplot) that represents this data.
Ideally, I'd want the ages to be split into age groups. For example,
20-30, 31-40, 41-50 and then the total number of awards for each group.
The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.
Thanks!
We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:
create sample data
#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
head(dat)
age awards
1 28 2
2 44 6
3 32 3
4 47 3
5 49 2
6 21 5
By age
#aggregate
sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
library(ggplot2)
ggplot(sum_by_age, aes(x = age, y = awards))+
geom_bar(stat = 'identity')
By age group
#create groups
dat$age_group <- ifelse(dat$age <= 30, '20-30',
ifelse(dat$age <= 40, '30-40',
'41 +'))
sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
geom_bar(stat = 'identity')
Note
We could skip the aggregate step altogether and just use:
ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.
For completeness, I am adding the base R solution to #bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.
# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
awards = rpois(200, 3))
# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
right = FALSE)
cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.
Now we can aggregate based on the groups we created.
> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
ageGroups awards
1 [20,30) 188
2 [30,40) 212
3 [40, Inf) 194
Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.
> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])
Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.
I have two data frames z (1 million observations) and b (500k observations).
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
b= Tracer time treatment
2 0 S
35 0 S
10 0 X
04 0 X
20 15 S
11 15 S
12 15 X
25 15 X
I'd like to create grouped boxplots using time as a factor and treatment as colour. Essentially I need to bind them together and then differentiate between them but not sure how. One way I tried was using:
zz<-factor(rep("Z", nrow(z))
bb<-factor(rep("B",nrow(b))
dumB<-merge(z,zz) #this won't work because it says it's too big
dumB<-merge(b,zz)
total<-rbind(dumB,dumZ)
But z and zz merge won't work because it says it's 10G in size (which can't be right)
The end plot might be similar to this example: Boxplot with two levels and multiple data.frames
Any thoughts?
Cheers,
EDIT: Added boxplot
I would approach it as follows:
# create a list of your data.frames
l <- list(z,b)
# assign names to the dataframes in the list
names(l) <- c("z","b")
# bind the dataframes together with rbindlist from data.table
# the id parameter will create a variable with the names of the dataframes
# you could also use 'bind_rows(l, .id="id")' from 'dplyr' for this
library(data.table)
zb <- rbindlist(l, id="id")
# create the plot
ggplot(zb, aes(x=factor(time), y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~id) +
theme_bw()
which gives:
Other alternatives for creating your plot:
# facet by 'time'
ggplot(zb, aes(x=id, y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
# facet by 'time' & color by 'id' instead of 'treatment'
ggplot(zb, aes(x=treatment, y=Tracer, color=id)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
In respons to your last comment: to get everything in one plot, you use interaction to distinguish between the different groupings as follows:
ggplot(zb, aes(x=treatment, y=Tracer, color=interaction(id, time))) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.7)) +
theme_bw()
which gives:
The key is you do not need to perform a merge, which is computationally expensive on large tables. Instead assign a new variable and value (source c(b,z) in my code below) to each dataframe and then rbind. Then it becomes straight forward, my solution is very similar to #Jaap's just with different faceting.
library(ggplot2)
#Create some mock data
t<-seq(1,55,by=2)
z<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
b<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
#Add a variable to each table to id itself
b$source<-"b"
z$source<-"z"
#concatenate the tables together
all<-rbind(b,z)
ggplot(all, aes(source, tracer, group=interaction(treatment,source), fill=treatment)) +
geom_boxplot() + facet_grid(~time)
I would like some help coloring a ggplot2 histogram generated from already-summarized count data.
The data are something like counts of # males and # females living in a number of different areas. It's easy enough to plot the histogram for the total counts (i.e. males + females):
set.seed(1)
N=100;
X=data.frame(C1=rnbinom(N,15,0.1), C2=rnbinom(N,15,0.1),C=rep(0,N));
X$C=X$C1+X$C2;
ggplot(X,aes(x=C)) + geom_histogram()
However, I'd like to color each bar according to the relative contribution from C1 and C2, so that I get the same histogram (i.e. overall bar heights) as in the above example, plus I see the proportion of type "C1" and "C2" individuals as in a stacked bar chart.
Suggestions for a clean way to do this with ggplot2, using data like "X" in the example?
Very quickly, you can do what the OP wants using the stat="identity" option and the plyr package to manually calculate the histogram, like so:
library(plyr)
X$mid <- floor(X$C/20)*20+10
X_plot <- ddply(X, .(mid), summarize, total=length(C), split=sum(C1)/sum(C)*length(C))
ggplot(data=X_plot) + geom_histogram(aes(x=mid, y=total), fill="blue", stat="identity") + geom_histogram(aes(x=mid, y=split), fill="deeppink", stat="identity")
We basically just make a 'mids' column for how to locate the columns and then make two plots: one with the count for the total (C) and one with the columns adjusted to the count of one of the columns (C1). You should be able to customize from here.
Update 1: I realized I made a small error in calculating the mids. Fixed now. Also, I don't know why I used a 'ddply' statement to calculate the mids. That was silly. The new code is clearer and more concise.
Update 2: I returned to view a comment and noticed something slightly horrifying: I was using sums as the histogram frequencies. I have cleaned up the code a little and also added suggestions from the comments concerning the coloring syntax.
Here's a hack using ggplot_build. The idea is to first get your old/original plot:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
stored in p. Then, use ggplot_build(p)$data[[1]] to extract the data, specifically, the columns xmin and xmax (to get the same breaks/binwidths of histogram) and count column (to normalize the percentage by count. Here's the code:
# get old plot
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
# get data of old plot: cols = count, xmin and xmax
d <- ggplot_build(p)$data[[1]][c("count", "xmin", "xmax")]
# add a id colum for ddply
d$id <- seq(nrow(d))
How to generate data now? What I understand from your post is this. Take for example the first bar in your plot. It has a count of 2 and it extends from xmin = 147 to xmax = 156.8. When we check X for these values:
X[X$C >= 147 & X$C <= 156.8, ] # count = 2 as shown below
# C1 C2 C
# 19 91 63 154
# 75 86 70 156
Here, I compute (91+86)/(154+156)*(count=2) = 1.141935 and (63+70)/(154+156) * (count=2) = 0.8580645 as the two normalised values for each bar we'll generate.
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(0,0))
p <- colSums(t)[1:2]/colSums(t)[3] * x$count
})
# then, it just normal plotting
require(reshape2)
dd <- melt(dd, id.var="id")
ggplot(data = dd, aes(x=id, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
And this is the original plot:
And this is what I get:
Edit: If you also want to get the breaks proper, then, you can get the corresponding x coordinates from the old plot and use it here instead of id:
p <- ggplot(data = X, aes(x=C)) + geom_histogram()
d <- ggplot_build(p)$data[[1]][c("count", "x", "xmin", "xmax")]
d$id <- seq(nrow(d))
require(plyr)
dd <- ddply(d, .(id), function(x) {
t <- X[X$C >= x$xmin & X$C <= x$xmax, ]
if(nrow(t) == 0) return(c(x$x,0,0))
p <- c(x=x$x, colSums(t)[1:2]/colSums(t)[3] * x$count)
})
require(reshape2)
dd.m <- melt(dd, id.var="V1", measure.var=c("V2", "V3"))
ggplot(data = dd.m, aes(x=V1, y=value)) +
geom_bar(aes(fill=variable), stat="identity", group=1)
How about:
library("reshape2")
mm <- melt(X[,1:2])
ggplot(mm,aes(x=value,fill=variable))+geom_histogram(position="stack")