Stacked barplot using ggplot2 - r

I am a newbie with R (we are using it at university for marketing research).
So, here's the thing:
I want to create a stacked barplot using ggplot2 that show 3 rectangles, with three different colours, for every var on the x line! On the Y line instead I would like to put percentages, but it's ok even with the frequencies.
I've been googling three day before posting, I swear. I'm not a programmer, so sometimes even If i may have found what I needed I think I haven't recognized it :/
This are my tries:
require(ggplot2)
head(dati)
ggplot(dati, aes(B4_1,B4_2,B4_3)) + geom_bar(position="dodge", stat="identity")
(but of course the results isn't right, because it puts the B4_1 var on the X and the B4_2 var on the Y)
so I tried:
ggplot(dati, aes(B4_1)) + geom_bar(position="dodge",binwidth=x)
because i tought it would at least give me the roots! But it says that x wasn't found.
I tried messing around with this code but it keeps giving me different errors, like that x wasn't found, that bin was wrong. I honestly tried and I can post my chronology on the internet of the last two days if anyone thinks I'm just asking you to do my work for me!
Thank you for the support, any help will be VERY appreciated! happy new year!

I need to make some assumptions because the structure of your data is not entirely clear. What I'm pretty sure of is that you are sampling pairs with two values:
x
something that can be either B4_1, B4_2, B4_3
I'm basing this on your comment about frequencies/counts. If B4_* are actual variables with values, then the answer will be different but will almost certainly involve "melting" your data into long format.
So, with that:
# Make up data
data <- data.frame(
x=sample(c("a", "b", "c"), 150, replace=T),
lab=sample(c("B4_1", "B4_2", "B4_3"), 150, replace=T)
)
# Plot
ggplot(data, aes(x=x, fill=lab)) + geom_bar()
Notice how I only specify the x values, and then geom_bar automatically counts them for me. You definitely don't want position=dodge.
Either way, you really should at a minimum post the results of dput(head(dati)) so people can provide better answers.

Related

How hist() handles zeroes

first-time poster, relative R newbie (I've taken a few Coursera classes over the past two years). I've found some behavior with the hist() function I don't really understand; apologies if this has already been answered, but I couldn't find anything.
I read an article saying there were 1,280 police killings in the U.S. last year and only 14 days when they didn't happen. I thought 14 sounded low, so I decided to see what random numbers would look like. (Should have just done ppois, but bear with me.) I don't have enough reputation to do images, but if you run this code it sure looks like like 40-some zeros.
set.seed(0)
hist(rpois(365, lambda = 1280/365))
But when I run
sum(rpois(365, lambda = 1280/365) == 0)
I always get somewhere around 10, which is the correct answer (and in line with the article). And if I plot it in ggplot instead, I get an extra bar that goes up to around 10, properly placed between 0 and 1, with the other bars shifted to the right:
set.seed(0)
randomnos <- as.data.frame(rpois(365, lambda = 1280/365))
colnames(randomnos) <- "Numbers"
ggplot(randomnos, aes(Numbers)) + geom_histogram(binwidth = 1)
Apparently the hist() function (A) isn't plotting the zeroes and (B) is putting the 1's between 0 and 1 on the X axis. ggplot behaves differently. What gives?

R: how to make multiple plots from one CSV, grouping by a column

I'd like to put multiple plots onto a single visual output in R, based on data that I have in a CSV that looks something like this:
user,size,time
fred,123,0.915022
fred,321,0.938769
fred,1285,1.185608
wilma,5146,2.196687
fred,7506,1.181990
barney,5146,1.860287
wilma,1172,1.158015
barney,5146,1.219313
wilma,13185,1.455904
wilma,8754,1.381372
wilma,878,1.216908
barney,2974,1.223852
I can read this just fine, using, e.g.:
data = read.csv('data.csv')
For the moment, a fairly simple plot is fine, so I'm just trying plot(), without much to it (setting type='o' to get lines and points), and' from solving a past problem, I know that I can do, e.g., the following, to get data for just fred:
plot(data$time[which(data$user == 'fred')], data$size[which(data$user == 'fred')], type='o')
What I'd like, though, is to have the data for each user all showing up on one set of axes, with color coding (and a legend to match users to colors) to identify different user data.
And if another user shows up, I'd like another line to show up, with another color (perhaps recycling if I have too many users at once).
However, just this doesn't do it:
plot(data$size, data$time, type='o',col=c("red", "blue", "green"))
Because it doesn't seem to group by the user.
And just this:
plot(data, type='o')
gives me an error:
Error in plot.default(...) :
formal argument "type" matched by multiple actual arguments
This:
plot(data)
does do something, but not what I want.
I've poked around, but I'm new enough to R that I'm not quite sure how best to search for this, nor where to look for examples that would hit a use-case like this.
I even got somewhat closer with this:
plot(data$size[which(data$user == 'wilma')], data$time[which(data$user == 'wilma')], type='o', col=c('red'))
lines(data$size[which(data$user == 'fred')], data$time[which(data$user == 'fred')], type='o', col=c('green'))
lines(data$size[which(data$user == 'barney')], data$time[which(data$user == 'barney')], type='o', col=c('blue'))
This gives me a plot (which I'd post inline, but as a new user, I'm not allowed to yet):
not-quite-right plot
which is kind of close to what I want, except that it:
doesn't have a legend
has ugly axis labels, instead of just time and size
is scaled to the first plot, and thus is missing data from some of the others
isn't sorted by x-axis, which I could do externally, though I'm guessing I could do it fairly easily in R.
So, the question, ultimately, is this:
What's an easy way to plot data like this which:
has multiple lines based on the labels in the first column of the CSV
uses the same set of axes for the data in columns 2 and 3, regardless of the label
has a legend and color-coding for which label is being used for a particular line (or set of points)
will adapt to adding new labels to the data file, hopefully without change to the R code.
Thanks in advance for any help or pointers on this.
P.S. I looked around for similar questions, and found one that's sort of close, but it's not quite the same, and I failed to figure out how to adapt it to what I'm trying to do.
Good question. This is doable in base plot, but it's even easier and more intuitive using ggplot2. Below is an example of how to do this with random data in ggplot2
First download and install the package
install.packages("ggplot2",repos='http://cran.us.r-project.org')
require(ggplot2)
Next generate the data
a <- c(rep('a',3),rep('b',3),rep('c',3))
b <- rnorm(9,50,30)
c <- rep(seq(1,3),3)
dat <- data.frame(a,b,c)
Finally, make the plot
ggplot(data=dat, aes(x=c, y=b , group=a, colour=a)) + geom_line() + geom_point()
Basically, you are telling ggplot that your x axis corresponds to the c column (dat$c), your y axis corresponds to the b column (y$b) and to group (draw separate lines) by the a column (dat$a). Colour specifies that you want to group colour by the a column as well.
The resulting graph looks like this:

Transformation of aestethic inputs in R and ggplot2

Is there a way to transform data in ggplot2 in the aes declaration of a geom?
I have a plot conceptually similar to this one:
test=data.frame("k"=rep(1:3,3),"ce"=rnorm(9),"comp"=as.factor(sort(rep(1:3,3))))
plot=ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
Suppose I would like to add a line calculated as the maximum of the values between the three comp for each k point, with only the plot object available. I have tried several options (e.g. using aggregate in the aes declaration, stat_function, etc.) but I could not find a way to make this work.
At the moment I am working around the problem by extracting the data frame with ggplot_build, but I would like to find a direct solution.
Is
require(plyr)
max.line = ddply(test, .(k), summarise, ce = max(ce))
plot = ggplot(test, aes(y=ce,x=k))
plot = plot + geom_line(aes(lty=comp))
plot = plot + geom_line(data=max.line, color='red')
something like what you want?
Thanks to JLLagrange and jlhoward for your help. However both solutions require the access to the underlying data.frame, which I do not have. This is the workaround I am using, based on the previous example:
data=ggplot_build(plot)$data[[1]]
cemax=with(data,aggregate(y,by=list(x),max))
plot+geom_line(data=cemax,aes(x=Group.1,y=x),colour="green",alpha=.3,lwd=2)
This does not require direct access to the dataset, but to me it is a very inefficient and inelegant solution. Obviously if there is no other way to manipulate the data, I do not have much of a choice :)
EDIT (Response to OP's comment):
OK I see what you mean now - I should have read your question more carefully. You can achieve what you want using stat_summary(...), which does not require access to the original data frame. It also solves the problem I describe below (!!).
library(ggplot2)
set.seed(1)
test <- data.frame(k=rep(1:3,3),ce=rnorm(9),comp=factor(rep(1:3,each=3)))
plot <- ggplot(test,aes(y=ce,x=k))+geom_line(aes(lty=comp))
##
plot + stat_summary(fun.y=max, geom="line", col="red")
Original Response (Requires access to original df)
One unfortunate characteristic of ggplot is that aggregating functions (like max, min, mean, sd, sum, and so on) used in aes(...) operate on the whole dataset, not subgroups. So
plot + geom_line(aes(y=max(ce)))
will not work - you get the maximum of all test$ce vs. k, which is not useful.
One way around this, which is basically the same as #JLLagrange's answer (but doesn't use external libraries), is:
plot+geom_line(data=aggregate(ce~k,test,max),colour="red")
This creates a data frame dynamically aggregating ce by k using the max function.

How to plot one column vs the rest in R

I have a data set where the [,1] is time and then the next 14 are magnitudes. I would like to scatter plot all the magnitudes vs time on one graph, where each different column is gridded (layered on top of one another)
I want to use the raw data to make these graphs and came make them separately but would like to only have to do this process once.
data set called A, the only independent variable is time (the first column)
df<-data.frame(time=A[,1],V11=A[,2],V08=A[,3],
V21=A[,4],V04=A[,5],V22=A[,6],V23=A[,7],
V24=A[,8],V25=A[,9],V07=A[,10],xxx=A[,11],
V26=A[,12],PV2=A[,13],V27=A[,14],V28=A[,15],
NV1=A[,16])
I tried the code mentioned by #VlooO but it scrunched the graphs making them too hard to decipher and each had its own axes. All my graphs can be on the same axes just separated by their headings.
When looking at the ggplots I Think that would be a perfect program for what I want.
ggplot(data=df.melt,aes(x=time,y=???))
I confused what my y should be since I want to reference each different column.
Thanks R community
Hope i understand you correctly:
df<-data.frame(time=rnorm(10),A=rnorm(10),B=rnorm(10),C=rnorm(10))
par(mfrow=c(length(df)-1,1))
sapply(2:length(df), function(x){
plot(df[,c(1,x)])
})
The result would be
here some hints since you don't provide a reproducible example , neither you show what you have tried :
Use list.files to go through all your documents
Use lapply to loop over the result of the previous step and read your data
Put your data in the long format using melt from reshape2 and the variable time as id.
Use ggplot2 to plot using the variable as aes color/group.
library(ggplot2)
library(reshape2)
invisible(lapply(list.files(pattern=...),{
dt = read.table(x)
dt.l = melt(dt,id.vars='time')
print(ggplot(dt.l)+geom_line(aes(x=time,y=value,color=variable))
}))
If you don't need ggplot2, then the matplot function for base graphics can be used to do what you want in one command.
SOLUTION:
After looking through a bunch more problems and playing around a bit more with ggplot2 I found a code that works pretty great. After I made my data frame (stated above), here is what i did
> df.m<- melt(df,"time")
ggplot(df.m, aes(time, value, colour = variable)) + geom_line() +
+ facet_wrap(~ variable, ncol = 2)
I would post the image but I don't have enough reputation points yet.
I still don't really understand why "value" is placed into the y position in aes(time, value,...) If anyone could provided an explanation that would be greatly appreciated. My last question is if anyones knows how to make the subgraphs titles smaller.
Can I use cex.lab=, cex.main= in ggplot2?

trying to plot ranges of dates

I have 19 tags which were deployed and reported at different times throughout the summer and fall. Currently I am trying to create a plot to display the times of deployment and reporting so that I can visualize where there is overlap in data collection. I have tried several different plotting functions including plot(), boxplot(), and ggplot(). I have gotten close to what I want with boxplot() but would like the box to extend from the start to the end date and eliminate the whiskers entirely. Is there a way to do this or should I use a different function or package? Here is my code, it probably isn't the most efficient since I'm somewhat new to R.
note: tnumber are just the tag numbers I used. The dates were all taken from different data sets.
dep.dates=boxplot(t62104[,8],t40636[,8],t84337[,8],t84353[,8],t62103[,8],
t110289[,8],t62102[,8],t62105[,8],t62101[,8],t84360[,8],
t117641[,8],t40643[,8],t110291[,8],t84338[,8],t110290[,8],
t84363[,8],t117639[,8],t117640[,8],t117638[,8],horizontal=T,
main='Tag deployment and pop-up dates',xlab='Month',
ylab='Tag number',names=c('62104','40636','84337','84353',
'62103','110289','62102','62105','62101','84360','117641',
'40643','110291','84338','110290','84363','117639','117640',
'117638'),las=1)
Something like this will work if all you care about is ranges.
require(ggplot2)
require(SpatioTemporal)
data(mesa.data.raw)
require(data.table)
out <- as.data.table(t(apply(mesa.data.raw$obs, 2, function(.v){
names(.v)[range(which(!is.na(.v)))]
})),keep=TRUE)
setnames(out, "rn", "monitors")
ggplot(out, aes(x=monitors, y=V1, ymin=V1, ymax=V2,)) + geom_crossbar() + coord_flip()
ggplot(out, aes(x=monitors, ymin=V1, ymax=V2)) + geom_linerange() + coord_flip()
The first ggplot call creates horizonal bars but I can't figure out how to get rid of the center line so I just put it at the start.
The second plot creates horizontal lines, which I think looks better anyway.

Resources