plotting aggregate data with ggplot

plotting aggregate data with ggplot - r

I have a data like this
subject<-1:208
ev<-runif(208, min=1, max=2)
seeds<-gl(6,40,labels=c('seed1', 'seed2','seed3','seed4','seed5','seed6'),length=208)
ngambles<-gl(2,1, labels=c('4','32'))
trial<-rep(1:20, each= 2, length=208)
ngambles<-rep('4','32' ,each=1, length=208)
data<-data.frame(subject,ev,seeds,ngambles,trial)
the data looks like this
subject ev seeds ngambles trial
1 1.996717 seed1 4 1
2 1.280977 seed1 32 1
3 1.571648 seed1 4 2
4 1.153311 seed1 32 2
5 1.502559 seed1 4 3
6 1.644001 seed1 32 3
I plot a graph with rep as x axis and expected_value as y axis for each seed and n_gambles by this command.
qplot(trial,ev,data=data,
facets=ngambles~seeds,xlab="Trial", ylab="Expected Value", geom="line")+
opts(title = "Expected Value for Each Seed")
now I want to draw a new graph by aggregating ev for trial equal to 1-5, 6-10,11-15,and 16-20. I also want to draw an error bar.
I have no clue how to do in R
maybe somebody can help me
thanks in advance

Assuming that your data frame is called df. First, added new column ag that show to which interval original trial value will belong with function cut().
df$ag<-cut(df$trial,c(1,6,11,16,21),right=FALSE)
Now there is two possibilities - first, aggregate your data using stat_.. functions of ggplot2. There is stat_summary() function already defined and then you should define also stat_sum_df() function (taken from stat_summary() help file) to calculate more than one summary value.
stat_sum_df <- function(fun, geom="crossbar", ...) {
stat_summary(fun.data=fun, colour="red", geom=geom, width=0.2, ...)
}
With stat_sum_df() and argument "mean_cl_normal" calculate confidence intervals to use in geom="errorbar" and with stat_summary() mean value for geom="line". As x value use new column ag. With scale_x_discrete() you can get right labels for x axis.
ggplot(df, aes(ag,ev,group=seeds))+stat_sum_df("mean_cl_normal",geom="errorbar")+
stat_summary(fun.y="mean",geom="line",color="red")+
facet_grid(ngambles~seeds)+
scale_x_discrete(labels=c("1-5","6-10","11-15","16-20"))
Second approach is to summarize data before plotting, for example, with function ddply() from library plyr. Also in this case you need column ag made in first example. And then use new data for plotting.
library(plyr)
df.new<-ddply(df,.(ag,seeds,ngambles),summarise,ev.m=mean(ev),
ev.lim=qt(0.975,length(ev)-1)*sd(ev)/sqrt(length(ev)))
ggplot(df.new,aes(ag,group=seeds))+
geom_errorbar(aes(y=ev.m,ymin=ev.m-ev.lim,ymax=ev.m+ev.lim))+
geom_line(aes(y=ev.m))+
facet_grid(ngambles~seeds)+
scale_x_discrete(labels=c("1-5","6-10","11-15","16-20"))

Related

R: creating a likert scale barplot

I'm new to R and feeling a bit lost ... I'm working on a dataset which contains 7 point-likert-scale answers.
My data looks like this for example:
My goal is to create a barplot which displays the likert scale on the x-lab and frequency on y-lab.
What I understood so far is that I first have to transform my data into a frequency table. For this I used a code that I found in another post on this site:
data <- factor(data, levels = c(1:7))
table(data)
However I always get this output:
data
1 2 3 4 5 6 7
0 0 0 0 0 0 0
Any ideas what went wrong or other ideas how I could realize my plan?
Thanks a lot!
Lorena

This is a very simple way of handling your question, only using base-R
## your data
my_obs <- c(4,5,3,4,5,5,3,3,3,6)
## use a factor for class data
## you could consider making it ordered (ordinal data)
## which makes sense for Likert data
## type "?factor" in the console to see the documentation
my_factor <- factor(my_obs, levels = 1:7)
## calculate the frequencies
my_table <- table(my_factor)
## print my_table
my_table
# my_factor
# 1 2 3 4 5 6 7
# 0 0 4 2 3 1 0
## plot
barplot(my_table)
yielding the following simple barplot:
Please, let me know whether this is what you want

Lorena!
First, there's no need to apply factor() neither table() in the dataset you showed. From what I gather, it looks fine.
R comes with some interesting plotting options, hist() is one of them.
Histogram with hist()
In the following example, I'll use the "Valenz" variable, as named in your dataset.
To get the frequency without needing to beautify it, you can simply ask:
hist(dataset, Valenz)
The first argument (dataset) informs where these values are; the second argument (Valenz) informs which values from dataset you want to use.
If you only want to know the frequency, without having to inform it in some elegant way, that oughta do it (:
Histogram with ggplot()
If you want to make it prettier, you can style your plot with the ggplot2 package, one of the most used packages in R.
First, install and then load the package.
install.packages("ggplot2")
library(ggplot2)
Then, create a histogram with x as the number of times some score occurred.
ggplot(dataset, aes(x = Valenz)) +
geom_histogram(bins = 7, color = "Black", fill = "White") +
labs(title = NULL, x = "Name of my variable", y = "Count of 'Variable'") +
theme_minimal()
ggplot() takes the value of your dataframe, then aes() specifies you want Valenz to be in the x-axis.
geom_histogram() gives you a histogram with "bins = 7" (7 options, since it's a likert scale), and the bars with "color = 'Black'" and "fill = 'White'".
labs() specifies the labels that appear beneath x ("x = "Name of my variable") and then by y (y = "Count of 'Variable'").
theme_minimal() makes the plot look cooler.
I hope I helped you in some way, Lorena. (:

plotCI: how to overlay plots of two variables

I am trying to plot populations of predators and of prey over time, with confidence intervals. I can plot these two separately, how to plot on same graph?
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(d.means)),d.means,d.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(w.means)),w.means,w.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")

After the first plot, use the code below before plotting the next plot:
par(new=T)
Make sure that you set the xlim and ylim to accommodate both plots. And you will need to use the options axes=F and ann=F.
These graphical features are discussed in detail in the ebook "R Fundamentals & Graphics". You might want to use it as a desk reference.

#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
Here you have created all the variables you need but to plot them using ggplot you need them to be in a tall dataset with an variable indicating if they are predator or prey. I also added a time variable, I think yours would be week.
x=data.frame(means=c(w.means,d.means),
n=c(w.n,d.n),
se=c(w.se,d.se),
role=c(rep("pred",length(w.n)),rep("prey",length(d.n))),
time=c(1:length(w.n),1:length(d.n))
)
I don't know exactly what your data look like so here is a fake one I cooked up just to illustrate the format.
means n se role time
1 0.9874234 10 0.16200575 pred 1
2 1.4120207 12 0.08895026 pred 2
3 2.7352516 8 0.07991036 pred 3
4 1.1301248 11 0.05481813 prey 1
5 2.4810040 13 0.28682585 prey 2
6 3.1546947 9 0.22126054 prey 3
Once the data are in this nice format using ggplot is really pretty easy.
ggplot(x, aes(x=time, y=means, colour=role)) +
geom_errorbar(aes(ymin=means-se, ymax=means+se), width=.1) +
geom_line()
That gives this:

Plotting an filled line chart with 4 variables against a 5th variable ggplot2

I am trying to create a postion="fill" which represents an allocation on the y axis (to always sum to 100) and another variable on the x axis. Variable 1-4 are numeric integers, variable 5 is also numeric. Variable 5 is a continuous numeric. All five variables on are on the same row.
Y axis: variable 1 + variable 2 + variable 3 + variable 4 = 100
X axis: variable 5
Is there a way to do this without melting my data table?
Sample code, caution: runs a bit slow due to how I set up variables 1-4...
library(combinat)
combinations <- combn(100, 4)
permutations <- combinations[, colSums(combinations) == 100]
rm(combinations)
data <- t(rbind(permutations,
replicate(ncol(permutations), cumprod(1+rnorm(20, 0.05, 0.30))[20])
))

One way to generate a reproducible example would be
set.seed(1)
data_ex <- data.frame(t(rmultinom(1000,prob=rep(0.25,4),size=100)),
v5=runif(1000,0.8,1))
and then
library(ggplot2)
library(reshape2)
ggplot(melt(data_ex,id.var="v5")) +
geom_area(aes(x=v5,y=value,fill=variable))
draws the plot.
If you really want to do things the hard way you can avoid using melt, but melt is much (much much) easier!
cumvals <- t(apply(data_ex[,1:4],1,cumsum))
data2 <- data.frame(cumvals,v5=data_ex$v5)
ggplot(data2,aes(x=v5)) +
## these must go in reverse order
geom_area(aes(y=X4),fill="green")+
geom_area(aes(y=X3),fill="purple")+
geom_area(aes(y=X2),fill="red")+
geom_area(aes(y=X1),fill="blue")

Draw a lot of plots on the same canvas (clean way)

I want to draw a number of similar plots with a loop.
What I do is:
plot(0, 0, type="l", col="white", xlim=range(1,N), ylim=range(0.5, 2.5)) # provide axes, frame, ...
for(col in colors)
{
X <- generate_X() # vector of random numbers
lines(1:N, X, type="l", col=col)
}
The problem is that random numbers sometimes go out of the range(0.5,2.5) and I want to lengthen ylim range. Atm I'm going to do it with min and max before plotting. But there must be much, much cleaner way which I poorly cant find anywhere.
I think I'm missing something basic about plotting, but I couldnt find the solution.
Thanks

I think there are two quick answers to the OP's question:
calculate the plot range before initializing the plot (implied by OP), or
use a "cleaner" plotting wrapper function.
Setup: First we need to define the variables and functions the OP implies and then generate some data to work with.
# Initialize our N number of X points and
# colors vector.
N <- 20
colors <- c("yellow", "red", "blue", "green")
# Create function 'generate_X' to perform
# as implied by the OP.
generate_X <- function(.N){
rnorm(n=.N, mean=0, sd=1)
}
# Generate the entire data frame
# using the 'matrix' function to shape
# the data quickly.
data <- data.frame(
id=1:N,
matrix(
generate_X(N*length(colors)),
ncol=length(colors)
)
)
The above code simply initializes the variables, function, and data needed for the OP's example.
Method 1: Calculate the plot range and initialize the plot. This is pretty easy using the 'range' function. In the data frame we created, there is an "id" column for our x values, so we use the range of 'data$id' for our x. Then, we find the range of all the data across every column EXCEPT the first column (data[,-1]) to find the overall y range. We initialize with the color white, since our background is also white. Otherwise, we would have a point in the lower-left and upper-right corners. I added x and y labels just for looks.
plot(
range(data$id),
range(data[,-1]),
col="white",
xlab="x",
ylab="y")
Next we just loop through and plot the lines.
for(i in 1:length(colors)){
lines(data$id, data[, i + 1], type="l", col=colors[i])
}
This is essentially the same thing the OP demonstrated, but it's adapted slightly to accept a data frame as input. It's far easier to reference columns using an integer counter (i in this case) rather than the list of colors.
Method 2: There are a lot of plot wrapper packages out there, and one of the most popular is the 'ggplot2' package, and for good reason. You can avoid a lot of the looping hassle with plots by feeding shaped data into a 'ggplot' function. The code here is much "cleaner" from a reading perspective.
# Load packages for shaping data and plotting.
library(reshape2)
library(ggplot2)
First, we need the 'reshape2' package, because we want to use "melted" data in our plot. This just makes the 'ggplot' code WAY cleaner. Then, we load up the 'ggplot2' package for the plotting.
For our plot, we initialize a plot without any instructions, so we can specify them in the geometry layer. If we were creating multiple layers from the same data, we would specify the options in the base plot layer, but for this, we are only creating a single geometry layer with lines. The + allows us to add plot layers.
Next, we choose a geometry layer ('geom_line' in this case) and specify the data as melt(data, id.vars="id"). This shapes our data for the 'ggplot' function to use with minimal code. We use the "id" column as the ID variable, since that contains our x values. The shaped data now looks more like this:
# id variable value
# 1 1 X1 -0.280035386
# 2 2 X1 -0.371020958
# 3 3 X1 -0.239889784
# 4 4 X1 0.450357442
# 5 5 X1 -0.801697283
# 6 6 X1 -0.453057841
# 7 7 X1 -0.451321958
# 8 8 X1 0.948124835
# 9 9 X1 2.724205279
# 10 10 X1 -0.725622824
# 11 11 X1 0.475545293
# 12 12 X1 0.533060822
# 13 13 X1 -1.928335572
# 14 14 X1 -0.466790259
# 15 15 X1 -1.606005895
# 16 16 X1 0.005678344
# 17 17 X1 -1.719827853
# 18 18 X1 0.601011314
# 19 19 X1 -2.056315661
# 20 20 X1 1.006169713
# 21 1 X2 -1.591227194
# ...
# 80 20 X4 -1.045224561
You don't need to get too hung up on the shaping. Just understand that "melted" data works better with the 'ggplot' functions. We specify our melted data as the data for our geometry layer, and then we use the 'aes' function to tell the geometry layer how to deal with our data. Our x values are in the "id" column, and our y values are in the "value" column. The next part is what removes the loops: we specify the color to be differentiated based on the "variable" column. In our melted data, the "variable" column contains the name of the column that the data originally came from, and using it to specify the color will tell 'ggplot' to automatically change the color for each new "variable" value.
ggplot() +
geom_line(
data=melt(data, id.vars="id"),
aes(
x=id,
y=value,
col=variable
),
lwd=1,
alpha=0.7)
I specified the line width ("lwd") and alpha values just to make the graph a little more readable.

ggplot: error bars

I want to plot error bars of the two different set of value of y1, y2 with respect to x. In other words, I have two data Y1,Y2 and they are correspond X value. I managed to plot them together after I reshaped the data frame. Now I want to graph the error bars on the same graph for each Y1, Y2 points. I understand geom_errorbar() is what I'm looking for. However, I'm following long way to do that and I'm sure there is a short way. What I'm doing I'm calculating "se" for each set and calculate aes(ymin=y1-se, ymax=y+se) and repeat the same for Y2. Because I want to apply this error bars to different plots . I 'd rather do it in a short way.
Here my data frame after reshape:
M Req Rec load Un L1
1 30.11 9.000000 3.000000 30.02000 A
2 50.31 10.030000 6.045000 39.44000 A
3 60.01 11.290000 7.366667 54.93000 A
4 66.10 12.630000 8.827500 68.44500 A
5 80.18 13.106000 9.462000 71.07600 A
6 87.10 14.421667 15.961667 82.70500 A
7 90.08 15.880000 20.644286 94.20714 A
1 4.000 1.500000 1.000000 1 B
2 8.240 6.240000 4.760000 3.00000 B
3 10.28 12.230000 9.420000 4.05000 B
4 18.570 25.570000 17.930000 6.00000 B
5 22.250 35.250000 27.850000 7.00000 B
6 35.070 55.010000 36.810000 8.06000 B
7 48.480 0.420000 47.020000 9.06000 B
I have used the following command to graph it:
ggplot(df_reshaped,aes(x = M, y = Req, colour = L1, shape=L1)) +
geom_point(size = 5)+
geom_line() +
scale_x_discrete(name="M") +
scale_y_continuous(name="Y1 Y2")+
ggtitle("A vs B")
In this case I'm graphing Y1=Req1, Y2=Req2, with respect to x=M
Any short way or suggestion to calculate the error bars ?
Is there any quick way to calculate the "se" ?

In general there are two possibilities to prepare your data for ggplot:
You could aggregate the raw data and plot the results. If you follow this way, you have to calculate the standard errors too since the information cannot be retrieved from the aggregated data. These standard errors could be plotted with geom_errorbar.
A second option is to use the raw data and let ggplot do all the calculations for you. This could be done with stat_summary. For example:
stat_summary(fun.data = "mean_cl_normal", mult = 1, geom = "errorbar")
Obviously, you have chosen the first approach. So, you just need to calculate the standard errors for the points of both variables.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

plotting aggregate data with ggplot - r

Related

R: creating a likert scale barplot

plotCI: how to overlay plots of two variables

Plotting an filled line chart with 4 variables against a 5th variable ggplot2

Draw a lot of plots on the same canvas (clean way)

ggplot: error bars

Categories

Resources