I would like to create violin plots with aggregated data. My data has a category, a value coloumn and a count coloumn:
data <- data.frame(category = rep(LETTERS[1:3],3),
value = c(1,1,1,2,2,2,3,3,3),
count = c(3,2,1,1,2,3,2,1,3))
If I create a simple violin plot it looks like this:
plot <- ggplot(data, aes(x = category, y = value)) + geom_violin()
plot
(source: ahschulz.de)
That is not what I wanted. A solution would be to reshape the dataframe by multiplying the rows of each category-value combination. The problem is that my counts go up to millions which takes hours to be plotted! :-(
Is there a solution with my data?
Thanks in advance!
You can submit a weight when calculating the areas.
plot2 <- ggplot(data, aes(x = category, y = value, weight = count)) + geom_violin()
plot2
You will get warning messages that the weights do not add to one, but that is ok. See here for similar/related discussion.
Using stat="identity" and specifying a violinwidth aesthetic appears to work,although I had to put in a fudge factor:
ggplot(data, aes(x = category, y = value)) +
geom_violin(stat="identity",aes(violinwidth=0.2*count))
Related
I have a distribution of data that is shown below in image 1. My goal is to show the likelihood that a variable is below a particular value for both X and for Y. For instance, I'd like to have a good way to show that ~95% of values are below 8000 on X-axis and below 6500 on the Y-axis. I am confident that there is a simple answer to this. I apologize if this has been asked many times before.
plot1 <- df %>% ggplot(mapping = aes(x = FLUID_TOT)) + stat_ecdf() + theme_bw()
plot2 <- df %>% ggplot(mapping = aes(x = FLUID_TOT, y = y)) + geom_point() + theme_bw()
I have a sample data frame like this:
Measurement <- c("Length","Breadth","Length","Breadth","Height",
"Height","Breadth","Length","Height","Breadth",
"Length","Height","Height","Breadth","Length")
Value <- c(45,43,45,100,62,62,43,74,74,74,12,17,17,44,12)
data <- data.frame(Measurement, Value)
I am trying to visualize this data to see how the values are distributed for each measurement and also if we combine the measurements. I am using a basic plot of histogram to do this but this is not visually appealing
hist(data$Value)
Could some one help me with ggplot2 or other advanced visualization to view this data better and I would like to group by Measurements. I would like to see if density plots can mean something here. Any help would be appreciated.
Here are a couple interesting options:
library(ggplot2)
ggplot(data, aes(factor(Measurement), Value)) + geom_violin(aes(fill = factor(Measurement)))
ggplot(data, aes(Value, colour = Measurement, group = Measurement)) + geom_density(fill=NA)
They produce the following:
Hope this helps!
Here is another possibility using geom_histogram. To get the best looking, most informative histogram, it is important to set the binwidth manually for every new data set.
library(ggplot2)
p = ggplot(data=data, aes(x=Value, fill=Measurement)) +
geom_histogram(binwidth=1, colour="grey40", drop=TRUE) +
facet_grid(Measurement ~ ., margins=TRUE) +
theme_bw()
ggsave("hist.png", plot=p, width=8, height=4, dpi=150)
Not sure if I understood the questions. Do you want to separate the values?
For that, you can do something like this:
ValueLength <- data.frame(Value = Value[which(Measurement == "Length")], Measurement = "Lenghth")
ValueBreadth <- data.frame(Value = Value[which(Measurement == "Breadth")], Measurement = "Breadth")
ValueHeight <- data.frame(Value = Value[which(Measurement == "Height")], Measurement = "Height")
Then you can combine them in one data frame again:
Values <- rbind(ValueLength, ValueBreadth, ValueHeight)
And plot with ggplot:
ggplot(Values, aes(Value, fill = Measurement)) + geom_density(alpha = 0.2)
ggplot
I was wondering if there is a way to normalize the heights of the histograms with multiple groups so that their first heights are all = 1. For instance:
results <- rep(c(1,1,2,2,2,3,1,1,1,3,4,4,2,5,7),3)
category <- rep(c("a","b","c"),15)
data <- data.frame(results,category)
p <- ggplot(data, aes(x=results, fill = category, y = ..count..))
p + geom_histogram(position = "dodge")
gives a regular histogram with 3 groups.
Also
results <- rep(c(1,1,2,2,2,3,1,1,1,3,4,4,2,5,7),3)
category <- rep(c("a","b","c"),15)
data <- data.frame(results,category)
p <- ggplot(data, aes(x=results, fill = category, y = ..ncount..))
p + geom_histogram(position = "dodge")
gives a the histogram where each group is normalized to have maximum height of 1.
I want to get a histogram where each group is normalized to have first height of 1 (so I can show growth) but I don't understand if there is an appropriate alternative to ..ncount or ..count.. or if anyone can help me understand the structure of ..count.. I could maybe figure it out from there.
Thanks!
I bet there is a nice way to do everything within ggplot. However, I tend to prefer preparing the desired data set before I plug it into ggplot. If I understood you correctly, you may try something like this:
# convert 'results' to factor and set levels to get an equi-spaced 'results' x-axis
df$results <- factor(df$results, levels = 1:7)
# for each category, count frequency of 'results'
df <- as.data.frame(with(df, table(results, category)))
# normalize: for each category, divide all 'Freq' (heights) with the first 'Freq'
df$freq2 <- with(df, ave(Freq, category, FUN = function(x) x/x[1]))
ggplot(data = df, aes(x = results, y = freq2, fill = category)) +
geom_bar(stat = "identity", position = "dodge")
It looks like ..density.. does what you want, but I can't for the life of me find documentation on it. On both your examples it does what you are looking for, though!
results <- rep(c(1,1,2,2,2,3,1,1,1,3,4,4,2,5,7),3)
category <- rep(c("a","b","c"),15)
data <- data.frame(results,category)
p <- ggplot(data, aes(x=results, fill = category, y = ..density..))
p + geom_histogram(position = "dodge")
This question's theme is simple but drives me crazy:
1. how to use melt()
2. how to deal with multi-lines in single one image?
Here is my raw data:
a 4.17125 41.33875 29.674375 8.551875 5.5
b 4.101875 29.49875 50.191875 13.780625 4.90375
c 3.1575 29.621875 78.411875 25.174375 7.8012
Q1:
I've learn from this post Plotting two variables as lines using ggplot2 on the same graph to know how to draw the multi-lines for multi-variables, just like this:
The following codes can get the above plot. However, the x-axis is indeed time-series.
df <- read.delim("~/Desktop/df.b", header=F)
colnames(df)<-c("sample",0,15,30,60,120)
df2<-melt(df,id="sample")
ggplot(data = df2, aes(x=variable, y= value, group = sample, colour=sample)) + geom_line() + geom_point()
I wish it could treat 0 15 30 60 120 as real number to show the time series, rather than name_characteristics. Even having tried this, I failed.
row.names(df)<-df$sample
df<-df[,-1]
df<-as.matrix(df)
df2 <- data.frame(sample = factor(rep(row.names(df),each=5)), Time = factor(rep(c(0,15,30,60,120),3)),Values = c(df[1,],df[2,],df[3,]))
ggplot(data = df2, aes(x=Time, y= Values, group = sample, colour=sample))
+ geom_line()
+ geom_point()
Loooooooooking forward to your help.
Q2:
I've learnt that the following script can add the spline() function for single one line, what about I wish to apply spline() for all the three lines in single one image?
n <-10
d <- data.frame(x =1:n, y = rnorm(n))
ggplot(d,aes(x,y))+ geom_point()+geom_line(data=data.frame(spline(d, n=n*10)))
Your variable column is a factor (you can verify by calling str(df2)). Just convert it back to numeric:
df2$variable <- as.numeric(as.character(df2$variable))
For your other question, you might want to stick with using geom_smooth or stat_smooth, something like this:
p <- ggplot(data = df2, aes(x=variable, y= value, group = sample, colour=sample)) +
geom_line() +
geom_point()
library(splines)
p + geom_smooth(aes(group = sample),method = "lm",formula = y~bs(x),se = FALSE)
which gives me something like this:
I'm looking for a way to plot a bar chart containing two different series, hide the bars for one of the series and instead have a line (smooth if possible) go through the top of where bars for the hidden series would have been (similar to how one might overlay a freq polynomial on a histogram). I've tried the example below but appear to be running into two problems.
First, I need to summarize (total) the data by group, and second, I'd like to convert one of the series (df2) to a line.
df <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,1,2,2,3,3))
df2 <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,4,3,5,1,2))
ggplot(df, aes(x=grp, y=val)) +
geom_bar(stat="identity", alpha=0.75) +
geom_bar(data=df2, aes(x=grp, y=val), stat="identity", position="dodge")
You can get group totals in many ways. One of them is
with(df, tapply(val, grp, sum))
For simplicity, you can combine bar and line data into a single dataset.
df_all <- data.frame(grp = factor(levels(df$grp)))
df_all$bar_heights <- with(df, tapply(val, grp, sum))
df_all$line_y <- with(df2, tapply(val, grp, sum))
Bar charts use a categorical x-axis. To overlay a line you will need to convert the axis to be numeric.
ggplot(df_all) +
geom_bar(aes(x = grp, weight = bar_heights)) +
geom_line(aes(x = as.numeric(grp), y = line_y))
Perhaps your sample data aren't representative of the real data you are working with, but there are no lines to be drawn for df2. There is only one value for each x and y value. Here's a modifed version of your df2 with enough data points to construct lines:
df <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,2,3,1,2,3))
df2 <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,4,3,5,0,2))
p <- ggplot(df, aes(x=grp, y=val))
p <- p + geom_bar(stat="identity", alpha=0.75)
p + geom_line(data=df2, aes(x=grp, y=val), colour="blue")
Alternatively, if your example data above is correct, you can plot this information as a point with geom_point(data = df2, aes(x = grp, y = val), colour = "red", size = 6). You can obviously change the color and size to your liking.
EDIT: In response to comment
I'm not entirely sure what the visual for a freq polynomial over a histogram is supposed to look like. Are the x-values supposed to be connected to one another? Secondly, you keep referring to wanting lines but your code shows geom_bar() which I assume isn't what you want? If you want lines, use geom_lines(). If the two assumptions above are correct, then here's an approach to do that:
#First let's summarise df2 by group
df3 <- ddply(df2, .(grp), summarise, total = sum(val))
> df3
grp total
1 A 5
2 B 8
3 C 3
#Second, let's plot df3 as a line while treating the grp variable as numeric
p <- ggplot(df, aes(x=grp, y=val))
p <- p + geom_bar(alpha=0.75, stat = "identity")
p + geom_line(data=df3, aes(x=as.numeric(grp), y=total), colour = "red")