ggplot2 in R: Calculate percentage and make a graph that might be a geom_area plot - r

I'm a beginner in R, so please be patient with me if there are very obvious mistakes in my code and for my question! For a homework problem, I am struggling to make what I think is a geom_area plot look like this:
As background, we are using the diamonds dataframe from ggplot2 library. We were given the plot and asked to reproduce it. My biggest problem is with the y-axis. The graph given indicated that the y-axis represents density, which I think is the percentage/proportion of each clarity grade given the title. Originally, I thought perhaps I needed to create a new dataframe with "Price" and "Clarity Proportion (or, density)", but I wasn't sure how to do that. The professor hinted that we should not need to create a new variable for this problem.
Here's what I have so far. It produces the error message: "In Ops.ordered(left, right): '/' is not meaningful for ordered factors":
set.seed(123)
d <- ggplot(diamonds[sample(nrow(diamonds),5000),]) #these were given in the homework
d + geom_area(aes(x = price, y = lapply(count(diamonds$clarity), FUN = count(diamonds$clarity)/53940), colour = clarity), position = "fill") +
labs(title = "Clarity Proportion by Price")
I know my y-axis is wrong, but I'm just not sure how to transform it. Your explanation and insight are greatly appreciated!

Related

How do I built a boxplot much better with extreme outlier using ggplot

I have a dataset with extreme outlier. I am trying to build boxplot using following code
ggplot(arrange_df, aes(x = "", y = idleS)) + geom_boxplot()
It display the plot like this
How do I present the boxplot much better with extreme outlier. Thanks for any help or suggestion

using ggridges to compare to numerical variables for different countries

I've been working with a tibble I created and I was wanting to create a ridge plot with ggplot2.
I have a list of regions, GDP and Infant Mortality.
I was hoping to compare each region as a coloured ridge showing GDP as x and Infant Mortality as y.
This is as far as I've gotten but it's not working (probably because I don't think I'm quite understanding each part).
library(ggplot2)
install.packages("ggridges")
library(ggridges)
as.tibble(Region_Compare)
colnames(Region_Compare)
ggplot(Region_Compare, aes(x = `GDP`, y = `Infant mortality`, fill = cut)) +
geom_density_ridges() +
theme_ridges() +
theme()
I get the following error:
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): fill = cut. Did you mistype the name of a data column or forget to add after_stat()?
Could someone please help me understand where I'm going wrong and where what I might need to add?
I was hoping to do something like the picture attached. With x showing regions labelled and y showing Infant mortality, with the peaks or density showing GDP. Is that possibly or should I be looking at something else.
Wishful thinking pic:

Density plot for multiple group shows one line, however legend shows 3

I am analyzing US election data volume from Google trend. I type the below command in R studio.
The poliData dataframe contains the SearchVolume for all months for three Politicians.
ggplot(data = poliData, aes(x=Date, group=Politician, colour=Politician)) +
geom_density()
But I only get the density line (blue) for one politician only with the above command.See the attached picture. Can you please help
I guess you got three lines on top each other because Date variable values are the same for all three politicians. My understanding of your analysis could be something like this:
ggplot(data = poliData,
aes(x=Date, colour=Politician,
weight = SearchVolume/sum(SearchVolume))) +
geom_density()
Adding weight should produce distinct lines for different politicians. If this is not what you wanted, please dput your data for others to work out a solution for you. Also, as I do not have the data, I have not tested the above code yet. Please let me know if it does not work.

Filling cross over under a Cumulative Frequency plot using ggplot in R

I am trying to plot two Cumulative Frequency curves in ggplot, and shade the cross over at a certain cut off. I haven't been using ggplot for long, so I was hoping someone might be able to help me with this one.
The plot without filled regions, looks like this...
Which I have created using the following code...
library(ggplot2) # required
north <- rnorm(3060, mean=277,sd=3.01) # to create synthetic data
south <- rnorm(3060, mean=278, sd=3.26) # in place of my real data.
#placing in dataframe
df_temp <- data.frame(temp=c(north,south),
region=c(rep("north",length=3060),rep("south",length=3060)))
#manipulating into cdf, as I've seen in other examples
temp.regions <- ddply(df_temp, .(region), summarize,
temp = unique(temp),
ecdf = ecdf(temp)(unique(temp)))
# feeding into ggplot.
ggplot(temp.regions,aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))
What I would then like, would be to shade both curves for temperatures below 0.2 on the y axis. Ideally I'd like to see the blue one shaded in blue, and the red one shaded in red. Then, where they cross over in purple.
However, the closest I have managed is as follows... ]
Which I have achieved using the following additions to my code.
# creating a dataframe with just the temperatures for below 0.2
# to try and aid control when plotting
temp.below <- temp.regions[which(temp.regions$ecdf<0.2),]
# plotting routine again.
ggplot(temp.regions, aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))+
# with additional line for shading.
geom_ribbon(data=temp.below,
aes(x=temp,ymin=0,ymax=0.2), alpha=0.5)
I've seen a few examples of people shading for a normal distribution density plot, which is where I have adapted my code from. But for some reason my boxes don't seem to want anything to do with the temperature curve.
Please help! I'm sure it's quite simple, I'm just really lost and have tried a few, producing less convincing results than these.
Thank you so much for taking a look.
PROBLEM SOLVED THANKS TO HELP BELOW...
running suggested code from below
geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
gives...
which is so very almost the solution I'm after, but with one final addition... like so
#geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
geom_ribbon(data=temp.below, aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
I get what I'm after...
The reason I set the data again is so that it only fills the lowest 20% of the two regions.
Thank you so much for the help :-)
Looks like you're thinking about it in the right way.
With geom_ribbon i dont think you need to set data to anything else. Just set aes(ymin = 0, ymax = ecdf, fill = region). I think that should do it.

Calculation of density estimate in density2d?

I have a more general question regarding the principle behind density2d.
I'm using ggplot and the density2d function to visualize animal movements. My idea was calculating heat maps showing where the animal is most of the time and/or to identify areas of particular interest. Yet, the density2d function sometimes generates rather inexplicable plots.
Here's what I mean:
set.seed(4)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
There are areas with a density estimate but without data (around x:50, y:300).
Now compare with this:
set.seed(13)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
Here there are regions "wihtout" a density estimate but with actual data (around x:100,y:550).
Someone asked a related question:
Create heatmap with distribution of attribute values in R (not density heatmap)
but there are no satisfactory answers to be found.
So my question would be (i) Why? and (ii) How to avoid/adjust if possible?
This may be helpful. I am not that familiar with stat_density2d. After seeing your code and ggplot documents (http://docs.ggplot2.org/0.9.2.1/stat_density2d.html), I thought ..level.. might not be the one. I, then, tried ..density.. Someone will be able to explain why you need density meanwhile I think this is the graph you wanted.
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +
geom_path() +
coord_equal(xlim=c(0,600),ylim=c(0,600)) +
expand_limits(x=c(0,600),y=c(0,600))

Resources