Add points to geom_density_ridges for groups with small number of observations - r

I am loving using geom_density_ridges(), with individual points also included for each group. However, some groups have small sample sizes (e.g. n=1 or 2) precluding the generation of the density ridges. For these groups, I'd like to be able to plot the locations of the existing observations - even though no probability density function will be shown.
In this example, I'd like to be able to plot the 2 data points for May on the appropriate line.
library(tidyverse)
library(ggridges)
data("lincoln_weather")
#pull weather from all months that are NOT May
lincoln_weather_nomay<-lincoln_weather[which(lincoln_weather$Month!="May"),]
#pull weather just from May
lincoln_weather_may<-lincoln_weather[which(lincoln_weather$Month=="May"),]
#recombine, keeping only the first two rows for the May dataset
new_weather<-rbind(lincoln_weather_nomay,lincoln_weather_may[c(1:2),])
ggplot( new_weather, aes(x=`Min Temperature [F]`,y=Month,fill=Month))+
geom_density_ridges(alpha = 0.5,jittered_points = TRUE, point_alpha=1,point_shape=21) +
labs(x="Average temperature (F)",y='')+
guides(fill=FALSE,color=FALSE)
How can I add the points for the May observations to the appropriate location (i.e. the May slot) and at the appropriate location along the x-axis?

Simply add a separate geom_point() call to the function, in which you subset the data to include only observations for the previously-unplotted categories. You can apply any of the usual customizations to either 'match' the points plotted for the other categories, or to make these points 'stand out'.
ggplot( new_weather, aes(x=`Min Temperature [F]`,y=Month,fill=Month))+
geom_density_ridges(alpha = 0.5,jittered_points = TRUE, point_alpha=1,point_shape=21) +
geom_point(data=subset(new_weather, Month %in% c("May")),
aes(),shape=13)+
labs(x="Average temperature (F)",y='')+
guides(fill=FALSE,color=FALSE)

Related

visualize relationship between categorical variable and frequency of other variable in one graph?

how in R, should I have a histogram with a categorical variable in x-axis and
the frequency of a continuous variable on the y axis?
is this correct?
There are a couple of ways one could interpret "one graph" in the title of the question. That said, using the ggplot2 package, there are at least a couple of ways to render histograms with by groups on a single page of results.
First, we'll create data frame that contains a normally distributed random variable with a mean of 100 and a standard deviation of 20. We also include a group variable that has one of four values, A, B, C, or D.
set.seed(950141237) # for reproducibility of results
df <- data.frame(group = rep(c("A","B","C","D"),200),
y_value = rnorm(800,mean=100,sd = 20))
The resulting data frame has 800 rows of randomly generated values from a normal distribution, assigned into 4 groups of 200 observations.
Next, we will render this in ggplot2::ggplot() as a histogram, where the color of the bars is based on the value of group.
ggplot(data = df,aes(x = y_value, fill = group)) + geom_histogram()
...and the resulting chart looks like this:
In this style of histogram the values from each group are stacked atop each other(i.e. the frequency of group A is added to B, etc. before rendering the chart), which might not be what the original poster intended.
We can verify the "stacking" behavior by removing the fill = group argument from aes().
# verify the stacking behavior
ggplot(data = df,aes(x = y_value)) + geom_histogram()
...and the output, which looks just like the first chart, but drawn in a single color.
Another way to render the data is to use group with facet_wrap(), where each distribution appears in a different facet on one chart.
ggplot(data = df,aes(x = y_value)) + geom_histogram() + facet_wrap(~group)
The resulting chart looks like this:
The facet approach makes it easier to see differences in frequency of y values between the groups.

plotting multiple lines in ggplot R

I have neuroscientific data where we count synapses/cells in the cochlea and quantify these per frequency. We do this for animals of different ages. What I thus ideally want is the frequencies (5,10,20,30,40) in the x-axis and the amount of synapses/cells plotted on the y-axis (usually a numerical value from 10 - 20). The graph then will contain 5 lines of the different ages (6 weeks, 17 weeks, 43 weeks, 69 weeks and 96 weeks).
I try this with ggplot and first just want to plot one age. When I use the following command:
ggplot(mydata, aes(x=Frequency, y=puncta6)) + geom_line()
I get a graph, but no line and the following error: 'geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?'
So I found I have to adjust the code to:
ggplot(mydata, aes(x=Frequency, y=puncta6, group = 1)) + geom_line()
This works, except for the fact that my first data point (5 kHz) is now plotted behind my last data point (40 kHz)......... (This also happens without the 'group = 1' addition). How do I solve this or is there an easier way to plot this kind of data?
I couldnt add a file so I added a photo of my code + graph with the 5 kHz data point oddly located and I added a photo of my data in excel.
example data
example code and graph

How to plot two y axis? or combine(merge) two plots? Should handle faceted column as well

I've a combination of two difficult(I'm naive) requirements :(
Consider the Weather data as example. Let's say I've dataset with following information.
"Datetime", "Word", "Frequency", "Temperature"
Visualization: I want to see change in frequency of a word over time and at temperature.
X-axis shows the time series(date)
Y-axis has the frequency scale(0 to max freq).
Requirements:
I need to draw frequencies of several words(Column "word") over the time.
Correlate the frequency with temperature.
I started with ggplot2:
ggplot(TemperatureData, aes(x=timeId, y=termFrequency)) + geom_line() + facet_wrap(~Keyword) +
geom_line(data = TemperatureData, aes(y = temperature)) +
labs(x="Time Series over X days", y = "Term Frequency")
The above approach results in overlapping y axis (frequency, temperature). And, a separate bin for each "Word" (facet for ggplot). i.e plot has 3 bin's for each keyword. Each bin shows temperature over time, and frequency of a word over time.
Problems:
I want to be able to separate y-axis for temperature, and frequency. Also, I do not want to normalize these y-axis as it gets tough to understand what are the high/low values of each axis over days. Plot Loses readability. I learnt that two y-axis is not possible using ggplot2.
Separate bin for each keyword is not required. One horizontal line per keyword is what I'm looking for.
The plot should have only one appearance(line graph) of temperature to reflect change over time.
I tried using PAR, but could not succeed.
Example solution using plotrix package

Label stacked bar chart with variable other than plotted Y

I'm working on some fish electroshocking data and looking at fish species abundance per transects in a river. Essentially, I have an abundance of different species per transect that I'm plotting in a stacked bar chart. But, what I would like to do is label the top of the bar, or underneath the x-axis tick mark with N = Total Preds for that particular transect. The abundance being plotted is the number of that particular species divided by the total number of fish (preds) that were caught at that transect. I am having trouble figuring out a way to do this since I don't want to label the plot with the actual y-value that is being plotted.
Excuse the crude code. I am newer to R and not super familiar with generating random datasets. The following is what I came up with. Obviously in my real data the abundance % per transect always adds up to 100 %, but the idea is to be able to label the graph with TotalPreds for a transect.
#random data
Transect<-c(1:20)
Habitat<-c("Sand","Gravel")
Species<-c("Smallmouth","Darter","Rock Bass","Chub")
Abund<-runif(20,0.0,100.0)
TotalPreds<-sample(1:139,20,replace=TRUE)
data<-data.frame(Transect,Habitat,Species,Abund,TotalPreds)
#Generate plot
AbundChart<-ggplot(data=data,aes(x=Transect,y=Abund,fill=Species))
AbundChart+labs(title="Shocking Fish Abundance")+theme_bw()+
scale_y_continuous("Relative Abundance (%)",expand=c(0.02,0),
breaks=seq(0,100,by=20),labels=seq(0,100,by=20))+
scale_x_discrete("Transect",expand=c(0.03,0))+
theme(plot.title=element_text(face='bold',vjust=2,size=25))+
theme(legend.title=element_text(vjust=5,size=15))+
geom_bar(stat="identity",colour="black")+
facet_grid(~Habitat,labeller=label_both,scales="free_x")
I get this plot that I would like to label with TotalPreds as described previously.
Again my plot would have bars that reached 100% for abundance, and in my real data transects 1-10 are gravel and 11-20 are sand. Excuse my poor sample dataset.
*Update
My actual data looks like this:
Variable in this case is the fish species and value is the abundance of that species at that particular electroshocking transect. Total_Preds is repeated when the data moves to a new species, because total preds is indicative of the total preds caught at that particular transect (i.e. each transect only has 1 total preds value). Maybe the melt function wasn't the right way to analyze this, but I have like 17 fish species that were caught at different rates across these 20 transects. I guess habitat type is singular to a transect as well, with 1-10 being gravel and 11-20 being sand, and that is repeated in my dataset across fish species as well.
Edited in response to the update, you should be able to create a new dataframe containing the TotalPred data (not repeated) and use that in geom_text. Can't test this without data but maybe:
# select non-repeated half of melted data for use in geom_text
textlabels <- data[c(1:19),]
#Generate plot
AbundChart<-ggplot(data=data,aes(x=Transect,y=Abund,fill=Species))
AbundChart+labs(title="Shocking Fish Abundance")+theme_bw()+
scale_y_continuous("Relative Abundance (%)",expand=c(0.02,0),breaks=seq(0,100,by=20),labels=seq(0,100,by=20))+
scale_x_discrete("Transect",expand=c(0.03,0))+
theme(plot.title=element_text(face='bold',vjust=2,size=25))+
theme(legend.title=element_text(vjust=5,size=15))+
geom_bar(stat="identity",colour="black")+
facet_grid(~Habitat,labeller=label_both,scales="free_x") +
geom_text(data = textlabels, aes(x = Transect_ID, y = value, vjust = -0.5,label = TotalPreds))
You might have to play around with different values for vjust to get the labels where you want them.
See the geom_text help page for more info.
Hope that edit works with your data.

Plotting multiple frequency polygon lines using ggplot2

I have a dataset with records that have two variables: "time" which are id's of decades, and "latitude" which are geographic latitudes. I have 7 time periods (numbered from 26 to 32).
I want to visualize a potential shift in latitude through time. So what I need ggplot2 to do, is to plot a graph with latitude on the x-axis and the count of records at a certain latitude on the y-axis. I need it do this for the seperate time periods and plot everything in 1 graph.
I understood that I need the function freqpoly from ggplot2, and I got this so far:
qplot(latitude, data = lat_data, geom = "freqpoly", binwidth = 0.25)
This gives me the correct graph of the data, ignoring the time. But how can I implement the time? I tried subsetting the data, but I can't really figure out if this is the best way..
So basically I'm trying to get a graph with 7 lines showing the frequency distribution in each decade in order to look for a latitude shift.
Thanks!!
Without sample data it is hard to answer but try to add color=factor(time) (where time is name of your column with time periods). This will draw lines for each time period in different color.
qplot(latitude, data = lat_data, geom = "freqpoly", binwidth = 0.25,
color=factor(time))

Resources