ggplot multi-factor-level grouping for boxplot with continuous scale - r

I'm trying to create a boxplot of the following data
Temp<-rnorm(90,mean=100,sd=10)
Yr<-sample(c("1999","2000","2005","2009","2010"),size=90,replace=TRUE)
Month<-sample(c("June","July","August"),size=90,replace=TRUE)
Month
df<-data.frame(Temp,Month,Yr)
The visual I want and its corresponding code are below:
ggplot(df,aes(x=interaction(Month,Yr),y=Temp,fill=Month))+
geom_boxplot()+
xlab("Year")+
ylab("Daily Maximum Temperature")
You'll notice, though, that there are a few years missing from the data, and I'm trying to make the plot reflect that with gaps in the x-scale. The other problem is the text and tick marks on the axis. I'd like the ticks to just be the Year of observation rather than Month.Year since the month is already coded in the fill. I've tried scale_x_discrete, but trying to supply discrete values for a continuous axis spits out a blank graph and an error. I've met my swearing at the computer quota for the day, and it would be really awesome to get a little help on this.

This creates huge gaps, as every year gets its own gap, but you can adapt this by passing only specific years as the levels argument to the factor() call.
df$Yr <- factor(df$Yr, levels=1999:2010)
ggplot(df,aes(x=Yr,y=Temp,fill=Month))+
geom_boxplot(position=position_dodge(1))+
ylab("Daily Maximum Temperature") +
scale_x_discrete("Year", drop=FALSE)

Related

Plotting years with decimals in R ggplot

I am plotting years across two decades using ggplot.I have a situation where due to how the data for the years was taken, the datapoints are really halfway through the year so to be accurate, I labeled the years with a .5 at the end. In addition, I also have one single datapoint that was taken in early 2005 so it's labeled as 2005.22 so the years look like : 2005.22, 2005.5,2006.5,2007.5,2008.5,2009.5,2010.5,2011.5,2012.5. Since I am technically missing data for 2005-2005.21, I want the plot to start at 2005 with no line showing until 2005.22 and then breaking every 2 years starting at 2005.5,2007.5 and so on...
I've been using the following to plot geom_line for the years but I do not know how to get the above result. I was able to get the limits to start at 2005 but with the datapoint starting at 2005.22, it just plots like 2005.22,2007.22....below is what I am using to properly plot and break the years.
scale_x_continuous(
name = "year",
breaks = seq(c(2005, 2012.5), by=2),
expand = c(0,0))+
coord_cartesian(xlim = c(2005, 2012.5))```
It's a little hard for me to understand what exactly you want the plot to look like (especially in terms of the labels), but does this do what you're looking for? You can add 2005 to the front of the breaks sequence, which places it in front without disrupting the rest of the sequence.
library(ggplot2)
d <- data.frame(x=c(2005.22, 2005.5,2006.5,2007.5,2008.5,2009.5,2010.5,2011.5,2012.5),
y=runif(9,-1,1))
ggplot(d, aes(x,y)) +
geom_line() +
scale_x_continuous(breaks=c(2005, seq(2005.5, 2012.5,2)))

I am using ggplot2 to make a bar chart and can't get the years correct along the x-axis

I am using ggplot2 to make a bar chart of the number of participants per year by gender. If I have 14 years included, I would like 2 bars for each year corresponding to the number of males and females for that year. I am not getting each year along the x-axis. I think data is being binned. I have tried changing the bin width, using scale_x_date and am still stuck.
Can you help me figure out how to have the data for EACH year in my graph?
As an example, here is my data for years 2004-2017:
year=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017)
gender=c("male" , "female")
Participants is by gender, male then female respectively per year:
Participants=c(1307,443,1847,630,2109,765, 1824,691,2250,952,3123,1421,4097,1904,6415,3284,8788,4678,11581,6694,13141,8478,16389,10575,20990,13811,26951,19729)
data=data.frame(year,gender,Participants)
Here is how I am trying to generate my plot:
MyPlot <- ggplot(data, aes(fill=gender, y=Participants, x=year)) +
geom_bar(position="dodge", stat="identity",width = .8)
print(MyPlot + ggtitle("Annual Number of Participants by Gender"))
On the x-axis, the years 2006, 2010, 2014 and 2018 are marked and the bars correspond to data from two years. I want data for each year, both in terms of the bars and in terms of the ticks on the x-axis.
Any help would be appreciated!
You have more participants than years, so you don't have a clear dataframe design to serve as an input to ggplot.
Start here:
Read this: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
The key to which is:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Then once you have a tibble/data frame your ggplot2 code should work fine. I'd kill the width= option until you have it working.

choosing specific values on the X axis when using ggplot2

I am trying to plot a graph showing the number of events at the Olympics as a function of the year that a specific Olympic took place.
My data frame is called supertable and it consists of 2 columns, the first is the year and the second is the number of events in the games held that year.
My problem is that on the x axis I only get the years 1920 and 1980 and I would like to have 1920,1950,1980,2010
this is my code
ggplot(data = supertable,aes(x=year,y=no.of.events))+geom_point(colour='red')+
scale_x_discrete(breaks=c(1920,1950,1980,2010))
This is the picture I get
I tried doing this
scale_x_discrete(breaks=c(1920,1950,1980,2010),limits=c(1920,1950,1980,2010)
but it didn't help
I am assuming It is some thing small that I am missing, I tried searching for the answer but didn't find it.
Your x-axis is a continuous variable, so you need to use scale_x_continuous.
You used breaks correctly to indicate where your ticks on the x axis are, but the limits value should be a c(min, max) of the range of the plot you want to show.
Try this: scale_x_continuous(breaks=c(1920,1950,1980,2010), limits = c(1920, 2019))

How to plot two y axis? or combine(merge) two plots? Should handle faceted column as well

I've a combination of two difficult(I'm naive) requirements :(
Consider the Weather data as example. Let's say I've dataset with following information.
"Datetime", "Word", "Frequency", "Temperature"
Visualization: I want to see change in frequency of a word over time and at temperature.
X-axis shows the time series(date)
Y-axis has the frequency scale(0 to max freq).
Requirements:
I need to draw frequencies of several words(Column "word") over the time.
Correlate the frequency with temperature.
I started with ggplot2:
ggplot(TemperatureData, aes(x=timeId, y=termFrequency)) + geom_line() + facet_wrap(~Keyword) +
geom_line(data = TemperatureData, aes(y = temperature)) +
labs(x="Time Series over X days", y = "Term Frequency")
The above approach results in overlapping y axis (frequency, temperature). And, a separate bin for each "Word" (facet for ggplot). i.e plot has 3 bin's for each keyword. Each bin shows temperature over time, and frequency of a word over time.
Problems:
I want to be able to separate y-axis for temperature, and frequency. Also, I do not want to normalize these y-axis as it gets tough to understand what are the high/low values of each axis over days. Plot Loses readability. I learnt that two y-axis is not possible using ggplot2.
Separate bin for each keyword is not required. One horizontal line per keyword is what I'm looking for.
The plot should have only one appearance(line graph) of temperature to reflect change over time.
I tried using PAR, but could not succeed.
Example solution using plotrix package

Using Table to Graph

I have the following data.frame:
sample <- data.frame(day=c(1,2,5,10,12,12,14))
sample.table <- as.data.frame(table(sample$day))
Now what I'd like to do is graph the day against the count of days, so something like:
require(ggplot2)
qplot(Var1, Freq, data=sample.table)
I realized though that Var1 really really really wants to be a factor. This works fine for a small number of days, but is terrible when days becomes much larger because the graph becomes unreadable. If I change it to a numeric or integer, then instead of plotting day on the x-axis, it plots the count of day, e.g. 1,2,3,4,5,6,7.
What can I do so that if I have, say 5000 days, it is still visible well?
This is because when you use table you get a vector with names (which are characters), and when you convert to data.frame these get converted to factors with the default settings.
You could avoid this by using your original data and getting ggplot2 to count the data:
qplot(day, ..count.., data=sample, stat="bin", binwidth=1)
or just use a histogram,
qplot(day, data=sample, geom="histogram", binwidth=1)
Note that you can adjust the binwidth argument to count in larger groups.
Figured out a hack for this.
as.integer(as.character(sample$day))

Resources