I have about 20 years of daily data in a time series. It has columns Date, rainfall and other data.
I am trying plot rainfall vs Time. I want to get 20 line plots with different colours and legend is generated that show the years in one graph. I tried the following codes but it is not giving me the desired results. Any suggestion to fix my issue would be most welcome
library(ggplot2)
library(seas)
data(mscdata)
p<-ggplot(data=mscdata,aes(x=date,y=precip,group=year,color=year))
p+geom_line()+scale_x_date(labels=date_format("%m"),breaks=date_breaks("1 months"))
It doesnt look great but here's a method. We first coerce the data into dates in the same year:
mscdata$dayofyear <- as.Date(format(mscdata$date, "%j"), format = "%j")
Then we plot:
library(ggplot2)
library(scales)
p <- ggplot(data = mscdata, aes(x = dayofyear, y = precip, group = year, color = year))
p + geom_line() +
scale_x_date(labels = date_format("%m"), breaks = date_breaks("1 months"))
While I agree with #Jaap that this may not be the best way to depict these data, try to following:
mscdata$doy <- as.numeric(strftime(mscdata$date, format="%j"))
ggplot(data=mscdata,aes(x=doy,y=precip,group=year)) +
geom_line(aes(color=year))
Although the given answers are good answers to your questions as it stands, i don't think it will solve your problem. I think you should be looking at a different way to present the data. #Jaap already suggested using facets. Take for example this approach:
#first add a month column to your dataframe
mscdata$month <- format(mscdata$date, "%m")
#then plot it using boxplot with year on the X-axis and month as facet.
p1 <- ggplot(data = mscdata, aes(x = year, y = precip, group=year))
p1 + geom_boxplot(outlier.shape = 3) + facet_wrap(~month)
This will give you a graph per month, showing the rainfall per year next to one each other. Because i use boxplot, the peaks in rainfall show up as dots ('normal' rain events are inside box).
Another possible approach would be to use stat_summary.
Related
Background
I have a dataframe, df, of athlete injuries:
df <- data.frame(number_of_injuries = c(1,2,3,4,5,6),
number_of_people = c(73,52,43,12,7,2),
stringsAsFactors=FALSE)
The Problem
I'd like to use ggplot2 to make a bar chart or histogram of this simple data using geom_bar or geom_histogram. Important point: I'm pretty novice with ggplot2.
I'd like something where the x-axis shows bins of the number of injuries (number_of_injuries), and the y-axis shows the counts in number_of_people. Like this (from Excel):
What I've tried
I know this is the most trivial dang ggplot issue, but I keep getting errors or weird results, like so:
ggplot(df, aes(number_of_injuries)) +
geom_bar(stat = "count")
Which yields:
I've been in the tidyverse reference website for an hour at this and I can't crack the code.
It can cause confusion from time to time. If you already have "count" statistics, then do not count data using geom_bar(stats = "count") again, otherwise you simply get 1 in all categories. You want to plot those values as they are with geom_col:
ggplot(df, aes(x = number_of_injuries, y = number_of_people)) + geom_col()
I have a dataset where one of the columns is dates but in character format. I used the following code to convert it to dates format and then take the month only:
library(lubridate)
dates <- dmy(Austria$date)
Month <- month(dates, label = TRUE, abbr = FALSE)
The problem is that I am taking levels back for the months which I don't want to. I searched on how to remove the levels but everything I found was about removing levels that are unused (which is not my case).
I also, used the as,Date but I am still having the same problem:
dates_Austria <- as.Date(Austria$date, "%d/%m/%Y")
My final purpose is to make a plot which will have unemployment on the horizontal axis, income level on the vertical axis and then change the color of the plot according to the month, like that:
ggplot(data = my_data, aes(x = unemployment, y = income, colour = Month)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
But by using that code I am getting back different regression lines according to the month. I want one line for all the data and the the rest of the dots of the scatter plot to change colour according to the month.
Any help would be appreciated.
I create a dummy timeseries xts object with missing data on date 2-09-2015 as:
library(xts)
library(ggplot2)
library(scales)
set.seed(123)
seq <- seq(as.POSIXct("2015-09-01"),as.POSIXct("2015-09-02"), by = "1 hour")
ob1 <- xts(rnorm(length(seq),150,5),seq)
seq2 <- seq(as.POSIXct("2015-09-03"),as.POSIXct("2015-09-05"), by = "1 hour")
ob2 <- xts(rnorm(length(seq2),170,5),seq2)
final_ob <- rbind(ob1,ob2)
plot(final_ob)
# with ggplot
df <- data.frame(time = index(final_ob), val = coredata(final_ob) )
ggplot(df, aes(time, val)) + geom_line()+ scale_x_datetime(labels = date_format("%Y-%m-%d"))
After plotting my data looks like this:
The red coloured rectangular portion represents the date on which data is missing. How should I show that data was missing on this day in the main plot?
I think I should show this missing data with a different colour. But, I don't know how should I process data to reflect the missing data behaviour in the main plot.
Thanks for the great reproducible example.
I think you are best off to omit that line in your "missing" portion. If you have a straight line (even in a different colour) it suggests that data was gathered in that interval, that happened to fall on that straight line. If you omit the line in that interval then it is clear that there is no data there.
The problem is that you want the hourly data to be connected by lines, and then no lines in the "missing data section" - so you need some way to detect that missing data section.
You have not given a criteria for this in your question, so based on your example I will say that each line on the plot should consist of data at hourly intervals; if there's a break of more than an hour then there should be a new line. You will have to adjust this criteria to your specific problem. All we're doing is splitting up your dataframe into bits that get plotted by the same line.
So first create a variable that says which "group" (ie line) each data is in:
df$grp <- factor(c(0, cumsum(diff(df$time) > 1)))
Then you can use the group= aesthetic which geom_line uses to split up lines:
ggplot(df, aes(time, val)) + geom_line(aes(group=grp)) + # <-- only change
scale_x_datetime(labels = date_format("%Y-%m-%d"))
I have a little problem with a ggplot barchart.
I wanted to make a barchart with ggplot2 in order to compare my Svolumes for my 4 stocks on a period of few months.
I have two problems:
The first one is that my y axis is wrong. My graph/data seems correct but the y axis don't "follow" as I thought it will contain another scale... I would to have to "total" number of my dataset svolumes, I think here it is writing my svolumes values. I don't know how to explain but I would like the scale corresponding to all of my data on the graph like 10,20,etc until my highest sum of svolumes.
There is my code:
Date=c(rep(data$date))
Subject=c(rep(data$subject))
Svolume=c(data$svolume)
Data=data.frame(Date,Subject,Svolume)
Data=ddply(Data, .(Date),transform,pos=cumsum(as.numeric(Svolume))-(0.5*(as.numeric(Svolume))))
ggplot(Data, aes(x=Date, y=Svolume))+
geom_bar(aes(fill=Subject),stat="identity")+
geom_text(aes(label=Svolume,y=pos),size=3)
and there is my plot:
I helped with the question here
Finally, How could I make the same plot for each months please? I don't know how to get the values per month in order to have a more readable barchart as we can't read anything here...
If you have other ideas for me I would be very glad to take any ideas and advices! Maybe the same with a line chart would be more readable...? Or maybe the same barchart for each stocks ? (I don't know how to get the values per stock either...)
I just found how to do it with lines.... but once again my y axis is wrong, and it's not very readable....
Thanks for your help !! :)
Try adding the following line right before your ggplot function. It looks like your y-axis is in character.
[edit] Incorporate #user20650's comments, add as.character() first then convert to numeric.
Data$Svolume <- as.numeric(as.character(Data$Svolume))
To produce the same plot for each month, you can add the month variable first: Data$Month <- month(as.Date(Date)). Then add facet to your ggplot object.
ggplot(Data, aes(x=Date, y=Svolume) +
...
+ facet_wrap(~ Month)
For example, your bar chart code will be:
Data$Svolume <- as.numeric(as.character(Data$Svolume))
Data$Month <- month(as.Date(Date))
ggplot(Data, aes(x=Date, y=Svolume)) +
geom_bar(aes(fill=Subject),stat="identity") +
geom_text(aes(label=Svolume,y=pos),size=3) +
facet_wrap(~ Month)
and your Line chart code will be:
Data$Svolume <- as.numeric(as.character(Data$Svolume))
Data$Month <- month(as.Date(Date))
ggplot(Data, aes(x=Date, y=Svolume, colour=Subject)) +
geom_line() +
facet_wrap(~ Month)
I have CSV data of a log for 24 hours that looks like this:
svr01,07:17:14,'u1#user.de','8.3.1.35'
svr03,07:17:21,'u2#sr.de','82.15.1.35'
svr02,07:17:30,'u3#fr.de','2.15.1.35'
svr04,07:17:40,'u2#for.de','2.1.1.35'
I read the data with tbl <- read.csv("logs.csv")
How can I plot this data in a histogram to see the number of hits per hour?
Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.
Thank you for helping me here!
I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.
a) Converting your data into hours:
#df being the dataframe
df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
df$hours <- as.numeric(format(df$timestamp, format="%H"))
hist(df$hours)
This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:
b) Making a histogram with ggplot2
#install.packages("ggplot2")
require(ggplot2)
ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) + facet_wrap(~ server)
# or use a color instead
ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)
c) You could also use another package:
require(plotrix)
l <- split(df$hours, f=df$server)
multhist(l)
The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.
EDIT
Here is how thes solutions would look like
first solution:
second solution:
third solution:
An example dataset:
dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
time = Sys.time() + sort(round(runif(1000, 1, 36000))))
The trick I use is to create a new variable which only specifies in which hour the hit was recorded:
dat$hr = strftime(dat$time, "%H")
Now we can use some plyr magick:
hits_hour = count(dat, vars = c("server","hr"))
And create the plot:
ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")
Which looks like:
I don't really like this plot, I'd be more in favor of:
ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)
Which looks like:
Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.