R + ggplot2: how to hide missing dates from x-axis? - r

Say we have the following simple data-frame of date-value pairs, where some dates are missing in the sequence (i.e. Jan 12 thru Jan 14). When I plot the points, it shows these missing dates on the x-axis, but there are no points corresponding to those dates. I want to prevent these missing dates from showing up in the x-axis, so that the point sequence has no breaks. Any suggestions on how to do this? Thanks!
dts <- c(as.Date( c('2011-01-10', '2011-01-11', '2011-01-15', '2011-01-16')))
df <- data.frame(dt = dts, val = seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point() +
scale_x_date(format = '%d%b', major='days')

I made a package that does this. It's called bdscale and it's on CRAN and github. Shameless plug.
To replicate your example:
> library(bdscale)
> library(ggplot2)
> library(scales)
> dts <- as.Date( c('2011-01-10', '2011-01-11', '2011-01-15', '2011-01-16'))
> ggplot(df, aes(x=dt, y=val)) + geom_point() +
scale_x_bd(business.dates=dts, labels=date_format('%d%b'))
But what you probably want is to load known valid dates, then plot your data using the valid dates on the x-axis:
> nyse <- bdscale::yahoo('SPY') # get valid dates from SPY prices
> dts <- as.Date('2011-01-10') + 1:10
> df <- data.frame(dt=dts, val=seq_along(dts))
> ggplot(df, aes(x=dt, y=val)) + geom_point() +
scale_x_bd(business.dates=nyse, labels=date_format('%d%b'), max.major.breaks=10)
Warning message:
Removed 3 rows containing missing values (geom_point).
The warning is telling you that it removed three dates:
15th = Saturday
16th = Sunday
17th = MLK Day

Turn the date data into a factor then. At the moment, ggplot is interpreting the data in the sense you have told it the data are in - a continuous date scale. You don't want that scale, you want a categorical scale:
require(ggplot2)
dts <- as.Date( c('2011-01-10', '2011-01-11', '2011-01-15', '2011-01-16'))
df <- data.frame(dt = dts, val = seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point() +
scale_x_date(format = '%d%b', major='days')
versus
df <- data.frame(dt = factor(format(dts, format = '%d%b')),
val = seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point()
which produces:
Is that what you wanted?

First question is : why do you want to do that? There is no point in showing a coordinate-based plot if your axes are not coordinates. If you really want to do this, you can convert to a factor. Be careful for the order though :
dts <- c(as.Date( c('31-10-2011', '01-11-2011', '02-11-2011',
'05-11-2011'),format="%d-%m-%Y"))
dtsf <- format(dts, format= '%d%b')
df <- data.frame(dt=ordered(dtsf,levels=dtsf),val=seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point()
With factors you have to be careful, as the order is arbitrary in a factor,unless you make it an ordered factor. As factors are ordered alphabetically by default, you can get in trouble with some date formats. So be careful what you do. If you don't take the order into account, you get :
df <- data.frame(dt=factor(dtsf),val=seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point()

Related

How do I order my months (Jan through Dec) in this plot in R? Transforming months to factor gives me thousands of NA

I feel like I've tried about everything...If I transform the months into factors, I get 16 thousand NA's. As my code is I get the plot to come out, but with the months out of order.
I got the original code here: https://www.r-graph-gallery.com/283-the-hourly-heatmap.html
I've edited it to fit my data, but my months come out out of order.
My months are numbers in the csv file (int in r), then changing them to abbreviations makes them characters.
SoilT.data<-read.csv(file="Transect 1 Soil Temp RStudio Number month.csv")
library(ggplot2)
library(dplyr)
library(viridis)
library(ggExtra)
library(lubridate)
df <-SoilT.data %>% select(Lower.Panel,Day,Hourly,Month,Year)
df <- transform(df, MonthAbb = month.abb[Month])
Panel.Area <-unique(df$Lower.Panel)
p <-ggplot(df,aes(Day,Hourly,fill=Lower.Panel))+geom_tile(color= "white",size=0.1)+scale_fill_viridis(name="Hrly Temps",option ="C")
p <-p + facet_grid(Year~MonthAbb)
p <-p + scale_y_continuous(trans = "reverse", breaks = unique(df$Hourly))
p <-p + scale_x_continuous(breaks =c(1,10,20,31))
p <-p + labs(title= paste("Hourly Temperature - Lower Panel",Panel.Area),x="Day", y="Hourly")
p <-p + theme(legend.position = "bottom")+theme(plot.title=element_text(size =14))+theme(axis.text.y=element_text(size=6)) +theme(strip.background =element_rect(colour="white"))+theme(plot.title=element_text(hjust=0))+theme(axis.ticks=element_blank())+theme(axis.text=element_text(size=7))+theme(legend.title=element_text(size=8))+theme(legend.text=element_text(size=6))+removeGrid()
p
enter image description here
You should have constructed MonthAbb as a factor. That way you could have specified the ordering of the levels attribute which most plotting functions will honor when it comes time for plotting.
df <- transform(df, MonthAbb = factor(month.abb[Month], month.abb(1:12))
Factor vectors are actually integers which plotting functions use as indices into the attribute specified at ttime of creation (or the default which was what was being used by your heatmapping function).

How can I plot a dataframe in R given in quarterly years?

i have a dataset given with:
Country Time Value
1 USA 1999-Q1 292929
2 USA 1999-Q2 392023
3. USA 1999-Q3 9392992
4
.... and so on. Now I would like to plot this dataframe with Time being on the x-axis and y being the Value. But the problem I face is I dont know how to plot the Time. Because it is not given in month/date/year. If that would be the case I would just code as.Date( format = "%m%d%y"). I am not allowed to change the quarterly name. So when I plot it, it should stay that way. How can I do this?
Thank you in advance!
Assuming DF shown in the Note at the end, convert the Time column to yearqtr class which directly represents year and quarter (as opposed to using Date class) and use scale_x_yearqtr. See ?scale_x_yearqtr for more information.
library(ggplot2)
library(zoo)
fmt <- "%Y-Q%q"
DF$Time <- as.yearqtr(DF$Time, format = fmt)
ggplot(DF, aes(Time, Value, col = Country)) +
geom_point() +
geom_line() +
scale_x_yearqtr(format = fmt)
(continued after graphics)
It would also be possible to convert it to a wide form zoo object with one column per country and then use autoplot. Using DF from the Note below:
fmt <- "%Y-Q%q"
z <- read.zoo(DF, split = "Country", index = "Time",
FUN = as.yearqtr, format = fmt)
autoplot(z) + scale_x_yearqtr(format = fmt)
Note
Lines <- "
Country Time Value
1 USA 1999-Q1 292929
2 USA 1999-Q2 392023
3 USA 1999-Q3 9392992"
DF <- read.table(text = Lines)
Using ggplot2:
library(ggplot2)
ggplot(df, aes(Time, Value, fill = Country)) + geom_col()
I know other people have already answered, but I think this more general answer should also be here.
When you do as.Date(), you can only do the beginning. I tried it on your data frame (I called it df), and it worked:
> as.Date(df$Time, format = "%Y")
[1] "1999-11-28" "1999-11-28" "1999-11-28"
Now, I don't know if you want to use plot(), ggplot(), the ggplot2 library... I don't know that, and it doesn't matter. However you want to specify the y axis, you can do it this way.

ggplot: Plotting timeseries data with missing values

I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".
This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:
YearMonthDay (format- "YYYYMMDD", example "19810501")
Rainfall (mm)
The code for importing data into R:
StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")
The code for time data "Time" :
Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")
The code for precipitation data "data1" :
library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")
The code for data frame "Pecip1" :
Precip1 <- data.frame(Time, data1, check.rows=TRUE)
The code for ggplot is:
ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()
Using ggplot for plotting the graph between "Time" and "data1" results as:
Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".
The plot of "data1" data against its index is as shown:
The code for this plot is:
plot(data1, type = "l")
Any help would be highly appreciated. Thanks!
By using pad we can make up for those lost values an assign an NA value as to
avoid plotting in the region of missing data.
library(padr)
library(zoo)
YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)
StLucia<-data.frame(YearMonthDay,Data)
StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format=
"%Y%m%d")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-04 3
4 1981-05-05 4
Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.
So lets add the missing date:
StLucia<-pad(StLucia,interval="day")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-03 NA
4 1981-05-04 3
5 1981-05-05 4
plot(StLucia, type = "l")
If you want to fill in those NA values, use na.locf() from package(zoo)
Here is a reproducible example - change the names to match your data.
# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))
# demonstrate problem
ggplot(dd, aes(t, y)) +
geom_point() +
geom_line()
The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:
ggplot(dd, aes(t, y)) +
geom_col()
If you really want to use lines, you should fill in the missing dates with NA for rainfall. H
# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")
# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
geom_point() +
geom_line()
Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.

How to plot a variable over time with time as rownames

I am trying to plot a time series in ggplot2. Assume I am using the following data structure (2500 x 20 matrix):
set.seed(21)
n <- 2500
x <- matrix(replicate(20,cumsum(sample(c(-1, 1), n, TRUE))),nrow = 2500,ncol=20)
aa <- x
rnames <- seq(as.Date("2010-01-01"), length=dim(aa)[1], by="1 month") - 1
rownames(aa) <- format(as.POSIXlt(rnames, format = "%Y-%m-%d"), format = "%d.%m.%Y")
colnames(aa) <- paste0("aa",1:k)
library("ggplot2")
library("reshape2")
library("scales")
aa <- melt(aa, id.vars = rownames(aa))
names(aa) <- c("time","id","value")
Now the following command to plot the time series produces a weird looking x axis:
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
What I found out is that I can change the format to date:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
This looks better, but still not a good graph. My question is especially how to control the formatting of the x axis.
Does it have to be in Date format? How can I control the amount of breaks (i.e. years) shown in either case? It seems to be mandatory if Date is not used; otherwise ggplot2 uses some kind of useful default for the breaks I believe.
For example the following command does not work:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line() +
scale_x_continuous(breaks=pretty_breaks(n=10))
Also if you got any hints how to improve the overall look of the graph feel free to add (e.g. the lines look a bit inprecise imho).
You can format dates with scale_x_date as #Gopala mentioned. Here's an example using a shortened version of your data for illustration.
library(dplyr)
# Dates need to be in date format
aa$time <- as.Date(aa$time, "%d.%m.%Y")
# Shorten data to speed rendering
aa = aa %>% group_by(id) %>% slice(1:200)
In the code below, we get date breaks every six months with date_breaks="6 months". That's probably more breaks than you want in this case and is just for illustration. If you want to determine which months get the breaks (e.g., Jan/July, Feb/Aug, etc.) then you also need to use coord_cartesian and set the start date with xlim and expand=FALSE so that ggplot won't pad the start date. But when you set expand=FALSE you also don't get any padding on the y-axis, so you need to add the padding manually with scale_y_continuous (I'd prefer to be able to set expand separately for the x and y axes, but AFAIK it's not possible). Because the breaks are packed tightly, we use a theme statement to rotate the labels by 90 degrees.
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line(show.legend=FALSE) +
scale_y_continuous(limits=c(min(aa$value) - 2, max(aa$value) + 1)) +
scale_x_date(date_breaks="6 months",
labels=function(d) format(d, "%b %Y")) +
coord_cartesian(xlim=c(as.Date("2009-07-01"), max(aa$time) + 182),
expand=FALSE) +
theme_bw() +
theme(axis.text.x=element_text(angle=-90, vjust=0.5))

Cannot convert a time variable to plot it on ggplot

I have two problems handling my time variable in Gnu R!
Firstly, I cannot recode the time data (downloadable here) from factor (or character) with as.Posixlt or with as.Date without an error message like this:
character string is not in a standard unambiguous format
I have then tried to covert my time data with:
dates <- strptime(time, "%Y-%m-%j")
which only gives me:
NA
Secondly, the reason why I wanted (had) to convert my time data is that I want to plot it with ggplot2 and adjust my scale_x_continuous (as described here) so that it only writes me every 50 year (i.e. 1250-01-01, 1300-01-01, etc.) in the x-axis, otherwise the x-axis is too busy (see graph below).
This is the code I use:
library(ggplot2)
library(scales)
library(reshape)
df <- read.csv(file="https://dl.dropboxusercontent.com/u/109495328/time.csv")
attach(df)
dates <- as.character(time)
population <- factor(Number_Humans)
ggplot(df, aes(x = dates, y = population)) + geom_line(aes(group=1), colour="#000099") + theme(axis.text.x=element_text(angle=90)) + xlab("Time in Years (A.D.)")
You need to remove the quotation marks in the date column, then you can convert it to date format:
df <- read.csv(file="https://dl.dropboxusercontent.com/u/109495328/time.csv")
df$time <- gsub('\"', "", as.character(df$time), fixed=TRUE)
df$time <- as.Date(df$time, "%Y-%m-%j")
ggplot(df, aes(x = time, y = Number_Humans)) +
geom_line(colour="#000099") +
theme(axis.text.x=element_text(angle=90)) +
xlab("Time in Years (A.D.)")

Resources