Extract day and month from date - r

I'm trying to extract only the day and the month from as.POSIXct entries in a dataframe to overlay multiple years of data from the same months in a ggplot.
I have the data as time-series objects ts.
data.ts<-read.zoo(data, format = "%Y-%m-%d")
ts<-SMA(data.ts[,2], n=10)
df<-data.frame(date=as.POSIXct(time(ts)), value=ts)
ggplot(df, aes(x=date, y=value),
group=factor(year(date)), colour=factor(year(date))) +
geom_line() +
labs(x="Month", colour="Year") +
theme_classic()
Now, obviously if I only use "date" in aes, it'll plot the normal time-series as a consecutive sequence across the years. If I do "day(date)", it'll group by day on the x-axis. How do I pull out day AND month from the date? I only found yearmon(). If I try as.Date(df$date, format="%d %m"), it's not doing anything and if I show the results of the command, it would still include the year.
data:
> data
Date V1
1 2017-02-04 113.26240
2 2017-02-05 113.89059
3 2017-02-06 114.82531
4 2017-02-07 115.63410
5 2017-02-08 113.68569
6 2017-02-09 115.72382
7 2017-02-10 114.48750
8 2017-02-11 114.32556
9 2017-02-12 113.77024
10 2017-02-13 113.17396
11 2017-02-14 111.96292
12 2017-02-15 113.20875
13 2017-02-16 115.79344
14 2017-02-17 114.51451
15 2017-02-18 113.83330
16 2017-02-19 114.13128
17 2017-02-20 113.43267
18 2017-02-21 115.85417
19 2017-02-22 114.13271
20 2017-02-23 113.65309
21 2017-02-24 115.69795
22 2017-02-25 115.37587
23 2017-02-26 114.64885
24 2017-02-27 115.05736
25 2017-02-28 116.25590
If I create a new column with only day and month
df$day<-format(df$date, "%m/%d")
ggplot(df, aes(x=day, y=value),
group=factor(year(date)), colour=factor(year(date))) +
geom_line() +
labs(x="Month", colour="Year") +
theme_classic()
I get such a graph for the two years.
I want it to look like this, only with daily data instead of monthly.
ggplot: Multiple years on same plot by month

You are almost there. As you want to overlay day and month based on every year, we need a continuous variable. "Day of the year" does the trick for you.
data <-data.frame(Date=c(Sys.Date()-7,Sys.Date()-372,Sys.Date()-6,Sys.Date()-371,
Sys.Date()-5,Sys.Date()-370,Sys.Date()-4,Sys.Date()-369,
Sys.Date()-3,Sys.Date()-368),V1=c(113.23,123.23,121.44,111.98,113.5,114.57,113.44, 121.23, 122.23, 110.33))
data$year = format(as.Date(data$Date), "%Y")
data$Date = as.numeric(format(as.Date(data$Date), "%j"))
ggplot(data=data, mapping=aes(x=Date, y=V1, shape = year, color = year)) + geom_point() + geom_line()
theme_bw()

Related

How I change the origin of the x axis in ggplot to go from 'August to March' instead of 'Jan to March, August to December'?

I want to plot temperature data over time, with the x axis: "08-01", "09-01", "10-01", "11-01", "12-01", "01-01", "02-01", "03-01"
Rather then: "01-01", "02-01", "03-01", "08-01", "09-01", "10-01", "11-01", "12-01", which R is doing.
My data looks like the following- my x axis uses the Month_day column. Unique values in this column are: "08-01", "09-01", "10-01", "11-01", "12-01", "01-01", "02-01", "03-01".
> head(upstream)
Date daily_aveTempC Moving_Average_7day Year Month Day Month_day monthAbb Migration EmbryoDev
1 2007-08-01 13.49556 13.94947 2007 08 01 08-01 Aug Upstream
2 2007-08-02 13.44325 13.74864 2007 08 02 08-02 Aug Upstream
3 2007-08-03 12.93881 13.56086 2007 08 03 08-03 Aug Upstream
4 2007-08-04 12.78937 13.41106 2007 08 04 08-04 Aug Upstream
5 2007-08-05 13.13963 13.29029 2007 08 05 08-05 Aug Upstream
6 2007-08-06 13.11844 13.19651 2007 08 06 08-06 Aug Upstream
I have the following code that plots Month_day (x axis) vs Moving Average 7day (y axis).
png(paste0(read_out_final, "Migration_Upstream_7day_MovingAve_Sal_4.png"), res=300, width = 15, height = 8, units = "in")
ggplot(data=upstream, aes(x=as.factor(Month_day), y=factor(Moving_Average_7day, levels=upstream$Month_day), color=Year, group=Year)) +
geom_line(size=1) +
theme_bw() +
scale_y_continuous(n.breaks = 20,
limits=c(1,20)) +
scale_x_discrete(breaks = upstream$Month_day[grep("0*-01", upstream$Month_day)]) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title="Salmon Creek 4: Upstream Migration from August to March",
x="Date",
y="Temperature (7-Day Rolling Average degrees C)")
dev.off()
This plots the data: "01-01", "02-01", "03-01", "08-01", "09-01", "10-01", "11-01", "12-01".
But I want the data plotted: "08-01", "09-01", "10-01", "11-01", "12-01", "01-01", "02-01", "03-01"
I've seen solutions to this issue using the plot() function, but not for ggplot.
Almost every question on ggplot2 that includes "order of ... axis" can be resolved by using factor(., levels=), and explicitly controlling the order of the levels.
dat <- data.frame(dt = seq(as.Date("2020-08-01"), as.Date("2021-04-01"), by="month"), y = 1:9)
dat$MonDay <- format(dat$dt, format = "%m-%d")
dat
# dt y MonDay
# 1 2020-08-01 1 08-01
# 2 2020-09-01 2 09-01
# 3 2020-10-01 3 10-01
# 4 2020-11-01 4 11-01
# 5 2020-12-01 5 12-01
# 6 2021-01-01 6 01-01
# 7 2021-02-01 7 02-01
# 8 2021-03-01 8 03-01
# 9 2021-04-01 9 04-01
library(ggplot2)
ggplot(dat, aes(MonDay, y)) + geom_point()
This is because ggplot2 looks to order its variables; if numeric or integer, it's easy; if character, then it sorts it lexicographically, and it seems clear that "08-01" comes after "04-01" (despite the fact that the strings were formed from an object that had the opposite ordering).
dat$MonDay <- factor(dat$MonDay, levels = unique(dat$MonDay[order(dat$dt)]))
ggplot(dat, aes(MonDay, y)) + geom_point()

Issue to have correct scale with date ggplot

I have two dataframes tur_e and tur_w. Below you can see the data frame:
tur_e:
Time_f turbidity_E
1 2014-12-12 00:00:00 87
2 2014-12-12 00:15:00 87
3 2014-12-12 00:30:00 91
4 2014-12-12 00:45:00 84
5 2014-12-12 01:00:00 92
6 2014-12-12 01:15:00 89
tur_w:
Time_f turbidity_w
47 2015-06-04 11:45:00 8.4
48 2015-06-04 12:00:00 10.5
49 2015-06-04 12:15:00 9.2
50 2015-06-04 12:30:00 9.1
51 2015-06-04 12:45:00 8.7
52 2015-06-04 13:00:00 8.4
I then create a unique dataframe combining turbidity_E and turbidity_w. I match with the date (time_f) and use melt to reshape data:
dplr <- left_join(tur_e, tur_w, by=c("Time_f"))
dt.df <- melt(dplr, measure.vars = c("turbidity_E", "turbidity_w"))
I plotted series of box plot over time. The code is below:
dt.df%>% mutate(Time_f = ymd_hms(Time_f)) %>%
ggplot(aes(x = cut(Time_f, breaks="month"), y = value)) +
geom_boxplot(outlier.size = 0.3) + facet_wrap(~variable, ncol=1)+labs(x = "time")
I obtain the following graph:
I would like to reduce the number of dates that appear in my x-axis. I add this line of code:
scale_x_date(breaks = date_breaks("6 months"),labels = date_format("%b"))
I got this following error:
Error: Invalid input: date_trans works with objects of class Date
only
I tried a lot of different solutions but no one work. Any help would be appreciate! Thanks!
Two things. First, you need to use scale_x_datetime (you don't have only dates, but also time!). Secondly, when you cut x, it actually just becomes a factor, losing any sense of time altogether. If you want a boxplot of each month, you can group by that cut instead:
dt.df %>% mutate(Time_f = lubridate::ymd_hms(Time_f)) %>%
ggplot(aes(x = Time_f, y = value, group = cut(Time_f, breaks="month"))) +
geom_boxplot(outlier.size = 0.3) +
facet_wrap(~variable, ncol = 1) +
labs(x = "time") +
scale_x_datetime(date_breaks = '1 month')

Time series graph does not show a fluid line

I am quite new to r and am trying to perform ARIMA time series forecast. The data I am looking into in electricity load per 15 min. My data looks as follows:
day month year PTE periode_van periode_tm gemeten_uitwisseling
1 1 01 2010 1 0 secs 900 secs 2636
2 1 01 2010 2 900 secs 1800 secs 2621
3 1 01 2010 3 1800 secs 2700 secs 2617
4 1 01 2010 4 2700 secs 3600 secs 2600
5 1 01 2010 5 3600 secs 4500 secs 2582
geplande_import geplande_export date weekend
1 719 -284 2010-01-01 00:00:00 0
2 719 -284 2010-01-01 00:15:00 0
3 719 -284 2010-01-01 00:30:00 0
4 719 -284 2010-01-01 00:45:00 0
5 650 -253 2010-01-01 01:00:00 0
weekday Month gu_ma
1 5 01 NA
2 5 01 NA
3 5 01 NA
4 5 01 NA
5 5 01 NA
to create a time series I have used the following code
library("zoo")
ZOO <- zoo(NLData$gemeten_uitwisseling,
order.by=as.POSIXct(NLData$date, format="%Y-%m-%d %H:%M:%S"))
ZOO <- na.approx(ZOO)
tsNLData <- ts(ZOO)
plot(tsNLData)
I have also tried the following
NLDatats <- ts(NLData$gemeten_uitwisseling, frequency = 96)
However when I plot the data I get the following;
How can I solve this?
There doesn't seem to be any problem with your graph, but your data come in 15 minute intervals, and you are plotting 4 years worth of data. So naturally it will look like a dark shaded region because there is no way to show the thousands of data points you have in your series in a single plot.
If you are struggling to handle this much data, you can consider sampling from your data frame before plotting, although this will remove seasonality and autocorrelation from the outcome. That can be helpful if you want to know average values of your outcome over time, but not as helpful to see the seasonal and autocorrelative structure in the data.
See the code below that uses dplyr and ggplot2 to plot some simulated time series that illustrates these issues. It's always best to start with simulated data and then work with your own data.
require(ggplot2)
require(dplyr)
sim_data <- arima.sim(model=list(ar=.88,order=c(1,0,0)),n=10000,sd=.3)
#Too many points
data_frame(y=as.numeric(sim_data),x=1:10000) %>% ggplot(aes(y=y,x=x)) + geom_line() +
theme_minimal() + xlab('Time') + ylab('Y_t')
#Sample from data (random sample)
#However, this will remove autocorrelation/seasonality
data_frame(y=as.numeric(sim_data),x=1:10000) %>% sample_n(500) %>%
ggplot(aes(y=y,x=x)) + geom_line() + theme_minimal() + xlab('Time') + ylab('Y_t')
# Plot a subset, which preserves autocorrelation and seasonality
data_frame(y=as.numeric(sim_data),x=1:10000) %>% slice(1:300) %>%
ggplot(aes(y=y,x=x)) + geom_line() + theme_minimal() + xlab('Time') + ylab('Y_t')

R ggplot how to expand the interval on the horizontal axis

I have a dataset with 100 Timestamp points. While when I plot the chart, the horizontal axis indicates all time points, and so all time points were overlapped together. How to indicates some regular time points on the horizontal axis rather than show all of them?
EU
T_DCEP DCEP
1 05/02/2016 1:28 1.14596
2 05/02/2016 1:39 1.14684
3 05/02/2016 2:04 1.14488
4 05/02/2016 3:15 1.14820
5 05/02/2016 3:34 1.14750
6 05/02/2016 4:40 1.14915
7 05/02/2016 4:56 1.14849
8 05/02/2016 5:22 1.14913
9 05/02/2016 5:55 1.14761
10 05/02/2016 6:07 1.14821
. ... ..
My code:
ggplot(EU,aes(T_DCEP,DCEP, group = 1)) + geom_line()+geom_point()
The class of the variables matter when plotting. Convert to a valid date-time class to solve the problem:
#Example data
set.seed(451)
dates <- seq(Sys.time()-99, Sys.time(), length.out=100)
df <- data.frame(x=dates, y=rnorm(100))
head(df)
# x y
# 1 2016-11-04 09:49:09 -0.9431540
# 2 2016-11-04 09:49:10 0.7257408
# 3 2016-11-04 09:49:11 2.5257787
# 4 2016-11-04 09:49:12 1.1916054
# 5 2016-11-04 09:49:13 3.1091791
# 6 2016-11-04 09:49:14 0.2848636
class(df$x)
[1] "POSIXct" "POSIXt"
This example will plot correctly because it is a proper date-time class.
ggplot(df, aes(x, y, group=1)) + geom_point() + geom_line()
But if I did not have a proper date-time class, it would look like your example.
df$x <- as.character(dates)
ggplot(df, aes(x, y, group=1)) + geom_point() + geom_line()

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

Resources