Issue to have correct scale with date ggplot - r

I have two dataframes tur_e and tur_w. Below you can see the data frame:
tur_e:
Time_f turbidity_E
1 2014-12-12 00:00:00 87
2 2014-12-12 00:15:00 87
3 2014-12-12 00:30:00 91
4 2014-12-12 00:45:00 84
5 2014-12-12 01:00:00 92
6 2014-12-12 01:15:00 89
tur_w:
Time_f turbidity_w
47 2015-06-04 11:45:00 8.4
48 2015-06-04 12:00:00 10.5
49 2015-06-04 12:15:00 9.2
50 2015-06-04 12:30:00 9.1
51 2015-06-04 12:45:00 8.7
52 2015-06-04 13:00:00 8.4
I then create a unique dataframe combining turbidity_E and turbidity_w. I match with the date (time_f) and use melt to reshape data:
dplr <- left_join(tur_e, tur_w, by=c("Time_f"))
dt.df <- melt(dplr, measure.vars = c("turbidity_E", "turbidity_w"))
I plotted series of box plot over time. The code is below:
dt.df%>% mutate(Time_f = ymd_hms(Time_f)) %>%
ggplot(aes(x = cut(Time_f, breaks="month"), y = value)) +
geom_boxplot(outlier.size = 0.3) + facet_wrap(~variable, ncol=1)+labs(x = "time")
I obtain the following graph:
I would like to reduce the number of dates that appear in my x-axis. I add this line of code:
scale_x_date(breaks = date_breaks("6 months"),labels = date_format("%b"))
I got this following error:
Error: Invalid input: date_trans works with objects of class Date
only
I tried a lot of different solutions but no one work. Any help would be appreciate! Thanks!

Two things. First, you need to use scale_x_datetime (you don't have only dates, but also time!). Secondly, when you cut x, it actually just becomes a factor, losing any sense of time altogether. If you want a boxplot of each month, you can group by that cut instead:
dt.df %>% mutate(Time_f = lubridate::ymd_hms(Time_f)) %>%
ggplot(aes(x = Time_f, y = value, group = cut(Time_f, breaks="month"))) +
geom_boxplot(outlier.size = 0.3) +
facet_wrap(~variable, ncol = 1) +
labs(x = "time") +
scale_x_datetime(date_breaks = '1 month')

Related

How to use another variable values as labels on date x-axis in ggplot?

I have created a ggplot using date x axis but I would like to show their values from another variable instead of dates.
df
library(tidyverse)
library(lubridate)
df <- read_rds("https://github.com/johnsnow09/covid19-df_stack-code/blob/main/vaccine_milestones.rds?raw=true")
df
Updated.On cr_bin days_to_next_10cr_vacc
<date> <fct> <drtn>
1 2021-04-11 10 Cr 85 days
2 2021-05-27 20 Cr 46 days
3 2021-06-24 30 Cr 28 days
4 2021-07-18 40 Cr 24 days
5 2021-08-06 50 Cr 19 days
6 2021-08-25 60 Cr 19 days
7 2021-09-07 70 Cr 13 days
8 2021-09-18 80 Cr 11 days
9 2021-10-02 90 Cr 14 days
df %>%
ggplot(aes(x = Updated.On, y = days_to_next_10cr_vacc)) +
geom_col() +
scale_x_date(aes(labels = cr_bin))
Also tried: scale_x_date(aes(labels = c("10","20","30","40","50","60","70","80","90")))
In the plot on the x axis I would like to have values displayed from cr_bin instead of dates as 10 Cr, 20 cr, 30 Cr ... so on 90 Cr.
I have tried above code but I am not sure what else to use in place of labels to get desired results
You need to set breaks for labels. I'm using unique, just in case there might be duplicate rows.
Also note conversion off difftime to integer.
library(tidyverse)
library(lubridate)
df <- read_rds("https://github.com/johnsnow09/covid19-df_stack-code/blob/main/vaccine_milestones.rds?raw=true")
df %>%
ggplot(aes(x = Updated.On, y = as.integer(days_to_next_10cr_vacc))) +
geom_col() +
scale_x_date(breaks = unique(df$Updated.On), labels = unique(df$cr_bin))
Created on 2021-10-21 by the reprex package (v2.0.1)

Linear chart for different variables

I am trying to figure out why all of my data points are grouped up on the y-axis of my graph (see image below). I am trying to do multiple plots with different variables. How can I separate the values on the y-axis? Also is it possible to show the dates every quarter on the x-axis instead of showing it every year?
I am fairly new to R so your help is much appreciated
Code :
library(ggplots)
library(xts)
beta<-as.data.frame(beta)
beta[,"Date"]<- as.Date(beta[,"Date"])
beta<- xts(beta,order.by=beta[,"Date"])
autoplot(beta,facets=Series~.)+ geom_point() + theme_bw()
Data set:
Size Value
2013-01-01 0.032715590 -0.729988962
2013-02-01 0.029004454 -0.720470432
2013-03-01 -0.005376306 -0.774927763
2013-04-01 -0.065253538 -0.832884912
2013-05-01 -0.132726778 -0.805900000
2013-06-01 -0.094694083 -0.693202747
2013-07-01 -0.067636417 -0.540439590
2013-08-01 -0.080754396 -0.523916099
2013-09-01 -0.046787938 -0.633682670
2013-10-01 -0.039442980 -0.527533014
2013-11-01 0.007652725 -0.602841925
2013-12-01 0.012766257 -0.562559325
2014-01-01 0.005465503 -0.590979360
2014-02-01 0.033734341 -0.500183338
2014-03-01 0.036242236 -0.458877891
2014-04-01 0.085039855 -0.370762659
2014-05-01 0.120012885 -0.361754453
2014-06-01 0.146198534 -0.291407100
2014-07-01 0.147598628 -0.393385963
2014-08-01 0.173900895 -0.384568303
Image:
This type of problem is generally a matter of Reshaping data.frame from wide to long format.
library(tidyverse)
beta %>%
mutate(Date = as.Date(Date)) %>%
pivot_longer(
cols = c(Size, Value),
names_to = "Variable",
values_to = "Values"
) %>%
ggplot(aes(Date, Values, color = Variable)) +
geom_point() +
geom_line() +
scale_x_date(date_breaks = "3 month", date_labels = "%b %Y") +
facet_grid(Variable ~ .) +
theme_bw() +
theme(axis.text.x=element_text(angle=60, hjust=1))
Data
beta <- read.table(text = "
Date Size Value
2013-01-01 0.032715590 -0.729988962
2013-02-01 0.029004454 -0.720470432
2013-03-01 -0.005376306 -0.774927763
2013-04-01 -0.065253538 -0.832884912
2013-05-01 -0.132726778 -0.805900000
2013-06-01 -0.094694083 -0.693202747
2013-07-01 -0.067636417 -0.540439590
2013-08-01 -0.080754396 -0.523916099
2013-09-01 -0.046787938 -0.633682670
2013-10-01 -0.039442980 -0.527533014
2013-11-01 0.007652725 -0.602841925
2013-12-01 0.012766257 -0.562559325
2014-01-01 0.005465503 -0.590979360
2014-02-01 0.033734341 -0.500183338
2014-03-01 0.036242236 -0.458877891
2014-04-01 0.085039855 -0.370762659
2014-05-01 0.120012885 -0.361754453
2014-06-01 0.146198534 -0.291407100
2014-07-01 0.147598628 -0.393385963
2014-08-01 0.173900895 -0.384568303
", header = TRUE)

Extract day and month from date

I'm trying to extract only the day and the month from as.POSIXct entries in a dataframe to overlay multiple years of data from the same months in a ggplot.
I have the data as time-series objects ts.
data.ts<-read.zoo(data, format = "%Y-%m-%d")
ts<-SMA(data.ts[,2], n=10)
df<-data.frame(date=as.POSIXct(time(ts)), value=ts)
ggplot(df, aes(x=date, y=value),
group=factor(year(date)), colour=factor(year(date))) +
geom_line() +
labs(x="Month", colour="Year") +
theme_classic()
Now, obviously if I only use "date" in aes, it'll plot the normal time-series as a consecutive sequence across the years. If I do "day(date)", it'll group by day on the x-axis. How do I pull out day AND month from the date? I only found yearmon(). If I try as.Date(df$date, format="%d %m"), it's not doing anything and if I show the results of the command, it would still include the year.
data:
> data
Date V1
1 2017-02-04 113.26240
2 2017-02-05 113.89059
3 2017-02-06 114.82531
4 2017-02-07 115.63410
5 2017-02-08 113.68569
6 2017-02-09 115.72382
7 2017-02-10 114.48750
8 2017-02-11 114.32556
9 2017-02-12 113.77024
10 2017-02-13 113.17396
11 2017-02-14 111.96292
12 2017-02-15 113.20875
13 2017-02-16 115.79344
14 2017-02-17 114.51451
15 2017-02-18 113.83330
16 2017-02-19 114.13128
17 2017-02-20 113.43267
18 2017-02-21 115.85417
19 2017-02-22 114.13271
20 2017-02-23 113.65309
21 2017-02-24 115.69795
22 2017-02-25 115.37587
23 2017-02-26 114.64885
24 2017-02-27 115.05736
25 2017-02-28 116.25590
If I create a new column with only day and month
df$day<-format(df$date, "%m/%d")
ggplot(df, aes(x=day, y=value),
group=factor(year(date)), colour=factor(year(date))) +
geom_line() +
labs(x="Month", colour="Year") +
theme_classic()
I get such a graph for the two years.
I want it to look like this, only with daily data instead of monthly.
ggplot: Multiple years on same plot by month
You are almost there. As you want to overlay day and month based on every year, we need a continuous variable. "Day of the year" does the trick for you.
data <-data.frame(Date=c(Sys.Date()-7,Sys.Date()-372,Sys.Date()-6,Sys.Date()-371,
Sys.Date()-5,Sys.Date()-370,Sys.Date()-4,Sys.Date()-369,
Sys.Date()-3,Sys.Date()-368),V1=c(113.23,123.23,121.44,111.98,113.5,114.57,113.44, 121.23, 122.23, 110.33))
data$year = format(as.Date(data$Date), "%Y")
data$Date = as.numeric(format(as.Date(data$Date), "%j"))
ggplot(data=data, mapping=aes(x=Date, y=V1, shape = year, color = year)) + geom_point() + geom_line()
theme_bw()

Validate time series index

I am using a dataset which is grouped by group_by function of dplyr package.
Each Group has it's own time index which i.e. supposedly consist of 12 months sequences.
This means that it can start from January and end up in December or in other cases it can start from June of the year before and end up in May next year.
Here is the dataset example:
ID DATE
8 2017-01-31
8 2017-02-28
8 2017-03-31
8 2017-04-30
8 2017-05-31
8 2017-06-30
8 2017-07-31
8 2017-08-31
8 2017-09-30
8 2017-10-31
8 2017-11-30
8 2017-12-31
32 2017-01-31
32 2017-02-28
32 2017-03-31
32 2017-04-30
32 2017-05-31
32 2017-06-30
32 2017-07-31
32 2017-08-31
32 2017-09-30
32 2017-10-31
32 2017-11-30
32 2017-12-31
45 2016-09-30
45 2016-10-31
45 2016-11-30
45 2016-12-31
45 2017-01-31
45 2017-02-28
45 2017-03-31
45 2017-04-30
45 2017-05-31
45 2017-06-30
45 2017-07-31
45 2017-08-31
The Problem is that I can't confirm or validate visualy because of dataset dimensions if there are so called "jumps", in other words if dates are consistent. Is there any simple way in r to do that, perhaps some modification/combination of functions from tibbletime package.
Any help will by appreciated.
Thank you in advance.
Here's how I would typically approach this problem using data.table -- the cut.Date() and seq.Date() functions from base are the meat of the logic, so you use the same approach with dplyr if desired.
library(data.table)
## Convert to data.table
setDT(df)
## Convert DATE to a date in case it wasn't already
df[,DATE := as.Date(DATE)]
## Order by ID and Date
setkey(df,ID,DATE)
## Create a column with the month of each date
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
## Generate a sequence of Dates by month for the number of observations
## in each group -- .N
df[,ExpectedMonth := seq.Date(from = min(Month),
by = "months",
length.out = .N), by = .(ID)]
## Create a summary table to test whether an ID had 12 observations where
## the actual month was equal to the expected month
Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]
print(Test)
# ID Valid
# 1: 8 TRUE
# 2: 32 TRUE
# 3: 45 TRUE
## Do a no-copy join of Test to df based on ID
## and create a column in df based on the 'Valid' column in Test
df[Test, Valid := i.Valid, on = "ID"]
## The final output:
head(df)
# ID DATE Month ExpectedMonth Valid
# 1: 8 2017-01-31 2017-01-01 2017-01-01 TRUE
# 2: 8 2017-02-28 2017-02-01 2017-02-01 TRUE
# 3: 8 2017-03-31 2017-03-01 2017-03-01 TRUE
# 4: 8 2017-04-30 2017-04-01 2017-04-01 TRUE
# 5: 8 2017-05-31 2017-05-01 2017-05-01 TRUE
# 6: 8 2017-06-30 2017-06-01 2017-06-01 TRUE
You could also do things a little more compactly if you really wanted to using a self-join and skip creating Test
setDT(df)
df[,DATE := as.Date(DATE)]
setkey(df,ID,DATE)
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]
You can use the summarise function from dplyr to return a logical value of whether there are any day differences greater than 31 within each ID. You do this by first constructing a temporary date using only the year and month and attaching "-01" as the fake day:
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
DATE_diff = c(0, diff(DATE2))) %>%
summarise(Valid = !any(DATE_diff > 31))
Result:
# A tibble: 3 x 2
ID Valid
<int> <lgl>
1 8 TRUE
2 32 TRUE
3 45 TRUE
You can also visually check if there are any gaps by plotting your dates for each ID:
library(ggplot2)
df %>%
mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
ID = as.factor(ID)) %>%
ggplot(aes(x = DATE, y = ID, group = ID)) +
geom_point(aes(color = ID)) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b-%Y") +
labs(title = "Time Line by ID")

Plotting histogram for data with start and end date

I have a data set that is something like this:
start_date end_date outcome
1 2014-07-18 2014-08-20 TRUE
2 2014-08-04 2014-09-23 TRUE
3 2014-08-01 2014-09-03 TRUE
4 2014-08-01 2014-09-03 TRUE
5 2014-12-10 2014-12-10 TRUE
6 2014-10-11 2014-11-07 TRUE
7 2015-04-27 2015-05-20 TRUE
8 2014-11-22 2014-12-25 TRUE
9 2015-03-24 2015-04-26 TRUE
10 2015-03-12 2015-04-10 FALSE
11 2014-05-29 2014-06-28 FALSE
12 2015-03-19 2015-04-20 TRUE
13 2015-03-25 2015-04-26 TRUE
14 2015-03-25 2015-04-26 TRUE
15 2014-07-09 2014-08-10 TRUE
16 2015-03-26 2015-04-26 TRUE
17 2014-07-09 2014-08-10 TRUE
18 2015-03-30 2015-04-28 TRUE
19 2014-03-13 2014-04-13 TRUE
20 2015-04-01 2015-04-29 TRUE
I want to plot a histogram where each bar corresponds to a month and it contains the proportion of FALSE / ALL = (FALSE + TRUE) in that month.
What is the easiest way to do this in R preferably using ggplot?
Here is one way. There will be better ways to do this. But I will leave what I tried. The main job was to create a new data frame for the graphic. Using your data above, I first converted factors to date objects. If yo have date objects in your data, you do not need this. Then, I summarised your data for start_date and end_date using count(). I bound the two data frames and further did the calculation to get the proportion of FALSE for each month.
library(zoo)
library(dplyr)
library(ggplot2)
library(lubridate)
mutate_each(mydf, funs(as.POSIXct(., format = "%Y-%m-%d")), -outcome) %>%
mutate_each(funs(paste(year(.),"-",month(.), sep = "")), vars = -outcome) -> foo1;
count(foo1, start_date, outcome) %>% rename(date = start_date) -> foo2;
count(foo1, end_date, outcome) %>%
rename(date = end_date) %>%
bind_rows(foo2) %>%
group_by(date, outcome) %>%
summarize(total = sum(n)) %>%
summarize(prop = length(which(outcome == FALSE)) / sum(total)) %>%
mutate(date = as.Date(as.yearmon(date))) -> foo3
ggplot(data = foo3, aes(x = date, y = prop)) +
geom_bar(stat = "identity") +
scale_x_date(labels = date_format("%Y-%m"), breaks = date_breaks("month")) +
theme(axis.text.x = element_text(angle = 90, vjust = 1))

Resources