Extracting a point from ggplot and plot it - r

I am initially having the dataset as shown below:
ID A B Type Time Date
1 12 13 R 23:20 1-1-01
1 13 12 F 23:40 1-1-01
1 13 11 F 00:00 2-1-01
1 15 10 R 00:20 2-1-01
1 12 06 W 00:40 2-1-01
1 11 09 F 01:00 2-1-01
1 12 10 R 01:20 2-1-01
so on...
I tried to make the ggplot of the above dataset for A and B.
ggplot(data=dataframe, aes(x=A, y=B, colour = Type)) +geom_point()+geom_path()
Problem:
HOW do I add a subsetting variable that looks at the first 24 hours after the every 'F' point.
For the time being I have posted a continuous data set [with respect to time] but my original data set is not continuous. How can I make my data set continuous in a interval of 10 mins? I have used interpolation xspline() function on A and B but I don't know how to make my data set continuous with respect to time,
The highlighted part shown below is what I am looking for, I want to extract this dataset and then plot a new ggplot:
From MarkusN plots this is what I am looking for:
Taking first point as 'F' point and traveling 24hrs from that point (Since there is no 24 hrs data set available here so it should produce like this) :

I've tried the following, maybe you can get an idea from here. I recommend you to first have a variable with the time ordered (either in minutes or hours, in this example I've used hours). Let's see if it helps
#a data set is built as an example
N = 100
set.seed(1)
dataframe = data.frame(A = cumsum(rnorm(N)),
B = cumsum(rnorm(N)),
Type = sample(c('R','F','W'), size = N,
prob = c(5/7,1/7,1/7), replace=T),
time.h = seq(0,240,length.out = N))
# here, a list with dataframes is built with the sequences
l_dfs = lapply(which(dataframe$Type == 'F'), function(i, .data){
transform(subset(.data[i:nrow(.data),], (time.h - time.h[1]) <= 24),
t0 = sprintf('t0=%4.2f', time.h[1]))
}, dataframe)
ggplot(data=do.call('rbind', l_dfs), aes(x=A, y=B, colour=Type)) +
geom_point() + geom_path(colour='black') + facet_wrap(~t0)

First I created sample data. Hope it's similar to your problem:
df = data.frame(id=rep(1:9), A=c(12,13,13,14,12,11,12,11,10),
B=c(13,12,10,12,6,9,10,11,12),
Type=c("F","R","F","R","W","F","R","F","R"),
datetime=as.POSIXct(c("2015-01-01 01:00:00","2015-01-01 22:50:00",
"2015-01-02 08:30:00","2015-01-02 23:00:00",
"2015-01-03 14:10:00","2015-01-05 16:30:00",
"2015-01-05 23:00:00","2015-01-06 17:00:00",
"2015-01-07 23:00:00")),
stringsAsFactors = F)
Your first question is to plot the data, highlighting the first 24h after an F-point. I used dplyr and ggplot for this task.
library(dplyr)
library(ggplot)
df %>%
mutate(nf = cumsum(Type=="F")) %>% # build F-to-F groups
group_by(nf) %>%
mutate(first24h = as.numeric((datetime-min(datetime)) < (24*3600))) %>% # find the first 24h of each F-group
mutate(lbl=paste0(row_number(),"-",Type)) %>%
ggplot(aes(x=A, y=B, label=lbl)) +
geom_path(aes(colour=first24h)) + scale_size(range = c(1, 2)) +
geom_text()
The problem here is, that the colour only changes at some points. One thing I'm not happy with is the use of different line colors for path sections. If first24h is a discrete variable
geom_path draws two sepearate paths. That's why I defined the variable as numeric. Maybe someone can improve this?
Your second question about an interpolation can easily be solved with the zoo package:
library(zoo)
full.time = seq(df$datetime[1], tail(df$datetime, 1), by=600) # new timeline with point at every 10 min
d.zoo = zoo(df[,2:3], df$datetime) # convert to zoo object
d.full = as.data.frame(na.approx(d.zoo, xout=full.time)) # interpolate; result is also a zoo object
d.full$datetime = as.POSIXct(rownames(d.full))
With these two dataframes combined, you get the solution. Every F-F section is drawn in a separate plot and only the points not longer than 24h after the F-point is shown.
df %>%
select(Type, datetime) %>%
right_join(d.full, by="datetime") %>%
mutate(Type = ifelse(is.na(Type),"",Type)) %>%
mutate(nf = cumsum(Type=="F")) %>%
group_by(nf) %>%
mutate(first24h = (datetime-min(datetime)) < (24*3600)) %>%
filter(first24h == TRUE) %>%
mutate(lbl=paste0(row_number(),"-",Type)) %>%
filter(first24h == 1) %>%
ggplot(aes(x=A, y=B, label=Type)) +
geom_path() + geom_text() + facet_wrap(~ nf)

Related

Time series as factor with equidistant ticks

Having a dataframe (df) containing a time series for a single variable (X):
X time
1 6.905551 14-01-2021 14:53
2 6.852534 27-01-2021 18:24
3 7.030995 23-01-2021 11:11
4 7.083345 23-01-2021 01:19
5 7.003437 28-01-2021 01:07
6 7.040500 14-01-2021 23:34
7 6.940566 14-01-2021 13:42
8 6.989434 22-01-2021 18:37
9 7.032720 22-01-2021 17:50
10 7.001651 23-01-2021 19:05
I am using the time as a factor to create a plot displaying points in an equidistant manner, for which I require a conversion from the original timestamp e.g. "2021-01-14 12:07:53 CET" to 14-01-2021 12:07.
This is done by factor(format(timestamp, "%d-%m-%Y %H:%M")).
Now for the plotting I use ggplot2:
ggplot(aes(x = time, y = X, group=1), data=df) +
geom_line(linetype="dotted") + geom_point() + theme_linedraw() +
theme(axis.text.x = element_text(angle = -40)) +
scale_x_discrete(breaks=df$time[seq(1,length(df$time),by=4)], name="Date")
As indicated, I want to change the tick frequency for the x axis to avoid overlap. Ideally, ticks are placed in an equidistant manner as well per day, e.g 14-01-2021, 22-01-2021 and so on. By scale_x_discrete, I am able to place ticks for every nth factor but they end up plotting this (which is to be expected):
I have also looked into using the dates directly by as.Date(timestamp) and for the scaling e.g. scale_x_date(date_breaks = "4 days"). This obviously yields the correct equidistant tick spacing but the plot itself will end stacking values for the same date and thus containing gaps.
EDIT
#Jon Springs' answer works well if there are no duplicates in the time due to multiple observations. However, having these will result in the following using facet_grid to resolve for the said variable.
In this case the df looks like (with grouper being the variable used for facet_wrap):
X time. grouper
1 6.905551 14-01-2021 14:53 red
2 6.905551 14-01-2021 14:53 green
3 6.852534 27-01-2021 18:24 red
4 6.852534 27-01-2021 18:24 green
5 7.030995 23-01-2021 11:11 red
6 7.030995 23-01-2021 11:11 green
set.seed(0)
library(dplyr)
my_data <- tibble(X = rnorm(10),
time_delay = runif(10, 1, 1000)) %>%
mutate(time = as.POSIXct("2021-01-14") + cumsum(time_delay)*1E5) %>%
# Label every other NEW time
arrange(time) %>%
mutate(label = if_else(
cumsum(time != lag(time, default = as.POSIXct("2000-01-01"))) %% 2 < 1,
format(time, "%d-%m-%Y\n%H:%M"),
"")
)
my_data
ggplot(my_data, aes(x = time %>% as.factor,
y = X, group = 1)) +
geom_line() +
scale_x_discrete(labels = my_data$label)

Normalize time data between datasets in R

I have 2 datasets I want to overlay on a single plot, but the data have been taken at different times, so my data is not overlayed, and the 2 data sets are 30 min apart. I want the 2 graphs to be on top of each other, the relative evolution through time is important, but not the absolute time at which data was taken, which is what the graphs show right now
How do I do this? Here is how my dataframes are built.
Time Raw_Touchpad0_Rx1_Tx2
2020-11-03 14:50:00 2702
2020-11-03 14:50:01 2704
Here is my code
X3500um_15_30_45_tx2rx1 <- data.frame(Time = c(2020-11-03 14:50:00, 2020-11-03 14:50:01), Raw_Touchpad0_Rx1_Tx2 = c(2702, 2704))
X15_30_45_rx2tx2 <- data.frame(Time = c(2020-11-03 15:20:00, 2020-11-03 15:20:01), Raw_Touchpad0_Rx1_Tx2 = c(2782, 27804))
ggplot(X3500um_15_30_45_tx2rx1, aes(as.numeric(Time), Raw_Touchpad0_Rx1_Tx2)) +
geom_line(aes(colour = Raw_Touchpad0_Rx1_Tx2)) +
geom_line(data = X15_30_45_rx2tx2, aes(colour = Raw_Touchpad0_Rx1_Tx2))
I want both plots to start at 0 time and evolve to 1 sec, 2 sec, etc instead of 14:50 vs 15:20
Thanks
A possible solution to set both lines to start at 0 secs would be (example on one data.frame but applies to the second the same way):
# one of your test data.frames (note that I included ")
X3500um_15_30_45_tx2rx1 <- data.frame(Time = c("2020-11-03 14:50:00", "2020-11-03 14:50:01"), Raw_Touchpad0_Rx1_Tx2 = c(2702, 2704))
library(dplyr)
library(lubridate)
# The calculation to get a new column of the difference from minimum timestamp
X3500um_15_30_45_tx2rx1 %>%
dplyr::mutate(Time = lubridate::as_datetime(Time)) %>%
dplyr::mutate(DIF = Time - min(Time))

Having trouble correctly producing time series plot

I am trying to plot a time series from an excel file in R Studio. It has a single column named 'Dates'. This column contains datetime data of customer visits in the form 2/15/2014 6:17:22 AM. The datetime was originally in char format and I converted it into a Large POSIXct value using lubridate:
tsData <- mdy_hms(fullUsage$Dates)
Which gives me a value:
POSIXct[1:25,354], format: "2018-04-13 10:18:14" "2018-04-14 13:27:11" .....
I then tried converting it into a time series object using the code below:
require(xts)
visitTimes.ts <- xts(tsData, start = 1, order.by=as.POSIXct(tsData))
plot(visitTimes.ts)
ts_plot(visitTimes.ts)
ts_info(visitTimes.ts)
Im not 100% sure but it looks like the plot is coming out using the sum count of visits. I believe my problem may be in correctly indexing my data using the dates. I apologize in advance if this is a simple issue to deal with I am still learning R. I have included the screenshot of my plot.
yes you are right, you need to provide both the date column (x axis) and the value (y axis)
here's a simple example:
v1 <- data.frame(Date = mdy_hms(c("1-1-2020-00-00-00", "1-2-2020-00-00-00", "1-3-2020-00-00-00")), Value = c(1, 3, 6))
v2 <- xts(v1["Value"], order.by = v1[, "Date"])
plot(v2)
first argument of xts takes the x values, on the order.by i leave the actual ts object
You need to count the number of events in each time period and plot these values on the y axis. You didn't provide enough data for a reproducible example, so I have created a small example. We'll use the tidyverse packages dplyr and lubridate to help us out here:
library(lubridate)
library(dplyr)
library(ggplot2)
set.seed(69)
fullUsage <- data.frame(Dates = as.POSIXct("2020-01-01") +
minutes(round(cumsum(rexp(10000, 1/25))))
)
head(fullUsage)
#> Dates
#> 1 2020-01-01 00:02:00
#> 2 2020-01-01 00:15:00
#> 3 2020-01-01 00:22:00
#> 4 2020-01-01 00:29:00
#> 5 2020-01-01 01:13:00
#> 6 2020-01-01 01:27:00
First of all, we will create columns that show the hour of day and the month that events occurred:
fullUsage$hours <- hour(fullUsage$Dates)
fullUsage$month <- floor_date(fullUsage$Dates, "month")
Now we can effectively just count the number of events per month and plot this number for each month:
fullUsage %>%
group_by(month) %>%
summarise(n = length(hours)) %>%
ggplot(aes(month, n)) +
geom_col()
And we can do the same for the hour of day:
fullUsage %>%
group_by(hours) %>%
summarise(n = length(hours)) %>%
ggplot(aes(hours, n)) +
geom_col() +
scale_x_continuous(breaks = 0:23) +
labs(y = "Hour of day")
Created on 2020-08-05 by the reprex package (v0.3.0)

ggplot2 identical scales (non-continuous) on both sides

Goal
Use ggplot2 (latest version) to produce a graph that duplicates the x- or y-axis on both sides of the plot, where the scale is not continuous.
Minimal Reprex
# Example data
dat1 <- tibble::tibble(x = c(rep("a", 50), rep("b", 50)),
y = runif(100))
# Standard scatterplot
p1 <- ggplot2::ggplot(dat1) +
ggplot2::geom_boxplot(ggplot2::aes(x = x, y = y))
When the scale is continuous, this is easy to do with an identity transformation (clearly one-to-one).
# This works
p1 + ggplot2::scale_y_continuous(sec.axis = ggplot2::sec_axis(~ .))
However, when the scale is not continuous, this doesn't work, as other scale_* functions don't have a sec.axis argument (which makes sense).
# This doesn't work
p1 + ggplot2::scale_x_discrete(sec.axis = ggplot2::sec_axis(~ .))
Error in discrete_scale(c("x", "xmin", "xmax", "xend"), "position_d", :
unused argument (sec.axis = <environment>)
I also tried using the position argument in the scale_* functions, but this doesn't work either.
# This doesn't work either
p1 + ggplot2::scale_x_discrete(position = c("top", "bottom"))
Error in match.arg(position, c("left", "right", "top", "bottom")) :
'arg' must be of length 1
Edit
For clarity, I was hoping to duplicate the x- or y-axis where the scale is anything, not just discrete (a factor variable). I just used a discrete variable in the minimal reprex for simplicity.
For example, this issue arises in a context where the non-continuous scale is datetime or time format.
Duplicating (and modifying) discrete axis in ggplot2
You can adapt this answer by just putting the same labels on both sides. As far as "you can convert anything non-continuous to a factor, but that's even more inelegant!" from your comment above, that's what a non-continuous axis is, so I'm not sure why that would be a problem for you.
TL:DR Use as.numeric(...) for your categorical aesthetic and manually supply the labels from the original data, using scale_*_continuous(..., sec_axis(~., ...)).
Edited to update:
I happened to look back through this thread and see that it was asked for dates and times. This makes the question worded incorrectly: dates and times are continuous not discrete. Discrete scales are factors. Dates and times are ordered continuous scales. Under the hood, they're just either the days or the seconds since "1970-01-01".
scale_x_date will indeed throw an error if you try to pass a sec.axis argument, even if it's dup_axis. To work around this, you convert your dates/times to a number, and then fool your scales using labels. While this requires a bit of fiddling, it's not too complicated.
library(lubridate)
library(dplyr)
df <- data_frame(tm = ymd("2017-08-01") + 0:10,
y = cumsum(rnorm(length(tm)))) %>%
mutate(tm_num = as.numeric(tm))
df
# A tibble: 11 x 3
tm y tm_num
<date> <dbl> <dbl>
1 2017-08-01 -2.0948146 17379
2 2017-08-02 -2.6020691 17380
3 2017-08-03 -3.8940781 17381
4 2017-08-04 -2.7807154 17382
5 2017-08-05 -2.9451685 17383
6 2017-08-06 -3.3355426 17384
7 2017-08-07 -1.9664428 17385
8 2017-08-08 -0.8501699 17386
9 2017-08-09 -1.7481911 17387
10 2017-08-10 -1.3203246 17388
11 2017-08-11 -2.5487692 17389
I just made a simple vector of 11 days (0 to 10) added to "2017-08-01". If you run as.numeric on that, you get the number of days since the beginning of the Unix epoch. (see ?lubridate::as_date).
df %>%
ggplot(aes(tm_num, y)) + geom_line() +
scale_x_continuous(sec.axis = dup_axis(),
breaks = function(limits) {
seq(floor(limits[1]), ceiling(limits[2]),
by = as.numeric(as_date(days(2))))
},
labels = function(breaks) {as_date(breaks)})
When you plot tm_num against y, it's treated just like normal numbers, and you can use scale_x_continuous(sec.axis = dup_axis(), ...). Then you have to figure out how many breaks you want and how to label them.
The breaks = is a function that takes the limits of the data, and calculates nice looking breaks. First you round the limits, to make sure you get integers (dates don't work well with non-integers). Then you generate a sequence of your desired width (the days(2)). You could use weeks(1) or months(3) or whatever, check out ?lubridate::days. Under the hood, days(x) generates a number of seconds (86400 per day, 604800 per week, etc.), as_date converts that into a number of days since the Unix epoch, and as.numeric converts it back to an integer.
The labels = is a function takes the sequence of integers we just generated and converts those back to displayable dates.
This also works with times instead of dates. While dates are integer days, times are integer seconds (either since the Unix epoch, for datetimes, or since midnight, for times).
Let's say you had some observations that were on the scale of minutes, not days.
The code would be similar, with a few tweaks:
df <- data_frame(tm = ymd_hms("2017-08-01 23:58:00") + 60*0:10,
y = cumsum(rnorm(length(tm)))) %>%
mutate(tm_num = as.numeric(tm))
df
# A tibble: 11 x 3
tm y tm_num
<dttm> <dbl> <dbl>
1 2017-08-01 23:58:00 1.375275 1501631880
2 2017-08-01 23:59:00 2.373565 1501631940
3 2017-08-02 00:00:00 3.650167 1501632000
4 2017-08-02 00:01:00 2.578420 1501632060
5 2017-08-02 00:02:00 5.155688 1501632120
6 2017-08-02 00:03:00 4.022228 1501632180
7 2017-08-02 00:04:00 4.776145 1501632240
8 2017-08-02 00:05:00 4.917420 1501632300
9 2017-08-02 00:06:00 4.513710 1501632360
10 2017-08-02 00:07:00 4.134294 1501632420
11 2017-08-02 00:08:00 3.142898 1501632480
df %>%
ggplot(aes(tm_num, y)) + geom_line() +
scale_x_continuous(sec.axis = dup_axis(),
breaks = function(limits) {
seq(floor(limits[1] / 60) * 60, ceiling(limits[2] / 60) * 60,
by = as.numeric(as_datetime(minutes(2))))
},
labels = function(breaks) {
stamp("Jan 1,\n0:00:00", orders = "md hms")(as_datetime(breaks))
})
Here I updated the dummy data to span 11 minutes from just before midnight to just after midnight. In breaks = I modified it to make sure I got an integer number of minutes to create breaks on, changed as_date to as_datetime, and used minutes(2) to make a break every two minutes. In labels = I added a functional stamp(...)(...), which creates a nice format to display.
Finally just times.
df <- data_frame(tm = milliseconds(1234567 + 0:10),
y = cumsum(rnorm(length(tm)))) %>%
mutate(tm_num = as.numeric(tm))
df
# A tibble: 11 x 3
tm y tm_num
<S4: Period> <dbl> <dbl>
1 1234.567S 0.2136745 1234.567
2 1234.568S -0.6376908 1234.568
3 1234.569S -1.1080997 1234.569
4 1234.57S -0.4219645 1234.570
5 1234.571S -2.7579118 1234.571
6 1234.572S -1.6626674 1234.572
7 1234.573S -3.2298175 1234.573
8 1234.574S -3.2078864 1234.574
9 1234.575S -3.3982454 1234.575
10 1234.576S -2.1051759 1234.576
11 1234.577S -1.9163266 1234.577
df %>%
ggplot(aes(tm_num, y)) + geom_line() +
scale_x_continuous(sec.axis = dup_axis(),
breaks = function(limits) {
seq(limits[1], limits[2],
by = as.numeric(milliseconds(3)))
},
labels = function(breaks) {format((as_datetime(breaks)),
format = "%H:%M:%OS3")})
Here we've got an observation every millisecond for 11 hours starting at t = 20min34.567sec. So in breaks = we dispense with any rounding, since we don't want integers now. Then we use breaks every milliseconds(2). Then labels = needs to be formatted to accept decimal seconds, the "%OS3" means 3 digits of decimals for the seconds place (can accept up to 6, see ?strptime).
Is all of this worth it? Probably not, unless you really really want a duplicated time axis. I'll probably post this as an issue on the ggplot2 GitHub, because dup_axis should "just work" with datetimes.
Option 1: This is not very elegant but it works using the cowplot::align_plots function:
library(cowplot)
library(ggplot2)
dat1 <- tibble::tibble(x = c(rep("a", 50), rep("b", 50)),
y = runif(100))
p <- ggplot2::ggplot(dat1) +
ggplot2::geom_boxplot(ggplot2::aes(x = x, y = y))
p <- p + ggplot2::scale_y_continuous(sec.axis = ggplot2::sec_axis(~ .))
p1 <- p + scale_x_discrete(position = c( "bottom"))
p2 <- p + scale_x_discrete(position = c( "top"))
plots <- align_plots(p1, p2, align="hv")
ggdraw() + draw_grob(plots[[1]]) + draw_grob(plots[[2]])
Option 2:
library(forcats)
dat1$num <- as.numeric(fct_recode(dat1$x, "1" = "a", "2" = "b"))
x11();ggplot2::ggplot(dat1, (aes(x = num, y = y, group = num))) +
geom_boxplot()+
ggplot2::scale_y_continuous(sec.axis = ggplot2::sec_axis(~ .)) +
scale_x_continuous(position = c("top"), breaks = c(1,2), labels = c("a", "b"),
sec.axis = ggplot2::sec_axis(~ .,breaks = c(1,2), labels = c("a", "b")))
Note: an answer to similar problem was posted [here] using the cowplot package (Duplicating Discrete Axis in ggplot2), but it didn't work for me. The cowplot::switch_axis_position() function has been deprecated.

R ggplot by month and values group by Week

With ggplot2, I would like to create a multiplot (facet_grid) where each plot is the weekly count values for the month.
My data are like this :
day_group count
1 2012-04-29 140
2 2012-05-06 12595
3 2012-05-13 12506
4 2012-05-20 14857
I have created for this dataset two others colums the Month and the Week based on day_group :
day_group count Month Week
1 2012-04-29 140 Apr 17
2 2012-05-06 12595 May 18
3 2012-05-13 12506 May 19
4 2012-05-20 14857 May 2
Now I would like for each Month to create a barplot where I have the sum of the count values aggregated by week. So for example for a year I would have 12 plots with 4 bars (one per week).
Below is what I use to generate the plot :
ggplot(data = count_by_day, aes(x=day_group, y=count)) +
stat_summary(fun.y="sum", geom = "bar") +
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
facet_grid(facets = Month ~ ., scales="free", margins = FALSE)
So far, my plot looks like this
https://dl.dropboxusercontent.com/u/96280295/Rplot.png
As you can see the x axes is not as I'm looking for. Instead of showing only week 1, 2, 3 and 4, it displays all the month.
Do you know what I must change to get what I'm looking for ?
Thanks for your help
Okay, now that I see what you want, I wrote a small program to illustrate it. The key to your order of month problem is making month a factor with the levels in the right order:
library(dplyr)
library(ggplot2)
#initialization
set.seed(1234)
sday <- as.Date("2012-01-01")
eday <- as.Date("2012-07-31")
# List of the first day of the months
mfdays <- seq(sday,length.out=12,by="1 month")
# list of months - this is key to keeping the order straight
mlabs <- months(mfdays)
# list of first weeks of the months
mfweek <- trunc((mfdays-sday)/7)
names(mfweek) <- mlabs
# Generate a bunch of event-days, and then months, then week numbs in our range
n <- 1000
edf <-data.frame(date=sample(seq(sday,eday,by=1),n,T))
edf$month <- factor(months(edf$date),levels=mlabs) # use the factor in the right order
edf$week <- 1 + as.integer(((edf$date-sday)/7) - mfweek[edf$month])
# Now summarize with dplyr
ndf <- group_by(edf,month,week) %>% summarize( count = n() )
ggplot(ndf) + geom_bar(aes(x=week,y=count),stat="identity") + facet_wrap(~month,nrow=1)
Yielding:
(As an aside, I am kind of proud I did this without lubridate ...)
I think you have to do this but I am not sure I understand your question:
ggplot(data = count_by_day, aes(x=Week, y=count, group= Month, color=Month))

Resources