Normalize time data between datasets in R - r

I have 2 datasets I want to overlay on a single plot, but the data have been taken at different times, so my data is not overlayed, and the 2 data sets are 30 min apart. I want the 2 graphs to be on top of each other, the relative evolution through time is important, but not the absolute time at which data was taken, which is what the graphs show right now
How do I do this? Here is how my dataframes are built.
Time Raw_Touchpad0_Rx1_Tx2
2020-11-03 14:50:00 2702
2020-11-03 14:50:01 2704
Here is my code
X3500um_15_30_45_tx2rx1 <- data.frame(Time = c(2020-11-03 14:50:00, 2020-11-03 14:50:01), Raw_Touchpad0_Rx1_Tx2 = c(2702, 2704))
X15_30_45_rx2tx2 <- data.frame(Time = c(2020-11-03 15:20:00, 2020-11-03 15:20:01), Raw_Touchpad0_Rx1_Tx2 = c(2782, 27804))
ggplot(X3500um_15_30_45_tx2rx1, aes(as.numeric(Time), Raw_Touchpad0_Rx1_Tx2)) +
geom_line(aes(colour = Raw_Touchpad0_Rx1_Tx2)) +
geom_line(data = X15_30_45_rx2tx2, aes(colour = Raw_Touchpad0_Rx1_Tx2))
I want both plots to start at 0 time and evolve to 1 sec, 2 sec, etc instead of 14:50 vs 15:20
Thanks

A possible solution to set both lines to start at 0 secs would be (example on one data.frame but applies to the second the same way):
# one of your test data.frames (note that I included ")
X3500um_15_30_45_tx2rx1 <- data.frame(Time = c("2020-11-03 14:50:00", "2020-11-03 14:50:01"), Raw_Touchpad0_Rx1_Tx2 = c(2702, 2704))
library(dplyr)
library(lubridate)
# The calculation to get a new column of the difference from minimum timestamp
X3500um_15_30_45_tx2rx1 %>%
dplyr::mutate(Time = lubridate::as_datetime(Time)) %>%
dplyr::mutate(DIF = Time - min(Time))

Related

Having trouble correctly producing time series plot

I am trying to plot a time series from an excel file in R Studio. It has a single column named 'Dates'. This column contains datetime data of customer visits in the form 2/15/2014 6:17:22 AM. The datetime was originally in char format and I converted it into a Large POSIXct value using lubridate:
tsData <- mdy_hms(fullUsage$Dates)
Which gives me a value:
POSIXct[1:25,354], format: "2018-04-13 10:18:14" "2018-04-14 13:27:11" .....
I then tried converting it into a time series object using the code below:
require(xts)
visitTimes.ts <- xts(tsData, start = 1, order.by=as.POSIXct(tsData))
plot(visitTimes.ts)
ts_plot(visitTimes.ts)
ts_info(visitTimes.ts)
Im not 100% sure but it looks like the plot is coming out using the sum count of visits. I believe my problem may be in correctly indexing my data using the dates. I apologize in advance if this is a simple issue to deal with I am still learning R. I have included the screenshot of my plot.
yes you are right, you need to provide both the date column (x axis) and the value (y axis)
here's a simple example:
v1 <- data.frame(Date = mdy_hms(c("1-1-2020-00-00-00", "1-2-2020-00-00-00", "1-3-2020-00-00-00")), Value = c(1, 3, 6))
v2 <- xts(v1["Value"], order.by = v1[, "Date"])
plot(v2)
first argument of xts takes the x values, on the order.by i leave the actual ts object
You need to count the number of events in each time period and plot these values on the y axis. You didn't provide enough data for a reproducible example, so I have created a small example. We'll use the tidyverse packages dplyr and lubridate to help us out here:
library(lubridate)
library(dplyr)
library(ggplot2)
set.seed(69)
fullUsage <- data.frame(Dates = as.POSIXct("2020-01-01") +
minutes(round(cumsum(rexp(10000, 1/25))))
)
head(fullUsage)
#> Dates
#> 1 2020-01-01 00:02:00
#> 2 2020-01-01 00:15:00
#> 3 2020-01-01 00:22:00
#> 4 2020-01-01 00:29:00
#> 5 2020-01-01 01:13:00
#> 6 2020-01-01 01:27:00
First of all, we will create columns that show the hour of day and the month that events occurred:
fullUsage$hours <- hour(fullUsage$Dates)
fullUsage$month <- floor_date(fullUsage$Dates, "month")
Now we can effectively just count the number of events per month and plot this number for each month:
fullUsage %>%
group_by(month) %>%
summarise(n = length(hours)) %>%
ggplot(aes(month, n)) +
geom_col()
And we can do the same for the hour of day:
fullUsage %>%
group_by(hours) %>%
summarise(n = length(hours)) %>%
ggplot(aes(hours, n)) +
geom_col() +
scale_x_continuous(breaks = 0:23) +
labs(y = "Hour of day")
Created on 2020-08-05 by the reprex package (v0.3.0)

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

splitting in samples and operating on them

I am just beginning with R and I have a beginner's question.
I have the following data frame (simplified):
Time: 00:01:00 00:02:00 00:03:00 00:04:00 ....
Flow: 2 4 5 1 ....
I would like to know the mean flow every two minutes instead of every minute. I need this for many hours of data.
I want to save those new means in a list. How can I do this using an apply function?
I assume you have continuous data without gaps, with values for Flow for every minute.
In base R we can use aggregate:
df.out <- data.frame(Time = df[seq(0, nrow(df) - 1, 2) + 1, "Time"]);
df.out$mean_2min = aggregate(
df$Flow,
by = list(rep(seq(1, nrow(df) / 2), each = 2)),
FUN = mean)[, 2];
df.out;
# Time mean_2min
#1 00:01:00 3
#2 00:03:00 3
Explanation: Extract only the odd rows from df; aggregate values in column Flow by every 2 rows, and store the mean in column mean_2min.
Sample data
df <- data.frame(
Time = c("00:01:00", "00:02:00", "00:03:00", "00:04:00"),
Flow = c(2, 4, 5, 1))
You can create a new variable in your data by using rounding your time variable to the closest two minutes below, then use a data table function to calculate the mean for your new minutes.
In order to help you precisely, you're gonna have to point out how your data is set up. If, for instance, your data is set up like this:
dt = data.table(Time = c(0:3), Flow = c(2,4,5,1))
Then the following would work for you:
dt[, twomin := floor(Time/2)*2]
dt[, mean(Flow), by = twomin]

plotting daily rainfall data using geom_step

I have some rainfall data collected continuously from which I have calculated daily totals. Here is some toy data:
Date <- c(seq(as.Date("2016-07-01"), by = "1 day", length.out = 10))
rain_mm <- c(3,6,8,12,0,0,34,23,5,1)
rain_data <- data.frame(Date, rain_mm)
I can plot this data as follows:
ggplot(rain_data, aes(Date, rain_mm)) +
geom_bar(stat = "identity") +
scale_x_date(date_labels = "%d")
Which gives the following:
This seems fine. It is clear how much rainfall there was on a certain day. However, it could also be interpreted that between midday of one day and midday of the next, a certain amount of rain fell, which is wrong. This is especially a problem if the graph is combined with other plots of related continuous variables over the same period.
To get round this issue I could use geom_step as follows:
library(ggplot)
ggplot(rain_data, aes(Date, rain_mm)) +
geom_step() +
scale_x_date(date_labels = "%d")
Which gives:
This is a better way to display the data, and now scale_x_date appears to be a continuous axis. However, it would be nice to get the area below the steps filled but cant seem to find a straight forward way of doing this.
Q1: How can I fill beneath the geom_step? Is it possible?
It may also be useful to convert Date into POSIXct to facilitate identical x-axis in multi-plot figures as discussed in this SO question here.
I can do this as follows:
library(dplyr)
rain_data_POSIX <- rain_data %>% mutate(Date = as.POSIXct(Date))
Date rain_mm
1 2016-07-01 01:00:00 3
2 2016-07-02 01:00:00 6
3 2016-07-03 01:00:00 8
4 2016-07-04 01:00:00 12
5 2016-07-05 01:00:00 0
6 2016-07-06 01:00:00 0
7 2016-07-07 01:00:00 34
8 2016-07-08 01:00:00 23
9 2016-07-09 01:00:00 5
10 2016-07-10 01:00:00 1
However, this gives a time of 01:00 for each date. I would rather have 00:00. Can I change this in the as.POSIXct function call, or do I have to do it afterwards using a separate function? I think it is something to do with tz = "" but cant figure it out.
How can I convert from class Date to POSIXct so that the time generated is 00:00?
Thanks
For your first question, you can work off this example. First, create a time-lagged version of your data:
rain_tl <- mutate( rain_data, rain_mm = lag( rain_mm ) )
Then combine this time-lagged version with the original data, and re-sort by date:
rain_all <- bind_rows( old = rain_data, new = rain_tl, .id="source" ) %>%
arrange( Date, source )
(Note the newly created source column is used to break ties, correctly interlacing the original data with the time-lagged version):
> head( rain_all )
source Date rain_mm
1 new 2016-07-01 NA
2 old 2016-07-01 3
3 new 2016-07-02 3
4 old 2016-07-02 6
5 new 2016-07-03 6
6 old 2016-07-03 8
You can now use the joint matrix to "fill" your steps:
ggplot(rain_data, aes(Date, rain_mm)) +
geom_step() +
geom_ribbon( data = rain_all, aes( ymin = 0, ymax = rain_mm ),
fill="tomato", alpha=0.5 ):
This produces the following plot:
For your second question, the problem is that as.POSIX.ct does not pass additional arguments to the converter, so specifying the tz argument does nothing.
You basically have two options:
1) Reformat the output to what you want: format( as.POSIXct( Date ), "%F 00:00" ), which returns a vector of type character. If you want to preserve the object type as POSIXct, you can instead...
2) Cast your Date vector to character prior to passing it to as.POSIX.ct: as.POSIXct( as.character(Date) ), but this will leave off the time entirely, which may be what you want anyway.
If you would like to avoid the hack, you can customize the position in the geom_bar expression.
I found good results with:
ggplot(rain_data, aes(Date, rain_mm)) +
geom_bar(stat = "identity", position = position_nudge(x = 0.51), width = 0.99) +
scale_x_date(date_labels = "%d")

Extracting a point from ggplot and plot it

I am initially having the dataset as shown below:
ID A B Type Time Date
1 12 13 R 23:20 1-1-01
1 13 12 F 23:40 1-1-01
1 13 11 F 00:00 2-1-01
1 15 10 R 00:20 2-1-01
1 12 06 W 00:40 2-1-01
1 11 09 F 01:00 2-1-01
1 12 10 R 01:20 2-1-01
so on...
I tried to make the ggplot of the above dataset for A and B.
ggplot(data=dataframe, aes(x=A, y=B, colour = Type)) +geom_point()+geom_path()
Problem:
HOW do I add a subsetting variable that looks at the first 24 hours after the every 'F' point.
For the time being I have posted a continuous data set [with respect to time] but my original data set is not continuous. How can I make my data set continuous in a interval of 10 mins? I have used interpolation xspline() function on A and B but I don't know how to make my data set continuous with respect to time,
The highlighted part shown below is what I am looking for, I want to extract this dataset and then plot a new ggplot:
From MarkusN plots this is what I am looking for:
Taking first point as 'F' point and traveling 24hrs from that point (Since there is no 24 hrs data set available here so it should produce like this) :
I've tried the following, maybe you can get an idea from here. I recommend you to first have a variable with the time ordered (either in minutes or hours, in this example I've used hours). Let's see if it helps
#a data set is built as an example
N = 100
set.seed(1)
dataframe = data.frame(A = cumsum(rnorm(N)),
B = cumsum(rnorm(N)),
Type = sample(c('R','F','W'), size = N,
prob = c(5/7,1/7,1/7), replace=T),
time.h = seq(0,240,length.out = N))
# here, a list with dataframes is built with the sequences
l_dfs = lapply(which(dataframe$Type == 'F'), function(i, .data){
transform(subset(.data[i:nrow(.data),], (time.h - time.h[1]) <= 24),
t0 = sprintf('t0=%4.2f', time.h[1]))
}, dataframe)
ggplot(data=do.call('rbind', l_dfs), aes(x=A, y=B, colour=Type)) +
geom_point() + geom_path(colour='black') + facet_wrap(~t0)
First I created sample data. Hope it's similar to your problem:
df = data.frame(id=rep(1:9), A=c(12,13,13,14,12,11,12,11,10),
B=c(13,12,10,12,6,9,10,11,12),
Type=c("F","R","F","R","W","F","R","F","R"),
datetime=as.POSIXct(c("2015-01-01 01:00:00","2015-01-01 22:50:00",
"2015-01-02 08:30:00","2015-01-02 23:00:00",
"2015-01-03 14:10:00","2015-01-05 16:30:00",
"2015-01-05 23:00:00","2015-01-06 17:00:00",
"2015-01-07 23:00:00")),
stringsAsFactors = F)
Your first question is to plot the data, highlighting the first 24h after an F-point. I used dplyr and ggplot for this task.
library(dplyr)
library(ggplot)
df %>%
mutate(nf = cumsum(Type=="F")) %>% # build F-to-F groups
group_by(nf) %>%
mutate(first24h = as.numeric((datetime-min(datetime)) < (24*3600))) %>% # find the first 24h of each F-group
mutate(lbl=paste0(row_number(),"-",Type)) %>%
ggplot(aes(x=A, y=B, label=lbl)) +
geom_path(aes(colour=first24h)) + scale_size(range = c(1, 2)) +
geom_text()
The problem here is, that the colour only changes at some points. One thing I'm not happy with is the use of different line colors for path sections. If first24h is a discrete variable
geom_path draws two sepearate paths. That's why I defined the variable as numeric. Maybe someone can improve this?
Your second question about an interpolation can easily be solved with the zoo package:
library(zoo)
full.time = seq(df$datetime[1], tail(df$datetime, 1), by=600) # new timeline with point at every 10 min
d.zoo = zoo(df[,2:3], df$datetime) # convert to zoo object
d.full = as.data.frame(na.approx(d.zoo, xout=full.time)) # interpolate; result is also a zoo object
d.full$datetime = as.POSIXct(rownames(d.full))
With these two dataframes combined, you get the solution. Every F-F section is drawn in a separate plot and only the points not longer than 24h after the F-point is shown.
df %>%
select(Type, datetime) %>%
right_join(d.full, by="datetime") %>%
mutate(Type = ifelse(is.na(Type),"",Type)) %>%
mutate(nf = cumsum(Type=="F")) %>%
group_by(nf) %>%
mutate(first24h = (datetime-min(datetime)) < (24*3600)) %>%
filter(first24h == TRUE) %>%
mutate(lbl=paste0(row_number(),"-",Type)) %>%
filter(first24h == 1) %>%
ggplot(aes(x=A, y=B, label=Type)) +
geom_path() + geom_text() + facet_wrap(~ nf)

Resources