Remove incomplete days / retain complete days - r

I have a data from field instruments where values for 7 different parameters are measured and recorded every 15 minutes. The data set extends for many years. Sometimes the instruments fail or are taken off-line for preventive maintenance giving incomplete days in the record. In post-processing the data, I would like to remove those incomplete days (or, stated alternatively, retain only the complete days).
An abbreviated example of what the data might look like:
Date Temp
2012-02-01 00:01:00 18.5
2012-02-01 00:16:00 18.4
2012-02-01 00:31:00 18.6
.
.
.
2012-02-01 23:31:00 19.0
2012-02-01 23:46:00 18.9
2012-02-02 00:01:00 19.0
2012-02-02 00:16:00 19.0
2012-02-03 00:01:00 17.0
2012-02-03 00:16:00 17.1
2012-02-03 00:31:00 17.0
.
.
.
2012-02-03 23:31:00 18.0
2012-02-03 23:46:00 18.2
So 2012-02-01 and 2012-02-03 are complete days and I'd like to remove 2012-02-02 as it is an incomplete day.

Convert dates to days
Count the number of observations per day
Retain only those days with the maximum number of observations
The code
library(dplyr)
library(lubridate)
dataset %>%
mutate(Day = floor_date(Date, unit = "day")) %>%
group_by(Day) %>%
mutate(nObservation = n()) %>%
filter(nObservation == max(nObservation)

Date.rle = rle(df$Date)
Date.good = Date.rle$val[Date.rle$len==96]
df = df[df$Date %in% Date.good,]

Here is one base R method that should work:
# create a day variable
df$day <- as.Date(df$Date, format="%Y-%m-%d")
# calculate the number of observations per day
df$obsCnt <- ave(df$Temp, df$day, FUN=length)
# subset data: more than 90 observations
dfNew <- df[df$obsCnt > 96,]
I put the threshold at 96 observations a day, but it is easily adjusted.

Related

Calculating the difference of a column compared to a specific reference row

I have a data frame with data for every minute and every weekday during the year and want to calculate the difference based on a specific reference line each day (which is 08:30:00 in this example and Data1 is the column I want to compare the difference for). Usually I would use diff and lag but there I can only check the difference to n-previous rows not one specific reference row.
As the entire data has about 1 Mio entries I think using lag and diff in a recursive function (where I could use the condition-check for the starting line and then walking forward) would be too time consuming. Another idea I had is doing a second data frame with only the reference line for each day (which only had line 3 in this sample) and then joining with the original data frame as a new column containing the starting value. Then I could easily calc the difference between two columns.
Date Time Data1 Diff
1 2022-01-03 08:28:00 4778.14 0
2 2022-01-03 08:29:00 4784.23 0
3 2022-01-03 08:30:00 4785.15 0
4 2022-01-03 08:31:00 4785.01 -0.14
5 2022-01-03 08:32:00 4787.83 2.68
6 2022-01-03 08:33:00 4788.80 3.65
You can subset Data1 to rows where Time is "08:30:00" as follows. This assumes Time is character.
dat$diff <- dat$Data1 - dat$Data1[[match("08:30:00", dat$Time)]]
dat
Date Time Data1 Diff diff
1 2022-01-03 08:28:00 4778.14 0.00 -7.01
2 2022-01-03 08:29:00 4784.23 0.00 -0.92
3 2022-01-03 08:30:00 4785.15 0.00 0.00
4 2022-01-03 08:31:00 4785.01 -0.14 -0.14
5 2022-01-03 08:32:00 4787.83 2.68 2.68
6 2022-01-03 08:33:00 4788.80 3.65 3.65
For data with multiple dates, you can do the same operation for each day using dplyr::group_by():
library(dplyr)
dat %>%
group_by(Date) %>%
mutate(diff = Data1 - Data1[[match("08:30:00", Time)]]) %>%
ungroup()

How to subset data by specific hours of interest?

I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?

Rolling Max/Min/Sum for time series over last x Mins interval

I have a financial time series data.frame with microsecond precision:
timestamp price volume
2017-08-29 08:00:00.345678 99.1 10
2017-08-29 08:00:00.674566 98.2 5
....
2017-08-29 16:00:00.111234 97.0 3
2017-08-29 16:00:01.445678 96.5 5
In total: around 100k records per day.
I saw a couple of functions where I can specify the width of the rolling windows, e.g. k = 10. But the k is expressed as a number of observations and not minutes.
I need to calculate runing/rolling Max, Min of Price series and a runing/rolling sum of Volume series like that:
starting with a timestamp exactly 5 minutes after the begin of the time series
for every following timestamp: look back for 5 minutes interval and
calculate the rolling statistics.
How to calculate this effectivly?
Your data
I wasn't able to capture milliseconds (but the solution should still work)
library(lubridate)
df <- data.frame(timestamp = ymd_hms("2017-08-29 08:00:00.345678", "2017-08-29 08:00:00.674566", "2017-08-29 16:00:00.111234", "2017-08-29 16:00:01.445678"),
price=c(99.1, 98.2, 97.0, 96.5),
volume=c(10,5,3,5))
purrr and dplyr solution
library(purrr)
library(dplyr)
timeinterval <- 5*60 # 5 minute
Filter df for observations within time interval, save as list
mdf <- map(1:nrow(df), ~df[df$timestamp >= df[.x,]$timestamp & df$timestamp < df[.x,]$timestamp+timeinterval,])
Summarise for each data.frame in list
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp = head(timestamp,1),
max.price = max(price),
max.volume = max(volume),
sum.price = sum(price),
sum.volume = sum(volume),
min.price = min(price),
min.volume = min(volume)))
Output
timestamp max.price max.volume sum.price sum.volume
1 2017-08-29 08:00:00 99.1 10 197.3 15
2 2017-08-29 08:00:00 98.2 5 98.2 5
3 2017-08-29 16:00:00 97.0 5 193.5 8
4 2017-08-29 16:00:01 96.5 5 96.5 5
min.price min.volume
1 98.2 5
2 98.2 5
3 96.5 3
4 96.5 5
As I was looking for a backward calculation (start with a timestamp and look 5 minutes back) I slightly modified the great solution by #CPak as follows:
mdf <- map(1:nrow(df), ~df[df$timestamp <= df[.x,]$timestamp & df$timestamp > df[.x,]$timestamp - timeinterval,])
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp_to = tail(timestamp,1),
timestamp_from = head(timestamp,1),
max.price = max(price),
min.price = min(price),
sum.volume = sum(volume),
records = n()))
In addition, I added records = n() to see how many records have been used in the intervals.
One caveat: the code takes 10 mins on mdf and another 6 mins for statdf on a dataset with 100K+ records.
Any ideas how to optimize it? Thank you!

period.apply over an hour with deciding start time

So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)

How to iterate over column values in a dataframe, take the mean, and create a new dataframe?

I have a large dataframe in R and I want to plot the change in temperature over time. I've tried this before but since there is so much data the graph is really noisy and impossible to read.
I experimented with other plot types to try and get around this but they didn't really work. So I decided instead I will plot the mean temperature for each hour.
I've uploaded the data from a csv file and there are about 56k rows, an hour is about 720 rows give or take.
> head(wormData)
Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44
The column I am interested in is Temp.1 so what I want to do is take the mean of every 720 values in the Temp.1 column, then put each of those mean values into a new dataframe so I can plot a cleaner graph.
I thought of just doing it by hand but that would be about 50 data points and I have many more csv files to do, so any help on how I could do this would be appreciated. I've tried subsetting the data or making vectors with the mean values as well as writing some loops, but I'm struggling to tell R that I want the mean of every 720 rows.
Thanks so much :)
A kind of basic solution on top of matrix:
set.seed(123)
x<-sample(1:10,(720*5),replace=TRUE) # generate dummy data
> str(x)
int [1:3600] 3 8 5 9 10 1 6 9 6 5 ...
# Use wormData$Temp.1 instead of x for your actual datas
z<-matrix(x,nrow=length(x)/719) # divide by 719 to get 720 values per row
rowMeans(z) # 'loop' over each row to get the mean
Output:
[1] 5.654167 5.375000 5.358333 5.477778 5.618056
If your dataset is not a multiple of 720, you'll get a warning and the last point would be false (recycling of the vector to fill the last line).
Here is a solution with dplyr, assuming your row number is a multiple of 720. We create a grouping variable and then compute the mean by group.
library(dplyr)
n <- 2 # replace with n <- 720 with your actual data
mutate(d,group = rep(1:(nrow(d)/n), each=n)) %>%
group_by(group) %>%
summarize(mean=mean(Temp.1))
data
d <- read.table(text = " Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44",stringsAsFactor=FALSE,head=TRUE)
Here is a more complete answer using dplyr. This uses the actual dates and times you have so that you aren't approximating 720 values per hour.
library(tidyverse)
worm_data <- data_frame(time = c("0:18:44","0:18:49","2:18:54",
"0:18:59","0:19:05","2:19:10"),
date = c("2016-07-01","2016-07-01","2016-07-01",
"2016-07-02", "2016-07-02", "2016-07-02"),
temp_1 = c(25,27,290,30,20,2))
worm_data_test <- worm_data %>%
mutate(
date = paste(date, time),
date = as.POSIXct(date, tz="GMT", format="%Y-%m-%d %H:%M:%S")
) %>%
group_by(
datetime = as.POSIXct(cut(date, breaks='hour')) # creates a new variable
) %>%
summarize(
temp_1 = mean(temp_1, na.rm=T)
) %>%
ungroup()
In this case, you are grouping by the hour, then summarizing over those hours. I chose strange values and modified the dates and times to show that it works.
For more on datetime, I suggest: https://www.stat.berkeley.edu/~s133/dates.html

Resources