I have a data frame with two columns, time and flow. The time interval for the time series is 15 minutes and I want to cut this time series so that the output time series has consistent one hour time intervals and the flow value from that hourly time stamp from the original data. How do I extract the hourly data?
Input:
structure(list(t = structure(c(1104555600, 1104556500, 1104557400,
1104558300, 1104559200, 1104560100, 1104561000, 1104561900, 1104562800
), class = c("POSIXct", "POSIXt"), tzone = "EST"), flow = c(18,
18, 18, 18.125, 18.125, 18.125, 18.125, 18.125, 18.125)), .Names = c("t", "flow"), row.names = c(NA, 9L), class = "data.frame")
And for output I would want something like
time flow
2005-01-01 00:00:00 18.000
2005-01-01 01:00:00 18.125
2005-01-01 02:00:00 18.125
You can use cut to get the hour in which each t variable is, and then just take the first element of every cut group. If df is your dataframe:
aggregate(df, list(cut(df$t,breaks="hour")), FUN=head, 1)[,-2]
# Group.1 flow
# 2005-01-01 00:00:00 18.000
# 2005-01-01 01:00:00 18.125
# 2005-01-01 02:00:00 18.125
if your dataframe is a:
library(dplyr)
filter(a, grepl(":00:00",t))
You don't give any example, but from what I understand you simply want to keep every forth row.
In a data set with
time<- c(10,11,12,13,14,15,16,17,18,19)
flow<- c(3,4,5,6,7,8,9,10,11,12)
d <- data.frame(time,flow)
1 10 3
2 11 4
3 12 5
4 13 6
5 14 7
6 15 8
7 16 9
8 17 10
9 18 11
10 19 12
with
> d[seq(1, NROW(d), by = 4),]
you only keep every fourth row.
time flow
1 10 3
5 14 7
9 18 11
Related
I have a dataframe df in R:
month abc1 def2 xyz3
201201 1 2 4
201202 2 5 7
201203 4 11 4
201204 6 23 40
I would like to convert each of the columns (of which there are ~50, each with ~100 monthly observations) to a time series format in order to check for seasonality in the data, using the decompose function.
I assumed a for loop using the ts function would be the best way of doing this. I would like to use something along the lines of the loop below, although I realise using a function on the left side of the <- produces an error. Is there a way to dynamically name variables generated by a loop?
for(i in 2:ncol(df)) {
paste(names(df[, i]), "_ts") <- ts(df[ ,i], start = c(2012, 1), end = c(2021,11), frequency = 12)
}
You could try zoo:
test = data.frame(month=c("201201", "201202", "201203", "201204"), abc1=c(1,2,3,4), def2=c(4,6,7,10), xyz3=c(12,15,16,19))
library(zoo)
ZOO =zoo(test[, c("abc1", "def2", "xyz3")], order.by=as.Date(paste0(test$month, "01"), format="%Y%m%d"))
ts(ZOO, frequency=12)
Output:
abc1 def2 xyz3
Jan 1 1 4 12
Feb 1 2 6 15
Mar 1 3 7 16
Apr 1 4 10 19
attr(,"index")
[1] 2012-01-01 2012-02-01 2012-03-01 2012-04-01
Update:
Now with correct frequency.
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
I am working in R. I have a data frame that consists of Sampling Date and water temperature. I have provided a sample dataframe below:
Date Temperature
2015-06-01 11
2015-08-11 13
2016-01-12 2
2016-07-01 12
2017-01-08 4
2017-08-13 14
2018-03-04 7
2018-09-19 10
2019-8-24 8
Due to the erratic nature of sampling dates (due to samplers ability to site) I am unable to classify years normally January 1 to December 31st and instead am using the beginning of the sampling period as the start of 1 year. In this case a year would start June 1st and End may 31st, that way I can accruately compare the years to one another. Thus I want 4 years to have the following labels
Year_One = "2015-06-01" - "2016-05-31"
Year_Two = "2016-06-01" - "2017-05-31"
Year_Three = "2017-06-01" - "2018-05-31"
Year_Four = "2018-06-01" - "2019-08-24"
My goal is to create an additional column with these labels but have thus far been unable to do so.
I create two columns year1 and year2 with two different approaches. The year2 approach needs that all the periods start june 1st and end may 31st (in your code the year_four ends 2019-08-24) so it may not be exactly what you need:
library(tidyverse)
library(lubridate)
dt$Date <- as.Date(dt$Date)
dt %>%
mutate(year1= case_when(between(Date, as.Date("2015-06-01") , as.Date("2016-05-31")) ~ "Year_One",
between(Date, as.Date("2016-06-01") , as.Date("2017-05-31")) ~ "Year_Two",
between(Date, as.Date("2017-06-01") , as.Date("2018-05-31")) ~ "Year_Three",
between(Date, as.Date("2018-06-01") , as.Date("2019-08-24")) ~ "Year_Four",
TRUE ~ "0")) %>%
mutate(year2 = paste0(year(Date-months(5)),"/", year(Date-months(5))+1))
The output:
# A tibble: 9 x 4
Date Temperature year1 year2
<date> <dbl> <chr> <chr>
1 2015-06-01 11 Year_One 2015/2016
2 2015-08-11 13 Year_One 2015/2016
3 2016-01-12 2 Year_One 2015/2016
4 2016-07-01 12 Year_Two 2016/2017
5 2017-01-08 4 Year_Two 2016/2017
6 2017-08-13 14 Year_Three 2017/2018
7 2018-03-04 7 Year_Three 2017/2018
8 2018-09-19 10 Year_Four 2018/2019
9 2019-08-24 8 Year_Four 2019/2020
Using strftime to get the years, then make a factor with levels on the unique values. I'd recommend numbers instead of words, because they can be coded automatically. Else, use labels=c("one", "two", ...).
d <- within(d, {
year <- strftime(Date, "%Y")
year <- paste("Year", factor(year, labels=seq(unique(year))), sep="_")
})
# Date temperature year
# 1 2017-06-01 1 Year_1
# 2 2017-09-01 2 Year_1
# 3 2017-12-01 3 Year_1
# 4 2018-03-01 4 Year_2
# 5 2018-06-01 5 Year_2
# 6 2018-09-01 6 Year_2
# 7 2018-12-01 7 Year_2
# 8 2019-03-01 8 Year_3
# 9 2019-06-01 9 Year_3
# 10 2019-09-01 10 Year_3
# 11 2019-12-01 11 Year_3
# 12 2020-03-01 12 Year_4
# 13 2020-06-01 13 Year_4
Data:
d <- structure(list(Date = structure(c(17318, 17410, 17501, 17591,
17683, 17775, 17866, 17956, 18048, 18140, 18231, 18322, 18414
), class = "Date"), temperature = 1:13), class = "data.frame", row.names = c(NA,
-13L))
I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15
I am quite new to R, and I am trying to find a way to average continuous data into a specific period of time.
My data is a month recording of several parameters with 1s time steps
The table via read.csv has a date and time in one column and several other columns with values.
TimeStamp UTC Pitch Roll Heave(m)
05-02-13 6:45 0 0 0
05-02-13 6:46 0.75 -0.34 0.01
05-02-13 6:47 0.81 -0.32 0
05-02-13 6:48 0.79 -0.37 0
05-02-13 6:49 0.73 -0.08 -0.02
So I want to average the data in specific intervals: 20 min for example in a way that the average for hour 7:00, takes all the points from hour 6:41 to 7:00 and returns the average in this interval and so on for the entire dataset.
The time interval will look like this :
TimeStamp
05-02-13 19:00 462
05-02-13 19:20 332
05-02-13 19:40 15
05-02-13 20:00 10
05-02-13 20:20 42
Here is a reproducible dataset similar to your own.
meteorological <- data.frame(
TimeStamp = rep.int("05-02-13", 1440),
UTC = paste(
rep(formatC(0:23, width = 2, flag = "0"), each = 60),
rep(formatC(0:59, width = 2, flag = "0"), times = 24),
sep = ":"
),
Pitch = runif(1440),
Roll = rnorm(1440),
Heave = rnorm(1440)
)
The first thing that you need to do is to combine the first two columns to create a single (POSIXct) date-time column.
library(lubridate)
meteorological$DateTime <- with(
meteorological,
dmy_hm(paste(TimeStamp, UTC))
)
Then set up a sequence of break points for your different time groupings.
breaks <- seq(ymd("2013-02-05"), ymd("2013-02-06"), "20 mins")
Finally, you can calculate the summary statistics for each group. There are many ways to do this. ddply from the plyr package is a good choice.
library(plyr)
ddply(
meteorological,
.(cut(DateTime, breaks)),
summarise,
MeanPitch = mean(Pitch),
MeanRoll = mean(Roll),
MeanHeave = mean(Heave)
)
Please see if something simple like this works for you:
myseq <- data.frame(time=seq(ISOdate(2014,1,1,12,0,0), ISOdate(2014,1,1,13,0,0), "5 min"))
myseq$cltime <- cut(myseq$time, "20 min", labels = F)
> myseq
time cltime
1 2014-01-01 12:00:00 1
2 2014-01-01 12:05:00 1
3 2014-01-01 12:10:00 1
4 2014-01-01 12:15:00 1
5 2014-01-01 12:20:00 2
6 2014-01-01 12:25:00 2
7 2014-01-01 12:30:00 2
8 2014-01-01 12:35:00 2
9 2014-01-01 12:40:00 3
10 2014-01-01 12:45:00 3
11 2014-01-01 12:50:00 3
12 2014-01-01 12:55:00 3
13 2014-01-01 13:00:00 4