I am working with a multi-year dataset that has columns for date (%Y-%m-%d) and daily values for several variables.
In R, how do I subset the data by a date range (i.e., June 29 +/- 5 days) but capture the data from all years?
DATE A B C
1996-06-10 12:00:00 178.0 24.1 1.7
1996-06-11 12:00:00 184.1 30.2 1.1
1996-06-12 12:00:00 187.2 29.4 1.8
1996-06-13 12:00:00 194.4 35.0 5.3
1996-06-14 12:00:00 200.3 35.9 1.5
1996-06-15 12:00:00 138.9 15.1 0.0
...
1) Base R
Let yrs be all unique years in the data and targets be each of those years with the target's month and day. Then create dates which contains all dates within delta days of any value in targets. Note that sapply strips dates of its "Date" class but that does not matter since it is only subsequently used in %in% and that ignores the class. Finally subset DF down to those rows whose DATE is in dates. No packages are used.
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
DATE <- as.Date(DF$DATE)
yrs <- unique(format(DATE, "%Y"))
targets <- as.Date(paste(yrs, target, sep = "-"))
dates <- c(sapply(targets, "+", seq(-delta, delta)))
DF[DATE %in% dates, ]
giving:
DATE A B C
5 1996-06-14 12:00:00 200.3 35.9 1.5
6 1996-06-15 12:00:00 138.9 15.1 0.0
2) sqldf
Alternately, this can be done using a single SQL statement. Note that we assume that the DATE column is character since the question referred to it being in a particular format. Now, using the same inputs the inner select generates target dates from each year and then the outer select joins DF to those rows within delta days of any target date. We use the H2 database backend here since it has better date support than SQLite.
library(sqldf)
library(RH2)
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
fn$sqldf("select DF.* from DF
join (select distinct cast(substr(DATE, 1, 4) || '-' || '$target' as DATE) as target
from DF)
on cast(substr(DATE, 1, 10) as DATE) between target - $delta and target + $delta")
giving:
DATE A B C
1 1996-06-14 12:00:00 200.3 35.9 1.5
2 1996-06-15 12:00:00 138.9 15.1 0.0
We could simplify the SQL somewhat if DATE is of R's "Date" class. That is, replace the sqldf statement above with:
DF2 <- transform(DF, DATE = as.Date(DATE))
fn$sqldf("select DF2.* from DF2
join (select distinct cast(year(DATE) || '-' || '$target' as DATE) as target from DF2)
on DATE between target - $delta and target + $delta")
giving:
DATE A B C
1 1996-06-14 200.3 35.9 1.5
2 1996-06-15 138.9 15.1 0.0
Note
The input DF is assumed to be:
DF <- structure(list(DATE = c("1996-06-10 12:00:00", "1996-06-11 12:00:00",
"1996-06-12 12:00:00", "1996-06-13 12:00:00", "1996-06-14 12:00:00",
"1996-06-15 12:00:00"), A = c(178, 184.1, 187.2, 194.4, 200.3,
138.9), B = c(24.1, 30.2, 29.4, 35, 35.9, 15.1), C = c(1.7, 1.1,
1.8, 5.3, 1.5, 0)), .Names = c("DATE", "A", "B", "C"), row.names = c(NA,
-6L), class = "data.frame")
A base R attempt.
Stealing the example data from the other answer by Kevin:
df <- data.frame(
my_date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
Set your variables for the selection:
month_num <- 6
day_num <- 29
bound <- 5
Find the key dates in your range of years:
keydates <- as.Date(sprintf(
"%d-%02d-%02d",
do.call(seq, as.list(as.numeric(range(format(df$my_date, "%Y"))))),
month_num,
day_num
))
Make a selection:
out <- df[df$my_date %in% outer(keydates, -bound:bound, `+`),]
Check that it worked:
table(format(out$my_date, "%m-%d"))
#06-24 06-25 06-26 06-27 06-28 06-29 06-30 07-01 07-02 07-03 07-04
# 10 10 10 10 10 10 10 10 10 10 10
One valid value for each day/month for each year 1990 to 1999, centred on "06-29" with a range of 5 days either side
You can use lubridate intervals to provide valid date ranges and then use a purrr map to run each interval over your data to filter.
library(dplyr)
library(lubridate)
library(magrittr) # only because I've used the "exposition" (%$%) pipe
library(purrr)
df <- tibble(
my_date = as.POSIXct(
seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
tz = "UTC"
),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
month_num <- 6
day_num <- 29
bound <- 5
date_span <- df %>%
select(my_date) %>%
filter(month(my_date) == month_num & day(my_date) == day_num) %>%
mutate(
start = my_date - days(bound),
end = my_date + days(bound)
) %$%
interval(start, end, tzone = "UTC")
map_dfr(date_span, ~filter(df, my_date %within% .x))
# # A tibble: 110 x 4
# my_date x y z
# <dttm> <dbl> <dbl> <dbl>
# 1 1990-06-24 10:00:00 0.404 1.33 1.58
# 2 1990-06-25 10:00:00 0.351 -1.73 0.665
# 3 1990-06-26 10:00:00 -0.512 1.01 1.72
# 4 1990-06-27 10:00:00 1.55 0.417 -0.126
# 5 1990-06-28 10:00:00 1.86 1.18 0.322
# 6 1990-06-29 10:00:00 -0.0193 -0.105 0.356
# 7 1990-06-30 10:00:00 0.844 -0.712 1.51
# 8 1990-07-01 10:00:00 -0.431 0.451 -2.19
# 9 1990-07-02 10:00:00 1.74 -0.0650 -0.866
# 10 1990-07-03 10:00:00 0.965 -0.506 -0.0690
# # ... with 100 more rows
You could also go via the Julian day, which allows you to do basic arithmetic operations (e.g. ± 5 days) without the need to convert back and forth between Date and character objects. Keep in mind that your target date translates into a different Julian day during leap years, so you'll need to extract this piece of information somehow (use lubridate::leap_year if you don't like the base R approach below):
## convert dates to julian day
dat$JULDAY = format(
dat$DATE
, "%j"
)
## target date (here 19 june) as julian day
dat$TARGET = ifelse(
as.integer(
format(
dat$DATE
, "%y"
)
) %% 4 == 0
, 171 # leap year
, 170 # common year
)
## create subset
subset(
dat
, JULDAY >= (TARGET - 5) & JULDAY <= (TARGET + 5)
, select = c("DATE", "A", "B", "C")
)
# DATE A B C
# 5 1996-06-14 12:00:00 200.3 35.9 1.5
# 6 1996-06-15 12:00:00 138.9 15.1 0.0
Related
I have long dataframe as below (Note: DoY: Day of the Year, Hour: Hour of the Day (e.g., Hour =0.5 means 12:30 AM):
Year DoY Hour
2016 126 0.5
2016 126 1
2016 126 1.5
- - -
2016 127 0
2016 127 0.5
- - -
2018 300 23.5
- - -
I am trying to combine these columns to single DateTime column (e.g. 2016-05-05 12:30 AM, 2016-05-05 1:00 AM, etc.). I tried following code:
x <- as.numeric(df$Hour)
x.m <- paste(floor(x), round((x-floor(x))*60), sep=":")
df$HourMinute <- x.m
df$DateTime <- strptime(paste0(df$Year, df$DoY, df$HourMinute), format = "%Y%j%H:%M")
Above code results into some strange output. After a Year end, It outputs NA values. How to create the desired output column?
You could use lubridate:
library(lubridate)
df$date <- make_datetime(year = df$Year, min = round(df$Hour*60)) + days(df$DoY-1)
df
#> Year DoY Hour date
#> 1 2016 126 0.5 2016-05-05 00:30:00
#> 2 2016 126 1.0 2016-05-05 01:00:00
#> 3 2016 126 1.5 2016-05-05 01:30:00
Data :
df <- structure(list(Year = c(2016L, 2016L, 2016L), DoY = c(126L, 126L, 126L),
Hour = c(0.5, 1, 1.5)),
class = "data.frame", row.names = c(NA,-3L))
Here is a base R way.
fun <- function(DF){
d <- with(DF, paste(Year, DoY))
d <- as.Date(d, "%Y %j")
hm <- DF[["Hour"]]*60
d <- paste(d, paste(hm %/% 60, hm %% 60, 0, sep = ":"))
d <- as.POSIXct(d, "%Y-%m-%d %H:%M:%S")
d
}
fun(df)
#[1] "2016-05-05 00:30:00" "2016-05-05 01:00:00"
#[3] "2016-05-05 01:30:00"
This result can be assigned to a new column in the usual way.
df$DateTime <- fun(df)
Data
df <- read.table(text = "
Year DoY Hour
2016 126 0.5
2016 126 1
2016 126 1.5
", header = TRUE)
I have a data.frame that doesn't account for leap year (ie all years are 365 days). I would like to repeat the last day value in February during the leap year. The DF in my code below has fake data set, I intentionally remove the leap day value in DF_NoLeapday. I would like to add a leap day value in DF_NoLeapday by repeating the value of the last day of February in a leap year (in our example it would Feb 28, 2004 value). I would rather like to have a general solution to apply this to many years data.
set.seed(55)
DF <- data.frame(date = seq(as.Date("2003-01-01"), to= as.Date("2005-12-31"), by="day"),
A = runif(1096, 0,10),
Z = runif(1096,5,15))
DF_NoLeapday <- DF[!(format(DF$date,"%m") == "02" & format(DF$date, "%d") == "29"), ,drop = FALSE]
We can use complete on the 'date' column which is already a Date class to expand the rows to fill in the missing dates
library(dplyr)
library(tidyr)
out <- DF_NoLeapday %>%
complete(date = seq(min(date), max(date), by = '1 day'))
dim(out)
#[1] 1096 3
out %>%
filter(date >= '2004-02-28', date <= '2004-03-01')
# A tibble: 3 x 3
# date A Z
# <date> <dbl> <dbl>
#1 2004-02-28 9.06 9.70
#2 2004-02-29 NA NA
#3 2004-03-01 5.30 7.35
By default, the other columns values are filled with NA, if we need to change it to a different value, it can be done within complete with fill
If we need the previous values, then use fill
out <- out %>%
fill(A, Z)
out %>%
filter(date >= '2004-02-28', date <= '2004-03-01')
# A tibble: 3 x 3
# date A Z
# <date> <dbl> <dbl>
#1 2004-02-28 9.06 9.70
#2 2004-02-29 9.06 9.70
#3 2004-03-01 5.30 7.35
I have a dataset that contains start and end time stamps, as well as a performance percentage. I'd like to calculate group statistics over hourly blocks, e.g. "the average performance for the midnight hour was x%."
My question is if there is a more efficient way to do this than a series of ifelse() statements.
# some sample data
pre.starting <- data.frame(starting = format(seq.POSIXt(from =
as.POSIXct(Sys.Date()), to = as.POSIXct(Sys.Date()+1), by = "5 min"),
"%H:%M", tz="GMT"))
pre.ending <- data.frame(ending = pre.starting[seq(1, nrow(pre.starting),
2), ])
ending2 <- pre.ending[-c(1), ]
starting2 <- data.frame(pre.starting = pre.starting[!(pre.starting$starting
%in% pre.ending$ending),])
dataset <- data.frame(starting = starting2
, ending = ending2
, perct = rnorm(nrow(starting2), 0.5, 0.2))
For example, I could create hour blocks with code along the lines of the following:
dataset2 <- dataset %>%
mutate(hour = ifelse(starting >= 00:00 & ending < 01:00, 12
, ifelse(starting >= 01:00 & ending < 02:00, 1
, ifelse(starting >= 02:00 & ending < 03:00, 13)))
) %>%
group_by(hour) %>%
summarise(mean.perct = mean(perct, na.rm=T))
Is there a way to make this code more efficient, or improve beyond ifelse()?
We can use cut ending hour based on hourly interval after converting timestamps into POSIXct and then take mean for each hour.
library(dplyr)
dataset %>%
mutate_at(vars(pre.starting, ending), as.POSIXct, format = "%H:%M") %>%
group_by(ending_hour = cut(ending, breaks = "1 hour")) %>%
summarise(mean.perct = mean(perct, na.rm = TRUE))
# ending_hour mean.perct
# <fct> <dbl>
# 1 2019-09-30 00:00:00 0.540
# 2 2019-09-30 01:00:00 0.450
# 3 2019-09-30 02:00:00 0.612
# 4 2019-09-30 03:00:00 0.470
# 5 2019-09-30 04:00:00 0.564
# 6 2019-09-30 05:00:00 0.437
# 7 2019-09-30 06:00:00 0.413
# 8 2019-09-30 07:00:00 0.397
# 9 2019-09-30 08:00:00 0.492
#10 2019-09-30 09:00:00 0.613
# … with 14 more rows
I am looking for a simple one-liner that will help me find a corresponding value in a dataframe.
Data sample:
weather <-data.frame("date" = seq(as.Date("2000/1/1"), by ="days", length.out = 10), temp = runif(10))
weather
date temp
1 2000-01-01 0.08520875
2 2000-01-02 0.69003449
3 2000-01-03 0.85892903
4 2000-01-04 0.37790250
5 2000-01-05 0.04121786
6 2000-01-06 0.31550816
7 2000-01-07 0.86219597
8 2000-01-08 0.30844555
9 2000-01-09 0.96949855
10 2000-01-10 0.18851018
Lets say I now want to find the day on which the maximum temperature occurred:
max_temp <- max(weather$temp)
max_temp
[1] 0.9694985
Now there are a couple of ways that I can find the date of this temperature (i.e. the corresponding value that i am after):
weather[which(weather$temp == max_temp), which(colnames(weather) == "date")]
[1] "2000-01-09"
But this is kind of laborious. I could also use dplyr:
library(dplyr)
filter(weather, temp == max_temp) %>%
select(date)
date
1 2000-01-09
But again, a two liner in the console just to get this seems like overkill.
I can't help but feel that there must be something like:
function(df, name_of_known_variable, value_of_known_variable, character_vector_of_variables_of_interest)
So for this example this would look like (assuming the function is "correspond"):
correspond(weather, temp, max_temp, date)
1 2000-01-09
I have looked all over and can't seem to find something simple for this. Please note that i understand that i could use:
weather[which.max(weather$temp), 1]
[1] "2000-01-09"
But lets assume that I am not necessarily looking for the maximum temperature (lets imagine i just have a value of interest and i am trying to find the corresponding value). Lets also imagine i have a massive data frame with lots and lots of columns (so many as to make counting them laborious). Further, lets imagine that i want to return corresponding values from multiple columns.
Turning my comment into an answer, using Base R only:
Create data, adding two more columns to provide a broader perspective:
set.seed( 1110 )
weather <-data.frame( "date" = seq( as.Date("2000/1/1"), by = "days", length.out = 10),
temp = round( runif( 10 ), 2 ),
loc = round( runif( 10 ) * 10, 2 ),
speed = round( runif( 10 ) * 50, 1 ) )
> weather
date temp loc speed
1 2000-01-01 0.48 9.79 18.9
2 2000-01-02 0.79 9.20 18.6
3 2000-01-03 0.88 9.65 46.3
4 2000-01-04 0.58 0.59 5.3
5 2000-01-05 0.22 6.12 38.7
6 2000-01-06 0.09 3.05 42.6
7 2000-01-07 0.49 4.09 2.1
8 2000-01-08 0.99 8.60 31.9
9 2000-01-09 0.56 4.27 12.6
10 2000-01-10 0.36 6.02 42.7
Now we can select per one-liner and based on column names rather than numbers, as required:
# The day with the maximum temparature
weather[ weather$temp == max( weather$temp ), "date" ]
[1] "2000-01-08"
But we can do a lot more:
# Speed and Location (order reversed) on the day with a temperature of 0.49
weather[ weather$temp == .49, c( "speed", "loc" ) ]
speed loc
7 2.1 4.09
# Date and speed, based upon two selection criteria (Temparature or Location)
# here we need to use which() to get the row indices
weather[ c( which( weather$temp == min( weather$temp ) ), which( weather$loc == 6.12 ) ), c( "date", "speed" ) ]
date speed
6 2000-01-06 42.6
5 2000-01-05 38.7
use data.table package. Syntax is simple.
a[variable == value_you_want]
a[variable == max(variable]
a[variable == 0]
dplyr::slice is also a possibility here:
set.seed(1)
weather <-data.frame("date" = seq(as.Date("2000/1/1"), by ="days", length.out = 10), temp = runif(10))
library(dplyr)
weather %>% arrange(desc(temp)) %>% slice(1)
# A tibble: 1 x 2
date temp
<date> <dbl>
1 2000-01-07 0.9446753
And you can use dplyr::filter if you need to look for a specific value
I have the following data frame:
Date_from <- c("2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13")
Date_to <- c("2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5","1.5","2.5","1.5","3.5","3")
y$Parameter<-c("A","A","A","B","B","B","C","C","C")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)
I will need to check the data frame if for EACH Parameter the date range begins at the first day of the year (2013-01-01) and ends at the last day of the year (2013-12-31). If not I will need to add an extra row at the beginning and at the end for each of the parameters to complete the date range to a full year for each parameter. The result should look like this:
Date_from Date_to concentration Parameter
2013-01-01 2013-01-31 NA NA
2013-02-01 2013-05-07 1.5 A
2013-05-10 2013-08-12 2.5 A
2013-08-13 2013-11-18 1.5 A
2013-11-19 2013-12-31 NA NA
2013-01-01 2013-01-31 NA NA
2013-02-01 2013-05-07 3.5 B
2013-05-10 2013-08-12 1.5 B
2013-08-13 2013-11-18 2.5 B
2013-11-19 2013-12-31 NA NA
2013-01-01 2013-01-31 NA NA
2013-02-01 2013-05-07 1.5 C
2013-05-10 2013-08-12 3.5 C
2013-08-13 2013-11-18 3.0 C
2013-11-19 2013-12-31 NA NA
Please note: The date ranges are only equal in this example for simplification.
UPDATE: This is my original data snippet and code:
sm<-read.csv("https://www.dropbox.com/s/tft6inwcrjqujgt/Test_data.csv?dl=1",sep=";",header=TRUE)
cleaned_sm<-sm[,c(4,5,11,14)] ##Delete obsolete columns
colnames(cleaned_sm)<-c("Parameter","Concentration","Date_from","Date_to")
cleaned_sm$Date_from<-as.Date(cleaned_sm$Date_from, format ="%d.%m.%Y")
cleaned_sm$Date_to<-as.Date(cleaned_sm$Date_to, format ="%d.%m.%Y")
#detect comma decimal separator and replace with dot decimal separater as comma is not recognised as a number
cleaned_sm=lapply(cleaned_sm, function(x) gsub(",", ".", x))
cleaned_sm<-data.frame(cleaned_sm)
cleaned_sm$Concentration <- as.numeric(cleaned_sm$Concentration)
cleaned_sm$Date_from <- as.Date(cleaned_sm$Date_from)
cleaned_sm$Date_to <- as.Date(cleaned_sm$Date_to)
Added code based on #jasbner:
cleaned_sm %>%
group_by(Parameter) %>%
do(add_row(.,
Date_from = ymd(max(Date_to))+1 ,
Date_to = ymd(paste(year(max(Date_to)),"1231")),
Parameter = .$Parameter[1])) %>%
do(add_row(.,
Date_to = ymd(min(Date_from))-1,
Date_from = ymd(paste(year(min(Date_from)),"0101")) ,
Parameter = .$Parameter[1],
.before = 0)) %>%
filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))
My attempt with dplyr and lubridate. Hacked together but I think it should work. Note this does not look for any gaps in the middle of the date ranges. Basically, for each group, you add a row before and after that particular group. Then if there are any cases where the date range starts at the beginning of the year or ends at the end of the year the added rows are filtered out.
library(dplyr)
library(lubridate)
cleaned_sm %>%
group_by(Parameter) %>%
do(add_row(.,
Date_from = ymd(max(.$Date_to))+1 ,
Date_to = ymd(paste(year(max(.$Date_to)),"1231")),
Parameter = .$Parameter[1])) %>%
do(add_row(.,
Date_to = ymd(min(.$Date_from))-1,
Date_from = ymd(paste(year(min(.$Date_from)),"0101")) ,
Parameter = .$Parameter[1],
.before = 0)) %>%
filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))
# A tibble: 15 x 4
# Groups: Parameter [3]
# Date_from Date_to concentration Parameter
# <date> <date> <dbl> <chr>
# 1 2013-01-01 2013-01-31 NA A
# 2 2013-02-01 2013-05-07 1.50 A
# 3 2013-05-10 2013-08-12 2.50 A
# 4 2013-08-13 2013-11-18 1.50 A
# 5 2013-11-19 2013-12-31 NA A
# 6 2013-01-01 2013-01-31 NA B
# 7 2013-02-01 2013-05-07 3.50 B
# 8 2013-05-10 2013-08-12 1.50 B
# 9 2013-08-13 2013-11-18 2.50 B
# 10 2013-11-19 2013-12-31 NA B
# 11 2013-01-01 2013-01-31 NA C
# 12 2013-02-01 2013-05-07 1.50 C
# 13 2013-05-10 2013-08-12 3.50 C
# 14 2013-08-13 2013-11-18 3.00 C
# 15 2013-11-19 2013-12-31 NA C
This seems like it requires a combination of different packages to attack it. I am using tidyr, data.table, and I used lubridate.
date.start <- seq.Date(as.Date("2013-01-01"), as.Date("2013-12-31"), by = "day")
Date.Int <- data.frame(Date_from = date.start, Date_to = date.start)
y_wide <- y %>% spread(Parameter, concentration)
y_wide <- as.data.table(setkey(as.data.table(y_wide), Date_from, Date_to))
Date.Int <- as.data.table(setkey(as.data.table(Date.Int), Date_from, Date_to))
dats <- foverlaps(Date.Int, y_wide, nomatch = NA)
fin.dat <- dats %>%
mutate(A = ifelse(is.na(A), -5, A),
seqs = cumsum(!is.na(A) & A != lag(A, default = -5))) %>%
group_by(seqs) %>%
summarise(Date_from = first(i.Date_from),
Date_to = last(i.Date_to) ,
A = first(A),
B = first(B),
C = first(C)) %>%
mutate(A = ifelse(A == -5, NA, A)) %>%
ungroup()%>%
gather(Concentration, Parameter, A:C) %>%
mutate(Concentration = ifelse(is.na(Parameter), NA, Concentration))
Okay, so I created a vector of dates from a start point to an end point (date.start); then I turned into a data.frame with the same interval names and interval dates for Date.Int. This is because foverlaps needs to compare two intervals (same date start and end dates in Date.Int are now officially intervals). I then took your data you provided and spread, turning it from long format data to wide format data and turned that into a data.table. keying a data.table sets up how it should be arranged, and when using foverlaps you have to key the start dates and end dates (in that order). foverlaps determines if an interval falls within another interval of dates. If you print out dats, you will see a bunch of lines with NA for everything because they did not fall within an interval. So now we have to group these in some manner. I picked grouping by values of "A" in dats. The grouping variable is called seqs. But then I summarised the data, and then switched it back from wide format to long format and replaced the appropriate NA values.