This seems really simple, yet I can't find an easy solution. I'm working with future streamflow projections for every day of a 25 year period (2024-2050). I'm only interested in streamflow during the 61 day period between 11th of April and 10th of June each year. I want to extract the data from the seq and Data column that are within this period for each year and have it in a data frame together.
Example data:
library(xts)
seq <- timeBasedSeq('2024-01-01/2050-12-31')
Data <- xts(1:length(seq),seq)
I want to achieve something like this BUT with all the dates between April 11 and June 10th and for all years (2024-2050). This is a shortened sample output:
seq_x <- c("2024-04-11","2024-06-10","2025-04-11","2025-06-10","2026-04-11","2027-06-10",
"2027-04-11", "2027-06-10")
Data_x <- c(102, 162, 467, 527, 832, 892, 1197, 1257)
output <- data.frame(seq_x, Data_x)
This question is similar to:
Calculating average for certain time period in every year
and
select date ranges for multiple years in r
but doesn't provide an efficient answer to my question on how to extract the same period over multiple years.
Here is a base R approach :
dates <- index(Data)
month <- as.integer(format(dates, '%m'))
day <- as.integer(format(dates, '%d'))
result <- Data[month == 4 & day >= 11 | month == 5 | month == 6 & day <= 10]
result
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
#...
#...
#2024-06-07 159
#2024-06-08 160
#2024-06-09 161
#2024-06-10 162
#2025-04-11 467
#2025-04-12 468
#...
#...
Create an mmdd character string and subset using it:
mmdd <- format(time(Data), "%m%d")
Data1 <- Data[mmdd >= "0411" & mmdd <= "0610"]
These would also work. They shift the dates back by 10 days in which case it coincides with April and May
Data2 <- Data[format(time(Data)-10, "%m") %in% c("04", "05")]
or
Data3 <- Data[ cycle(as.yearmon(time(Data)-10)) %in% 4:5 ]
The command fortify.zoo(x) can be used to convert an xts object x to a data frame.
Here is an option. Do a group by year of the 'seq_x', then summarise to create a list column by subsetting 'Data' based on the first and last elements of 'seq_x' and select the column
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
output %>%
group_by(year = year(seq_x)) %>%
summarise(new = list(Data[str_c(first(seq_x), last(seq_x), sep="::")]),
.groups = 'drop') %>%
pull(new) %>%
invoke(rbind, .)
# [,1]
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
# ...
Related
I have data with one column which specifies day of the year, the code below provides an example dataset. No errors are appearing with my code but when I look at the number of observations in 'df_2' and 'df_3' something is wrong. I can't work out what it is.
#Example data
height <- c(21,34,64,27,74,90)
weight <- c(1,45,2,46,3,7)
day <- c(23,67,34,1,90,54)
df <- data.frame(height,weight,day)
#get days between 30 &70, and between 80 & 100
df_2 <- subset(df, day>”30” & day<”70”)
df_3 <- subset(df, day>”80” & day<”100”)
df_4 <- rbind(df_2,df_3)
I have also tried typing it as a range eg: subset(df, day[30:70] but this produces an error.
Please remove quotes as its they are numeric format
df_2 <- subset(df, day>= 30 & day <= 70)
df_3 <- subset(df, day>=80 & day<=100)
df_4 <- rbind(df_2,df_3)
> print(df_4)
height weight day
34 45 67
64 2 34
90 7 54
74 3 90
I have data measuring precipitation daily using R. My dates are in format 2008-01-01 and range for 10 years. I am trying to aggregate from 2008-10-01 to 2009-09-31 but I am not sure how. Is there a way in aggregate to set a start date of aggregation and group.
My current code is
data<- aggregate(data$total_snow_cm, by=list(data$year), FUN = 'sum')
but this output gives me a sum total of the snowfall for each year from jan - dec but I want it to include oct / 08 to sept / 09.
Assuming your data are in long format, I'd do something like this:
library(tidyverse)
#make sure R knows your dates are dates - you mention they're 'yyyy-mm-dd', so
yourdataframe <- yourdataframe %>%
mutate(yourcolumnforprecipdate = ymd(yourcolumnforprecipdate)
#in this script or another, define a water year function
water_year <- function(date) {
ifelse(month(date) < 10, year(date), year(date)+1)}
#new wateryear column for your data, using your new function
yourdataframe <- yourdataframe %>%
mutate(wateryear = water_year(yourcolumnforprecipdate)
#now group by water year (and location if there's more than one)
#and sum and create new data.frame
wy_sums <- yourdataframe %>% group_by(locationcolumn, wateryear) %>%
summarize(wy_totalprecip = sum(dailyprecip))
For more info, read up on the tidyverse 's great sublibrary called lubridate -
where the ymd() function is from. There are others like ymd_hms(). mutate() is from the tidyverse's dplyr libary. Both libraries are extremely useful!
I'd like to give the actual answer to the question, where the aggregate() way was asked.
You may use with() to wrap the data specification around aggregate(). In the with() you can define date intervals as you can with numbers.
df1.agg <- with(df1[as.Date("2008-10-01") <= df1$year & df1$year <= as.Date("2009-09-30"), ],
aggregate(total_snow_cm, by=list(year), FUN=sum))
Another way is to use aggregate()'s formula interface, where data and, hence, also the interval can be specified inside the aggregate() call.
df1.agg <- aggregate(total_snow_cm ~ year,
data=df1[as.Date("2008-10-01") <= df1$year &
df1$year <= as.Date("2009-09-30"), ], FUN=sum)
Result
head(df1.agg)
# year total_snow_cm
# 1 2008-10-01 171
# 2 2008-10-02 226
# 3 2008-10-03 182
# 4 2008-10-04 129
# 5 2008-10-05 135
# 6 2008-10-06 222
Data
set.seed(42)
df1 <- data.frame(total_snow_cm=sample(120:240, 4018, replace=TRUE),
year=seq(as.Date("2000-01-01"),as.Date("2010-12-31"), by="day"))
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I need to calculate the number of days elapsed between multiple dates in two ways and then output those results to new columns: i) number of days that has elapsed as compared to the first date (e.g., RESULTS$FIRST) and ii) between sequential dates (e.g., RESULTS$BETWEEN). Here is an example with the desired results. Thanks in advance.
library(lubridate)
DATA = data.frame(DATE = mdy(c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013",
"12/16/2013", "12/16/2015")))
RESULTS = data.frame(DATE = mdy(c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013",
"12/16/2013", "12/16/2015")),
FIRST = c(0, 24, 53, 107, 161, 891), BETWEEN = c(0, 24, 29, 54, 54, 730))
#Using dplyr package
library(dplyr)
df1 %>% # your dataframe
mutate(BETWEEN0=as.numeric(difftime(DATE,lag(DATE,1))),BETWEEN=ifelse(is.na(BETWEEN0),0,BETWEEN0),FIRST=cumsum(as.numeric(BETWEEN)))%>%
select(-BETWEEN0)
DATE BETWEEN FIRST
1 2013-07-08 0 0
2 2013-08-01 24 24
3 2013-08-30 29 53
4 2013-10-23 54 107
5 2013-12-16 54 161
6 2015-12-16 730 891
This will get you what you want:
d <- as.Date(DATA$DATE, format="%m/%d/%Y")
first <- c()
for (i in seq_along(d))
first[i] <- d[i] - d[1]
between <- c(0, diff(d))
This uses the as.Date() function in the base package to cast the vector of string dates to date values using the given format. Since you have dates as month/day/year, you specify format="%m/%d/%Y" to make sure it's interpreted correctly.
diff() is the lagged difference. Since it's lagged, it doesn't include the difference between element 1 and itself, so you can concatenate a 0.
Differences between Date objects are given in days by default.
Then constructing the output dataframe is simple:
RESULTS <- data.frame(DATE=DATA$DATE, FIRST=first, BETWEEN=between)
For the first part:
DATA = data.frame((c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013","12/16/2013", "12/16/2015")))
names(DATA)[1] = "V1"
date = as.Date(DATA$V1, format="%m/%d/%Y")
print(date-date[1])
Result:
[1] 0 24 53 107 161 891
For second part - simply use a for loop
You can just add each column with the simple difftime and lagged diff calculations.
DATA$FIRST <- c(0,
with(DATA,
difftime(DATE[2:length(DATE)],DATE[1], unit="days")
)
)
DATA$BETWEEN <- c(0,
with(DATA,
diff(DATE[1:(length(DATE) - 1)], unit="days")
)
)
identical(DATA, RESULTS)
[1] TRUE
I have data measured over a 7 day period. Part of the data looks as follows:
start wk end wk X1
2/1/2004 2/7/2004 89
2/8/2004 2/14/2004 65
2/15/2004 2/21/2004 64
2/22/2004 2/28/2004 95
2/29/2004 3/6/2004 79
3/7/2004 3/13/2004 79
I want to convert this weekly (7 day) data into monthly data using weighted averages of X1. Notice that some of the 7 day X1 data will overlap from one month to the other (X1=79 for the period 2/29 to 3/6 of 2004).
Specifically I would obtain the February 2004 monthly data (say, Y1) the following way
(7*89 + 7*65 + 7*64 + 7*95 + 1*79)/29 = 78.27
Does R have a function that will properly do this? (to.monthly in the xts library DOES NOT do what I need) If, not what is the best way to do this in R?
Convert the data to daily data and then aggregate:
Lines <- "start end X1
2/1/2004 2/7/2004 89
2/8/2004 2/14/2004 65
2/15/2004 2/21/2004 64
2/22/2004 2/28/2004 95
2/29/2004 3/6/2004 79
3/7/2004 3/13/2004 79
"
library(zoo)
# read data into data frame DF
DF <- read.table(text = Lines, header = TRUE)
# convert date columns to "Date" class
fmt <- "%m/%d/%Y"
DF <- transform(DF, start = as.Date(start, fmt), end = as.Date(end, fmt))
# convert to daily zoo series
to.day <- function(i) with(DF, zoo(X1[i], seq(start[i], end[i], "day")))
z.day <- do.call(c, lapply(1:nrow(DF), to.day))
# aggregate by month
aggregate(z.day, as.yearmon, mean)
The last line gives:
Feb 2004 Mar 2004
78.27586 79.00000
If you are willing to get rid of "end week" from your DF, apply.monthly will work like a charm.
DF.xts <- xts(DF$X1, order.by=DF$start_wk)
DF.xts.monthly <- apply.monthly(DF.xts, "sum")
Then you can always recreate end dates if you absolutely need them by adding 30.