Calculate mean of one column for 14 rows before certain row, as identified by date for each group (year) - r

I would like to calculate mean of Mean.Temp.c. before certain date, such as 1963-03-23 as showed in date2 column in this example. This is time when peak snowmelt runoff occurred in 1963 in my area. I want to know 10 day’s mean temperature before this date (ie., 1963-03-23). How to do it? I have 50 years data, and each year peak snowmelt date is different.
example data

You can try:
library(dplyr)
df %>%
mutate(date2 = as.Date(as.character(date2)),
ten_day_mean = mean(Mean.Temp.c[between(date2, "1963-03-14", "1963-03-23")]))
In this case the desired mean would populate the whole column.
Or with data.table:
library(data.table)
setDT(df)[between(as.Date(as.character(date2)), "1963-03-14", "1963-03-23"), ten_day_mean := mean(Mean.Temp.c)]
In the latter case you'd get NA for those days that are not relevant for your date range.

Supposing date2 is a Date field and your data.frame is called x:
start_date <- as.Date("1963-03-23")-10
end_date <- as.Date("1963-03-23")
mean(x$Mean.Temp.c.[x$date2 >= start_date & x$date2 <= end_date])
Now, if you have multiple years of interest, you could wrap this code within a for loop (or [s|l]apply) taking elements from a vector of dates.

Related

Filter date time POSIXct data

I am trying to filter a large dataset down to records that occur on the hour. The data looks like this:
I want to filter the Date_Time field to be only the records that are on the hour i.e. "yyyy-mm-dd XX:00:00" or within 10 min of the hour. So, for example, this dataset would reduce down to row 1 and 5. Does anyone have a suggestion?
You can extract the minute value from datetime and select the rows which is within 10 minutes.
result <- subset(df, as.integer(format(UTC_datetime, '%M')) <= 10)
Or with dplyr and lubridate -
library(dplyr)
library(lubridate)
result <- df %>% filter(minute(UTC_datetime) <= 10)
Using data.table
library(data.table)
setDT(df)[minute(UTC_datetime)<=10]

How to subset data according to date in R?

Simple enough question. I have data of US treasury bill rates, with two columns-
1) Date and 2) Rate. The data ranges back to 1960. I wish to subset the rates from 1990 onward, i.e. according to the date.
Code:-
data = read.csv("3mt-bill.csv")
rates= ?
So, I just want a vector of the t-bill rates, but from 1990 onwards.
How should I write the condition?
We need to first convert the 'Date' to Date class, extract the year with format, check whether it is greater than 1990 and subset the 'Rates' based on that logical vector
data$Rate[format(as.Date(data$Date), "%Y") >= 1990]
If the 'Date' column include only year part, it is easier
data$Rate[data$Date >= 1990]
Just in case, if we need tidyverse
library(tidyverse)
data %>%
filter(year(ymd(Date)) >= 1990) %>%
select(Rate)
Or using data.table
library(data.table)
setDT(data)[year(as.IDate(Date)) >= 1990, Rate]

How to match dates in 2 data frames in R, then sum specific range of values up to that date?

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}

Group dates by time doing the mean in the rest of the columns

Hi and thanks in advance,
I'd need to group the rows by date of this data set I've imported with: read.table. One problem to add is the format of all variables is factor:
Date; Time; Global_active_power; Global_reactive_power; Voltage
16/12/2006; 00:00:00; 4.216; 0.418; 234.840
16/12/2006; 00:01:00; 5.360; 0.436; 233.630
16/12/2006; 00:02:00; 5.360; 0.436; 233.630
.....
17/12/2006; 00:00:00; 1.044; 0.152; 242.730
Instead of group by date I need to calculate the mean of every column to summarize all the records during a day in just one row like this:
Date; Time; Global_active_power; Global_reactive_power; Voltage
16/12/2006; - MEAN ALL MEASURES OF THE DAY
After doing date I'delete the Time columns since I just need the mean of the measures of each day during a period of time.
Thanks again !
You can do this using the dplyr package assuming that your data is in a data frame df:
library(`dplyr`)
result <- df %>% group_by(Date) %>% ## 1.
select(-Time) %>% ## 2.
mutate_each(funs(as.numeric)) %>% ## 3.
summarise_each(funs(mean)) ## 4.
In fact, the commands reflect what you want to accomplish.
Notes:
First group_by the Date column so that the subsequent mean is computed with respect to values over all times for the date.
Then select all other columns except for the Time column using select(-Time).
As you pointed out, the columns of the data to be averaged needs to be numeric instead of factors, so convert each to numeric as necessary. This uses mutate_each to apply the as.numeric function to each column selected.
Finally, summarise_each of these selected columns applying the mean function to each column.
Using the data you provided:
print(result)
### A tibble: 2 x 4
## Date Global_active_power Global_reactive_power Voltage
## <chr> <dbl> <dbl> <dbl>
##1 16/12/2006 4.978667 0.430 234.0333
##2 17/12/2006 1.044000 0.152 242.7300
Hope this helps.

R: How to lag xts column by one day of the set

Imagine an intra-day set of data, e.g. hourly intervals. Thanks to Google and valuable Joshua's answers to other people, I managed to create new columns in the xts object carrying DAILY Open/High/Low/Close values. These are daily values applied on intra-day intervals so all rows of the same day have the same value in particular column. Since the HLC values are look-ahead biased, I want to move them to the next day. Let's focus on just one column called Prev.Day.Close.
Actual status:
My Prev.Day.Close column caries proper values for the current day. All "2010-01-01 ??:??" rows have the same value - Close of 2010-01-01 trading session. So it is not PREVIOUS day at the moment how the column name says.
What I need:
Lag the Prev.Day.Close column to the NEXT DAY OF THE SET.
I cannot lag it using lag() because it works on row (not day) basis. It must not be fixed calendar day like:
C <- ave(x$Close, .indexday(x), FUN = last)
index(C) <- index(C) + 86400
x$Prev.Day.Close <- C
Because this solution does not care about real data in the set. For example it adds new rows because the original data set has holes on weekends and holidays. Moreover, two particular days may not have the same number of intervals (rows) so the shifted data will not fit.
Desired result:
All rows of the first day in the set have NA in Prev.Day.Close because there is no previous day to get data from.
All rows of the second day have the same value in Prev.Day.Close - Any of the values I actually have in Prev.Day.Close of previous day.
The same for every next row.
If I understand correctly, here's one way to do it:
require(xts)
# sample data
dt <- .POSIXct(seq(1, 86400*4, 3600), tz="UTC")-1
x <- xts(seq_along(dt), dt)
# get the last value for each calendar day
daily.last <- apply.daily(x, last)
# merge the last value of the day with the origianl data set
y <- merge(x, daily.last)
# now lag the last value of the day and carry the NA forward
# y$daily.last <- na.locf(lag(y$daily.last))
y$daily.last <- lag(y$daily.last)
y$daily.last <- na.locf(y$daily.last)
Basically, you want to get the end of day values, merge them with the original data, then lag them. That will align the previous end of day values with the beginning of the day.

Resources