How to choose multiple rows every nth rows in R - r

I am currently working on a daily time series data which looks like this
Date
streamflow
1985-10-01
24
1985-10-02
6
1985-10-03
12
1985-10-04
14
...
...
2010-09-30
21
What I need to do is select the data from Oct 5 to Oct 24 from each year. I know use slice() and seq() can select a row every nth row, but I don't know how to have it work on selecting multiple rows. Any suggestion will be greatly appreciated. Thank you in advance!

Assuming your Date column is a valid Date class, use filter:
library(dplyr)
library(lubridate)
your_data %>%
filter(
month(Date) == 10 &
day(Date) >= 5 &
day(Date) <= 24
)
If your data isn't Date class yet, throw in a mutate(Date = ymd(Date)) before the filter() step.

Related

Dplyr filter based on less than equal to condition in R

I am trying to subset a data based on <= logic using dplyr in R. Even after running filter function, the data is not being filtered.
How can I fix this?
Code
library(tidyverse)
value = c("a,b,c,d,e,f")
Year = c(2020,2020,2020,2020,2020,2020)
Month = c(01,01,12,12,07,07)
dummy_df = data.frame(value, Year, Month)
dummy_df = dplyr::filter(dummy_df, Month <=07)
Now on a dummy data frame this does work, but when I use this function on an actual data set with in which I created Year, Month and Day columns using lubridate; I still see data from months greater than 07.
It may be because the OP's original dataset may be having 'Month' as character column. Convert to numeric and it should work
dummy_df = dplyr::filter(dummy_df, as.numeric(Month) <= 7)
Or in base R we could do:
subset(dummy_df, as.numeric(Month) <= 7)
value Year Month
1 a,b,c,d,e,f 2020 1
2 a,b,c,d,e,f 2020 1
5 a,b,c,d,e,f 2020 7
6 a,b,c,d,e,f 2020 7

Filter date time POSIXct data

I am trying to filter a large dataset down to records that occur on the hour. The data looks like this:
I want to filter the Date_Time field to be only the records that are on the hour i.e. "yyyy-mm-dd XX:00:00" or within 10 min of the hour. So, for example, this dataset would reduce down to row 1 and 5. Does anyone have a suggestion?
You can extract the minute value from datetime and select the rows which is within 10 minutes.
result <- subset(df, as.integer(format(UTC_datetime, '%M')) <= 10)
Or with dplyr and lubridate -
library(dplyr)
library(lubridate)
result <- df %>% filter(minute(UTC_datetime) <= 10)
Using data.table
library(data.table)
setDT(df)[minute(UTC_datetime)<=10]

Filter Data by Seasonal Ranges Over Several Years Based on Month and Day Column in R Studio

I am trying to filter a large dataset to contain results between a range of days and months over several years to evaluate seasonal objectives. My season is defined from 15 March through 15 September. I can't figure out how to filter the days so that they are only applied to March and September and not the other months within the range. My dataframe is very large and contains proprietary information, but I think the most important information is that the dates are describes by columns: SampleDate (date formatted as %y%m%d), day (numeric), and month (numeric).
I have tried filtering using multiple conditions like so:
S1 <- S1 %>%
filter((S1$month >= 3 & S1$day >=15) , (S1$month<=9 & S1$day<=15 ))
I also attempted to set ranges using between for every year that I have data with no luck:
S1 %>% filter(between(SampleDate, as.Date("2010-03-15"), as.Date("2010-09-15") &
as.Date("2011-03-15"), as.Date("2011-09-15")&
as.Date("2012-03-15"), as.Date("2012-09-15")&
as.Date("2013-03-15"), as.Date("2013-09-15")&
as.Date("2014-03-15"), as.Date("2014-09-15")&
as.Date("2015-03-15"), as.Date("2015-09-15")&
as.Date("2016-03-15"), as.Date("2016-09-15")&
as.Date("2017-03-15"), as.Date("2017-09-15")&
as.Date("2018-03-15"), as.Date("2018-09-15")))
I am pretty new to R and can't find any solution online. I know there must be a somewhat simple way to do this! Any help is greatly appreciated!
Maybe something like this:
library(data.table)
df <- setDT(df)
# convert a date like this '2020-01-01' into this '01-01'
df[,`:=`(month_day = str_sub(date, 6, 10))]
df[month_day >= '03-15' & month_day <= '09-15']

Calculations by Subgroup in a Column [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 5 years ago.
I have a dataset that looks approximately like this:
> dataSet
month detrend
1 Jan 315.71
2 Jan 317.45
3 Jan 317.5
4 Jan 317.1
5 Jan 315.71
6 Feb 317.45
7 Feb 313.5
8 Feb 317.1
9 Feb 314.37
10 Feb 315.41
11 March 316.44
12 March 315.73
13 March 318.73
14 March 315.55
15 March 312.64
.
.
.
How do I compute the average by month? E.g., I want something like
> by_month
month ave_detrend
1 Jan 315.71
2 Feb 317.45
3 March 317.5
What you need to focus on is a means to group your column of interest (the "detrend") by the month. There are ways to do this within "vanilla R", but the most effective way is to use tidyverse's dplyr.
I will use the example taken directly from that page:
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
In your case, that would be:
by_month <- dataSet %>%
group_by(month) %>%
summarize(avg = mean(detrend))
This new "tidyverse" style looks quite different, and you seem quite new, so I'll explain what's happening (sorry if this is overly obvious):
First, we are grabbing the dataframe, which I'm calling dataSet.
Then we are piping that dataset to our next function, which is group_by. Piping means that we're putting the results of the last command (which in this case is just the dataframe dataSet) and using it as the first parameter of our next function. The function group_by has a dataframe provided as its first function.
Then the results of that group by are piped to the next function, which is summarize (or summarise if you're from down under, as the author is). summarize simply calculates using all the data in the column, however, the group_by function creates partitions in that column. So we now have the mean calculated for each partition that we've made, which is month.
This is the key: group_by creates "flags" so that summarize calculates the function (mean, in this case) separately on each group. So, for instance, all of the Jan values are grouped together and then the mean is calculated only on them. Then for all of the Feb values, the mean is calculated, etc.
HTH!!
R has an inbuilt mean function: mean(x, trim = 0, na.rm = FALSE, ...)
I would do something like this:
january <- dataset[dataset[, "month"] == "january",]
januaryVector <- january[, "detrend"]
januaryAVG <- mean(januaryVector)

Filter dataframe based on a date that may or may not be contained in the dataframe

I have a dataframe (df) like the following:
derv market date
-10.7803563 S&P 500 Index 2008-01-02
-15.6922552 S&P 500 Index 2008-01-03
-15.7648483 S&P 500 Index 2008-01-04
-10.2294744 S&P 500 Index 2008-01-07
-0.5918593 S&P 500 Index 2008-01-08
8.1518987 S&P 500 Index 2008-01-09
.....
84.1518987 S&P 500 Index 2014-12-31
and I want to find the 10 trading days in df before a specific day. For example, 2008-01-12.
I have thought of using dplyr like the following:
df %>% select(derv,Market,date) %>%
filter(date > 2008-01-12 - 10 & Date <2008-01-12)
but the issue I am having is about how to index the 10 trading days before the specific day. The code I have above is not working and I do not know how to deal with it in the case of using dplyr.
Another concerning issue is that the specific day (e.g. 2008-01-12) may or may not be in df. If the specific is in df, I think I only need to go back 9 days to count; but it is not in df, I need to go back 10 indices. I am not sure if I am correct here or not, but this is the part making me confused.
Would greatly appreciate any insight.
Using dplyr and data.table::rleid()
Example data:
set.seed(123)
df=data.frame(derv=rnorm(18),Date=as.Date(c(1,2,3,4,6,7,9,11,12,13,14,15,18,19,20,21,23,24),origin="2008-01-01"))
An column with an index is created in order to select no more than 10 days before the chosen date.
library(dplyr)
library(data.table)
df %>%
filter(Date < "2008-01-19") %>%
mutate(id = rleid(Date)) %>%
filter(id > (max(id)-10)) %>%
ungroup() %>%
select(derv,Date)
derv Date
1 -1.0678237 2008-01-04
2 -0.2179749 2008-01-05
3 -1.0260044 2008-01-07
4 -0.7288912 2008-01-08
5 -0.6250393 2008-01-10
6 -1.6866933 2008-01-12
7 0.8377870 2008-01-13
8 0.1533731 2008-01-14
9 -1.1381369 2008-01-15
10 1.2538149 2008-01-16
EDIT: Procrastinatus Maximus' solution is shorter and only requires dplyr
df %>% filter(Date < "2008-01-19") %>% filter(row_number() > (max(row_number())-10))
This gives the same output.
So the answer to this question really depends on how your dates are stored in R. But let's assume ISO 8601, which is what it looks like based on your code.
So first let's make some data.
mydates <- as.Date("2007-06-22")
mydates<-c(mydates[1]+1:11, mydates[1]+14:19)
StockPrice<-c(1:17)
df<-data.frame(mydates,StockPrice)
Then specify the date of interest like #stats_guy
dateofinterest<-as.Date("2007-07-11")
I'd say use subset, and just subtract 11 from your date since it's already in that format.
foo<-subset(df, mydates<dateofinterest & mydates>(dateofinterest-11))
Then you'll have a nice span of 10 days, but I'm not sure if you want 10 trading days? Or just 10 consecutive days, even if that means your list of prices might be < 10. I intentionally made my dataset with breaks like real market data to illustrate that point. So I came up with 8 values over the 10 day period instead of 10. Interested to hear what you're actually looking for.
Say you were actually looking for 10 trading days. Just to be the devil's advocate here you could assume that there won't be more than 10 ten days of no trading. So we go 20 days back in time before your date of interest.
foo<-subset(df, mydates<dateofinterest & mydates>(dateofinterest-20))
Then we check your subset of data to see if there are more than 10 trading days within it using an if statement. If there are more then 10 rows then you have too many days. We just trim it the subset data, foo, to the right length starting from the bottom (the latest date) and then count up 9 entries from there. Now you have ten trading days in a nice tidy dataset.
if (nrow(foo)>10){
foo<-foo[(nrow(foo)-9):(nrow(foo)),]
}

Resources