R: Only keeping the first observation of the month in dataset - r

I have the following kind of dataframe, with thousands of columns and rows. First column contains dates, and the following columns contain asset returns indexes corresponding to that date.
DATE
Asset_1
Asset_2
Asset_3
Asset_4
2000-01-01
1000
300
2900
NA
.....
2000-01-31
1100
350
2950
NA
2000-02-02
1200
330
2970
100
...
2000-02-28
1200
360
3000
200
2000-03-01
1200
370
3500
300
I want to make this into a monthly dataset by only keeping the first observation of the month.
I have come up with the following script:
library(dplyr)
library(lubridate)
monthly <- daily %>% filter(day(DATE) == 1)
However, the problem with this is that it doesnt work for months where the first day of the month is not a trading date (aka it is missing from the daily dataset).
So when I run the command, those months where the first day of the month doesn't exist are excluded from my dataset.

If the data is always ordered, you could group by year\month, then keep (slice) the first record from each group. Like:
df<-data.frame(mydate=as.Date("2023-01-01")+1:45)
library(tidyverse)
library(lubridate)
df %>%
group_by(ym=paste(year(mydate), month(mydate))) %>%
#group_by(year(mydate), month(mydate)) %>%
slice_head(n=1)

Use slice_min
library(dplyr) # version 1.1.0 or later
library(zoo)
daily %>%
mutate(ym = as.yearmon(DATE)) %>%
slice_min(DATE, by = ym)

Related

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Rbind Difference of rows

I want to determine the difference of each row and have that total difference rbinded at the end. Below is a sample dataset:
DATE <- as.Date(c('2016-11-28','2016-11-29'))
TYPE <- c('A', 'B')
Revenue <- c(2000, 1000)
Sales <- c(1000, 4000)
Price <- c(5.123, 10.234)
Material <- c(10000, 7342)
df<-data.frame(DATE, TYPE, Revenue, Sales, Price, Material)
df
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
How Do I calculate the Difference of Each of the Columns to produce this total:
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
3 DIFFERENCE -1000 3000 5.111 -2658
I can easily do it by columns but having trouble doing it by row.
Any help would be great thanks!
As 'DATE' is Date class, we may need to change it to character before proceeding with rbinding with string "DIFFERENCE". Other than that, subset the numeric columns of 'df', loop it with lapply, get the difference, concatenate with the 'DATE' and 'TYPE', and rbind with original dataset.
df$DATE <- as.character(df$DATE)
rbind(df, c(DATE = "DIFFERENCE", TYPE= NA, lapply(df[-(1:2)], diff)))
# DATE TYPE Revenue Sales Price Material
#1 2016-11-28 A 2000 1000 5.123 10000
#2 2016-11-29 B 1000 4000 10.234 7342
#3 DIFFERENCE <NA> -1000 3000 5.111 -2658

How to check if any row has negative values by leaving out selected rows?

Below is the dataframe I get by running a query. Please note that df1 is a dynamic dataframe and it might return either an empty df or partial df with not all quarters as seen below:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 299111.86
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
I would want to check the values of all the rows in Revenue column and see if any value is 0 or negative excluding 2014-Q1 row
Also, the df1 is dynamic and will contain only 12 quarters of data i.e. when I reach next qtr i.e. 2017-Q2, the Revenue associated with 2014-Q2 becomes 0 and it will look like this:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 0.00
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
14 2017-Q2 146253.64
In the above case, I would need to check all rows for the Revenue column by excluding 2014-Q1 and 2014-Q2
And this goes on as quarter progresses
Need your help to generate the code which would dynamically do all the above steps of excluding the row(s) and check only the rows that matter for a particular quarter
Currently, I am using the below code:
#Taking the first df1 into consideration which has 2017-Q1 as the last quarter
startQtr <- "2014-Q2" #This value is dynamically achieved and will change as we move ahead. Next quarter, the value changes to 2014-Q3 and so on
if(length(df1[["FISC_QTR_VAL"]][nrow(df1)-11] == startQtr) == 1){
if(nrow(df1[df1$Revenue < 0,]) == 0 & nrow(df1[df1$Revenue == 0,]) == 0){
df1 <- df1 %>% slice((nrow(df1)-11):(nrow(df1)))
}
}
The first IF loop checks if there is data in df1
If the df is empty, df1[["FISC_QTR_VAL"]][nrow(df1)-10] == startQtr condition would return numeric(0) whose length would be 0 and hence the condition fails
If not, then it goes to the next IF loop and checks for -ve and 0 values in Revenue column. But it does for all the rows. I want 2014-Q1 excluded in this case, and going forward to the future quarters, would want the condition to be dynamic as explained above.
Also, I do not want to slice the dataset before the if condition as the code would throw an error if the initial dataframe df1 returns 1 row or 2 rows and we try to slice those further
Thanks
Here's a solution using a few functions from the dplyr and tidyr packages.
Here's a toy data set to work with:
d <- data.frame(
FISC_QTR_VAL = c("2015-Q1", "2014-Q2", "2014-Q1", "2015-Q2"),
Revenue = c(100, 200, 0, 0)
)
d
#> FISC_QTR_VAL Revenue
#> 1 2015-Q1 100
#> 2 2014-Q2 200
#> 3 2014-Q1 0
#> 4 2015-Q2 0
Notice that FISC_QTR_VAL is intentionally out of order (as a precaution).
Next, set variables for the current year and quarter (you'll see why separate in a moment):
current_year <- 2014
current_quarter <- 2
Then run the following:
d %>%
separate(FISC_QTR_VAL, c("year", "quarter"), sep = "-Q") %>%
arrange(year, quarter) %>%
slice(which(year == current_year & quarter == current_quarter):n()) %>%
filter(Revenue <= 0)
#> year quarter Revenue
#> 1 2015 2 0
First, we separate() the FISC_QTR_VAL into separate year and quarter variables for (a) a tidy data set and (b) a way to order the data in case it's out of order (as in the toy used here). We then arrange() the data so that it's ordered by year and quarter. Then, we slice() away any quarters prior to the current one, and then filter() to return all rows where Revenue <= 0.
To alternatively get, for example, a count of the number of rows that are returned, you can pipe on something like nrow().
Is the subset function an option for you?
exclude.qr <- c("2014-Q1", "2014-Q2")
df <- data.frame(
FISC_QTR_VAL = c("2014-Q1", "2014-Q2", "2014-Q3", "2014-Q4"),
Revenue = c(0.00, 299111.86, 174071.98, 257655.30))
subset(
df,
FISC_QTR_VAL != exclude.qr, Revenue > 0)
You can easily create exclue.qr dynamically, e.g. via paste an years <- 2010:END.
I hope this is helpfull!

Subset data frame by ID but within 7 days

I have data frame with two variables ID and arrival. Here is head of my data frame:
head(sun_2)
Source: local data frame [6 x 2]
ID arrival
(chr) (dats)
1 027506905 01.01.15
2 042363988 01.01.15
3 026050529 01.01.15
4 028375072 01.01.15
5 055384859 01.01.15
6 026934233 01.01.15
How could I subset data by ID which has arrive within 7 days?
So like a lot of the other folks were saying, without more information (what the original observation looks like for example) we can't get at exactly what your issue is without making some assumptions.
I assumed that you have a column of data that indicates the original Date? And that these rows are formatted as.Date.
#generate Data
Data <- data.frame(ID = as.character(1394:2394),
arrival = sample(seq(as.Date('2015/01/01'), as.Date('2016/01/01'), by = 'day'), 1001, replace = TRUE))
# Make the "Original Observation" Variable
delta_times <- sample(c(3:10), 1001, replace = TRUE)
Data$First <- Data$arrival - delta_times
this gives me a data set that looks like this
ID arrival First
1 1394 2015-11-06 2015-10-28
2 1395 2015-08-04 2015-07-26
3 1396 2015-04-19 2015-04-16
4 1397 2015-05-13 2015-05-03
5 1398 2015-07-18 2015-07-11
6 1399 2015-01-08 2015-01-03
If that is the case then the solution is to use difftime, like so:
# Now we need to make a subsetting variables
Data$diff_times <- difftime(Data$arrival, Data$First, units = "days")
Data$diff_times
within_7 <- subset(Data, diff_times <=7)
max(within_7$diff_times)
Time difference of 7 days
It's a bit difficult to be sure given the information you've provided, but I think you could do it like this:
library(dplyr)
dt %>% group_by(ID) %>% filter(arrival < min(arrival) + 7)

dplyr: mean of a group count

I am trying to find the mean length of a variable over a dataframe using dplyr:
x <- data %>% group_by(Date, `% Bucket`) %>% summarise(count = n())
Date % Bucket count
(date) (fctr) (int)
1 2015-01-05 <=1 1566
2 2015-01-05 (1-25] 421
3 2015-01-05 (25-50] 461
4 2015-01-05 (50-75] 485
5 2015-01-05 (75-100] 662
6 2015-01-05 (100-150] 1693
7 2015-01-05 >150 12359
8 2015-01-13 <=1 1608
9 2015-01-13 (1-25] 441
10 2015-01-13 (25-50] 425
How to aggregate to find average across each % Bucket over the year with dplyr?
in base:
x <- as.data.frame(x)
aggregate(count ~ `% Bucket`, data = x, FUN=mean)
% Bucket count
1 <=1 2609.5294
2 (1-25] 449.0000
3 (25-50] 528.7059
4 (50-75] 593.2157
5 (75-100] 763.0000
6 (100-150] 1758.6667
7 >150 12457.9216
Aggregate function will take the count found by dplyr across each bucket above and sum them, dividing by the number of rows that contain that % Bucket variable and give the answer above. How can I accomplish this with dplyr though? This is not about completing the problem but understanding how the dplyr package would be used in such a scenario.
Another example of this type of thing would be summarise the n() of each group_by variable and also listing the minimum length "count" of that variable across the 52 weeks.
I am struggling because dplyr seems to be built to find a mean of a value in a column, but here I am counting the number of row occurrences given a variable in a column and trying to find the mean, min, max, etc. of it.
We can use dplyr methods
library(dplyr)
x %>%
group_by(`% Bucket`) %>%
summarise(count= mean(count))

Resources