Calculate average of month and replace values of other column - r

I have a dataframe as given below:
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015","13-05-2015","14-05-2015"
,"15-05-2015","12-06-2015","13-06-2015","14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
Below is the column which contains value based on some calculation:
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)
Now I want to replace pvar value if its value less than the average value for that particular month.
For example,
for month 4,
Average value of pvar is 9.3 ((8.4+2.4+12+14.4)/4).
Then replace all the value in pvar which is less than avg for month 4 that is (8.4 &2.4).
Pvar value would be 9.3,9.3,12,14.4
I need to do this for all the values in pvar.

A base R solution would be to use ave. Note that we first need to convert the date column to actual date in order to extract the month (strsplit or regex can also do it but I prefer to have it set as a proper date), i.e.
df$vdate <- as.POSIXct(df$vdate, format = '%d-%m-%Y')
with(df, ave(pvar, format(vdate, '%m'), FUN = function(i) replace(i, i < mean(i), mean(i))))
#[1] 9.30 9.30 12.00 14.40 4.65 4.65 7.80 5.00 16.00 14.45 18.00 18.40
As per your edit, I will use dplyr to tackle it as it might be more readable. There are actually two ways I came up with.
First: Create an extra grouping variable that will put all the months you need to alter the values in the same group and replace from there, i.e.
library(dplyr)
cbind(df, pvar) %>%
group_by(grp = cumsum(!month %in% c(4, 5))+1, month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))) %>%
ungroup() %>%
select(-grp)
Second: Filter the months you need, do the calculations. Then filter the months you don't need, create again the pvar but without changing anything (necessary for binding the rows) and bind the rows, i.e.
bind_rows(
cbind(df, pvar) %>%
filter(month %in% c(4, 5)) %>%
group_by(month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))),
cbind(df, pvar) %>%
filter(!month %in% c(4, 5))
)
Both the above give,
vdate month col1 pvar
<fct> <dbl> <dbl> <dbl>
1 12-04-2015 4. 12.0 12.0
2 13-04-2015 4. 12.4 12.4
3 14-04-2015 4. 14.3 14.3
4 15-04-2015 4. 3.00 10.4
5 12-05-2015 5. 5.30 5.30
6 13-05-2015 5. 1.80 4.80
7 14-05-2015 5. 7.60 7.60
8 15-05-2015 5. 4.50 4.80
9 12-06-2015 6. 7.60 7.60
10 13-06-2015 6. 10.7 10.7
11 14-06-2015 6. 12.0 12.0
12 15-06-2015 6. 15.7 15.7

A dplyr based solution could be :
#Additional condition has been added to check if month != 6
cbind(df, pvar) %>%
group_by(month) %>%
mutate(pvar = ifelse(pvar < mean(pvar) & month != 6, mean(pvar), pvar)) %>%
as.data.frame()
# vdate month col1 pvar
# 1 12-04-2015 4 12.0 9.30
# 2 13-04-2015 4 12.4 9.30
# 3 14-04-2015 4 14.3 12.00
# 4 15-04-2015 4 3.0 14.40
# 5 12-05-2015 5 5.3 4.65
# 6 13-05-2015 5 1.8 4.65
# 7 14-05-2015 5 7.6 7.80
# 8 15-05-2015 5 4.5 5.00
# 9 12-06-2015 6 7.6 16.00
# 10 13-06-2015 6 10.7 5.40
# 11 14-06-2015 6 12.0 18.00
# 12 15-06-2015 6 15.7 18.40
Data
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015",
"13-05-2015","14-05-2015","15-05-2015","12-06-2015","13-06-2015",
"14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)

Related

Calculate 7 day average in r

Date
rate
7_Day_rate_avg
1967-07-01
12.5
N/a
1967-07-02
12.5
N/a
1967-07-03
6
N/a
1967-07-04
8
N/a
1967-07-05
4
N/a
1967-07-06
2
N/a
1967-07-07
11.5
avg
1967-07-08
12.1
avg
1967-07-09
10
avg
1967-07-10
12.0
avg
1967-07-11
11.1
avg
1967-07-12
10
avg
I'm trying to calculate 7 day rate average using the "rate" column in r using rolling mean function, but I am getting a lot of errors. Not sure where to start. I want the final output to look like 7_Day_rate_avg column
library(zoo)
rollmean(rate, date, 7)
Assuming data frame dat shown reproducibly in the Note at the end, there are several problems:
the library statement must be separated from the next statement by a newline or semicolon
rate is a column of a data frame, not an R variable
the arguments to rollmean are wrong
the alignment was not specified. Since you need right alignment use rollmeanr with an r on the end.
fill=NA is needed to specify that you want to fill the first 6 values with NA values.
1) Putting all this together we have:
library(zoo)
transform(dat, avg7 = rollmeanr(rate, 7, fill = NA))
giving this data frame:
date rate avg7
1 1967-07-01 12.5 NA
2 1967-07-02 12.5 NA
3 1967-07-03 6.0 NA
4 1967-07-04 8.0 NA
5 1967-07-05 4.0 NA
6 1967-07-06 2.0 NA
7 1967-07-07 11.5 8.071429
8 1967-07-08 12.1 8.014286
9 1967-07-09 10.0 7.657143
10 1967-07-10 12.0 8.514286
11 1967-07-11 11.1 8.957143
12 1967-07-12 10.0 9.814286
2) Alternately convert dat to a zoo object and then cbind it to the rolling mean. In this case we don't need fill= since zoo objects are automatically aligned.
library(zoo)
rate <- read.zoo(dat)
cbind(rate, avg7 = rollmeanr(rate, 7))
giving this zoo object:
rate avg7
1967-07-01 12.5 NA
1967-07-02 12.5 NA
1967-07-03 6.0 NA
1967-07-04 8.0 NA
1967-07-05 4.0 NA
1967-07-06 2.0 NA
1967-07-07 11.5 8.071429
1967-07-08 12.1 8.014286
1967-07-09 10.0 7.657143
1967-07-10 12.0 8.514286
1967-07-11 11.1 8.957143
1967-07-12 10.0 9.814286
Note
dat in reproducible form is:
dat <- structure(list(date = c("1967-07-01", "1967-07-02", "1967-07-03",
"1967-07-04", "1967-07-05", "1967-07-06", "1967-07-07", "1967-07-08",
"1967-07-09", "1967-07-10", "1967-07-11", "1967-07-12"), rate = c(12.5,
12.5, 6, 8, 4, 2, 11.5, 12.1, 10, 12, 11.1, 10)), row.names = c(NA,
-12L), class = "data.frame")

Group by weekly data and summarize by month in R with dplyr

I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows

Calculate daily mean of data frame in r

I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.
Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))

Trouble using object in dataframe after a pipe (decomposition of a msts object)

I do time series decomposition and I want to save the resulting objects in a dataframe. It works if I store the results in a object and use it to make the dataframe afterwards:
# needed packages
library(tidyverse)
library(forecast)
# some "time series"
vec <- 1:1000 + rnorm(1000)
# store pipe results
pipe_out <-
# do decomposition
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
# relevant data
.$seasonal
# make a dataframe with the stored seasonal data
data.frame(ts= pipe_out)
But doing the same as a one-liner fails:
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
data.frame(ts= .$seasonal)
I get the error
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"decomposed.ts"’ to a data.frame
I thought that the pipe simply moves forward the things that came up in the last step which saves us storing those things in objects. If so, shouldn't both codes result in the very same output?
EDIT (from comments)
The first code works but it is a bad solution because if one wants to extract all the vectors of the decomposed time series one would need to do it in multiple steps. Something like the following would be better:
decompose(msts(vec, start= c(2001, 1, 1),
seasonal.periods= c(7, 365.25))) %>%
data.frame(seasonal= .$seasonal, x=.$x, trend=.$trend, random=.$random)
It's unclear from your example whether you want to extract $x or $seasonal. Either way, you can extract part of a list either with the `[[`() function in base or the alias extract2() in magrittr, as you prefer. You should then use the . when you create a data.frame in the last step.
Cleaning up the code a bit to be consistent with the piping, the following works:
library(magrittr)
library(tidyverse)
library(forecast)
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start = c(2001, 1, 1), seasonal.periods= c(7, 365.25)) %>%
decompose %>%
`[[`("seasonal") %>%
# extract2("seasonal") %>% # Another option, uncomment if preferred
data.frame(ts = .) %>%
head # Just for the reprex, remove as required
#> ts
#> 1 -1.17332998
#> 2 0.07393265
#> 3 0.37631946
#> 4 0.30640395
#> 5 1.04279779
#> 6 0.20470768
Created on 2019-11-28 by the reprex package (v0.3.0)
Edit based on comment:
To do what you mention in the comments, you need to use curly brackets (see e.g. here for an explanation why). Hence, the following works:
library(magrittr)
library(tidyverse)
library(forecast)
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
{data.frame(seasonal = .$seasonal,
trend = .$trend)} %>%
head
#> seasonal trend
#> 1 -0.4332034 NA
#> 2 -0.6185832 NA
#> 3 -0.5899566 NA
#> 4 0.7640938 NA
#> 5 -0.4374417 NA
#> 6 -0.8739449 NA
However, for your specific use case, it may be clearer and easier to use magrittr::extract and then simply bind_cols:
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
magrittr::extract(c("seasonal", "trend")) %>%
bind_cols %>%
head
#> # A tibble: 6 x 2
#> seasonal trend
#> <dbl> <dbl>
#> 1 -0.433 NA
#> 2 -0.619 NA
#> 3 -0.590 NA
#> 4 0.764 NA
#> 5 -0.437 NA
#> 6 -0.874 NA
Created on 2019-11-29 by the reprex package (v0.3.0)
With daily data, decompose() does not work well because it will only handle the annual seasonality and will give relatively poor estimates of it. If the data involve human behaviour, it will probably have both weekly and annual seasonal patterns.
Also, msts objects are not great for daily data either because they don't store the dates explicitly.
I suggest you use tsibble objects with an STL decomposition instead. Here is an example using your data.
library(tidyverse)
library(tsibble)
library(feasts)
mydata <- tsibble(
day = as.Date(seq(as.Date("2001-01-01"), length=1000, by=1)),
vec = 1:1000 + rnorm(1000)
)
#> Using `day` as index variable.
mydata
#> # A tsibble: 1,000 x 2 [1D]
#> day vec
#> <date> <dbl>
#> 1 2001-01-01 0.161
#> 2 2001-01-02 2.61
#> 3 2001-01-03 1.37
#> 4 2001-01-04 3.15
#> 5 2001-01-05 4.43
#> 6 2001-01-06 7.35
#> 7 2001-01-07 7.10
#> 8 2001-01-08 10.0
#> 9 2001-01-09 9.16
#> 10 2001-01-10 10.2
#> # … with 990 more rows
# Compute a decomposition
mydata %>% STL(vec)
#> # A dable: 1,000 x 7 [1D]
#> # STL Decomposition: vec = trend + season_year + season_week + remainder
#> day vec trend season_year season_week remainder season_adjust
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-01-01 0.161 14.7 -14.6 0.295 -0.193 14.5
#> 2 2001-01-02 2.61 15.6 -14.2 0.0865 1.04 16.7
#> 3 2001-01-03 1.37 16.6 -15.5 0.0365 0.240 16.9
#> 4 2001-01-04 3.15 17.6 -13.0 -0.0680 -1.34 16.3
#> 5 2001-01-05 4.43 18.6 -13.4 -0.0361 -0.700 17.9
#> 6 2001-01-06 7.35 19.5 -12.4 -0.122 0.358 19.9
#> 7 2001-01-07 7.10 20.5 -13.4 -0.181 0.170 20.7
#> 8 2001-01-08 10.0 21.4 -12.7 0.282 1.10 22.5
#> 9 2001-01-09 9.16 22.2 -13.8 0.0773 0.642 22.9
#> 10 2001-01-10 10.2 22.9 -12.7 0.0323 -0.0492 22.9
#> # … with 990 more rows
Created on 2019-11-30 by the reprex package (v0.3.0)
The output is a dable (decomposition table) which behaves like a dataframe most of the time. So you can extract the trend column, or either of the seasonal component columns in the usual way.

Create several new derived variables from existing variables in data.frame

In R I have a data.frame that has several variables that have been measured monthly over several years. I would like to derive the monthly average (using all years) for each variable. Ideally these new variables would all be together in a new data.frame (carrying over the ID), below I am simply adding the new variable to the data.frame. The only way I know how to do this at the moment (below) seems quite laborious, and I was hoping there might be a smarter way to do this in R, that would not require typing out each month and variable as I did below.
# Example data.frame with only two years, two month, and two variables
# In the real data set there are always 12 months per year
# and there are at least four variables
df<- structure(list(ID = 1:4, ABC.M1Y2001 = c(10, 12.3, 45, 89), ABC.M2Y2001 = c(11.1,
34, 67.7, -15.6), ABC.M1Y2002 = c(-11.1, 9, 34, 56.5), ABC.M2Y2002 = c(12L,
13L, 11L, 21L), DEF.M1Y2001 = c(14L, 14L, 14L, 16L), DEF.M2Y2001 = c(15L,
15L, 15L, 12L), DEF.M1Y2002 = c(5, 12, 23.5, 34), DEF.M2Y2002 = c(6L,
34L, 61L, 56L)), .Names = c("ID", "ABC.M1Y2001", "ABC.M2Y2001","ABC.M1Y2002",
"ABC.M2Y2002", "DEF.M1Y2001", "DEF.M2Y2001", "DEF.M1Y2002",
"DEF.M2Y2002"), class = "data.frame", row.names = c(NA, -4L))
# list variable to average for ABC Month 1 across years
ABC.M1.names <- c("ABC.M1Y2001", "ABC.M1Y2002")
df <- transform(df, ABC.M1 = rowMeans(df[,ABC.M1.names], na.rm = TRUE))
# list variable to average for ABC Month 2 across years
ABC.M2.names <- c("ABC.M2Y2001", "ABC.M2Y2002")
df <- transform(df, ABC.M2 = rowMeans(df[,ABC.M2.names], na.rm = TRUE))
# and so forth for ABC
# ...
# list variables to average for DEF Month 1 across years
DEF.M1.names <- c("DEF.M1Y2001", "DEF.M1Y2002")
df <- transform(df, DEF.M1 = rowMeans(df[,DEF.M1.names], na.rm = TRUE))
# and so forth for DEF
# ...
Here's a solution using data.table development version v1.8.11 (which has melt and cast methods implemented for data.table):
require(data.table)
require(reshape2) # melt/cast builds on S3 generic from reshape2
dt <- data.table(df) # where df is your data.frame
dcast.data.table(melt(dt, id="ID")[, sum(value)/.N, list(ID,
gsub("Y.*$", "", variable))], ID ~ gsub)
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1: 1 -0.55 11.55 9.50 10.5
2: 2 10.65 23.50 13.00 24.5
3: 3 39.50 39.35 18.75 38.0
4: 4 72.75 2.70 25.00 34.0
You can just cbind this to your original data.
Note that sum is a primitive where as mean is S3 generic. Therefore, using sum(.)/length(.) is better (as if there are too many groupings, dispatching the right method with mean for every group could be quite a time-consuming operation). .N is a special variable in data.table that directly gives you the length of the group.
Here is a solution using reshape2 that is more automated when you have lots of data and uses regular expressions to extract the variable name and the month. This solution will give you a nice summary table.
# Load required package
require(reshape2)
# Melt your wide data into long format
mdf <- melt(df , id = "ID" )
# Extract relevant variable names from the variable colum
mdf$Month <- gsub( "^.*\\.(M[0-9]{1,2}).*$" , "\\1" , mdf$variable )
mdf$Var <- gsub( "^(.*)\\..*" , "\\1" , mdf$variable )
# Aggregate by month and variable
dcast( mdf , Var ~ Month , mean )
# Var M1 M2
#1 ABC 30.5875 19.275
#2 DEF 16.5625 26.750
Or to be compatible with the other solutions, and return the table by ID as well...
dcast( mdf , ID ~ Var + Month , mean )
# ID ABC_M1 ABC_M2 DEF_M1 DEF_M2
#1 1 -0.55 11.55 9.50 10.5
#2 2 10.65 23.50 13.00 24.5
#3 3 39.50 39.35 18.75 38.0
#4 4 72.75 2.70 25.00 34.0
This is pretty straight forward in base R.
mean.names <- split(names(df)[-1], gsub('Y[0-9]{4}$', '', names(df)[-1]))
means <- lapply(mean.names, function(x) rowMeans(df[, x], na.rm = TRUE))
data.frame(df, means)
This gives you your original data.frame with the following four columns at the end:
ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 -0.55 11.55 9.50 10.5
2 10.65 23.50 13.00 24.5
3 39.50 39.35 18.75 38.0
4 72.75 2.70 25.00 34.0
You can use Reshape from package {splitstackshape} and then use plyr package or data.table or base R to perform mean.
library(splitstackshape) # Reshape
library(plyr) # ddply
kk<-Reshape(df,id.vars="ID",var.stubs=c("ABC.M1","ABC.M2","DEF.M1","DEF.M2"),sep="")
> kk
ID AE DB time ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 NA NA 1 10.0 11.1 14.0 15
2 2 NA NA 1 12.3 34.0 14.0 15
3 3 NA NA 1 45.0 67.7 14.0 15
4 4 NA NA 1 89.0 -15.6 16.0 12
5 1 NA NA 2 -11.1 12.0 5.0 6
6 2 NA NA 2 9.0 13.0 12.0 34
7 3 NA NA 2 34.0 11.0 23.5 61
8 4 NA NA 2 56.5 21.0 34.0 56
ddply(kk[,c(1,5:8)],.(ID),colwise(mean))
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 -0.55 11.55 9.50 10.5
2 2 10.65 23.50 13.00 24.5
3 3 39.50 39.35 18.75 38.0
4 4 72.75 2.70 25.00 34.0

Resources