I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows
I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.
Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))
I have some climate data with temperature and humidity as well as a timestamp which is transformed to the time in %H:%M.
When using ggplot2 for visualization, the time gets sorted - screwing the order of measurements as the first measurement was taken at 14:00 (2pm) and the last one at 10:27 (10:27am) the following day.
How do I prevent ggplot2 from sorting the x-values? (see plot)
MVE:
library(tidyverse)
df = read_csv('./climate_stats_incl_time.csv')
colnames(df)[1] <- c('sample')
head(df)
tail(df)
ggplot(data=df, mapping=aes(x=time)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
> head(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 0 1581253210. 21.9 47.6 14:00
2 1 1581253275. 21.7 47.8 14:01
3 2 1581253336. 21.7 47.8 14:02
4 3 1581253397. 21.8 47.8 14:03
5 4 1581253457. 21.7 47.8 14:04
6 5 1581253520. 21.8 47.8 14:05
> tail(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 1203 1581326567. 19.1 49.8 10:22
2 1204 1581326628. 19.1 49.7 10:23
3 1205 1581326688. 19.1 49.9 10:24
4 1206 1581326749. 19.1 49.9 10:25
5 1207 1581326812. 19.1 49.7 10:26
6 1208 1581326873. 19.1 49.8 10:27
Format your timestamps to a proper date-time (assuming the origin is 1970):
df$date_time <- as.POSIXct(df$timestamp, origin="1970-01-01", tz = "GMT")
Then use this new date_time variable instead of time for plotting
Edit:
I accidentally submitted a wrong solution (I re-formated the date-time to a date) . Now the solution should work for your problem (i.e. it makes a date-time!)
A workaround
df %>%
mutate(orig_seq = seq(1,nrow(df),1)) %>%
ggplot(mapping=aes(x=reorder(time, orig_seq)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
With data like below
text = "
date,weekday,hour,a,b
12/2/2019,Mon,8,18.17183824,0.017741935
12/2/2019,Mon,9,18.11228506,0.020967742
12/9/2019,Mon,8,16.77932274,0.020322581
12/9/2019,Mon,9,16.97327971,0.019677419
12/3/2019,Tue,8,18.17183824,0.017741935
12/3/2019,Tue,10,18.11228506,0.020967742
12/10/2019,Tue,8,16.77932274,0.020322581
12/10/2019,Tue,10,16.97327971,0.019677419
"
df = read.table(textConnection(text), sep=",", header = T)
Need to find the change in the variables a and b on a weekday to weekday basis.
Example for a, the change would be calculated as follows
Change for hour 8 on Mondays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 9 on Mondays = (16.97327971 - 18.11228506)/18.11228506
Change for hour 8 on Tuesdays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 10 on Tuesdays = (16.97327971 - 18.11228506)/18.11228506
Average change for variable a in the dataset = Avg of 1,2,3,4
Would appreciate help
For one variable, I would have converted from long to wide format and computed gain for each pair of same weekdays by adding week+number as a label for values for a. I find the challenge with doing it for multiple variables - a and b here. My real data has more than these 2 variables
We can group_by weekday and hour, use lead/lag to get next/previous value and use mutate_at to apply it for multiple columns.
library(dplyr)
df %>%
group_by(weekday, hour) %>%
mutate_at(vars(a:b), list(change = ~(lead(.) - .)/.))
# date weekday hour a b a_change b_change
# <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#1 12/2/2019 Mon 8 18.2 0.0177 -0.0766 0.145
#2 12/2/2019 Mon 9 18.1 0.0210 -0.0629 -0.0615
#3 12/9/2019 Mon 8 16.8 0.0203 NA NA
#4 12/9/2019 Mon 9 17.0 0.0197 NA NA
#5 12/3/2019 Tue 8 18.2 0.0177 -0.0766 0.145
#6 12/3/2019 Tue 10 18.1 0.0210 -0.0629 -0.0615
#7 12/10/2019 Tue 8 16.8 0.0203 NA NA
#8 12/10/2019 Tue 10 17.0 0.0197 NA NA
Here is an option with data.table
library(data.table)
setDT(df)[, c('a_change', 'b_change') :=
(shift(.SD, type = 'lead') - .SD)/.SD , .(weekday, hour), .SDcols = a:b]
I have a dataframe as given below:
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015","13-05-2015","14-05-2015"
,"15-05-2015","12-06-2015","13-06-2015","14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
Below is the column which contains value based on some calculation:
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)
Now I want to replace pvar value if its value less than the average value for that particular month.
For example,
for month 4,
Average value of pvar is 9.3 ((8.4+2.4+12+14.4)/4).
Then replace all the value in pvar which is less than avg for month 4 that is (8.4 &2.4).
Pvar value would be 9.3,9.3,12,14.4
I need to do this for all the values in pvar.
A base R solution would be to use ave. Note that we first need to convert the date column to actual date in order to extract the month (strsplit or regex can also do it but I prefer to have it set as a proper date), i.e.
df$vdate <- as.POSIXct(df$vdate, format = '%d-%m-%Y')
with(df, ave(pvar, format(vdate, '%m'), FUN = function(i) replace(i, i < mean(i), mean(i))))
#[1] 9.30 9.30 12.00 14.40 4.65 4.65 7.80 5.00 16.00 14.45 18.00 18.40
As per your edit, I will use dplyr to tackle it as it might be more readable. There are actually two ways I came up with.
First: Create an extra grouping variable that will put all the months you need to alter the values in the same group and replace from there, i.e.
library(dplyr)
cbind(df, pvar) %>%
group_by(grp = cumsum(!month %in% c(4, 5))+1, month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))) %>%
ungroup() %>%
select(-grp)
Second: Filter the months you need, do the calculations. Then filter the months you don't need, create again the pvar but without changing anything (necessary for binding the rows) and bind the rows, i.e.
bind_rows(
cbind(df, pvar) %>%
filter(month %in% c(4, 5)) %>%
group_by(month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))),
cbind(df, pvar) %>%
filter(!month %in% c(4, 5))
)
Both the above give,
vdate month col1 pvar
<fct> <dbl> <dbl> <dbl>
1 12-04-2015 4. 12.0 12.0
2 13-04-2015 4. 12.4 12.4
3 14-04-2015 4. 14.3 14.3
4 15-04-2015 4. 3.00 10.4
5 12-05-2015 5. 5.30 5.30
6 13-05-2015 5. 1.80 4.80
7 14-05-2015 5. 7.60 7.60
8 15-05-2015 5. 4.50 4.80
9 12-06-2015 6. 7.60 7.60
10 13-06-2015 6. 10.7 10.7
11 14-06-2015 6. 12.0 12.0
12 15-06-2015 6. 15.7 15.7
A dplyr based solution could be :
#Additional condition has been added to check if month != 6
cbind(df, pvar) %>%
group_by(month) %>%
mutate(pvar = ifelse(pvar < mean(pvar) & month != 6, mean(pvar), pvar)) %>%
as.data.frame()
# vdate month col1 pvar
# 1 12-04-2015 4 12.0 9.30
# 2 13-04-2015 4 12.4 9.30
# 3 14-04-2015 4 14.3 12.00
# 4 15-04-2015 4 3.0 14.40
# 5 12-05-2015 5 5.3 4.65
# 6 13-05-2015 5 1.8 4.65
# 7 14-05-2015 5 7.6 7.80
# 8 15-05-2015 5 4.5 5.00
# 9 12-06-2015 6 7.6 16.00
# 10 13-06-2015 6 10.7 5.40
# 11 14-06-2015 6 12.0 18.00
# 12 15-06-2015 6 15.7 18.40
Data
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015",
"13-05-2015","14-05-2015","15-05-2015","12-06-2015","13-06-2015",
"14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)