Format/Generate new table in R using dplyr/tidyr - r

I have a table calculated using a df which looks like the following.
Month_considered pct `ATC Count`
<fct> <dbl> <fct>
1 Apr-17 54.9 198,337
2 May-17 56.4 227,681
3 Jun-17 58.0 251,664
4 Jul-17 57.7 251,934
5 Aug-17 55.5 259,617
6 Sep-17 55.7 245,588
7 Oct-17 56.6 247,051
8 Nov-17 57.6 256,375
9 Dec-17 56.9 277,784
10 Jan-18 56.7 272,818
Now I want to find the difference in pct between two months.So the desired output would be like
Month_considered pct
<fct> <dbl>
1 Apr-17-May-17 1.5
2 May-17-Jun-17 1.6
3 Jun-17-Jul-17 - 0.3
How do I concatenate the first column like above. I did try using unite in tidyr but it isnt the output what I want to generate.Thank you.

We need to take the difference between the current and the next value
library(dplyr)
library(zoo)
df1 %>%
arrange(as.yearmon(Month_considered, format = "%b-%y")) %>% # to order
mutate_at(vars(Month_considered, pct),
funs(new = lead(., default = last(.)))) %>%
unite(Month_considered, Month_considered, Month_considered_new, sep="-") %>%
transmute(Month_considered, pct = pct_new - pct)
# Month_considered pct
#1 Apr-17-May-17 1.5
#2 May-17-Jun-17 1.6
#3 Jun-17-Jul-17 -0.3
#4 Jul-17-Aug-17 -2.2
#5 Aug-17-Sep-17 0.2
#6 Sep-17-Oct-17 0.9
#7 Oct-17-Nov-17 1.0
#8 Nov-17-Dec-17 -0.7
#9 Dec-17-Jan-18 -0.2
#10 Jan-18-Jan-18 0.0
Or using base R
pct <- df1$pct[-1] - df1$pct[-nrow(df1)]
Month_considered <- paste(df1$Month_considered[-1],
df1$Month_considered[-nrow(df1)], sep="-")
data.frame(Month_considered, pct)

One can try using self-join after adding 1 month from zoo::yearmon type column.
To add a month in yearmon type column simply add 1/12.
The solution is:
library(zoo)
library(dplyr)
df %>% mutate(Month_considered = as.yearmon(Month_considered, "%b-%y"),
Next_Month = Month_considered+(1/12)) %>%
#self join
left_join(.,.,by=c("Next_Month"="Month_considered")) %>%
mutate(Month_considered = paste(Month_considered,Next_Month,sep="-"),
pct = pct.y - pct.x) %>%
select(Month_considered, pct)
# Month_considered pct
# 1 Apr 2017-May 2017 1.5
# 2 May 2017-Jun 2017 1.6
# 3 Jun 2017-Jul 2017 -0.3
# 4 Jul 2017-Aug 2017 -2.2
# 5 Aug 2017-Sep 2017 0.2
# 6 Sep 2017-Oct 2017 0.9
# 7 Oct 2017-Nov 2017 1.0
# 8 Nov 2017-Dec 2017 -0.7
# 9 Dec 2017-Jan 2018 -0.2
# 10 Jan 2018-Feb 2018 NA
Data:
df <- read.table(text=
"Month_considered pct 'ATC Count'
Apr-17 54.9 198337
May-17 56.4 227681
Jun-17 58.0 251664
Jul-17 57.7 251934
Aug-17 55.5 259617
Sep-17 55.7 245588
Oct-17 56.6 247051
Nov-17 57.6 256375
Dec-17 56.9 277784
Jan-18 56.7 272818",
header=TRUE, stringsAsFactors = FALSE)

Related

Group by weekly data and summarize by month in R with dplyr

I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows

Calculate daily mean of data frame in r

I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.
Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))

Use time values for x-axis labels

I have some climate data with temperature and humidity as well as a timestamp which is transformed to the time in %H:%M.
When using ggplot2 for visualization, the time gets sorted - screwing the order of measurements as the first measurement was taken at 14:00 (2pm) and the last one at 10:27 (10:27am) the following day.
How do I prevent ggplot2 from sorting the x-values? (see plot)
MVE:
library(tidyverse)
df = read_csv('./climate_stats_incl_time.csv')
colnames(df)[1] <- c('sample')
head(df)
tail(df)
ggplot(data=df, mapping=aes(x=time)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))
> head(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 0 1581253210. 21.9 47.6 14:00
2 1 1581253275. 21.7 47.8 14:01
3 2 1581253336. 21.7 47.8 14:02
4 3 1581253397. 21.8 47.8 14:03
5 4 1581253457. 21.7 47.8 14:04
6 5 1581253520. 21.8 47.8 14:05
> tail(df)
# A tibble: 6 x 5
sample timestamp temperature humidity time
<dbl> <dbl> <dbl> <dbl> <drtn>
1 1203 1581326567. 19.1 49.8 10:22
2 1204 1581326628. 19.1 49.7 10:23
3 1205 1581326688. 19.1 49.9 10:24
4 1206 1581326749. 19.1 49.9 10:25
5 1207 1581326812. 19.1 49.7 10:26
6 1208 1581326873. 19.1 49.8 10:27
Format your timestamps to a proper date-time (assuming the origin is 1970):
df$date_time <- as.POSIXct(df$timestamp, origin="1970-01-01", tz = "GMT")
Then use this new date_time variable instead of time for plotting
Edit:
I accidentally submitted a wrong solution (I re-formated the date-time to a date) . Now the solution should work for your problem (i.e. it makes a date-time!)
A workaround
df %>%
mutate(orig_seq = seq(1,nrow(df),1)) %>%
ggplot(mapping=aes(x=reorder(time, orig_seq)) +
geom_line(aes(y=temperature, color='red')) +
geom_line(aes(y=humidity, color='blue'))

R - find change across same days of week for multiple variables and agrregate

With data like below
text = "
date,weekday,hour,a,b
12/2/2019,Mon,8,18.17183824,0.017741935
12/2/2019,Mon,9,18.11228506,0.020967742
12/9/2019,Mon,8,16.77932274,0.020322581
12/9/2019,Mon,9,16.97327971,0.019677419
12/3/2019,Tue,8,18.17183824,0.017741935
12/3/2019,Tue,10,18.11228506,0.020967742
12/10/2019,Tue,8,16.77932274,0.020322581
12/10/2019,Tue,10,16.97327971,0.019677419
"
df = read.table(textConnection(text), sep=",", header = T)
Need to find the change in the variables a and b on a weekday to weekday basis.
Example for a, the change would be calculated as follows
Change for hour 8 on Mondays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 9 on Mondays = (16.97327971 - 18.11228506)/18.11228506
Change for hour 8 on Tuesdays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 10 on Tuesdays = (16.97327971 - 18.11228506)/18.11228506
Average change for variable a in the dataset = Avg of 1,2,3,4
Would appreciate help
For one variable, I would have converted from long to wide format and computed gain for each pair of same weekdays by adding week+number as a label for values for a. I find the challenge with doing it for multiple variables - a and b here. My real data has more than these 2 variables
We can group_by weekday and hour, use lead/lag to get next/previous value and use mutate_at to apply it for multiple columns.
library(dplyr)
df %>%
group_by(weekday, hour) %>%
mutate_at(vars(a:b), list(change = ~(lead(.) - .)/.))
# date weekday hour a b a_change b_change
# <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#1 12/2/2019 Mon 8 18.2 0.0177 -0.0766 0.145
#2 12/2/2019 Mon 9 18.1 0.0210 -0.0629 -0.0615
#3 12/9/2019 Mon 8 16.8 0.0203 NA NA
#4 12/9/2019 Mon 9 17.0 0.0197 NA NA
#5 12/3/2019 Tue 8 18.2 0.0177 -0.0766 0.145
#6 12/3/2019 Tue 10 18.1 0.0210 -0.0629 -0.0615
#7 12/10/2019 Tue 8 16.8 0.0203 NA NA
#8 12/10/2019 Tue 10 17.0 0.0197 NA NA
Here is an option with data.table
library(data.table)
setDT(df)[, c('a_change', 'b_change') :=
(shift(.SD, type = 'lead') - .SD)/.SD , .(weekday, hour), .SDcols = a:b]

Calculate average of month and replace values of other column

I have a dataframe as given below:
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015","13-05-2015","14-05-2015"
,"15-05-2015","12-06-2015","13-06-2015","14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
Below is the column which contains value based on some calculation:
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)
Now I want to replace pvar value if its value less than the average value for that particular month.
For example,
for month 4,
Average value of pvar is 9.3 ((8.4+2.4+12+14.4)/4).
Then replace all the value in pvar which is less than avg for month 4 that is (8.4 &2.4).
Pvar value would be 9.3,9.3,12,14.4
I need to do this for all the values in pvar.
A base R solution would be to use ave. Note that we first need to convert the date column to actual date in order to extract the month (strsplit or regex can also do it but I prefer to have it set as a proper date), i.e.
df$vdate <- as.POSIXct(df$vdate, format = '%d-%m-%Y')
with(df, ave(pvar, format(vdate, '%m'), FUN = function(i) replace(i, i < mean(i), mean(i))))
#[1] 9.30 9.30 12.00 14.40 4.65 4.65 7.80 5.00 16.00 14.45 18.00 18.40
As per your edit, I will use dplyr to tackle it as it might be more readable. There are actually two ways I came up with.
First: Create an extra grouping variable that will put all the months you need to alter the values in the same group and replace from there, i.e.
library(dplyr)
cbind(df, pvar) %>%
group_by(grp = cumsum(!month %in% c(4, 5))+1, month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))) %>%
ungroup() %>%
select(-grp)
Second: Filter the months you need, do the calculations. Then filter the months you don't need, create again the pvar but without changing anything (necessary for binding the rows) and bind the rows, i.e.
bind_rows(
cbind(df, pvar) %>%
filter(month %in% c(4, 5)) %>%
group_by(month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))),
cbind(df, pvar) %>%
filter(!month %in% c(4, 5))
)
Both the above give,
vdate month col1 pvar
<fct> <dbl> <dbl> <dbl>
1 12-04-2015 4. 12.0 12.0
2 13-04-2015 4. 12.4 12.4
3 14-04-2015 4. 14.3 14.3
4 15-04-2015 4. 3.00 10.4
5 12-05-2015 5. 5.30 5.30
6 13-05-2015 5. 1.80 4.80
7 14-05-2015 5. 7.60 7.60
8 15-05-2015 5. 4.50 4.80
9 12-06-2015 6. 7.60 7.60
10 13-06-2015 6. 10.7 10.7
11 14-06-2015 6. 12.0 12.0
12 15-06-2015 6. 15.7 15.7
A dplyr based solution could be :
#Additional condition has been added to check if month != 6
cbind(df, pvar) %>%
group_by(month) %>%
mutate(pvar = ifelse(pvar < mean(pvar) & month != 6, mean(pvar), pvar)) %>%
as.data.frame()
# vdate month col1 pvar
# 1 12-04-2015 4 12.0 9.30
# 2 13-04-2015 4 12.4 9.30
# 3 14-04-2015 4 14.3 12.00
# 4 15-04-2015 4 3.0 14.40
# 5 12-05-2015 5 5.3 4.65
# 6 13-05-2015 5 1.8 4.65
# 7 14-05-2015 5 7.6 7.80
# 8 15-05-2015 5 4.5 5.00
# 9 12-06-2015 6 7.6 16.00
# 10 13-06-2015 6 10.7 5.40
# 11 14-06-2015 6 12.0 18.00
# 12 15-06-2015 6 15.7 18.40
Data
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015",
"13-05-2015","14-05-2015","15-05-2015","12-06-2015","13-06-2015",
"14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)

Resources