I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)
Related
The post has been edit at Aug 17, 2020 to make the example looks more like my actual data.
The days always come first either with 1 or 2 digits. The months always come second either in full or in part and in French. The years always come third either with 2 or 4 digits.
I'm learning to code with tidyverse packages. I'm trying to replace every elements in a variable by an other string if they match specific conditions. The problem is that I can only do it one condition at the time. I would like to know how to achieve it at severals condition a the time.
Here's a reproductible exemple :
library(tidyverse)
library(magrittr)
tib <- tibble(
ID = 1:6,
Date = c("1-JAN-20", "15-JUILL-20", "30 DEC 2020",
"1-JAN-20", "15-JUILL-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 1-JAN-20 Should be 2020-01-01
2 2 15-JUILL-20 Should be 2020-06-15
3 3 30 DEC 2020 Should be 2020-12-30
4 4 1-JAN-20 Should be 2020-01-01
5 5 15-JUILL-20 Should be 2020-06-15
6 6 30 DEC 2020 Should be 2020-12-30
# Returns the unique values of the character variables execept the "Comm" one. So, it
# returns only one in that case, but my original data have severals ones.
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
$Date
[1] "1-JAN-20" "15-JUILL-20" "30 DEC 2020"
Here we are! The following code works, but I wonder if there's a better way to atcheive it instead of copy/pass the same code line every time and changing it.
tib <- tib %>% mutate(Date = case_when(Date == "1-JAN-20" ~ "2020-01-01",
Date == "15-JUILL-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
head(tib)
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
Since I will have to do this manipulation on other variables, how could I build a function that would accomplish this?
Also, I would like to know if you know some good documentations/tutorials to learn Purrr package?
Thank you and have a good day!
While handling dates/times you should use standard date time functions for manipulation. Don't replace dates one by one using str_replace. Imagine you have 1000's of dates with different years, it is not practically possible to list each one of them. In this case, you can use lubridate::dmy to convert them to date object, for more complicated cases there is lubridate::parse_date_time which can convert variables in different format to dates.
tib %>% dplyr::mutate(new_date = lubridate::dmy(Date))
# ID Date Comm new_date
# <int> <chr> <chr> <date>
#1 1 01-JAN-20 Should be 2020-01-01 2020-01-01
#2 2 15-JUN-20 Should be 2020-06-15 2020-06-15
#3 3 30 DEC 2020 Should be 2020-12-30 2020-12-30
#4 4 01-JAN-20 Should be 2020-01-01 2020-01-01
#5 5 15-JUN-20 Should be 2020-06-15 2020-06-15
#6 6 30 DEC 2020 Should be 2020-12-30 2020-12-30
If you want dates in specific format, you can use the format function on new_date.
Maybe you could try dplyr::case_when:
library(magrittr)
library(purrr)
# A tibble that looks like my data.
tib <- tibble(
ID = 1:6,
Date = c("01-JAN-20", "15-JUN-20", "30 DEC 2020",
"01-JAN-20", "15-JUN-20", "30 DEC 2020"),
Comm = c("Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30",
"Should be 2020-01-01", "Should be 2020-06-15", "Should be 2020-12-30"))
head(tib)
tib %>% select(where(is.character), -Comm) %>% map(~ unique(.x))
tib <- tib %>% mutate(Date = dplyr::case_when(Date == "01-JAN-20" ~ "2020-01-01",
Date == "15-JUN-20" ~ "2020-06-15",
Date == "30 DEC 2020" ~ "2020-12-01"))
> tib
# A tibble: 6 x 3
ID Date Comm
<int> <chr> <chr>
1 1 2020-01-01 Should be 2020-01-01
2 2 2020-06-15 Should be 2020-06-15
3 3 2020-12-01 Should be 2020-12-30
4 4 2020-01-01 Should be 2020-01-01
5 5 2020-06-15 Should be 2020-06-15
6 6 2020-12-01 Should be 2020-12-30
The best thing to try to do here is to transform your Date column into Date class using the "anytime" package. Although you would have to manually fix your Date column so all years have 4 digits. If years are always in the last place of the date, that can be an easy thing to do.
I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17
I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.
I have data that includes dates (dd/mm/yyyy) and am wanting to summarise the data by year. I'm sure that there is an easier way to do it but the route that I've taken is to try to create a new categorical variable using the "cut" function.
For example:
# create sample dataframe
dates<-c("01/01/2013", "01/02/2013", "01/01/2014", "01/02/2014", "01/01/2015", "01/02/2015")
cases<-c(3,5,2,6,8,4)
df<-as.data.frame(cbind(dates, cases))
df$dates <- as.Date(df$dates,"%d/%m/%Y")
# categorise by year
df$year <- cut(df$dates, c(2013-01-01, 2013-12-31, 2014-12-31, 2015-12-31))
This gives an error:
invalid specification of 'breaks'
How do I tell R to cut at various "date" intervals? Is my approach to this all wrong? Still new to R (sorry about the basic question).
Greg
How should your output look like?
Your code works when you define your breaks with as.Date:
breaks <- as.Date(c("2013-01-01", "2013-12-31", "2014-12-31", "2015-12-31"))
# categorise by year
df$year <- cut(df$dates, breaks)
dates cases year
1 2013-01-01 3 2013-01-01
2 2013-02-01 5 2013-01-01
3 2014-01-01 2 2013-12-31
4 2014-02-01 6 2013-12-31
5 2015-01-01 8 2014-12-31
6 2015-02-01 4 2014-12-31
I'm guessing you want your variable year to look different, though? You can define labels when using cut:
# categorise by year
df$year <- cut(df$dates, breaks, labels = c(2013, 2014, 2015))
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
if you are just looking for the year, maybe this helps:
df$year <- format(df$dates, format="%Y")
dates cases year
1 2013-01-01 3 2013
2 2013-02-01 5 2013
3 2014-01-01 2 2014
4 2014-02-01 6 2014
5 2015-01-01 8 2015
6 2015-02-01 4 2015
I think the solutions based on cut are a bit overkill. You can use the year function from the lubridate package to extract the year from the date:
library(dplyr)
library(lubridate)
df %>% mutate(year = year(dates))
# dates cases year
# 1 2013-01-01 3 2013
# 2 2013-02-01 5 2013
# 3 2014-01-01 2 2014
# 4 2014-02-01 6 2014
# 5 2015-01-01 8 2015
# 6 2015-02-01 4 2015
lubridate is such an awesome package when it comes to dealing with time data.
After the year column is constructed you can apply all kinds of summaries. I use the dplyr style here:
# Note that as.numeric(as.character()) is needed as `cbind` forces `cases` to be a factor
df %>% mutate(year = year(dates), cases = as.numeric(as.character(cases))) %>%
group_by(year) %>% summarise(tot_cases = sum(cases))
# # A tibble: 3 × 2
# year tot_cases
# <dbl> <dbl>
# 1 2013 8
# 2 2014 8
# 3 2015 12
Note that group_by ensures that all operations after that are done per unique category mentioned there, in this case per year.
A simple solution would be using the dplyr package. Here is a simple example:
library(dplyr)
df_grouped <- df %>%
mutate(
dates = as_date(dates),
cases = as.numeric(cases)) %>%
group_by(year = year(dates)) %>%
summarise(tot_cases = sum(cases))
In the mutate statement we convert the variables to a more suitable format, in group_by we select which variable is going to do the grouping and in summarise we create any new variables that we want.
df_grouped looks like this:
# A tibble: 3 × 2
year tot_cases
<dbl> <dbl>
1 2013 6
2 2014 6
3 2015 9
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).