merging quarterly time series in R - r

How do you merge a and b xts series that are presented like this:
a:
1948-01-01 1
1948-04-01 1
1948-07-01 1
1948-10-01 1
b:
1948-03-01 2
1948-06-01 2
1948-09-01 2
1948-12-01 2
Result should look like this
a b
1948Q1 1 2
1948Q2 1 2
1948Q3 1 2
1948Q4 1 2
In what format the date is doesn't matter, as long as the a and b are aligned by quarters. This is much easier to do with monthly using indexFormat() and %b%Y and etc., but there isn't an index available for quarterly.
aggregate(a, as.yearqtr) doesn't work too well because for some reason it takes Q2 as the first quarter of the year. Then if you want to take yearly average, it takes Q2,Q3,Q4,Q1 for each year, instead of Q1-Q4. So, I am looking for another method. Let me know if such exists.

Try this 2-step solution:
# Step 1: set rownames as quarters
library(zoo)
rownames(a) <- format.yearqtr(as.Date(a$V1))
rownames(b) <- format.yearqtr(as.Date(b$V1))
# Step2: merge by rownames (quarters)
merge(a, b, by = "row.names", all = TRUE, suffixes = c(".a",".b"))[, -c(2, 4)]
# output (you can now store the result in a dataframe and change colnames as you want)
Row.names V2.a V2.b
1 1948 Q1 1 2
2 1948 Q2 1 2
3 1948 Q3 1 2
4 1948 Q4 1 2
Data
a <- structure(list(V1 = c("1948-01-01", "1948-04-01", "1948-07-01",
"1948-10-01"), V2 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2"
), class = "data.frame", row.names = c("1948 Q1", "1948 Q2",
"1948 Q3", "1948 Q4"))
b <- structure(list(V1 = c("1948-03-01", "1948-06-01", "1948-09-01",
"1948-12-01"), V2 = c(2L, 2L, 2L, 2L)), .Names = c("V1", "V2"
), class = "data.frame", row.names = c("1948 Q1", "1948 Q2",
"1948 Q3", "1948 Q4"))

Related

How to count the number of customer per month in R?

So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!
You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))

Divide data by the preceding row and create new dataframe

I have a data set and I'm trying to calculate the rate of change between the rows.
My input looks like this:
foo = structure(list(date = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("10/03/2020",
"11/03/2020", "12/03/2020", "13/03/2020", "9/03/2020"), class = "factor"),
A = c(0.60256322, 0.634543306, 0.022976661, 0.009839044,
0.319456765), B = c(45.42320826, 57.32689951, 32.49487759,
29.40804164, 54.33691346), C = c(5.114123914, 3.674167652,
2.330610757, 5.510280192, 5.717950467), D = c(4.187409484,
4.835943165, 4.340614439, 4.607468576, 3.14338155)), row.names = c(NA,
5L), class = "data.frame")
I'm trying to divide each of the following cells with the one before
eg. [5,2] / [4,2]; [4,2] / [3,2]... etc
and I'm trying to create a new output like this:
bar = structure(list(date = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("10/03/2020",
"11/03/2020", "12/03/2020", "13/03/2020", "9/03/2020"), class = "factor"),
A = c(0, 1.053073412, 0.03620976, 0.428219052, 32.46827283
), B = c(0, 1.262061878, 0.56683473, 0.90500546, 1.847688946
), C = c(0, 0.718435398, 0.634323465, 2.364307371, 1.037687789
), D = c(0, 1.154877063, 0.897573501, 1.061478424, 0.682236134
)), row.names = c(NA, 5L), class = "data.frame")
I'm sure there's a better way than finding the length of the column and looping through. Can anyone point me in the right direction?
You cans use mutate_if or mutate_at from dplyr package.
library(dplyr)
foo %>%
mutate_if(!grepl("date", names(.)), function(x) x/lag(x))
OR
foo %>%
mutate_at(vars(-date), function(x) x/lag(x))
In base R, we can use head and tail to divide data.
foo[-1] <- lapply(foo[-1], function(x) c(0, tail(x, -1)/head(x, -1)))
foo
# date A B C D
#1 9/03/2020 0.00000000 0.0000000 0.0000000 0.0000000
#2 10/03/2020 1.05307341 1.2620619 0.7184354 1.1548771
#3 11/03/2020 0.03620976 0.5668347 0.6343235 0.8975735
#4 12/03/2020 0.42821905 0.9050055 2.3643074 1.0614784
#5 13/03/2020 32.46827283 1.8476889 1.0376878 0.6822361
Another tidyverse approach.
library(tidyverse)
bar <- foo %>%
mutate_if(is.double, ~ replace_na(./lag(.), replace = 0))
bar
#> date A B C D
#> 1 9/03/2020 0.00000000 0.0000000 0.0000000 0.0000000
#> 2 10/03/2020 1.05307341 1.2620619 0.7184354 1.1548771
#> 3 11/03/2020 0.03620976 0.5668347 0.6343235 0.8975735
#> 4 12/03/2020 0.42821905 0.9050055 2.3643074 1.0614784
#> 5 13/03/2020 32.46827283 1.8476889 1.0376878 0.6822361

Calculating age over multiple dataframes based on name of dataframe

I was wondering if someone here can help me with a lapply question.
Every month, data are extracted and the data frames are named according to the date extracted (01-08-2019,01-09-2019,01-10-2019 etc). The contents of each data frame are similar to the example below:
01-09-2019
ID DOB
3 01-07-2019
5 01-06-2019
7 01-05-2019
8 01-09-2019
01-10-2019
ID DOB
2 01-10-2019
5 01-06-2019
8 01-09-2019
9 01-02-2019
As the months roll on, there are more data sets being downloaded.
I am wanting to calculate the ages of people in each of the data sets based on the date the data was extracted - so in essence, the age would be the date difference between the data frame name and the DOB variable.
01-09-2019
ID DOB AGE(months)
3 01-07-2019 2
5 01-06-2019 3
7 01-05-2019 4
8 01-09-2019 0
01-10-2019
ID DOB AGE(months)
2 01-10-2019 0
5 01-06-2019 4
8 01-09-2019 1
9 01-02-2019 8
I was thinking of putting all of the data frames together in a list (as there are a lot) and then using lapply to calculate age across all data frames. How do I go about calculating the difference between a data frame name and a column?
If I may suggest a slightly differen approach: It might make more sense to compress your list into a single data frame before calculating the ages. Given your data looks something like this, i.e. it is a list of data frames, where the list element names are the dates of access:
$`01-09-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 3 2019-07-01
2 5 2019-06-01
3 7 2019-05-01
4 8 2019-09-01
$`01-10-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 2 2019-10-01
2 5 2019-06-01
3 8 2019-09-01
4 9 2019-02-01
You can call bind_rows first with parameter .id = "date_extracted" to turn your list into a data frame, and then calculate age in months.
library(tidyverse)
library(lubridate)
tib <- bind_rows(tib_list, .id = "date_extracted") %>%
mutate(date_extracted = dmy(date_extracted),
DOB = dmy(DOB),
age_months = month(date_extracted) - month(DOB)
)
#### OUTPUT ####
# A tibble: 8 x 4
date_extracted ID DOB age_months
<date> <dbl> <date> <dbl>
1 2019-09-01 3 2019-07-01 2
2 2019-09-01 5 2019-06-01 3
3 2019-09-01 7 2019-05-01 4
4 2019-09-01 8 2019-09-01 0
5 2019-10-01 2 2019-10-01 0
6 2019-10-01 5 2019-06-01 4
7 2019-10-01 8 2019-09-01 1
8 2019-10-01 9 2019-02-01 8
This can be solved with lapply as well but we can also use Map in this case to iterate over list and their names after adding all the dataframes in a list. In base R,
Map(function(x, y) {
x$DOB <- as.Date(x$DOB)
transform(x, age = as.integer(format(as.Date(y), "%m")) -
as.integer(format(x$DOB, "%m")))
}, list_df, names(list_df))
#$`01-09-2019`
# ID DOB age
#1 3 0001-07-20 2
#2 5 0001-06-20 3
#3 7 0001-05-20 4
#4 8 0001-09-20 0
#$`01-10-2019`
# ID DOB age
#1 2 0001-10-20 0
#2 5 0001-06-20 4
#3 8 0001-09-20 1
#4 9 0001-02-20 8
We can also do the same in tidyverse
library(dplyr)
library(lubridate)
purrr::imap(list_df, ~.x %>% mutate(age = month(.y) - month(DOB)))
data
list_df <- list(`01-09-2019` = structure(list(ID = c(3L, 5L, 7L, 8L),
DOB = structure(c(3L, 2L, 1L, 4L), .Label = c("01-05-2019", "01-06-2019",
"01-07-2019", "01-09-2019"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L)), `01-10-2019` = structure(list(ID = c(2L, 5L, 8L, 9L),
DOB = structure(c(4L, 2L, 3L, 1L), .Label = c("01-02-2019",
"01-06-2019", "01-09-2019", "01-10-2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L)))
It's bad practice to use dates and numbers as dataframe names consider prefix the date with an "x" as shown below in this base R solution:
df_list <- list(x01_09_2019 = `01-09-2019`, x01_10_2019 = `01-10-2019`)
df_list <- mapply(cbind, "report_date" = names(df_list), df_list, SIMPLIFY = F)
df_list <- lapply(df_list, function(x){
x$report_date <- as.Date(gsub("_", "-", gsub("x", "", x$report_date)), "%d-%m-%Y")
x$Age <- x$report_date - x$DOB
return(x)
}
)
Data:
`01-09-2019` <- structure(list(ID = c(3, 5, 7, 8),
DOB = structure(c(18078, 18048, 18017, 18140), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
`01-10-2019` <- structure(list(ID = c(2, 5, 8, 9),
DOB = structure(c(18170, 18048, 18140, 17928), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))

How to divide contents of one column by different values, conditional on contents of a second column?

I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)

R - Adding numbers within a data frame cell together

I have a data frame in which the values are stored as characters. However, many values contain two numbers that need to be added together. Example:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 3+6 2+10 8 13+2
Product 2 6 4+0 <NA> 5
Product 3 <NA> 5+9 3+1 11
Is there a way to go through the whole data frame and replace all cells containing characters like "3+6" with new values equal to their sum? I assume this would involve coercing the characters to numeric or integers, but I don't know how that would be possible for values with the + sign in them. I would like the example data frame to end up looking like this:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 9 12 8 15
Product 2 6 4 <NA> 5
Product 3 <NA> 14 4 11
Here's an easier example:
dat <- data.frame(a=c("3+6", "10"), b=c("12", NA), c=c("3+4", "5+6"))
dat
## a b c
## 1 3+6 12 3+4
## 2 10 <NA> 5+6
apply(dat, 1:2, function(x) eval(parse(text=x)))
## a b c
## [1,] 9 12 7
## [2,] 10 NA 11
Using R itself to do the computation with eval and parse does the trick.
Here is one option with gsubfn without using eval(parse. We convert the 'data.frame' to 'matrix' (as.matrix(dat)). We match the numbers ([0-9]+), capture it as a group using parentheses ((..)) followed by +, followed by second set of numbers, and replace it by converting to numeric class and then do the +. The output can be assigned back to the original dataset to get the same structure as in 'dat'.
library(gsubfn)
dat[] <- as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.matrix(dat)))
dat
# 2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
#Product 1 9 12 8 15
#Product 2 6 4 NA 5
#Product 3 NA 14 4 11
Or we can loop the columns with lapply and perform the replacement with gsubfn for each of the columns.
dat[] <- lapply(dat, function(x) as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.character(x))))
data
dat <- structure(list(`2014 Q1 Sales` = structure(c(1L, 2L, NA), .Label = c("3+6",
"6"), class = "factor"), `2014 Q2 Sales` = structure(1:3, .Label = c("2+10",
"4+0", "5+9"), class = "factor"), `2014 Q3 Sales` = structure(c(2L,
NA, 1L), .Label = c("3+1", "8"), class = "factor"), `2014 Q4 Sales` = structure(c(2L,
3L, 1L), .Label = c("11", "13+2", "5"), class = "factor")), .Names = c("2014 Q1 Sales",
"2014 Q2 Sales", "2014 Q3 Sales", "2014 Q4 Sales"), class = "data.frame", row.names = c("Product 1",
"Product 2", "Product 3"))

Resources