summarize multiple dynamic columns and store results in new columns - r

I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3

Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13

A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3

Related

Grouping and Counting by Dates (R)

I am working with the R programming language. I have a data frame that looks like this:
startdate <- c('2010-01-01','2010-01-01','2010-01-01', '2010-01-02','2010-01-03','2010-01-03')
event <- c(1,1,1,1,1,1)
my_data <- data.frame(startdate, event)
startdate event
1 2010-01-01 1
2 2010-01-01 1
3 2010-01-01 1
4 2010-01-02 1
5 2010-01-03 1
6 2010-01-03 1
Note: The actual value of "startdate" is "POSIXct" and is written as "year-month-date".
I am trying to take a cumulative sum of "event" according to the "startdate" column. The result should look like this
startdate <- c('2010-01-01', '2010-01-02' ,'2010-01-03')
event <- c(3,4,6)
my_data_2 <- data.frame(startdate, event)
#desired file
startdate event
1 2010-01-01 3
2 2010-01-02 4
3 2010-01-03 6
I tried to do this with the "dplyr" library:
library(dplyr)
new_file = my_data %>% group_by(startdate) %>% mutate(cumsum_value = cumsum(event))
But this returns something slightly different and non-intended:
startdate event cumsum_value
<chr> <dbl> <dbl>
1 2010-01-01 1 1
2 2010-01-01 1 2
3 2010-01-01 1 3
4 2010-01-02 1 1
5 2010-01-03 1 1
6 2010-01-03 1 2
Can someone please show me how to fix this?
Thanks
my_data %>%
mutate(cumsum = cumsum(event)) %>%
group_by(startdate) %>%
summarise(max(cumsum))
# A tibble: 3 × 2
startdate `max(cumsum)`
<chr> <dbl>
1 2010-01-01 3
2 2010-01-02 4
3 2010-01-03 6
mutate the event column and calculate cumsum
group_by startdate and
summarise max(event)
library(dplyr)
my_data %>%
mutate(event = cumsum(event)) %>%
group_by(startdate) %>%
summarise(event = max(event))
```
```
startdate event
<chr> <dbl>
1 2010-01-01 3
2 2010-01-02 4
3 2010-01-03 6
```
Another option is also to make use of duplicated and thus avoiding the group_by. Also, if the 'event' column is just 1, instead of doing cumsum, we could use the built-in function row_number() to create a sequence
library(dplyr)
my_data %>%
mutate(event = row_number()) %>%
filter(!duplicated(startdate, fromLast = TRUE))

How to convert a single date column into three individual columns (y, m, d)?

I have a large dataset with thousands of dates in the ymd format. I want to convert this column so that way there are three individual columns by year, month, and day. There are literally thousands of dates so I am trying to do this with a single code for the entire dataset.
You can use the year(), month(), and day() extractors in lubridate for this. Here's an example:
library('dplyr')
library('tibble')
library('lubridate')
## create some data
df <- tibble(date = seq(ymd(20190101), ymd(20191231), by = '7 days'))
which yields
> df
# A tibble: 53 x 1
date
<date>
1 2019-01-01
2 2019-01-08
3 2019-01-15
4 2019-01-22
5 2019-01-29
6 2019-02-05
7 2019-02-12
8 2019-02-19
9 2019-02-26
10 2019-03-05
# … with 43 more rows
Then mutate df using the relevant extractor function:
df <- mutate(df,
year = year(date),
month = month(date),
day = day(date))
This results in:
> df
# A tibble: 53 x 4
date year month day
<date> <dbl> <dbl> <int>
1 2019-01-01 2019 1 1
2 2019-01-08 2019 1 8
3 2019-01-15 2019 1 15
4 2019-01-22 2019 1 22
5 2019-01-29 2019 1 29
6 2019-02-05 2019 2 5
7 2019-02-12 2019 2 12
8 2019-02-19 2019 2 19
9 2019-02-26 2019 2 26
10 2019-03-05 2019 3 5
# … with 43 more rows
If you only want the new three columns, use transmute() instead of mutate().
Using lubridate but without having to specify a separator:
library(tidyverse)
df <- tibble(d = c('2019/3/18','2018/10/29'))
df %>%
mutate(
date = lubridate::ymd(d),
year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date)
)
Note that you can change the first entry from ymd to fit other formats.
A slighlty different tidyverse solution that requires less code could be:
Code
tibble(date = "2018-05-01") %>%
mutate_at(vars(date), lst(year, month, day))
Result
# A tibble: 1 x 4
date year month day
<chr> <dbl> <dbl> <int>
1 2018-05-01 2018 5 1
#Data
d = data.frame(date = c("2019-01-01", "2019-02-01", "2012/03/04"))
library(lubridate)
cbind(d,
read.table(header = FALSE,
sep = "-",
text = as.character(ymd(d$date))))
# date V1 V2 V3
#1 2019-01-01 2019 1 1
#2 2019-02-01 2019 2 1
#3 2012/03/04 2012 3 4
OR
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(date2 = as.character(ymd(date))) %>%
separate(date2, c("year", "month", "day"), "-")
# date year month day
#1 2019-01-01 2019 01 01
#2 2019-02-01 2019 02 01
#3 2012/03/04 2012 03 04

generate rows for each id by date sequence

my dataframe looks someway like this
df <- read.table(text="
id start end
1 2 2018-10-01 2018-12-01
2 3 2018-01-01 2018-04-01
", header=TRUE)
What I trying to achieve is get difference between start and day in months for each id and then generate new dataframe with each month for particular id. Result should be
result <- read.table(text="
id date
1 2 2018-10-01
2 2 2018-11-01
3 2 2018-12-01
4 3 2018-01-01
5 3 2018-02-01
6 3 2018-03-01
7 3 2018-04-01
", header=TRUE)
Most straightforward way using base R functions is to create a sequence of monthly dates for each row, create a dataframe and rbind them together
do.call(rbind, with(df,lapply(1:nrow(df), function(i)
data.frame(id = id[i], date = seq(as.Date(start[i]), as.Date(end[i]), by = "month")))))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
We can do this easily with Map. Pass the Date converted 'start' and 'end' columnd of the dataset as arguments to Map, get the sequence of 'month's as a list and expand the 'id' based on the lengths of list as well as concatenate the list elements to create the expanded dataframe
lst1 <- Map(seq, MoreArgs = list(by = 'month'), as.Date(df$start), as.Date(df$end))
data.frame(id = rep(df$id, lengths(lst1)), date = do.call(c, lst1))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
Or using tidyverse, we mutate the class of the 'start', 'end' columns to Date, using map2 (from purrr), get the sequence of dates from 'start' to 'end' by the 'month' and expand the data by unnesting the dataset
library(tidyverse)
df %>%
mutate_at(2:3, as.Date) %>%
transmute(id = id, date = map2(start, end, ~ seq(.x, .y, by = 'month'))) %>%
unnest
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01

Performing in group operations in R

I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)

insert rows between dates by group

I want to insert rows between two dates by group. My way of doing it is so complicated that I insert missing values by last observation carry forwards and then merge. I was wondering is there any easier way to achieve it.
# sample data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
dt<-data.frame(user,dummy,date)
user dummy date
1 A 1 2017-01-03
2 A 1 2017-01-06
3 B 1 2016-05-01
4 B 1 2016-05-03
5 B 1 2016-05-05
Desired output
By using dplyr and tidyr :)(one line solution )
library(dplyr)
library(tidyr)
dt %>% group_by(user) %>% complete(date=full_seq(date,1),fill=list(dummy=0))
# A tibble: 9 x 3
# Groups: user [2]
user date dummy
<fctr> <date> <dbl>
1 A 2017-01-03 1
2 A 2017-01-04 0
3 A 2017-01-05 0
4 A 2017-01-06 1
5 B 2016-05-01 1
6 B 2016-05-02 0
7 B 2016-05-03 1
8 B 2016-05-04 0
9 B 2016-05-05 1
you can try this
library(data.table)
setDT(dt)
tmp <- dt[, .(date = seq.Date(min(date), max(date), by = '1 day')), by =
'user']
dt <- merge(tmp, dt, by = c('user', 'date'), all.x = TRUE)
dt[, dummy := ifelse(is.na(dummy), 0, dummy)]
We can use the tidyverse to achieve this task.
library(tidyverse)
dt2 <- dt %>%
group_by(user) %>%
do(date = seq(from = min(.$date), to = max(.$date), by = 1)) %>%
unnest() %>%
left_join(dt, by = c("user", "date")) %>%
replace_na(list(dummy = 0)) %>%
select(colnames(dt))
dt2
# A tibble: 9 x 3
user dummy date
<fctr> <dbl> <date>
1 A 1 2017-01-03
2 A 0 2017-01-04
3 A 0 2017-01-05
4 A 1 2017-01-06
5 B 1 2016-05-01
6 B 0 2016-05-02
7 B 1 2016-05-03
8 B 0 2016-05-04
9 B 1 2016-05-05
The simplest way that I have found to do this is with the padr library.
library(padr)
dt_padded <- pad(dt, group = "user", by = "date") %>%
replace_na(list(dummy=0))
A Base R (not quite as elegant) solution:
# Data
user<-c("A","A","B","B","B")
dummy<-c(1,1,1,1,1)
date<-as.Date(c("2017/1/3","2017/1/6","2016/5/1","2016/5/3","2016/5/5"))
df1 <-data.frame(user,dummy,date)
# Solution
do.call(rbind, lapply(split(df1, df1$user), function(df) {
dff <- data.frame(user=df$user[1], dummy=0, date=seq.Date(min(df$date), max(df$date), 'day'))
dff[dff$date %in% df$date, "dummy"] <- df$dummy[1]
dff
}))
# user dummy date
# A 1 2017-01-03
# A 0 2017-01-04
# A 0 2017-01-05
# A 1 2017-01-06
# B 1 2016-05-01
# B 0 2016-05-02
# B 1 2016-05-03
# B 0 2016-05-04
# B 1 2016-05-05
Assuming your data is called df1, and you want to add dates between two days try this:
library(dplyr)
df2 <- seq.Date(as.Date("2015-01-03"), as.Date("2015-01-06"), by ="day")
left_join(df2, df1)
If you're simply trying to add a new record, I suggest using rbind.
rbind()

Resources