R - Create new column with cumulative means by group - r

I have the following data frame, listing the spends for each category for each day
Dataframe: actualSpends
Date Category Spend ($)
2017/01/01 Apple 10
2017/01/02 Apple 12
2017/01/03 Apple 8
2017/01/01 Banana 13
2017/01/02 Banana 15
2017/01/03 Banana 7
I want to create a new data frame that will list down the average amount spend for each category, for each day of the month.
(e.g. On the 3rd of the month, the average spend of all days that have passed in the month, from the 1st to 31st of each month. )
EDIT:
So the output should look something like..
Date Category AvgSpend ($)
2017/01/01 Apple 10
2017/01/02 Apple 11
2017/01/03 Apple 10
2017/01/01 Banana 13
2017/01/02 Banana 14
2017/01/03 Banana 11.7
Where for each category, the average spend for each day is an average of all the days past. 1st, is an average of 1st. 2nd is an average of 1st + 2nd. 3rd is an average of 1st + 2nd + 3rd.
Is there any workaround for this?

We can use the cummean function from the dplyr package to calculate cumulative averages for each category; then melt the results into a new column:
library(dplyr)
library(reshape2)
unq <- unique(df$Category)
df$AvgSpend <- melt(
sapply(1:length(unq),
function(i) cummean(df$Spending[which(df$Category==unq[i])])))$value
Output:
Date Category Spending AvgSpend
1 2017/01/01 Apple 10 10.00000
2 2017/01/02 Apple 12 11.00000
3 2017/01/03 Apple 8 10.00000
4 2017/01/01 Banana 13 13.00000
5 2017/01/02 Banana 15 14.00000
6 2017/01/03 Banana 7 11.66667
Sample data:
df <- data.frame(Date=c("2017/01/01","2017/01/02","2017/01/03",
"2017/01/01","2017/01/02","2017/01/03"),
Category=c("Apple","Apple","Apple",
"Banana","Banana","Banana"),
Spending=c(10,12,8,13,15,7))

Here is a tidyverse option
library(tidyverse)
df %>%
group_by(Date, Category) %>%
summarise(Spending = mean(Spending, na.rm = TRUE))
# A tibble: 4 x 3
# Groups: Date [?]
# Date Category Spending
# <fctr> <fctr> <dbl>
#1 2017/01/01 Apple 11
#2 2017/01/02 Banana 14
#3 2017/01/03 Apple 8
#4 2017/01/03 Banana 7

You can use 'sqldf' (https://cran.r-project.org/web/packages/sqldf/sqldf.pdf) package
install.packages("sqldf")
library(sqldf)
actualSpends <- data.frame(
Date = c('2017/01/01','2017/01/02', '2017/01/03','2017/01/01','2017/01/02','2017/01/03'),
Category =('Apple','Apple','Apple','Banana','Banana','Banana'),
Spend = c(10,12,8,13,15,7))
sqldf("select Date,Category,sum(Spend) from actualSpends
group by Date,Category ")

Related

imputing missing values in R dataframe

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.
Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles
I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

Group dataframe rows by creating a unique ID column based on the amount of time passed between entries and variable values

I'm trying to group the rows of my dataframe into "courses" when the same variables appear at regular date intervals. When there is a gap in time frequency or when one of variables change I would like to give it a new course ID.
To give an example, my data looks something like this:
Date Name Item
1 2018-06-02 Johan Apple
2 2018-07-05 Johan Apple
3 2018-08-02 Johan Apple
4 2019-04-15 Johan Apple
5 2019-05-15 Johan Apple
6 2019-05-30 Samantha Orange
7 2019-06-12 Samantha Orange
8 2019-06-27 Samantha Orange
9 2018-02-15 Mary Lemon
10 2018-04-10 Mary Lemon
11 2018-06-12 Mary Lemon
12 2018-08-13 Mary Lime
13 2018-08-27 Mary Lime
14 2017-03-09 George Kiwi
Each different combination of Name and Item should generate a new course ID.
However (the tricky part) if there is a significant time gap between two transactions where the other variables are constant, defined as: either more than 6 months or more than three times the average interval up to that date for that specific combination of Item and Name then it should be given a new CourseID
In my example:
Because Johan had a break after August 2018, transactions after that should have a new CourseID. Ideally the interval to check for future breaks would then be based on the average in this new group.
Samantha is buying oranges on a biweekly basis with no siginficant gap so all her transactions will have one CourseID.
Mary is buying lemons at a regular interval but then switches to buying limes at a regular interval, so these have two CourseIDs.
George just bought the one Kiwi, so a single CourseID
Code to reproduce:
data.frame(Date = as.Date(c("2018-06-02", "2018-07-05", "2018-08-02", "2019-04-15", "2019-05-15", "2019-05-30", "2019-06-12", "2019-06-27", "2018-02-15", "2018-04-10", "2018-06-12", "2018-08-13", "2018-08-27", "2017-03-09")),
Name = c(rep("Johan", 5), rep("Samantha", 3), rep("Mary", 5), "George"),
Item = c(rep("Apple", 5), rep("Orange", 3), rep("Lemon", 3), rep("Lime",2), "Kiwi"))
I'd like to create an additional column which has a unique identifier for each course - i.e. using stringi or similar.
Ideally the output would look something like this:
Date Name Item CourseID
1 2018-06-02 Johan Apple q3J
2 2018-07-05 Johan Apple q3J
3 2018-08-02 Johan Apple q3J
4 2019-04-15 Johan Apple f8j
5 2019-05-15 Johan Apple f8j
6 2019-05-30 Samantha Orange p8U
7 2019-06-12 Samantha Orange p8U
8 2019-06-27 Samantha Orange p8U
9 2018-02-15 Mary Lemon wi9
10 2018-04-10 Mary Lemon wi9
11 2018-06-12 Mary Lemon wi9
12 2018-08-13 Mary Lime q8U
13 2018-08-27 Mary Lime q8U
14 2017-03-09 George Kiwi jJ0
I've tried going about this using max/min on the date varaible, however I'm stumped when it comes to identifying the break based on the previous purchasing pattern.
There may be a package I don't know which has something for this, however I've been trying with Tidyverse so far.
Here's a dplyr approach that calculates the gap and rolling avg gap within each Name/Item group, then flags large gaps, and assigns a new group for each large gap or change in Name or Item.
df1 %>%
group_by(Name,Item) %>%
mutate(purch_num = row_number(),
time_since_first = Date - first(Date),
gap = Date - lag(Date, default = as.Date(-Inf)),
avg_gap = time_since_first / (purch_num-1),
new_grp_flag = gap > 180 | gap > 3*avg_gap) %>%
ungroup() %>%
mutate(group = cumsum(new_grp_flag))

Subsetting observations with grouping some features

I have a dataset like below:
date, time,product,shop_id
20140104 900 Banana 18
20140104 900 Banana 19
20140104 924 Banana 18
20140104 929 Banana 18
20140104 932 Banana 20
20140104 948 Banana 18
and i need to extract the observations with different product, and different shop_id
so, i need to group the observations by product+shop_id
here is my code:
library(plyr)
d_ply( shop, .( product,shop_id ),table )
print(p)
unfortunately, it prints null
dataset:
date=c(20140104,20140104,20140104,20140104,20140104)
time=c(924 ,900,854,700,1450)
product=c(Banana ,Banana ,Banana ,Banana ,Banana)
shop_id=c(18,18,18,19,20)
shop<-data.frame(date=date,time=time,product=product,shop_id=shop_id)
the output should be
date, time, product, shop_id
20140104 900 Banana 19
20140104 932 Banana 20
20140104 948 Banana 18
We can do
library(tidyverse)
shop %>%
group_by(product, shop_id) %>%
mutate(n = n()) %>%
group_by(time) %>%
arrange(n) %>%
slice(1) %>%
group_by(product, shop_id) %>%
arrange(-time) %>%
slice(1) %>%
select(-n) %>%
arrange(time)
# date time product shop_id
# <int> <int> <chr> <int>
#1 20140104 900 Banana 19
#2 20140104 932 Banana 20
#3 20140104 948 Banana 18
In order to take only first unique combination, just use aggregate from package stats:
> aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){x[1]})
Group.1 Group.2 date time product shop_id
1 Banana 18 20140104 924 Banana 18
2 Banana 19 20140104 700 Banana 19
3 Banana 20 20140104 1450 Banana 20
Explanation: My FUN=function(x){x[1]} takes only first element in case of collision
To drop "Group.1", "Group.2" or other columns:
> res <- aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){x[1]})
> res[ , !(names(res) %in% c("Group.1", "Group.2"))]
date time product shop_id
1 20140104 924 Banana 18
2 20140104 700 Banana 19
3 20140104 1450 Banana 20
P.S. Your dataset provided is inconsistent with examples you required, so that's why there is a difference in numbers.
P.S.2 If you want to get all data in case of collision:
> aggregate(shop, by=list(shop$product, shop$shop_id), FUN="identity")
Group.1 Group.2 date time product shop_id
1 Banana 18 20140104, 20140104, 20140104 924, 900, 854 1, 1, 1 18, 18, 18
2 Banana 19 20140104 700 1 19
3 Banana 20 20140104 1450 1 20
If you want to mark collisions:
> aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){if (length(x) > 1) NA else x})
Group.1 Group.2 date time product shop_id
1 Banana 18 NA NA NA NA
2 Banana 19 20140104 700 1 19
3 Banana 20 20140104 1450 1 20
If you want to exclude non-unique rows:
> res <- aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){if (length(x) > 1) NULL else x})
> res[res$product != "NULL", !(names(res) %in% c("Group.1", "Group.2"))]
date time product shop_id
2 20140104 700 1 19
3 20140104 1450 1 20
If you want to avoid coerce from String to Int (for product), use ""/"NULL"/"NA" instead of NULL/NA.
It can be done using dplyr as follows:
# create the sample dataset
date=c(20140104,20140104,20140104,20140104,20140104)
time=c(924 ,900,854,700,1450)
product=c("Banana","Banana","Banana","Banana","Banana")
shop_id=c(18,18,18,19,20)
shop<-data.frame(date=date,time=time,product=product,shop_id=shop_id)
# load a dplyr library
library(dplyr)
# take shop data
shop %>%
# group by product, shop id, date
group_by(product, shop_id, date) %>%
# for each such combination, find the earliest time
summarise(time = min(time)) %>%
# group by product, shop id
group_by(product, shop_id) %>%
# for each combination of product & shop id
# return the earliest date and time recorded on the earliest date
summarise(date = min(date), time = time[date == min(date)])

Taking average of dataframe elements sharing same date

I am a bit lost in how to take average of a data frame formatted in the following way:
id date quantity product
1 12-05-2015 10 apple
2 21-03-2015 12 orange
3 12-05-2015 15 orange
4 21-03-2015 16 apple
Expected result:
date quantity
21-03-2015 14
12-05-2015 12.5
I tried converting it to zoo object, but then I run into issues as dates are non-unique.
Try
aggregate(quantity~date, df1, mean)
# date quantity
#1 12-05-2015 12.5
#2 21-03-2015 14.0
Or
library(data.table)
setDT(df1)[, list(quantity=mean(quantity)), date]
As #Alex A. mentioned in the comments, list( can be replaced by .( in the recent data.table versions.
You could also use the dplyr package. Assuming your data frame is called df:
library(dplyr)
df %>%
group_by(date) %>%
summarize(quantity = mean(quantity))
# date quantity
# 1 12-05-2015 12.5
# 2 21-03-2015 14.0
This gets the mean quantity grouped by date.

In R: add rows based on a date and another condition

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!
library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

Resources