How to create a new data frame with grouped transactions in R? - r

I am trying to create a new data frame in R using an existing data frame of items bought in transactions as shown below:
dput output for the data:
structure(list(Transaction = c(1L, 2L, 2L, 3L, 3L, 3L), Item = c("Bread",
"Scandinavian", "Scandinavian", "Hot chocolate", "Jam", "Cookies"
), date_time = c("30/10/2016 09:58", "30/10/2016 10:05", "30/10/2016 10:05",
"30/10/2016 10:07", "30/10/2016 10:07", "30/10/2016 10:07"),
period_day = c("morning", "morning", "morning", "morning",
"morning", "morning"), weekday_weekend = c("weekend", "weekend",
"weekend", "weekend", "weekend", "weekend"), Year = c("2016",
"2016", "2016", "2016", "2016", "2016"), Month = c("October",
"October", "October", "October", "October", "October")), row.names = c(NA,
6L), class = "data.frame")
As you can see in the example, the rows are due to each individual product bought, not the transactions themselves (hence why Transaction 2 is both rows 2 and 3).
I would like to make a new table where the rows are the different transactions (1, 2, 3, etc.) and the different columns are categorical (Bread = 0, 1) so I can perform apriori analysis.
Any idea how I can group the different transactions together and then create these new columns?

Assuming your dataframe is called df you can use tidyr's pivot_wider :
df1 <- tidyr::pivot_wider(df, names_from = Item, values_from = Item,
values_fn = n_distinct, values_fill = 0)
df1
# Transaction date_time period_day weekday_weekend Year Month Bread Scandinavian `Hot chocolate` Jam Cookies
# <int> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
#1 1 30/10/2016 09… morning weekend 2016 Octob… 1 0 0 0 0
#2 2 30/10/2016 10… morning weekend 2016 Octob… 0 1 0 0 0
#3 3 30/10/2016 10… morning weekend 2016 Octob… 0 0 1 1 1
Or with data.table's dcast :
library(data.table)
dcast(setDT(df), Transaction+date_time+period_day + weekday_weekend +
Year + Month ~ Item, value.var = 'Item', fun.aggregate = uniqueN)

Try dummy_cols from the fastDummies package. This will turn the item column into 0's and 1's. The second line sums per transaction.
d <- dummy_cols(data[1:2], remove_selected_column=T)
d <- aggregate(d[-1], by=list(Transaction=d$Transaction), FUN=sum)

Related

In R, make a conditional indicator variable based on (a) the first instance of a record type and (b) a date difference

Background
Here's a df with some data in it from a Costco-like members-only big-box store:
d <- data.frame(ID = c("a","a","b","c","c","d"),
purchase_type = c("grocery","grocery",NA,"auto","grocery",NA),
date_joined = as.Date(c("2014-01-01","2014-01-01","2013-04-30","2009-03-08","2009-03-08","2015-03-04")),
date_purchase = as.Date(c("2014-04-30","2016-07-08","2013-06-29","2015-04-07","2017-09-10","2017-03-10")),
stringsAsFactors=T)
d <- d %>%
mutate(date_diff = d$date_purchase - d$date_joined)
This yields the following table:
As you can see, it's got a member ID, purchase types based on the broad category of what people bought, and two dates: the date the member originally became a member, and the date of a given purchase. I've also made a variable date_diff to tally the time between a given purchase and the beginning of membership.
The Problem
I'd like to make a new variable early_shopper that's marked 1 on all of a member's purchases if
That member's first purchase was made within a year of joining (so date_diff <= 365 days).
This first purchase doesn't have an NA in purchase_type.
If these criteria aren't met, give a 0.
What I'm looking for is a table that looks like this:
Note that Member a is the only "true" early_shopper: their first purchase is non-NA in purchase_type, and only 119 days passed between their joining the store and making a purchase there. Member b looks like they could be based on my date_diff criterion, but since they don't have a non-NA value in purchase_type, they don't count as an early_shopper.
What I've Tried
So far, I've tried using mutate and first functions like this:
d <- d %>%
mutate(early_shopper = if_else(!is.na(first(purchase_type,order_by = date_joined)) & date_diff < 365, 1, 0))
Which gives me this:
Something's kinda working here, but not fully. As you can see, I get the correct early_shopper = 1 in Member a's first purchase, but not their second. I also get a false positive with member b, who's marked as an early_shopper when I don't want them to be (because their purchase_type is NA).
Any ideas? I can further clarify if need be. Thanks!
You could use
library(dplyr)
d %>%
mutate(date_diff = date_purchase - date_joined) %>%
group_by(ID, purchase_type) %>%
arrange(ID, date_joined) %>%
mutate(
early_shopper = +(!is.na(first(purchase_type)) & date_diff <= 365)
) %>%
group_by(ID) %>%
mutate(early_shopper = max(early_shopper)) %>%
ungroup()
which returns
# A tibble: 6 x 6
ID purchase_type date_joined date_purchase date_diff early_shopper
<fct> <fct> <date> <date> <drtn> <int>
1 a grocery 2014-01-01 2014-04-30 119 days 1
2 a grocery 2014-01-01 2016-07-08 919 days 1
3 b NA 2013-04-30 2013-06-29 60 days 0
4 c auto 2009-03-08 2015-04-07 2221 days 0
5 c grocery 2009-03-08 2017-09-10 3108 days 0
6 d NA 2015-03-04 2017-03-10 737 days 0
If you want the early_shopper column to be boolean/logical, just remove the +.
Data
I used this data, here the date_joined for b is 2013-04-30 like shown in your images and not like in your actual data posted.
structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), purchase_type = structure(c(2L,
2L, NA, 1L, 2L, NA), .Label = c("auto", "grocery"), class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311,
16498), class = "Date"), date_purchase = structure(c(16190,
16990, 15885, 16532, 17419, 17235), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Here is my approach using a join to get the early_shopper value to be the same for all rows of the same ID.
library(dplyr)
d <- structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L),
.Label = c("a","b", "c", "d"),
class = "factor"),
purchase_type = structure(c(2L, 2L, NA, 1L, 2L, NA),
.Label = c("auto", "grocery"),
class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311, 16498),
class = "Date"),
date_purchase = structure(c(16190, 16990, 15885, 16532, 17419, 17235),
class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
d %>%
inner_join(d %>%
mutate(date_diff = d$date_purchase - d$date_joined) %>%
group_by(ID) %>%
slice_min(date_diff) %>%
transmute(early_shopper = if_else(!is.na(first(purchase_type,
order_by = date_joined)) &
date_diff < 365, 1, 0)) %>%
ungroup()
)
ID purchase_type date_joined date_purchase early_shopper
1 a grocery 2014-01-01 2014-04-30 1
2 a grocery 2014-01-01 2016-07-08 1
3 b <NA> 2013-04-30 2013-06-29 0
4 c auto 2009-03-08 2015-04-07 0
5 c grocery 2009-03-08 2017-09-10 0
6 d <NA> 2015-03-04 2017-03-10 0

Dividing each row by the previous one in R

I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)

Automate date formatting

Data contains a column "date range" that contains 2 months i.e. Oct 31,2019-Nov 30,2019 (November) and Dec 1,2019-Dec 31, 2019(December). Need to separate them in different columns under Post Period (December) and Pre Period (October) wrt to column "Revenue". I want to automate this process when I upload a file comparing any 2 months. Earlier month under "Pre Period" and later under "Post Period". Attached an example excel screenshot of the raw data and the processed data.
x<-data.frame("A"=c("book","mobile","tablet","desktop"),
"B"=c("new york","chicago","london","paris"),
"Date.Range"=c("Oct 31,2019-Nov 30,2019","Oct 31,2019-Nov 30,2019","Dec 1,2019-Dec 31, 2019","Dec 1,2019-Dec 31, 2019"),
"Revenue"=c(542,837,1234,846))
dput(x)
structure(list(A = structure(c(1L, 3L, 4L, 2L), .Label = c("book",
"desktop", "mobile", "tablet"), class = "factor"), B = structure(c(3L,
1L, 2L, 4L), .Label = c("chicago", "london", "new york", "paris"
), class = "factor"), Date.Range = structure(c(2L, 2L, 1L, 1L
), .Label = c("Dec 1,2019-Dec 31, 2019", "Oct 31,2019-Nov 30,2019"
), class = "factor"), Revenue = c(542, 837, 1234, 846)), class = "data.frame", row.names = c(NA,
-4L))
Raw Data.
Processed Data.
Using base R's reshape function:
df = reshape(data = x,idvar = c("A","B"),direction = "wide",timevar = "DateRange")
colnames(df)=c("A","B","pre","post")
We can extract one date from Date.Range, arrange the data according to it, create a new period column and get the data in wide format.
library(dplyr)
x %>%
mutate(date = lubridate::mdy(sub("-.*", "", Date.Range))) %>%
arrange(date) %>%
mutate(period = rep(c("pre", "post"), each = 2)) %>%
tidyr::pivot_wider(names_from = period, values_from = Revenue,
values_fill = list(Revenue = 0)) %>%
select(-date)
# A tibble: 4 x 5
# A B Date.Range pre post
# <fct> <fct> <fct> <dbl> <dbl>
#1 book new york Oct 31,2019-Nov 30,2019 542 0
#2 mobile chicago Oct 31,2019-Nov 30,2019 837 0
#3 tablet london Dec 1,2019-Dec 31, 2019 0 1234
#4 desktop paris Dec 1,2019-Dec 31, 2019 0 846

R dplyr perform different aggregation by group

I have a dataframe dat which looks like this:
dat <- structure(list(cell.ID = c(329574L, 329574L, 329574L, 329574L,
329574L, 329574L, 329574L, 329574L, 329574L, 329574L, 329574L,
329574L), Year = c("2010", "2010", "2010", "2010", "2010", "2010",
"2010", "2010", "2010", "2010", "2010", "2010"), month_name = c("June",
"July", "June", "July", "June", "July", "June", "July", "June",
"July", "June", "July"), value = c(459.860986624053, 398.94083733151,
16, 23, 111.69, 453.333, 71.55, 30.38, 31.928, 30.13355, 17.587,
19.7938709677419), variable_name = c("ETo", "ETo", "Rday", "Rday",
"Rsum", "Rsum", "Thdd", "Thdd", "Tmax", "Tmax", "Tmin", "Tmin"
), monthID = c(6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L
)), row.names = c(NA, -12L), class = "data.frame")
library(dplyr)
dat %>%
dplyr::group_by(Year, variable_name) %>%
dplyr::summarise(variable = sum(value))
If I want to average the Tmax and Tmin and sum the rest of the variables, I did this
dat %>%
dplyr::group_by(Year, variable_name) %>%
dplyr::summarise(variable = ifelse(variable_name %in% c('Tmax', 'Tmin'), mean(value), sum(value)))
Error: Column `variable` must be length 1 (a summary value), not 2
How do I correct this?
Another way to do this is dplyr is to use if and else instead of ifelse:
dat %>%
group_by(Year, variable_name) %>%
summarise(variable = if (variable_name[1] %in% c('Tmax', 'Tmin')) mean(value) else sum(value))
# A tibble: 6 x 3
# Groups: Year [1]
Year variable_name variable
<chr> <chr> <dbl>
1 2010 ETo 859.
2 2010 Rday 39
3 2010 Rsum 565.
4 2010 Thdd 102.
5 2010 Tmax 31.0
6 2010 Tmin 18.7
I think the problem is that ifelse in this context is operating row-wise, not at the level of the group. If that's right, then you could work around the problem by getting both summary statistics and then conditionally selecting the one you want by variable name, like this:
dat %>%
dplyr::group_by(Year, variable_name) %>%
dplyr::summarise(var_mean = mean(value), var_sum = sum(value)) %>%
dplyr::mutate(variable = ifelse(variable_name %in% c('Tmax', 'Tmin'), var_mean, var_sum)) %>%
dplyr::select(-var_mean, -var_sum)
Result:
# A tibble: 6 x 3
# Groups: Year [1]
Year variable_name variable
<chr> <chr> <dbl>
1 2010 ETo 859.
2 2010 Rday 39
3 2010 Rsum 565.
4 2010 Thdd 102.
5 2010 Tmax 31.0
6 2010 Tmin 18.7

Combining data with Base R

I currently need to translate my dplyr code into base R code. My dplyr code gives me 3 columns, competitor sex, the olympic season and the number of different sports. The code looks like this:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
My data structure looks like this.
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun",
"Nstor Abad Sanjun"), Sex = c("M", "M", "F", "M", "M", "M"),
Age = c(23L, 28L, 22L, 30L, 23L, 23L), Height = c(170L, 184L,
170L, 187L, 167L, 167L), Weight = c(60, 85, 125, 76, 64,
64), Team = c("China", "Finland", "Romania", "France", "Spain",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP", "ESP"
), Games = c("2012 Summer", "2014 Winter", "2016 Summer",
"2012 Summer", "2016 Summer", "2016 Summer"), Year = c(2012L,
2014L, 2016L, 2012L, 2016L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro", "Rio de Janeiro"
), Sport = c("Judo", "Ice Hockey", "Weightlifting", "Athletics",
"Gymnastics", "Gymnastics"), Event = c("Judo Men's Extra-Lightweight",
"Ice Hockey Men's Ice Hockey", "Weightlifting Women's Super-Heavyweight",
"Athletics Men's 1,500 metres", "Gymnastics Men's Individual All-Around",
"Gymnastics Men's Floor Exercise"), Medal = c(NA, "Bronze",
NA, NA, NA, NA), BMI = c(20.7612456747405, 25.1063327032136,
43.2525951557093, 21.7335354170837, 22.9481157445588, 22.9481157445588
)), .Names = c("Name", "Sex", "Age", "Height", "Weight",
"Team", "NOC", "Games", "Year", "Season", "City", "Sport", "Event",
"Medal", "BMI"), row.names = c(NA, 6L), class = "data.frame")
Does anyone know how to translate this into base R?
Since you are grouping twice in dplyr you can use double aggregate in base R
setNames(aggregate(Name~Sex + Season,
aggregate(Name~Sex + Season + Sport, olympics, length), length),
c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# Competitor_Sex Olympic_Season Num_Sports
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
This gives the same output as dplyr option
library(dplyr)
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# Competitor_Sex Olympic_Season Num_Sports
# <chr> <chr> <int>
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
A base R option would be using aggregate twice
out <- aggregate(BMI ~ Sex + Season,
aggregate(BMI ~ Sex + Season + Sport, olympics, length), length)
names(out) <- c("Competitor_Sex", "Olympic_Season", "Num_Sports")
out
# Competitor_Sex Olympic_Season Num_Sports
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
It is similar to the OP's output
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# A tibble: 3 x 3
# Groups: Sex [2]
# Competitor_Sex Olympic_Season Num_Sports
# <chr> <chr> <int>
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
Or it can be done in a compact way with table from base R
table(sub(",[^,]+$", "", names(table(do.call(paste,
c(olympics[c("Sex", "Season", "Sport")], sep=","))))))
# F,Summer M,Summer M,Winter
# 1 3 1

Resources