Automate date formatting - r

Data contains a column "date range" that contains 2 months i.e. Oct 31,2019-Nov 30,2019 (November) and Dec 1,2019-Dec 31, 2019(December). Need to separate them in different columns under Post Period (December) and Pre Period (October) wrt to column "Revenue". I want to automate this process when I upload a file comparing any 2 months. Earlier month under "Pre Period" and later under "Post Period". Attached an example excel screenshot of the raw data and the processed data.
x<-data.frame("A"=c("book","mobile","tablet","desktop"),
"B"=c("new york","chicago","london","paris"),
"Date.Range"=c("Oct 31,2019-Nov 30,2019","Oct 31,2019-Nov 30,2019","Dec 1,2019-Dec 31, 2019","Dec 1,2019-Dec 31, 2019"),
"Revenue"=c(542,837,1234,846))
dput(x)
structure(list(A = structure(c(1L, 3L, 4L, 2L), .Label = c("book",
"desktop", "mobile", "tablet"), class = "factor"), B = structure(c(3L,
1L, 2L, 4L), .Label = c("chicago", "london", "new york", "paris"
), class = "factor"), Date.Range = structure(c(2L, 2L, 1L, 1L
), .Label = c("Dec 1,2019-Dec 31, 2019", "Oct 31,2019-Nov 30,2019"
), class = "factor"), Revenue = c(542, 837, 1234, 846)), class = "data.frame", row.names = c(NA,
-4L))
Raw Data.
Processed Data.

Using base R's reshape function:
df = reshape(data = x,idvar = c("A","B"),direction = "wide",timevar = "DateRange")
colnames(df)=c("A","B","pre","post")

We can extract one date from Date.Range, arrange the data according to it, create a new period column and get the data in wide format.
library(dplyr)
x %>%
mutate(date = lubridate::mdy(sub("-.*", "", Date.Range))) %>%
arrange(date) %>%
mutate(period = rep(c("pre", "post"), each = 2)) %>%
tidyr::pivot_wider(names_from = period, values_from = Revenue,
values_fill = list(Revenue = 0)) %>%
select(-date)
# A tibble: 4 x 5
# A B Date.Range pre post
# <fct> <fct> <fct> <dbl> <dbl>
#1 book new york Oct 31,2019-Nov 30,2019 542 0
#2 mobile chicago Oct 31,2019-Nov 30,2019 837 0
#3 tablet london Dec 1,2019-Dec 31, 2019 0 1234
#4 desktop paris Dec 1,2019-Dec 31, 2019 0 846

Related

In R, make a conditional indicator variable based on (a) the first instance of a record type and (b) a date difference

Background
Here's a df with some data in it from a Costco-like members-only big-box store:
d <- data.frame(ID = c("a","a","b","c","c","d"),
purchase_type = c("grocery","grocery",NA,"auto","grocery",NA),
date_joined = as.Date(c("2014-01-01","2014-01-01","2013-04-30","2009-03-08","2009-03-08","2015-03-04")),
date_purchase = as.Date(c("2014-04-30","2016-07-08","2013-06-29","2015-04-07","2017-09-10","2017-03-10")),
stringsAsFactors=T)
d <- d %>%
mutate(date_diff = d$date_purchase - d$date_joined)
This yields the following table:
As you can see, it's got a member ID, purchase types based on the broad category of what people bought, and two dates: the date the member originally became a member, and the date of a given purchase. I've also made a variable date_diff to tally the time between a given purchase and the beginning of membership.
The Problem
I'd like to make a new variable early_shopper that's marked 1 on all of a member's purchases if
That member's first purchase was made within a year of joining (so date_diff <= 365 days).
This first purchase doesn't have an NA in purchase_type.
If these criteria aren't met, give a 0.
What I'm looking for is a table that looks like this:
Note that Member a is the only "true" early_shopper: their first purchase is non-NA in purchase_type, and only 119 days passed between their joining the store and making a purchase there. Member b looks like they could be based on my date_diff criterion, but since they don't have a non-NA value in purchase_type, they don't count as an early_shopper.
What I've Tried
So far, I've tried using mutate and first functions like this:
d <- d %>%
mutate(early_shopper = if_else(!is.na(first(purchase_type,order_by = date_joined)) & date_diff < 365, 1, 0))
Which gives me this:
Something's kinda working here, but not fully. As you can see, I get the correct early_shopper = 1 in Member a's first purchase, but not their second. I also get a false positive with member b, who's marked as an early_shopper when I don't want them to be (because their purchase_type is NA).
Any ideas? I can further clarify if need be. Thanks!
You could use
library(dplyr)
d %>%
mutate(date_diff = date_purchase - date_joined) %>%
group_by(ID, purchase_type) %>%
arrange(ID, date_joined) %>%
mutate(
early_shopper = +(!is.na(first(purchase_type)) & date_diff <= 365)
) %>%
group_by(ID) %>%
mutate(early_shopper = max(early_shopper)) %>%
ungroup()
which returns
# A tibble: 6 x 6
ID purchase_type date_joined date_purchase date_diff early_shopper
<fct> <fct> <date> <date> <drtn> <int>
1 a grocery 2014-01-01 2014-04-30 119 days 1
2 a grocery 2014-01-01 2016-07-08 919 days 1
3 b NA 2013-04-30 2013-06-29 60 days 0
4 c auto 2009-03-08 2015-04-07 2221 days 0
5 c grocery 2009-03-08 2017-09-10 3108 days 0
6 d NA 2015-03-04 2017-03-10 737 days 0
If you want the early_shopper column to be boolean/logical, just remove the +.
Data
I used this data, here the date_joined for b is 2013-04-30 like shown in your images and not like in your actual data posted.
structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), purchase_type = structure(c(2L,
2L, NA, 1L, 2L, NA), .Label = c("auto", "grocery"), class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311,
16498), class = "Date"), date_purchase = structure(c(16190,
16990, 15885, 16532, 17419, 17235), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Here is my approach using a join to get the early_shopper value to be the same for all rows of the same ID.
library(dplyr)
d <- structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L),
.Label = c("a","b", "c", "d"),
class = "factor"),
purchase_type = structure(c(2L, 2L, NA, 1L, 2L, NA),
.Label = c("auto", "grocery"),
class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311, 16498),
class = "Date"),
date_purchase = structure(c(16190, 16990, 15885, 16532, 17419, 17235),
class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
d %>%
inner_join(d %>%
mutate(date_diff = d$date_purchase - d$date_joined) %>%
group_by(ID) %>%
slice_min(date_diff) %>%
transmute(early_shopper = if_else(!is.na(first(purchase_type,
order_by = date_joined)) &
date_diff < 365, 1, 0)) %>%
ungroup()
)
ID purchase_type date_joined date_purchase early_shopper
1 a grocery 2014-01-01 2014-04-30 1
2 a grocery 2014-01-01 2016-07-08 1
3 b <NA> 2013-04-30 2013-06-29 0
4 c auto 2009-03-08 2015-04-07 0
5 c grocery 2009-03-08 2017-09-10 0
6 d <NA> 2015-03-04 2017-03-10 0

How to create a new data frame with grouped transactions in R?

I am trying to create a new data frame in R using an existing data frame of items bought in transactions as shown below:
dput output for the data:
structure(list(Transaction = c(1L, 2L, 2L, 3L, 3L, 3L), Item = c("Bread",
"Scandinavian", "Scandinavian", "Hot chocolate", "Jam", "Cookies"
), date_time = c("30/10/2016 09:58", "30/10/2016 10:05", "30/10/2016 10:05",
"30/10/2016 10:07", "30/10/2016 10:07", "30/10/2016 10:07"),
period_day = c("morning", "morning", "morning", "morning",
"morning", "morning"), weekday_weekend = c("weekend", "weekend",
"weekend", "weekend", "weekend", "weekend"), Year = c("2016",
"2016", "2016", "2016", "2016", "2016"), Month = c("October",
"October", "October", "October", "October", "October")), row.names = c(NA,
6L), class = "data.frame")
As you can see in the example, the rows are due to each individual product bought, not the transactions themselves (hence why Transaction 2 is both rows 2 and 3).
I would like to make a new table where the rows are the different transactions (1, 2, 3, etc.) and the different columns are categorical (Bread = 0, 1) so I can perform apriori analysis.
Any idea how I can group the different transactions together and then create these new columns?
Assuming your dataframe is called df you can use tidyr's pivot_wider :
df1 <- tidyr::pivot_wider(df, names_from = Item, values_from = Item,
values_fn = n_distinct, values_fill = 0)
df1
# Transaction date_time period_day weekday_weekend Year Month Bread Scandinavian `Hot chocolate` Jam Cookies
# <int> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
#1 1 30/10/2016 09… morning weekend 2016 Octob… 1 0 0 0 0
#2 2 30/10/2016 10… morning weekend 2016 Octob… 0 1 0 0 0
#3 3 30/10/2016 10… morning weekend 2016 Octob… 0 0 1 1 1
Or with data.table's dcast :
library(data.table)
dcast(setDT(df), Transaction+date_time+period_day + weekday_weekend +
Year + Month ~ Item, value.var = 'Item', fun.aggregate = uniqueN)
Try dummy_cols from the fastDummies package. This will turn the item column into 0's and 1's. The second line sums per transaction.
d <- dummy_cols(data[1:2], remove_selected_column=T)
d <- aggregate(d[-1], by=list(Transaction=d$Transaction), FUN=sum)

Pivot dataset and column names [R]

I have a dataset that I want to pivot.
dataset <- data.frame(date = c("01/01/2020","02/01/2020", "02/01/2020", "03/01/2020")
, camp_type = c("acquisition", "acquisition", "newsletter", "acquisition")
, channel_type = c("email", "direct_mail","email","email")
, sent = c(100, 200, 50, 250)
, open = c(30, NA, 14, 148)
, click = c(14, NA, 1, 100)
)
PLEASE NOTE: I have many more camp_types than the ones displayed in this example.
I want to get one row per day, and the rest of the information in different columns such as the picture below (renaming the columns "sent", "open" and "click" based on "channel_type" and "camp_type").
I have tried something not very elegant, and entirely manual, but I get an error when I rename the variables (code below)
dataset %>%
filter(camp_type == 'Acquisition' & channel_type == 'direct_mail') %>%
rename (dm_acq_sent = sent
, dm_acq_open = open
, dm_acq_click = clicked
)
The problem with this code above is that (once I fix the renaming issue) it will be heavily manual because I have to repeat the same chunk of code several times and needs that someone regularly checks that there are no more combinations of camp_type and channel_type.
Any help / advise will be massively appreciated.
With tidyr you can use pivot_wider:
library(tidyr)
pivot_wider(df, id_cols = date, names_from = c(camp_type, channel_type), values_from = c(sent, open, click))
Output
# A tibble: 3 x 10
date sent_acquisition… sent_acquisition_… sent_newsletter_… open_acquisitio… open_acquisition… open_newsletter… click_acquisiti… click_acquisitio… click_newslette…
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-01-01 100 NA NA 30 NA NA 14 NA NA
2 2020-02-01 NA 200 50 NA NA 14 NA NA 1
3 2020-03-01 250 NA NA 148 NA NA 100 NA NA
Data
df <- structure(list(date = structure(c(18262, 18293, 18293, 18322), class = "Date"),
camp_type = structure(c(1L, 1L, 2L, 1L), .Label = c("acquisition",
"newsletter"), class = "factor"), channel_type = structure(c(2L,
1L, 2L, 2L), .Label = c("direct_email", "email"), class = "factor"),
sent = c(100, 200, 50, 250), open = c(30, NA, 14, 148), click = c(14,
NA, 1, 100)), class = "data.frame", row.names = c(NA, -4L
))

Get row value corresponding to a column name

I have this dataset
Book2 <- structure(list(meanX3 = c(21.66666667, 21.66666667, 11, 25, 240.3333333
), meanX1 = c(23, 34.5, 10, 25, 233.5), meanX2 = c(24.5, 26.5,
20, 25, 246.5), to_select = structure(c(3L, 1L, 2L, 1L, 1L), .Label = c("meanX1",
"meanX2", "meanX3"), class = "factor"), selected = c(NA, NA,
NA, NA, NA)), .Names = c("meanX3", "meanX1", "meanX2", "to_select",
"selected"), class = "data.frame", row.names = c(NA, -5L))
I want to get the coresponding row value for the column name on variable to_select .
I have tried
Book2 %>% dplyr::mutate(selected=.[paste0(to_select)])
But it returns all the column values. How can I go about to get a data set like
structure(list(meanX3 = c(21.66666667, 21.66666667, 11, 25, 240.3333333
), meanX1 = c(23, 34.5, 10, 25, 233.5), meanX2 = c(24.5, 26.5,
20, 25, 246.5), to_select = structure(c(3L, 1L, 2L, 1L, 1L), .Label = c("meanX1",
"meanX2", "meanX3"), class = "factor"), selected = c(21.66, 34.5,
20, 25, 240.33)), .Names = c("meanX3", "meanX1", "meanX2", "to_select",
"selected"), class = "data.frame", row.names = c(NA, -5L))
With base R, a safe strategy would be something like
cols <- as.character(unique(Book2$to_select))
row_col <- match(Book2$to_select, cols)
idx <- cbind(seq_along(Book2$to_select), row_col)
selected <- Book2[, cols][idx]
Book2$selected <- selected
Or using tidyverse packages, something like
library(tidyverse)
Book2 %>% mutate(row=1:n()) %>%
gather(prop, val, meanX3:meanX2) %>%
group_by(row) %>%
mutate(selected=val[to_select==prop]) %>%
spread(prop, val) %>% select(-row)
Would be a decent strategy.
One way is to group by row using rowwise() and then get the value of the string in 'to_select' column
Book2 %>%
rowwise() %>%
mutate(selected = get(as.character(to_select)))
# A tibble: 5 × 5
# meanX3 meanX1 meanX2 to_select selected
# <dbl> <dbl> <dbl> <fctr> <dbl>
#1 21.66667 23.0 24.5 meanX3 21.66667
#2 21.66667 34.5 26.5 meanX1 34.50000
#3 11.00000 10.0 20.0 meanX2 20.00000
#4 25.00000 25.0 25.0 meanX1 25.00000
#5 240.33333 233.5 246.5 meanX1 233.50000
In base R you can use match to select the desired column and then matrix subsetting to select the particular element for each row like this
Book2$selected <- as.numeric(Book2[cbind(seq_len(nrow(Book2)),
match(Book2$to_select, names(Book2)))])

Can you merge your data without creating separate dataframe in R?

My data frame is something like the follows:
sex year country value
F 2010 AU 350
F 2011 GE 258
M 2010 AU 250
F 2012 GE 928
In order to create another data frame that is merged by year and country, with sex and value being what you want to compare, you must first create separate data frames, like:
f <- subset(df, sex=="F")
m <- subset(df, sex=="M")
df_new <- merge(f, m, by=c("country", "year"), suffixes=c("_f", "_m"))
In this way, you can obtain a new data frame with year, and country being matched and just the value being different.
However, I don't like to bother to create separate data frames in order to merge. Is it possible to just write a code in one-line to achieve the data frame?
Considering dput(dft) as :
structure(list(sex = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"),
year = c(2010, 2011, 2010, 2012),
country = structure(c(1L, 2L, 1L, 2L), .Label = c("AU", "GE"), class = "factor"),
value = c(350, 258, 250, 928)), .Names = c("sex", "year", "country", "value"),row.names = c(NA, -4L), class = "data.frame")
you can use tidyverse and do:
dft %>% spread(sex,value)
which gives:
# year country F M
#1 2010 AU 350 250
#2 2011 GE 258 NA
#3 2012 GE 928 NA
We can do a split and then with Reduce/merge can get the expected output
Reduce(function(...) merge(..., by = c("country", "year"),
suffixes = c("_f", "_m")), split(df, df$sex))
# country year sex_f value_f sex_m value_m
#1 AU 2010 F 350 M 250
NOTE: This should also work when there are 'n' number of unique elements in the split by column (without the suffixes or its modification)
A reshaping option with data.table is
library(data.table)
na.omit(dcast(setDT(df), country + year ~ rowid(country, year),
value.var = c("sex", "value")))
# country year sex_1 sex_2 value_1 value_2
#1: AU 2010 F M 350 250

Resources