Filtering transaction level data - r

I am dealing with a data frame containing the transaction level data. It contains two fields, bill_id and product.
The data represents products purchased at a bill level, and a particular bill_id gets repeated as many times as the number of products purchased in that bill. For example, if 5 items have been purchased in bill_id 12345, the data for this bill will be like this:
bill_id product
12345 A
12345 B
12345 C
12345 D
12345 E
My objective is to filter out data of all bills containing a certain product.
Following is an example of how I am performing this task currently:
library(dplyr)
set.seed(1)
# Sample data
dat <- data.frame(bill_id = sample(1:500, size = 1000, replace = TRUE),
product = sample(LETTERS, size = 1000, replace =
TRUE),
stringsAsFactors = FALSE) %>%
arrange(bill_id, product)
# vector of bill_ids of product A
bills_productA <- dat %>%
filter(product == "A") %>%
pull(bill_id) %>%
unique()
# data for bill_ids in vector bills_productA
dat_subset <- dat %>%
filter(bill_id %in% bills_productA)
This leads to the creation of an intermediary vector of bill_ids (bills_productA) and a two-step filtering process (first find ids of bills containing the product, and then find all transactions of these bills).
Is there a more efficient way of performing this task?

a data.table approach:
preparation
library(data.table)
setDT(dat)
actual code
dat[ bill_id %in% dat[ product == "A",][[1]], ]
output
# bill_id product
# 1: 14 A
# 2: 14 I
# 3: 19 A
# 4: 19 W
# 5: 22 A
# ---
# 130: 478 A
# 131: 478 V
# 132: 478 Z
# 133: 494 A
# 134: 494 J

You can filter the bill_id by directly subsetting it
library(dplyr)
dat_subset1 <- dat %>% filter(bill_id %in% unique(bill_id[product == "A"]))
identical(dat_subset, dat_subset1)
#[1] TRUE
This would also work without unique in it but better to keep the list short.

Another variation:
library(dplyr)
dat_subset2 <- semi_join(dat, filter(dat, product == "A") %>% select(bill_id))
> identical(dat_subset, dat_subset2)
[1] TRUE

Related

What is the best way to re-write (simplify) the same logic to produce the same result as below codes in R?

I need to extract a sample that has equal distribution in each experience-level group. For your info, there are total 4 groups (1, 2, 3, 4 years of exp), and total 8 people (A, B, C, D, E, F, G, H) in this example scenario. I was trying to come up with a function with loops, but don't know how to. Please help me out! Thank you! :)
library(tidyverse)
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4), pre_year_exp = year_exp - 1)
data_0 <- data %>% filter(year_exp == max(year_exp) - 0) %>% sample_n(2)
data_1 <- data %>% filter(year_exp == max(year_exp) - 1) %>% anti_join(data_0, by = 'id') %>% sample_n(2)
data_2 <- data %>% filter(year_exp == max(year_exp) - 2) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% sample_n(2)
data_3 <- data %>% filter(year_exp == max(year_exp) - 3) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% anti_join(data_2, by = 'id')
#Result Table
result <- data_0 %>% bind_rows(data_1, data_2, data_3)
result
The below produces the same output as your code and extends the idea to allow for an arbitrary number of values of year_exp using a for loop.
Please note that because this simply extends your code, it must share the following (possibly-undesirable) features with your code:
The code moves sequentially through groups, sampling from the members of later groups who were not sampled for early groups. Accordingly, there is a risk that the code throws an error because it tries to sample from groups whose members were already sampled from previous, other groups.
The probabilities of selection are not uniformly distributed across members of a group. Accordingly, the samples drawn from each group are not representative of that group.
In the event that there data were instead a balanced panel, there are much more efficient and simpler ways to accomplish this.
library(tibble)
library(dplyr)
set.seed(123)
# Create original data
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),
year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4),
pre_year_exp = year_exp - 1)
# Assign values to parameters used by/in the loop.
J <- data$id %>% unique %>% length # unique units/persons (8)
K <- data$year_exp %>% unique %>% length # unique groups/years (4)
N <- 2 # sample size per group (2)
# Initialize objects loop will modify
samples_list <- vector(mode = "list", length = K) # stores each sample
used_ids <- rep(NA_character_, J) # stores used ids
index <- 1:N # initial indices for used ids
# For-loop solution
for (k in 1:K) {
# Identifier for current group
cur_group <- 1 + K - k
# Sample from persons in current group who were not previously sampled
one_sample <- data %>%
filter(year_exp == cur_group, !(id %in% used_ids)) %>%
slice_sample(n = N)
# Save sample and the id values for those sampled
samples_list[[k]] <- one_sample
used_ids[index] <- one_sample$id
index <- index + N
}
# Bind into a single data.frame
bind_rows(samples_list)
#> # A tibble: 8 x 3
#> id year_exp pre_year_exp
#> <chr> <dbl> <dbl>
#> 1 H 4 3
#> 2 D 4 3
#> 3 G 3 2
#> 4 E 3 2
#> 5 C 2 1
#> 6 B 2 1
#> 7 F 1 0
#> 8 A 1 0

Extract values according to the result

I have a dataframe that represents characteristics of people, such as occupation, gender, and telework use :
data = data.frame (profession = sample (c ("craftsman", "employee", "senior executive"), 10000, replace = TRUE), sex = sample (c ("M", "F"), 10000, replace = TRUE), en_teletjob = sample (c ("Yes", "No"), 10000, replace = TRUE))
I would like to create a new dataframe, resulting from an extraction of the values ​​of "data", such as:
That there are 20% men and 80% women
And, that there are 60% of craftsmen, 20% of employees, and 20% of senior executives
And, that there be 50% of "Yes" to the use of telework.
Is it possible to do this on R?
Thank you
One approach you can try is next with apply() and prop.table() joint with table() in order to summarise all variables. Here the code:
#Code
apply(data,2,function(x) prop.table(table(x)))
Output:
$profession
x
craftsman employee senior executive
0.3331 0.3315 0.3354
$sex
x
F M
0.4987 0.5013
$en_teletjob
x
No Yes
0.503 0.497
You can use lapply() to call proportions() on each variable. It returns a list object.
lapply(data, function(x) proportions(table(x)))
# $profession
# x
# craftsman employee senior executive
# 0.3336 0.3318 0.3346
#
# $sex
# x
# F M
# 0.5035 0.4965
#
# $en_teletjob
# x
# No Yes
# 0.4978 0.5022
Note: prop.table() is an earlier name of proportions(), retained for back-compatibility.
An option with tidyverse would be to use adorn_percentages
-code
library(purrr)
library(dplyr)
library(janitor)
map(names(data), ~data %>%
select(.x) %>%
count(!! rlang::sym(.x)) %>%
adorn_percentages(denominator = 'col'))
-output
#[[1]]
# profession n
# craftsman 0.3302
# employee 0.3320
# senior executive 0.3378
#[[2]]
# sex n
# F 0.5108
# M 0.4892
#[[3]]
# en_teletjob n
# No 0.4981
# Yes 0.5019

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

data.table alternative for slow group_by() and case_when() function

In my data i have customer-ids, orderdates and an indicator if an order contained a type of product.
I want to give an indicator to each customer, if his first order contained this type of product. But because my data is pretty big i cannot use group_by and case_when, because it is way too slow. I think i could speed things up a lot by using data.table.
Could you point me to a solution? I haven´t had any contact with data.table until now...
# generate data
id <- round(rnorm(3000, mean = 5000, 400),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)
# Does the first Order contain X?
df <- df %>% group_by(id) %>% mutate(First_Order_contains_x = case_when(
date == min(date) & indicator == "1" ~ 1,
TRUE ~ 0
)) %>% ungroup()
# If first order > 1 ==> all orders get 1 //
df <- df %>% group_by(id) %>% mutate(Customer_type = case_when(
sum(First_Order_contains_x) > 0 ~ "Customer with X in first order",
TRUE ~ "Customer without x in first order"
)) %>% ungroup()
Another way:
library(data.table)
DT = data.table(df[, 1:3])
lookupDT = DT[, .(date = min(date)), by=id]
lookupDT[, fx := DT[copy(.SD), on=.(id, date), max(indicator), by=.EACHI]$V1]
DT[, v := "Customer without x in first order"]
DT[lookupDT[fx == 1L], on=.(id), v := "Customer with X in first order"]
# check results
fsetequal(DT[, .(id, v)], data.table(id = df$id, v = df$Customer_type))
# [1] TRUE
If you want more speed improvements, maybe see ?IDate.
The copy on .SD is needed due to an open issue.
Here's how you can improve your existing code using dplyr more efficiently:
lookup = data.frame(First_Order_contains_x = c(TRUE, FALSE),
Customer_Type = c("Customer with X in first order",
"Customer without x in first order"))
df %>%
group_by(id) %>%
mutate(First_Order_contains_x = any(as.integer(date == min(date) & indicator == 1))) %>%
ungroup() %>%
left_join(lookup, by = "First_Order_contains_x")
# A tibble: 3,000 x 5
id date indicator First_Order_contains_x Customer_Type
<fct> <date> <dbl> <lgl> <fct>
1 5056 2018-03-10 1 TRUE Customer with X in first order
2 5291 2018-12-28 0 FALSE Customer without x in first order
3 5173 2018-04-19 0 FALSE Customer without x in first order
4 5159 2018-11-13 0 TRUE Customer with X in first order
5 5252 2018-05-30 0 TRUE Customer with X in first order
6 5200 2018-01-20 0 FALSE Customer without x in first order
7 4578 2018-12-18 1 FALSE Customer without x in first order
8 5308 2018-03-24 1 FALSE Customer without x in first order
9 5234 2018-05-29 1 TRUE Customer with X in first order
10 5760 2018-06-12 1 TRUE Customer with X in first order
# … with 2,990 more rows
Another data.table approach. Sort the data first so that the first date is the earliest date and we can then use the first indicator for testing the condition. Then, convert logical to an integer (FALSE -> 1 and TRUE -> 2) and map into desired output using a character vector.
library(data.table)
setDT(df)
setorder(df, id, date)
map <- c("Customer without x in first order", "Customer with X in first order")
df[, idx := 1L+any(indicator[1L]==1L), by=.(id)][,
First_Order_contains_x := map[idx]]
If the original order is important, we can store the original order using df[, rn := .I] then finally setorder(df, rn).
data:
set.seed(0L)
id <- round(rnorm(3000, mean = 5000, 5),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)

Create a loop for pasting or removing elements based on different scenarios

Say I have the following data set:
mydf <- data.frame( "MemberID"=c("111","0111A","0111B","112","0112A","113","0113B"),
"resign.date"=c("2013/01/01",NA,NA,"2014/03/01",NA,NA,NA))
Note: 111,112 and 113 are the IDs for the family representative.
I would like to do two things:
a) if I have the resign dates for a family representative for instance in the case of 111, I want to paste the same resign dates for 0111A and 0111B (These represent spouse and children of 111 if you're wondering)
b) if I don't have resign dates for the family representative, for instance 113, I would simply like to remove the rows 113 and 0113B.
My resulting data frame should look like this:
mydf <- data.frame("MemberID"=c("111","0111A","0111B","112","0112A"),
"resign.date"=c("2013/01/01","2013/01/01","2013/01/01","2014/03/01","2014/03/01"))
Thanks in advance.
If resign.date is only present for (some) MembersID without trailing letters, a solution using data.table
library(data.table)
df <- data.table( "MemberID"=c("0111","0111A","0111B","0112","0112A","0113","0113B"),
"resign.date"=c("2013/01/01",NA,NA,"2014/03/01",NA,NA,NA))
df <- df[order(MemberID)] ## order data : MemberIDs w/out trailing letters first by ID
df[, myID := gsub("\\D+", "", MemberID)] ## create myID col : MemberID w/out trailing letters
df[ , my.resign.date := resign.date[1L], by = myID] ##assign first occurrence of resign date by myID
df <- df[!is.na(my.resign.date)] ##drop rows if my.resign.date is missing
EDIT
If inconsistencies in MemberID (some have leading 0 some don't) you can try some work around as in what follows
df <- data.table( "MemberID"=c("111","0111A","0111B","112","0112A","113","0113B"),
"resign.date"=c("2013/01/01",NA,NA,"2014/03/01",NA,NA,NA))
df[, myID := gsub("(?<![0-9])0+", "", gsub("\\D+", "", MemberID), perl = TRUE)]
df <- df[order(myID, -MemberID)]
df[ , my.resign.date := resign.date[1L], by = myID]
df <- df[!is.na(my.resign.date)]
We can also use tidyverse
library(tidyverse)
mydf %>%
group_by(grp = parse_number(MemberID)) %>%
mutate(resign.date = first(resign.date)) %>%
na.omit() %>%
ungroup() %>%
select(-grp)
# A tibble: 5 x 2
# MemberID resign.date
# <fctr> <fctr>
#1 0111 2013/01/01
#2 0111A 2013/01/01
#3 0111B 2013/01/01
#4 0112 2014/03/01
#5 0112A 2014/03/01

Resources