Creating synthetic user data in R; issues with generating user identifier variable - r

I am trying to generate synthetic user event log data for demonstration purposes. It's going to be very basic feature-wise (about 4 variables altogether). Here is what I have so far:-
require(wakefield)#for generating the Status variable
require(dplyr)
require(stringi)
set.seed(1)
#data<-data.frame()
eventDate<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
eventDate<-sample(rep(eventDate,each=1000),replace = T)
u <- runif(length(eventDate), 0, 60*60*12) # "noise" to add or subtract from some timepoint
eventDateTime<-as.POSIXlt(u, origin = paste0(eventDate,"00:00:00"))
eventDateTime
eventOutcome<-r_sample_factor(x = c("Passed", "Failed", "Ongoing","Unknown"), n=length(eventDate))
eventOutcome
data<-data.frame(eventDate,eventDateTime,eventOutcome)
head(data)
# eventDate eventDateTime eventOutcome
#1 2015-01-25 2015-01-25 04:48:47 Unknown
#2 2015-05-05 2015-05-05 09:35:22 Unknown
#3 2015-11-28 2015-11-28 08:56:16 Failed
#4 2015-05-23 2015-05-23 02:24:52 Ongoing
#5 2015-01-26 2015-01-26 07:43:52 Failed
#6 2015-10-22 2015-10-22 03:07:14 Passed
There is about 365000 rows of data here. All that is left to do is add a user identifier variable. I would like it if some users will maybe have a handful of interactions recorded in the data set, whereas some users may have dozens/hundreds/thousands of interactions (I would like this dataset to have that kind of variability).
I can create a user identifier variable no problem:-
UserId<-stri_rand_strings(1300,6)
But if I add this to the data, it doesn't work:-
data$UserId<-stri_rand_strings(1300,6)
Error in `$<-.data.frame`(`*tmp*`, UserId, value = c("k3QlXs", "gK3eBa", :
replacement has 1300 rows, data has 365000
So my request two-fold: How can I assign a User identifier variable to this kind of data; how can I make it variable, where some users have a 1 or a few interactions whilst others will appear frequently (i.e. dozens, hundreds, thousands of times)?
Thank you in advance for any help, always appreciated :)

One option might be to generate a UUID for each user. A UUID looks like this:
c7f2dde5-dfeb-45cb-9720-87b23effd45d
If you use a good UUID generator, then it is almost impossible to generate the same UUID more than once. R has a uuid package which can be used:
library(uuid)
user_uuid <- UUIDgenerate()

It sounds like the distribution of user IDs you are looking for is something akin to a power law or Pareto distribution. This is a way to generate a vector of relative probabilities of sampling each user ID that follows a power law, then sample the user IDs following those relative probabilities.
Here I've used the function Pareto::rPareto to generate the relative probabilities. You can vary the parameters, especially alpha, to get different distributions. As alpha increases the distribution will become more even. I also supplied a truncation parameter so that you will not get too many users with unrealistically high numbers of purchases (In this example the most frequent ID has ~3700 cases).
Note you are not guaranteed to sample each of the 1300 user IDs at least once with this particular way of doing it.
Next I use the function uuid::UUIDgenerate to generate 1300 unique strings. Finally I use sample to sample the unique IDs with replacement as many times as you have rows in your data frame. I plot the frequencies of the different IDs in the sample. Again, modify the parameters if this distribution is not what you are looking for.
library(Pareto)
library(uuid)
library(ggplot2)
set.seed(1)
n_users <- 1300
n_rows <- 365000
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
id_sample <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
# Check the distribution of IDs
ggplot(as.data.frame(table(id_sample)), aes(x = Freq)) +
geom_histogram() +
scale_x_log10()

You are generating 1300 strings but number of rows in your data is 365000. So you can use sample to repeat those 1300 string randomly.
library(stringi)
data$UserId <- sample(stri_rand_strings(1300,6), nrow(data), replace = TRUE)

Related

Argument is not numeric

I would like to visualize the number of people infected with COVID-19, but I am unable to obtain the mortality rate because the number of deaths is derived by int when obtaining the mortality rate per 100,000 population for each prefecture.
What I want to achieve
I want to find the solution of "covid19j_20200613$POP2019 * 100" by setting the data type of "covid19j_20200613$deaths" to num.
Error message.
Error in covid19j_20200613$deaths/covid19j_20200613$POP2019:
Argument of binary operator is not numeric
Source code in question.
library(spdep)
library(sf)
library(spatstat)
library(tidyverse)
library(ggplot2)
needs::prioritize(magrittr)
covid19j <- read.csv("https://raw.githubusercontent.com/kaz-ogiwara/covid19/master/data/prefectures.csv",
header=TRUE)
# Below is an example for May 20, 2020.
# Month and date may be changed
covid19j_20200613 <- dplyr::filter(covid19j,
year==2020,
month==6,
date==13)
covid19j_20200613$CODE <- 1:47
covid19j_20200613[is.na(covid19j_20200613)] <- 0
pop19 <- read.csv("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/pop2019.csv", header=TRUE)
covid19j_20200613 <- dplyr::inner_join(covid19j_20200613, pop19,
by = c("CODE" = "CODE"))
# Load Japan prefecture administrative boundary data
jpn_pref <- sf::st_read("/Users/carlobroschi_imac/Documents/lectures/EGDS/07/covid19_data/covid19_data/jpn_pref.shp")
# Data and concatenation
jpn_pref_cov19 <- dplyr::inner_join(jpn_pref, covid19j_20200613, by=c("PREF_CODE"="CODE"))
ggplot2::ggplot(data = jpn_pref_cov19) +
geom_sf(aes(fill=testedPositive)) +
scale_fill_distiller(palette="RdYlGn") +
theme_bw() +
labs(title = "Tested Positiv of Covid19 (2020/06/13)")
# Mortality rate per 100,000 population
# Population number in units of 1000
as.numeric(covid19j_20200613$deaths)
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
Source code in question.
prefectures.csv
https://docs.google.com/spreadsheets/d/11C2vVo-jdRJoFEP4vAGxgy_AEq7pUrlre-i-zQVYDd4/edit?usp=sharing
pop2019.csv
https://docs.google.com/spreadsheets/d/1CbEX7BADutUPUQijM0wuKUZFq2UUt-jlWVQ1ipzs348/edit?usp=sharing
What we tried
I tried to put "as.numeric(covid19j_20200613$deaths)" before the calculation and set the number of dead to type
num, but I got the same error message during the calculation.
Additional information (FW/tool versions, etc.)
iMac M1 2021, R 4.2.0
Translated with www.DeepL.com/Translator (free version)
as.numeric() does not permanently change the data type - it only does it temporarily.
So when you're running as.numeric(covid19j_20200613$deaths), this shows you the column deaths as numeric, but the column will stay a character.
So if you want to coerce the data type, you need to also reassign:
covid19j_20200613$deaths <- as.numeric(covid19j_20200613$deaths)
covid19j_20200613$POP2019 <- as.numeric(covid19j_20200613$POP2019)
# Now you can do calculations
covid19j_20200613$deaths_rate <- covid19j_20200613$deaths / covid19j_20200613$POP2019 * 100
It's easier to read if you use mutate from dplyr:
covid19j_20200613 <- covid19j_20200613 |>
mutate(
deaths = as.numeric(deaths),
POP2019 = as.numeric(POP2019),
death_rate = deaths / POP2019 * 100
)
Result
deaths POP2019 deaths_rate
1 91 5250 1.73333333
2 1 1246 0.08025682
3 0 1227 0.00000000
4 1 2306 0.04336513
5 0 966 0.00000000
PS: your question is really difficult to follow! There is a lot of stuff that we don't actually need to answer it, so that makes it harder for us to identify where the issue is. For example, all the data import, the join, the ggplot...
When writing a question, please only include the minimal elements that lead to a problem. In your case, we only needed a sample dataset with the deaths and POP2019 columns, and the two lines of code that you tried to fix at the end.
If you look at str(covid19j) you'll see that the deaths column is a character column containing a lot of blanks. You need to figure out the structure of that column to read it properly.

Backtesting in R for time series

I am new to the backtesting methodology - algorithm in order to assess if something works based on the historical data.Since I am new to that I am trying to keep things simple in order to understand it.So far I have understood that if let's say I have a data set of time series :
date = seq(as.Date("2000/1/1"),as.Date("2001/1/31"), by = "day")
n = length(date);n
class(date)
y = rnorm(n)
data = data.frame(date,y)
I will keep the first 365 days that will be the in sample period in order to do something with them and then I will update them with one observation at the time for the next month.Am I correct here ?
So if I am correct, I define the in sample and out of sample periods.
T = dim(data)[1];T
outofsampleperiod = 31
initialsample = T-outofsampleperiod
I want for example to find the quantile of alpha =0.01 of the empirical data.
pre = data[1:initialsample,]
ypre = pre$y
quantile(ypre,0.01)
1%
-2.50478
Now the difficult part for me is to update them in a for loop in R.
I want to add each time one observation and find again the empirical quantile of alpha = 0.01.To print them all and check the condition if is greater than the in sample quantile as resulted previously.
for (i in 1:outofsampleperiod){
qnew = quantile(1:(initialsample+i-1),0.01)
print(qnew)
}
You can create a little function that gets the quantile of column y, over rows 1 to i of a frame df like this:
func <- function(i,df) quantile(df[1:i,"y"],.01)
Then apply this function to each row of data
data$qnew = lapply(1:nrow(data),func,df=data)
Output (last six rows)
> tail(data)
date y qnew
392 2001-01-26 1.3505147 -2.253655
393 2001-01-27 -0.5096840 -2.253337
394 2001-01-28 -0.6865489 -2.253019
395 2001-01-29 1.0881961 -2.252701
396 2001-01-30 0.1754646 -2.252383
397 2001-01-31 0.5929567 -2.252065

Is there a way to import the results of HSD.test from agricolae directly into geom_text() in a ggplot2?

I'm creating figures that show the efficacy of several warning signals relative to the event they warn about. The figure is based off a dataframe which is produced by a function that runs a model multiple times and collates the results like this:
t type label early
4 847 alarm alarm.1 41
2 849 alarm alarm.2 39
6 853 alarm alarm.3 35
5 923 alarm alarm.4 -35
7 1003 alarm alarm.5 -115
But with a dozen alarms and a value for each alarm n times (typically 20 - 100), with each value being slightly different depending on random and stochastic variables built into the model.
I'm putting the results in an lm
a.lm <- lm(log(early + 500) ~ label, data = alarm.data)
and after checking the assumptions are met, running a 1 way anova
anova(a.lm)
then a tukey post hoc test
HSD.test(a.lm, trt = "label", console = TRUE)
Which produces
log(early + 500) groups
alarm.1 6.031453 a
alarm.2 6.015221 a
alarm.3 6.008366 b
alarm.4 5.995150 b
alarm.5 5.921384 c
I have a function which generates a ggplot2 figure based on the collated data, to which I am then manually adding +geom_text(label = c("a", "a", "b", "b", "c") or whatever the appropriate letters are. Is there a way to generalise that last step? Calling the letters directly from the result of the HSD.test. If I put the results of the HSD.test into an object
a.test <- HSD.test(a.lm, trt = "label", console = TRUE)
I can call the results using a.test$groups and calling the letter groupings specifically using a.test$groups$groups but I don't know enough about manipulating lists to make that useful to me. Whilst the order of the labels in the ggplot is predictable, the order of the groups in the HSD.test result isn't and can vary a lot between iterations of the model running function.
If anyone has any insights I would be grateful.
Okay I actually bumped into a solution just after I posted the question.
If you take the output of the HSD.test and make it into an object
a.test <- HSD.test(ram.lm, trt = "label")
Then convert the groups list into a dataframe
a.df <- as.data.frame(a.test$groups)
The row index is the alarm names rather than numbers
a.df
log(early + 500) groups
alarm.1 6.849082 a
alarm.2 6.842465 a
alarm.3 6.837438 a
alarm.4 6.836437 a
alarm.5 6.812714 a
so they can be called specifically into geom_text inside the function
a.plot +
geom_text(label = c(a.df["alarm.1",2],
a.df["alarm.2",2],
a.df["alarm.3", 2],
a.df["alarm.4", 2],
a.df["alarm.5", 2])
even though not using the same functions to get the compact letter display, I think this may be a more efficient way of doing it? (Make sure to unfold the code via the "Code" button above the ggplots)

Performance issues in for loop (moving towards vectorization with multiple sample()'s)

I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.
I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).
This issue is as follows, using these parameters I will build a transaction/customer/product table:
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
PropCustTypes = c(.10, .45, .30, .15), # Probability for being in each group.
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability for each group
ByDirectCustomer = c(0.60, .30, 0.15, 0.05), # of coming through channel X
ByPartnerBlog = c(0.30, .30, 0.35, 0.35), #
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
# Some other parameters for later use.
NumDays = 1000
NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.)
SalesMultiplierWeekends = 1.5 # For example, I want more in weekends
StartDate <- as.Date("2009-01-04")
NumProducts <- 150
AnnualGrowth <- .1 # I also want an annual growth trend
I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.
days <- data.frame( # Define the days
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend
CustomerRate = NumTransactions/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends)
days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth)
Next I generate the transactions using this table:
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA, # Will be assigned later, NewCustomer: 0.8 and .2
ProductID = NA, # insinuate new/existing customers above
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.
StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12)
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 100, 50))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15 # Cap ProductPrice at 10$
Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.
Customers <- data.frame(
ID=(1:sum(Transactions$NewCustomer)),
CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer),
replace=TRUE, prob=Parameters$PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now
Transactions$Discount <- rnorm(length(Transactions$ID), # assigned to each transaction
Transactions$Discount,Transactions$Discount/20)
Transactions$Timeliness <- rnorm(length(Transactions$ID),
Transactions$Timeliness, Transactions$Timeliness/6)
Now the performance issues start to arise, the for loop:
for (i in 1:nrow(Transactions)){
# Only sample customers which share the same 'CustomerType' as the transaction
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.
For loop / filling 100 rows: ~ 18 Seconds
For loop / filling 200 rows: ~ 37 Seconds
For loop / filling 1000 rows: ~ 3 minutes
For loop / filling 300000 rows: No idea, can't get that far?
Why is it running so slow and how can I solve this? Any help is greatly appreciated.
Below is how you would do the first part using data.table, adding CustomerID to the Transactions table. I have changed some names and dropped the placeholder columns as they will be added through the data.table joins.
Tr <- data.table(Transactions)
Tr[, CustomerID:=NULL]
Tr[, ProductID:=NULL]
Tr[, ReferredBy:=NULL] ## see #Arun's comment for a more compact way to do this
Cs <- data.table(Customers)
setnames(Cs, 'ID', 'CustomerID') ## So we avoid duplicate with Tr
## Add customer ID, matching customer types
setkey(Tr, CustomerType)
setkey(Cs, CustomerType)
# Make an index Transaction ID -> Customer ID
# Large interim matrix should not be formed, but I am not sure
TrID2CustID <- Cs[Tr, allow.cartesian=T][, list(CustomerID=sample(CustomerID, 1)), by=ID]
setkey(TrID2CustID, ID)
setkey(Tr, ID)
Tr <- Tr[TrID2CustID]
There is a large matrix that is the cartesian product of your Transactions and Customers tables (about 15M rows) which would exhaust the memory if it is explicitly computed. Judging by the fact that this takes about a second, I'd say it is not computed, but I am not sure.
I will work on the rest and edit the answer if I come up with the solutions quickly, but this ought to show you how to do this using data.table.
UPDATE 1: adding ReferredBy
Since the referral probabilities only vary by CustomerType, you can generate the referrals in blocks with replacement (much faster than by individual ID)
setkey(Tr, CustomerType)
Tr[, ReferredBy:=sample(ReferredByOptions, replace=TRUE, size=.N,
prob=c(BySearchEngine[1],
ByDirectCustomer[1],
ByPartnerBlog[1])),
by=CustomerType]
UPDATE 2: adding ProductID
This is proving trickier to do in a neat cartesian-product sort of way. I cannot think of an elegant way to generate the 31 dates (-15:15) for each purchase (melted matrix would probably be too big). The code below works as intended but is not as fast as the previous 2:
Pr <- data.table(Products)
setnames(Pr, 'ID', 'ProductID') ## not necessary here, but good practice
CenteredAround <- as.Date(Tr$Date - 30*Tr$Timeliness)
setkey(Tr, ID)
Tr[, ProductID:=sample(Pr[abs(Pr$DateReleased -
CenteredAround[.I]) <= 15, ProductID], 1), by=ID]
A very simple optimization is to avoid modifying the data frame in the loop, as others have suggested. At least prior to R3.1, modifying a data frame is really expensive, so that's the last thing you want to be doing in a loop. Also, based on Hadley's comments and release notes for R3.1, it may be the case that modifying data frames is not as expensive with R3.1, but I haven't tested.
Here we get around the data frame modification by storing interim results in vectors, and then only inserting into the data frame after the loop. Consider:
system.time({
custId <- Transactions$CustomerID
refBy <- Transactions$ReferredBy
productID <- Transactions$ProductID
for (i in 1:100){
# Only sample customers which share the same 'CustomerType' as the transaction
custId <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
refBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
productID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Transactions$CustomerID <- custId
Transactions$ReferredBy <- refBy
Transactions$ProductID <- productID
})
Which times in at:
user system elapsed
0.66 0.06 0.71
The corresponding time with your original code is:
user system elapsed
5.01 1.78 6.79
So close to a 10x improvement with a minor change (avoiding modifying the data frame repeatedly).
I'm sure you can get further improvements, but this is a real low hanging fruit you can easily implement.

Generating dummy webshop data in R: Incorporating parameters when randomly generating transactions

For a course I am currently in I am trying to build a dummy transaction, customer & product dataset to showcase a machine learning usecase in a webshop environment as well as a financial dashboard; unfortunately, we have not been given dummy data. I figured this'd be a nice way to improve my R knowledge, but am experiencing severe difficulties in realizing it.
The idea is that I specify some parameters/rules (arbitrary/fictitious, but applicable for a demonstration of a certain clustering algorithm). I'm basically trying to hide a pattern to then re-find this pattern utilizing machine learning (not part of this question). The pattern I'm hiding is based on the product adoption life cycle, attempting to show how identifying different customer types could be used for targeted marketing purposes.
I'll demonstrate what I'm looking for. I'd like to keep it as realistic as possible. I attempted to do so by assigning the number of transactions per customer and other characteristics to normal distributions; I am completely open to potential other ways to do this?
The following is how far I have come, first build a table of customers:
# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability of being in each group.
set.seed(1) # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000),
CustomerType = sample(CustomerTypes, size=10000,
replace=TRUE, prob=PropCustTypes),
NumBought = rnorm(10000,3,2) # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0 # Cap NumBought at 0
Next, generate a table of products to choose from:
Products <- data.frame(
ID=(1:50),
DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10") # Cap Releasedate to 1 year ago
Now I would like to generate n transactions (number is in customer table above), based on the following parameters for each variable that is currently relevant).
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
Parameters
CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1 EarlyAdopter 0.1 0.60 0.30 1 0.00
2 Pragmatists 0.4 0.30 0.30 6 0.00
3 Conservatives 0.5 0.15 0.35 12 0.05
4 Dealseeker 0.6 0.05 0.35 12 0.10
The idea is that 'EarlyAdopters' would have (on average, normally distributed) 10% of transactions with a label 'BySearchEngine', 60% 'ByDirectCustomer' and 30% 'ByPartnerBlog'; these values need to exclude each other: one cannot be obtained via both a PartnerBlog and via a Search Engine in the final dataset. The options are:
ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")
Furthermore, I'd like to generate a discount variable that is normally distributed utilizing the above means. For simplicity, standard deviations may be mean/5.
Next, my most tricky part, I'd like to generate these transactions according to a few rules:
Somewhat evenly distributed over days, maybe slightly more during the weekend;
Spread out between 2006-2014.
Spreading out the # of transactions of customers over the years;
Customers cannot buy products that haven't been released yet.
Other Parameters:
YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <- 1 # Same question? Likely dependent on YearlyMax
The result for CustomerID 2 would be:
Transactions <- data.frame(
ID = c(1,2),
CustomerID = c(2,2), # The customer that bought the item.
ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.
Transactions
ID CustomerID ProductID DateOfPurchase ReferredBy GrossPrice Discount
1 1 2 51 2013-01-02 DirectCustomer 50.00 0.02
2 2 2 100 2012-12-03 SearchEngine 52.99 0.00
I'm getting more and more confident in writing R code, but I'm having difficulties writing the code to keep the global parameters (daily distributions of transactions, yearly maximum of # transactions per customer) as well as the various linkages in line:
Timeliness: how quick people purchase after release
ReferredBy: how did this customer arrive to my website?
How much discount has the customer had (to illustrate how sensitive one is to discounts)
This causes me to not know whether I should write a for loop over the customer table, generating transactions per customer, or whether I should take a different route. Any contributions are greatly appreciated. Alternative dummy datasets are welcome as well, even though I'm eager to solve this problem by means of R. I'll keep this post updated as I progress.
My current pseudocode:
Assign customer to customer type with sample()
Generate Customers$NumBought transactions
... Still thinking?
EDIT: Generating the transactions table, now I 'just' need to fill it with the right data:
Tr <- data.frame(
ID = 1:sum(Customers$NumBought),
CustomerID = NA,
DateOfPurchase = NA,
ReferredBy = NA,
GrossPrice=NA,
Discount=NA)
Very roughly, set up an database of days, and number of visits in that day:
days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)
Then catalogue the visits
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])
Any of the variables with X in front of them are parameters of your process. You'd similarly go on to generate a transactions database by parametrising the relative likelihood amongst objects available, according to the other columns you have. Or you can generate a visits database including a key to each product available at that day:
productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
visits <- visits[(1:nrow(visits))[day$productsAvailable],]
visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))
You can then decide a function that gives you, for each row, a probability of the customer purchasing that item (based on day, customer, product). And then fill in the purchase by `visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability.
Sorry, there's probably typos's littered throughout this as I was typing it straight, but hopefully this gives you an idea.
Following Gavin, I solved the issue with the following code:
First instantiate the CustomerTypes:
require(lubridate)
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability for being in each group.
Set the parameters for my customer types
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
Describe the number of visitors
TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on this, make sure
# we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15
Now, as suggested, build a dataset of days. I added DaysSinceStart to use it in growing the business over time.
days <- data.frame(
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate,
CustomerRate = TotalVisits/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)
Now build transactions from these days.
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA,
ProductID = NA,
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?
Initiate some customers we can choose from when not new.
Customers <- data.frame(ID=(1:100),
CustomerType = sample(CustomerTypes, size=100,
replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.
Make up a buch of products to choose from, with evenly divided release dates
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
Now I loop over the newly created Transaction data.frame, choosing from available products (measured by purchase date - average timeliness (in months) * 30 days +/- 15 days. I also assign new customers to a new CustomerID and choose from existing customers if it is not new. Other fields are determined by the parameters above.
Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){
if (Transactions[i,]$NewCustomer==1){
NewCustomerID <- max(Customers$ID, na.rm=T)+1
Customers[NewCustomerID,]$ID = NewCustomerID
Transactions[i,]$CustomerID <- NewCustomerID
Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
}
if (Transactions[i,]$NewCustomer==0){
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
}
Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Elapsed <- Sys.time()-Start.time
length(Transactions$ID)
And it's done! Unfortunately it takes ~ 22 minutes on a dataset of 20,000 products sold in 100 days. Not necessarily a problem, but I'm very much interested in potential improvements.

Resources