PDC calculation in R - Removing Loops - r

I am just starting to learn R, which has been really useful, and I'm trying to use it to calculate Proportion of Days Covered. This metric has to do with measuring a person's adherence to their medication. Basically, for a given time period you find all the fills of a drug take the fill date and the number of days in the supply to determine which days they were covered for. E.g. if a person gets a 35 day fill on 2/1/2016, they have coverage from 2/1/16 through 3/6/2016. Easy enough.
This gets tricky when they go back for a fill before they run out of coverage on the first fill, you don't double count days (e.g. the person gets their second fill on 3/1/2016, 3/1-3/6 are only counted once).
I've actually written some code that seems to be working properly, but its using FOR loops, which I've come to learn don't work well in R and I'm worried when I start throwing a bunch of data at it.
Here is the first part of the code that builds the test data and initializes some variables:
#Create test data vectors
Person <- c(rep("Person1",12),rep("Person2",9))
FillDate <- c("2016-1-1", "2016-2-1", "2016-3-1", "2016-4-1", "2016-5-1", "2016-6-1", "2016-7-1", "2016-8-1", "2016-9-1", "2016-10-1", "2016-11-1", "2016-12-1", "2016-2-1", "2016-3-1", "2016-4-20", "2016-5-1", "2016-6-1", "2016-7-1", "2016-8-1", "2016-9-1", "2016-10-1")
DaysSupply <- c(rep("35", 14), "20", "5", "20", rep("35", 4))
#Build into data.frame
PDCTestData <- cbind.data.frame(as.factor(Person),as.Date(FillDate,"%Y-%m-%d"),as.numeric(DaysSupply))
colnames(PDCTestData) <- c("Person","FillDate","DaysSupply")
#Create start and end dates for overall period
StartDate <- as.Date("2016-01-01")
EndDate <- as.Date("2016-12-31")
#Initialize DaysCoveredList, a vector to hold the list of dates that a person has drug coverage
DaysCoveredList <- NULL
#Initialize DaysCoveredTable, a matrix to count the total number of unique days in the DaysCovered List, by person
DaysCoveredTable <- NULL
and the second part that does the actual work:
#Begin looping through individuals
for(p in levels(PDCTestData$Person)){
#Begin looping through drug fills
for(DrugSpan in 1:nrow(PDCTestData[PDCTestData$Person == p,])){
#Create a sequence of the dates covered by that fill, the sequence starts on the fill date and runs for the number of days in Days Supply, Builds a list of all days covered for that person
DaysCoveredList <- c(DaysCoveredList,seq.Date(from = PDCTestData[PDCTestData$Person == p,][DrugSpan,]$FillDate, length.out = PDCTestData[PDCTestData$Person == p,][DrugSpan,]$DaysSupply, by = "day"))
} #Exit drug fill loop
#Counts the number of unique days covered from the DaysCovredList, with in the start and end of the overall period
DaysCovered <- length(unique(DaysCoveredList[DaysCoveredList >= StartDate & DaysCoveredList <= EndDate]))
#Adds the unique count from DaysCovered to the summary DaysCoveredTable
DaysCoveredTable <- rbind(DaysCoveredTable,cbind(p,DaysCovered))
#Clear DaysCovered and DaysCovredList
DaysCovered <- NULL
DaysCoveredList <- NULL
} #Exit the individual loop
Any help you can offer is appreciated.
Thanks.

library(lubridate)
ptd <- PDCTestData # I get bored writing long variable names
ptd$EndDate <- ptd$FillDate + ptd$DaysSupply
ptd$DrugInterval <- interval(ptd$FillDate, ptd$EndDate)
all_days <- as.Date(StartDate:EndDate, origin = "1970-01-01")
lapply(unique(ptd$Person), function (y) sum(sapply(all_days, function (x) any(x %within% ptd$DrugInterval[ptd$Person==y]))))
No guarantees about speed, but maybe easier to read.

Related

How to loop over rows sequentially while picking a specific column in each row

I am a relatively novice R user and am trying to recreate a 'Dogs of the DOW' strategy with the 6 biggest Canadian banks whereby you buy the poorest performing stock/bank from the previous year in the current year. I would like to go through each row and select the column with the preceding year's worst performer. Any suggestions or tips are greatly appreciated!
I have tried writing several versions of for loops but continue to get odd outputs. In the below code I get a list where each item is the same number?
In the code, I have a data frame (BtBB) which is the annual returns of the bank stocks from 2012 through to 2018 as the rows, and the 6 banks as the columns. BtBB_min is a vector which has 6 entries denoting which column the previous year's minimum return is in (so first value points to column 4 which is year 2012's worst performer, second value is column 2 which is 2013's worst performer etc.) BtBB_ret is meant to be the output showing the returns.
#Entering data
BtBB <- data.frame(
Date = as.Date(c("2012-12-31", "2013-12-31", "2014-12-31", "2015-12-31", "2016-12-31", "2017-12-31", "2018-12-31"), format = "%Y-%m-%d"),
CIBC = c(0.08375119, 0.13442541, 0.10052910, -0.08663862, 0.20144753, 0.11847390, -0.17023013),
RBC = c(0.151981531, 0.192551770, 0.123652150, -0.075897308, 0.225488874, 0.129635743, -0.089722358),
National = c(0.07069587, 0.14422579, 0.11880516, -0.18466828, 0.35276606, 0.15019255, -0.10634566),
BMO = c(0.08911954, 0.16348998, 0.16057054, -0.04989048, 0.23680840, 0.04162783, -0.11333135),
TD = c(0.097771953, 0.195319962, 0.108869357, -0.022878761, 0.220870206, 0.112201752, -0.078615071),
BNS = c(0.130434783, 0.156108597, -0.001806413, -0.155934248, 0.335715562, 0.085072231, -0.161119329))
BtBB_min <- apply(BtBB[-1], 1, which.min) # Finding Minimums
#Adding scalar to min vector so column numbers match properly with BtBB dataframe
BtBB_min <- BtBB_min + 1
#Removing last entry since only minimums from prior years matter, not current years
BtBB_min <- BtBB_min[-length(BtBB_min)]
#Removing first row from data frame since we want to reference current years
BtBB <- BtBB[-1,]
#Creating output vector for for loop
BtBB_ret <- vector("double", length = length(BtBB_min))
#Nested For loop where I'm having issue generating a proper output
for (h in seq_along(BtBB_ret)) {
for (i in nrow(BtBB)) {
for (j in seq_along(BtBB_min)) {
BtBB_ret[h] <- BtBB[i,BtBB_min[j]]
}
}
}
Expect to get a vector of returns as:
.1442258, .10052910, -0.155934248, 0.3527661, 0.11847390, -0.11333135
Actually get BMO's return 6 times (-0.11333135). Can't figure out why. Have worked on this problem for like a week and can't seem to crack it :(
You are doing unnecessary loops and you are overwriting the BtBB_ret values over and over again. One loop should suffice:
#Nested For loop where I'm having issue generating a proper output
for (i in 1:nrow(BtBB)) {
BtBB_ret[i] <- BtBB[i,BtBB_min[i]]
}
BtBB_ret

Breaking a continuous variable into categories using dplyr and/or cut

I have a dataset that is a record of price changes, among other variables. I would like to mutate the price column into a categorical variable. I understand that the two functions of importance here in R seem to be dplyr and/or cut.
> head(btc_data)
time btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663
>dput(btc_data)
("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
), class = "data.frame", row.names = c(NA, -10023L))
The difficulty is in the categories I want to create. The categories -1,0,1 should be based upon the % change over the previous time-lag.
So for example, a 20% increase in price over the past 60 minutes would be labeled 1, otherwise 0. A 20% decrease in price over the past 60 minutes should be -1, otherwise 0.
Is this possible in R? What is the most efficient way to implement the change?
There is a similar question here and also here but these do not answer my question for two reasons-
a) I am trying to calculate % change, not simply the difference
between 2 rows.
b) This calculation should be based on the max/min values for the rolling past time frame (ie- 20% decrease in the past hour = -1, 20% increase in the past hour = 1
Here's an easy way to do this without having to rely on the data.table package. If you want this for only 60 minute intervals, you would first need to filter btc_data for the relevant 60 minute intervals.
# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)
# sort data frame
btc_data = btc_data[order(btc_data$time),]
# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]
# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0))
Using the data.table package and converting btc_data to a data.table would be a much more efficient and faster way to do this. There is a learning curve to using the package, but there are great vignettes and tutorials for this package.
Its always difficult to work with percentage. You need to be aware that every thing is flexible: when you choose a reference which is a difference, a running mean, max or whatever - you have at least two variables on the side of the reference which you have to choose carefully. The same thing with the value you want to set in relation to your reference. Together this give you almost infinite possible how you can calculate your percentage. Here is the key to your question.
# create the data
dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763",
"4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325",
"4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025",
"4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075",
"4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738",
"4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788",
"4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038",
"4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288",
"5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788",
"5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175",
"5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350",
"5,013.9075")
dat <- as.numeric(gsub(",","",dat))
# calculate the difference to the last minute
dd <- diff(dat)
# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}
# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}
# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))
# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")
I think you search the way how to calculate the last plot (price last minute, relative to "mean" of t...) the value in this example vary between 1.0010 and 1.0025 so far away from what you expect with 0.8 and 1.2. You can make the difference bigger when you choose a bigger time interval than 20min maybe a week could be good (11340) but even with this high time value it will be difficult to achieve a value above 1.2. The problem is the high price of 5000 a change of 10 is very little.
You also have to take in account that you gave a continuously rising price, there it is impossible to get a value under 1.
In this calculation I use the mean() for the running observation of the last minutes. I'm not sure but I speculate that on stock markets you use both min() and max() as reference in different time interval. You choose min() as reference when your price is rising and max() when your price is falling. All this is possible in R.
I can't completely reproduce your example, but if I had to guess you would want to do something like this:
btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "",
btc_data$btc_price)))
pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) /
btc_data$btc_price[i - 60,]
}
pct_change <- pct_change[61:length(pct_change)]
new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2,
max(pct_change)), labels = c(-1,0,1))
btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)

Logical in my R function always returns "TRUE"

I'm trying to write an R function that calculates whether a data subject is eligible for subsidies based on their income (X_INCOMG), the size of their household (household calculated from CHILDREN and NUMADULT), and the federal poverty limit for their household size (fpl_matrix). I use a number of if statements to evaluate whether the record is eligible, but for some reason my code is labeling everyone as eligible, even though I know that's not true. Could someone else take a look at my code?
Note that the coding for the variable X_INCOMG denotes income categories (less than $15000, 25-35000, etc).
#Create a sample data set
sampdf=data.frame(NUMADULT=sample(3,1000,replace=T),CHILDREN=sample(0:5,1000,replace=T),X_INCOMG=sample(5,1000,replace=T))
#Introducing some "impurities" into the data so its more realistic
sampdf[sample(1000,3),'CHILDREN']=13
sampdf[sample(1000,3),'CHILDREN']=NA
sampdf[sample(1000,3),'X_INCOMG']=9
#this is just a matrix of the federal poverty limit, which is based on household size
fpl_2004=matrix(c(
1,9310,
2,12490,
3,15670,
4,18850,
5,22030,
6,25210,
7,28390,
8,31570,
9,34750,
10,37930,
11,41110),byrow=T,ncol=2)
##################here is the function I'm trying to create
fpl250=function(data,fpl_matrix,add_limit){ #add_limit is the money you add on for every extra person beyond a household size of 11
data[which(is.na(data$CHILDREN)),'CHILDREN']=99 #This code wasn't liking NAs so I'm coding NA as 99
data$household=data$CHILDREN+data$NUMADULT #calculate household size
for(i in seq(nrow(data))){
if(data$household[i]<=11){data$bcccp_cutoff[i]=2.5*fpl_matrix[data$household[i],2]} #this calculates what the subsidy cutoff should be, which is 250% of the FPL
else{data$bcccp_cutoff[i]=2.5*((data$household[i]-11)*add_limit+fpl_matrix[11,2])}}
data$incom_elig='yes' #setting the default value as 'yes', then changing each record to 'no' if the income is definitely more than the eligibility cutoff
for(i in seq(nrow(data))){
if(data$X_INCOMG[i]=='1' | data$X_INCOMG[i]=='9'){data$incom_elig='yes'} #This is the lowest income category and almost all of these people will qualify
if(data$X_INCOMG[i]=='2' & data$bcccp_cutoff[i]<15000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='3' & data$bcccp_cutoff[i]<25000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='4' & data$bcccp_cutoff[i]<35000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='5' & data$bcccp_cutoff[i]<50000){data$incom_elig[i]='no'}
if(data$household[i]>90){data$incom_elig[i]='no'}
}
return(data)
}
dd=fpl250(sampl,fpl_2004,3180)
with(dd,table(incom_elig)) #it's coding all except one as eligible
I know this is a lot of code to digest, but I appreciate whatever help you have to offer!
I find it easier to get the logic working well outside of a function first, then wrap it in a function once it is all working well. My code below does this.
I think one issue was you had the literal comparisons to X_INCOMG as strings (data$X_INCOMG[i]=='1'). That field is a numeric in your sample code, so remove the quotes. Try using a coded factor for X_INCOMG as well. This will make your code easier to manage later.
There is no need to loop over each row in the data frame.
#put the poverty level data in a data frame for merging
fpl_2004.df<- as.data.frame(fpl_2004)
names(fpl_2004.df)<-c("household","pov.limit")
#Include cutoffs
fpl_2004.df$cutoff = 2.5 * fpl_2004.df$pov.limit
add_limit=3181
#compute household size (if NA's this will skip them)
sampdf$household = numeric(nrow(sampdf))
cc<-which(complete.cases(sampdf))
sampdf$household[cc] = sampdf$NUMADULT[cc] + sampdf$CHILDREN[cc]
#get max household and fill fpl_2004 frame
max.hh<-max(sampdf$household,na.rm=TRUE)
#get the 11 person poverty limit
fpl11=subset(fpl_2004.df,household==11)$pov.limit
#rows to fill out the data frame
append<-data.frame(household=12:max.hh,pov.limit=numeric(max.hh-12+1),
cutoff=2.5 *(((12:max.hh)-11)*add_limit+fpl11))
fpl_2004.df<- rbind(fpl_2004.df,append)
#merge the two data frames
sampdf<- merge(sampdf,fpl_2004.df, by="household",all.x=TRUE)
#Add a logical variable to hold the eligibility
sampdf$elig <- logical(nrow(sampdf))
#compute eligibility
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 1,"elig"] = TRUE
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 9,"elig"] = TRUE
#for clarity define variable of what to subset
lvl2 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 2
lvl2 <- lvl2 & !is.na(sampdf$cutoff) & sampdf$cutoff>=15000
#set the eligibility (note the initial value was false thus cutoff logic reversed)
sampdf[lvl2,"elig"] = TRUE
#continue computing these
lvl3 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 3
lvl3 <- lvl3 & !is.na(sampdf$cutoff) & sampdf$cutoff>=25000
sampdf[lvl3,"elig"] = TRUE
Alternately you could load in a small data frame with the cutoff comparison values (15000; 25000; 35000 etc) and the X_INCOMG. Then merge by X_INCOMG, as I did with the household size, and set all the values in one go like this the below. You may need to use complete.cases again.
sampdf$elig = sampdf$cutoff >= sampdf$comparison.value
You will then have elig == FALSE for any incomplete cases, which will need further investigation.

Performance issues in for loop (moving towards vectorization with multiple sample()'s)

I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.
I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).
This issue is as follows, using these parameters I will build a transaction/customer/product table:
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
PropCustTypes = c(.10, .45, .30, .15), # Probability for being in each group.
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability for each group
ByDirectCustomer = c(0.60, .30, 0.15, 0.05), # of coming through channel X
ByPartnerBlog = c(0.30, .30, 0.35, 0.35), #
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
# Some other parameters for later use.
NumDays = 1000
NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.)
SalesMultiplierWeekends = 1.5 # For example, I want more in weekends
StartDate <- as.Date("2009-01-04")
NumProducts <- 150
AnnualGrowth <- .1 # I also want an annual growth trend
I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.
days <- data.frame( # Define the days
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend
CustomerRate = NumTransactions/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends)
days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth)
Next I generate the transactions using this table:
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA, # Will be assigned later, NewCustomer: 0.8 and .2
ProductID = NA, # insinuate new/existing customers above
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.
StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12)
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 100, 50))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15 # Cap ProductPrice at 10$
Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.
Customers <- data.frame(
ID=(1:sum(Transactions$NewCustomer)),
CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer),
replace=TRUE, prob=Parameters$PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now
Transactions$Discount <- rnorm(length(Transactions$ID), # assigned to each transaction
Transactions$Discount,Transactions$Discount/20)
Transactions$Timeliness <- rnorm(length(Transactions$ID),
Transactions$Timeliness, Transactions$Timeliness/6)
Now the performance issues start to arise, the for loop:
for (i in 1:nrow(Transactions)){
# Only sample customers which share the same 'CustomerType' as the transaction
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.
For loop / filling 100 rows: ~ 18 Seconds
For loop / filling 200 rows: ~ 37 Seconds
For loop / filling 1000 rows: ~ 3 minutes
For loop / filling 300000 rows: No idea, can't get that far?
Why is it running so slow and how can I solve this? Any help is greatly appreciated.
Below is how you would do the first part using data.table, adding CustomerID to the Transactions table. I have changed some names and dropped the placeholder columns as they will be added through the data.table joins.
Tr <- data.table(Transactions)
Tr[, CustomerID:=NULL]
Tr[, ProductID:=NULL]
Tr[, ReferredBy:=NULL] ## see #Arun's comment for a more compact way to do this
Cs <- data.table(Customers)
setnames(Cs, 'ID', 'CustomerID') ## So we avoid duplicate with Tr
## Add customer ID, matching customer types
setkey(Tr, CustomerType)
setkey(Cs, CustomerType)
# Make an index Transaction ID -> Customer ID
# Large interim matrix should not be formed, but I am not sure
TrID2CustID <- Cs[Tr, allow.cartesian=T][, list(CustomerID=sample(CustomerID, 1)), by=ID]
setkey(TrID2CustID, ID)
setkey(Tr, ID)
Tr <- Tr[TrID2CustID]
There is a large matrix that is the cartesian product of your Transactions and Customers tables (about 15M rows) which would exhaust the memory if it is explicitly computed. Judging by the fact that this takes about a second, I'd say it is not computed, but I am not sure.
I will work on the rest and edit the answer if I come up with the solutions quickly, but this ought to show you how to do this using data.table.
UPDATE 1: adding ReferredBy
Since the referral probabilities only vary by CustomerType, you can generate the referrals in blocks with replacement (much faster than by individual ID)
setkey(Tr, CustomerType)
Tr[, ReferredBy:=sample(ReferredByOptions, replace=TRUE, size=.N,
prob=c(BySearchEngine[1],
ByDirectCustomer[1],
ByPartnerBlog[1])),
by=CustomerType]
UPDATE 2: adding ProductID
This is proving trickier to do in a neat cartesian-product sort of way. I cannot think of an elegant way to generate the 31 dates (-15:15) for each purchase (melted matrix would probably be too big). The code below works as intended but is not as fast as the previous 2:
Pr <- data.table(Products)
setnames(Pr, 'ID', 'ProductID') ## not necessary here, but good practice
CenteredAround <- as.Date(Tr$Date - 30*Tr$Timeliness)
setkey(Tr, ID)
Tr[, ProductID:=sample(Pr[abs(Pr$DateReleased -
CenteredAround[.I]) <= 15, ProductID], 1), by=ID]
A very simple optimization is to avoid modifying the data frame in the loop, as others have suggested. At least prior to R3.1, modifying a data frame is really expensive, so that's the last thing you want to be doing in a loop. Also, based on Hadley's comments and release notes for R3.1, it may be the case that modifying data frames is not as expensive with R3.1, but I haven't tested.
Here we get around the data frame modification by storing interim results in vectors, and then only inserting into the data frame after the loop. Consider:
system.time({
custId <- Transactions$CustomerID
refBy <- Transactions$ReferredBy
productID <- Transactions$ProductID
for (i in 1:100){
# Only sample customers which share the same 'CustomerType' as the transaction
custId <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
refBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
productID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Transactions$CustomerID <- custId
Transactions$ReferredBy <- refBy
Transactions$ProductID <- productID
})
Which times in at:
user system elapsed
0.66 0.06 0.71
The corresponding time with your original code is:
user system elapsed
5.01 1.78 6.79
So close to a 10x improvement with a minor change (avoiding modifying the data frame repeatedly).
I'm sure you can get further improvements, but this is a real low hanging fruit you can easily implement.

unused arguments error using apply() in R

I get an error message when I attempt to use apply() conditional on a column of dates to return a set of coefficients.
I have a dataset (herein modified for simplicity, but reproducible):
ADataset <- data.table(Epoch = c("2007-11-15", "2007-11-16", "2007-11-17",
"2007-11-18", "2007-11-19", "2007-11-20", "2007-11-21"),
Distance = c("92336.22", "92336.23", "92336.22", "92336.20",
"92336.19", "92336.21", "92336.18))
ADataset
Epoch Distance
1: 2007-11-15 92336.22
2: 2007-11-16 92336.23
3: 2007-11-17 92336.22
4: 2007-11-18 92336.20
5: 2007-11-19 92336.19
6: 2007-11-20 92336.21
7: 2007-11-21 92336.18
The analysis begins with establishing start and end dates:
############## Establish dates for analysis
#4.Set date for center of duration
StartDate <- "2007-11-18"
as.numeric(as.Date(StartDate)); StartDate
EndDate <- as.Date(tail(Adataset$Epoch,1)); EndDate
Then I establish time durations for analysis:
#5.Quantify duration of time window
STDuration <- 1
LTDuration <- 3
Then I write functions to regress over both durations and return the slopes:
# Write STS and LTS functions, each with following steps
#6.Define time window- from StartDate less ShortTermDuration to
StartDate plus ShortTermDuration
#7.Define Short Term & Long Term datasets
#8. Run regression over dataset
my_STS_Function <- function (StartDate) {
STAhead <- as.Date(StartDate) + STDuration; STAhead
STBehind <- as.Date(StartDate) - STDuration; STBehind
STDataset <- subset(Adataset, as.Date(Epoch) >= STBehind & as.Date(Epoch)<STAhead)
STResults <- rlm( Distance ~ Epoch, data=STDataset); STResults
STSummary <- summary( STResults ); STSummary
# Return coefficient (Slope of regression)
STNum <- STResults$coefficients[2];STNum
}
my_LTS_Function <- function (StartDate) {
LTAhead <- as.Date(StartDate) + LTDuration; LTAhead
LTBehind <- as.Date(StartDate) - LTDuration; LTBehind
LTDataset <- subset(Adataset, as.Date(Epoch) >= LTBehind & as.Date(Epoch)<LTAhead)
LTResults <- rlm( Distance ~ Epoch, data=LTDataset); LTResults
LTSummary <- summary( LTResults ); LTSummary
# Return coefficient (Slope of regression)
LTNum <- LTResults$coefficients[2];LTNum
Then I test the function to make sure it works for a single date:
myTestResult <- my_STS_Function("2007-11-18")
It works, so I move on to apply the function over the range of dates in the dataset:
mySTSResult <- apply(Adataset, 1, my_STS_Function, seq(StartDate : EndDate))
...in which my desired result is a list or array or vector of mySTSResult (slopes) (and, subsequently, a separate list/array/vector of myLTSResults so then I can create a STSlope:LTSlope ratio over the duration), something like (mySTSResults fabricated)...
> Adataset
Epoch Distance mySTSResults
1: 2007-11-15 92336.22 3
2: 2007-11-16 92336.23 4
3: 2007-11-17 92336.22 5
4: 2007-11-18 92336.20 6
5: 2007-11-19 92336.19 7
6: 2007-11-20 92336.21 8
7: 2007-11-21 92336.18 9
Only I get this error:
Error in FUN(newX[, i], ...) : unused argument(s) (1:1185)
What is this telling me and how to do correct it? I've done some looking and cannot find the correction.
Hopefully I've explained this sufficiently. Please let me know if you need further details.
Ok, it seems the problem is in the additional arguments to my_STS_Function as stated in your apply function call (as you have defined it with only one parameter). The date range is being passed as an additional parameter to that function, and R is complaining that it is unused (a vector of 1185 elements it seems). Are you rather trying to pull a subset of the rows restricted by date range first, then wishing to apply the my_STS_Function? I'd have to think a bit on an exact solution to that.
Sorry - I did my working out in the comments there. A possible solution is this:
subSet <- Adataset[Adataset[,1] %in% seq(StartDate:EndDate),][order(na.exclude(match(Adataset[,1], seq(StartData,EndDate))),]
Adapted from the answer in this question:
R select rows in matrix from another vector (match, %in)
Adding this as a new answer as the previous one was getting confused. A previous commenter was correct, there are bugs in your code, but they aren't a sticking point.
My updated approach was to use seq.Date to generate the date sequence (only works if you have a data point for each day between the start and end - though you could use na.exclude as above):
dates = seq.Date(as.Date(StartDate),as.Date(EndDate),"days")
You then use this as the input to apply, with some munging of types to get things working correctly (I've done this with a lamda function):
mySTSResult <- apply(as.matrix(dates), 1, function(x) {class(x) <- "Date"; my_STS_Function(x)})
Then hopefully you should have a vector of the results, and you should be able to do something similar for LTS, and then manipulate that into another column in your original data frame/matrix.

Resources