How can I speed up this R code, in which I use stringdist? - r

I'm trying to clean up our customer database by identifying customer data that is similar enough to consider them the same customer (thus, give them the same customer id). I've concatenated relevant customerdata into one column named customerdata. I've found the R package stringdist and I'm using the following code to calculate the distance between every single record:
output <- df$id
for(i in 1:(length(df$customerdata)-1) ){
for(j in (i+1):length(df$customerdata)){
if(abs(df$customerdataLEN[i]-df$customerdataLEN[j]) < 10){
if( stringdist(df$customerdata[i],df$customerdata[j])<10){
output[j] <- df$id[i]
}
}
}
}
df$newcustomerid <- output
So here, I first initialize a vector named output with customerid data. Then, I loop through all customers. I have a column called customerdatalength. To reduce calculation time, I first check if there is large (>10) difference in length between columns. If that is the case, I don't bother calculating the stringdist. Otherwise, if the distance between the two customers is < 10, I consider them to be the same customer, and I give that customer the same id.
I'm looking to speed up the process however. At 2000 rows, this loop takes 2 minutes. At 7400 rows, this loop takes 32 minutes. I'm looking to run this on around 1 000 000 rows. Does anyone have any idea on how to improve the speed of this loop?

Related

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

How to quickly check, if a given variable has many levels?

I write a script that helps to verify structure and do basic descriptive statistics of a very large data.tables. I need to have a fast way to tell if a given variable has more than - say - 50 unique levels.
I might do it by
function(DT, colnr)
{
b<-DT[, list(var=colnames(DT)[[colnr]],.N), by = list(level=DT[[colnr]])]
if(nrow(b)<50)
#less than 50 levels, probably we need a dictionary.
setorder(b, level)
b
else
#more than 50 levels. Just return the number of unique values.
data.table(level=NA, var=colnames(DT)[[colnr]], N=-nrow(b))
}
but DT[, list(var=colnames(DT)[[colnr]],.N), by = list(level=DT[[colnr]])] takes very long for variables with as many unique values as there are cases (I have over 4 million of cases in DT)
Is there a way to improve the performance? I don't need to count the number of unique cases if there is more then 50 of them. Returning "more than 50" would suffice.

Item-based recommender system with R

Actually I'm trying to build a simple item-based recommender system and I decided to use R due to my lack of programming knowledge.
Still, some issues remain so I'll try to be as methodic as possible to explain them.
Log file
I start with a log file imported as a data frame containing many columns among which : The ID of the customers, the ID of the items and the transaction date.
Here's an overview of the log file
'data.frame': 539673 obs. of 3 variables:
$ cid: int 1 1 1 1 2 2 3 4 ...
$ iid : int 1002 1345 1346 2435 6421 4356 1002 4212 ...
$ timestamp : int 1330232453 1330233859 13302349437 1330254065 1330436051
I succeeded to turn this log file into a matrix with the customers in lines , the products in columns and the timestamp (much easier to manipulate) of the transaction date when a transaction occurs between these first two elements.
So I end up with a matrix of 100000 rows and 3000 columns, which is pretty huge.
Similarity matrix
From that point, I can create my item-based recommender system.
First, I binarize my matrix m in order to be able to compute the similarity:
mbin <- (m > 0) + 0
To compute the similarity, I use the cosine measure by a function:
getCosine <- function(x,y)
{
cosine <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
return(cosine)
}
After creating a matrix for receiving the different similarity measures,
I created two loops to measure this one:
for(i in 1:ncol(mbin)) {
for(j in 1:ncol(mbin)) {
mbin.sim[i,j] <- getCosine((mbin[i]),(mbin[j]))
}}
This similarity matrix takes too long to be computed, that is why I only focus on retrieving one particular similiarity.
Note that I've taken a random number n, I would also like to enter an item's name
n = 5
for(i in n) {
for(j in 1:ncol(mbin)) {
mbin.sim[i,j] <- getCosine(mbin[i],mbin[j])
}}
How can I achieve that?
Building and applying the recommender
From this part, I'm stuck because I can't see how I can make an easy recommender which will take one item and recommender k users to it.
Testing
Moreover, to test the recommender, I should be able to go back in time and to see if , from a certain date, I can predict the good users.
To do that, I know that I have to create a function which can give me the date of the nth transaction. Mathematically, it means that for a particular column, I have to get the nth non zero elements for that column. So I tried this but it only gives me the first one :
firstel <- function(x,n){
m <- head(x[x!=0],1)
m[n]
}
How can I do that? and moreover, How can I use this variable to discriminate between past and future events, with another function?
Sorry for this long post but I really wanted to show that I'm really into it and that I really want to overcome that step in order to begin the real analysis afterwards.
NB : I'm really doing that without complex packages due to the huge amount of data.

R- Speed up calculation related with subset of data.table

Need help on speed up for case below:
I am having roughly 8.5 Millions rows of orders history for 1.3M orders. I need to calculate the time it take between two steps of each order. I use calculation as below:
History[, time_to_next_status:=
get_time_to_next_step(id_sales_order_item_status_history,
id_sales_order_item, History_subset),
by='id_sales_order_item_status_history']
In the code above:
id_sales_order_item - id of a sales order item - there are multiple history record have the same id_sales_order_item
id_sales_order_item_status_history - id of a row
History_subset is a subset of History which contains only 3 columns [id_sales_order_item_status_history, id_sales_order_item, created_at] needed in the calculations.
created_at is the time the history was created
The function get_time_to_next_step is defined as below
get_time_to_next_step <- function(id_sales_order_item_status_history, filter_by,
dataSet){
dataSet <- dataSet %.% filter(id_sales_order_item == filter_by)
index <- match(currentId, dataSet$id_sales_order_item_status_history)
time_to_next_status <- dataSet[index + 1, created_at] - dataSet[index, created_at]
time_to_next_status
}
The issues is that it take 15mins to run arround 10k records of the History. So it would take up to ~9 days to complete the calculation. Is there anyway I can fasten this up without break the data in to multiple subset?
I will take a shot. Can't you try something like this..
History[ , Index := 1:.N, by= id_sales_order_item]
History[ , time_to_next_status := created_at[Index+1]-created_at[Index], by= id_sales_order_item]
I would think this would be pretty fast.

optimizing markov chain transition matrix calculations?

As an intermediate R user, I know that for loops can very often be optimized by using functions like apply or otherwise. However, I am not aware of functions that can optimize my current code to generate a markov chain matrix, which is running quite slowly. Have I max-ed out on speed or are there things that I am overlooking? I am trying to find the transition matrix for a Markov chain by counting the number of occurrences in 24-hour time periods before given alerts. The vector ids contains all possible id's (about 1700).
The original matrix looks like this, as an example:
>matrix
id time
1 1376084071
1 1376084937
1 1376023439
2 1376084320
2 1372983476
3 1374789234
3 1370234809
And here is my code to try to handle this:
matrixtimesort <- matrix[order(-matrix$time),]
frequency = 86400 #number of seconds in 1 day
# Initialize matrix that will contain probabilities
transprobs <- matrix(data=0, nrow=length(ids), ncol=length(ids))
# Loop through each type of event
for (i in 1:length(ids)){
localmatrix <- matrix[matrix$id==ids[i],]
# Loop through each row of the event
for(j in 1:nrow(localmatrix)) {
localtime <- localmatrix[j,]$time
# Find top and bottom row number defining the 1-day window
indices <- which(matrixtimesort$time < localtime & matrixtimesort$time >= (localtime - frequency))
# Find IDs that occur within the 1-day window
positiveids <- unique(matrixtimesort[c(min(indices):max(indices)),]$id)
# Add one to each cell in the matrix that corresponds to the occurrence of an event
for (l in 1:length(positiveids)){
k <- which(ids==positiveids[l])
transprobs[i,k] <- transprobs[i,k] + 1
}
}
# Divide each row by total number of occurrences to determine probabilities
transprobs[i,] <- transprobs[i,]/nrow(localmatrix)
}
# Normalize rows so that row sums are equal to 1
normalized <- transprobs/rowSums(transprobs)
Can anyone make any suggestions to optimize this for speed?
Using nested loops seems a bad idea. Your code can be vectorized to speed up.
For example, why find the top and bottom of row numbers? You can simply compare the time value with "time_0 + frequency": it is a vectorized operation.
HTH.

Resources