Simple way of showing progress in data.table "by" loops - r

I have a slow computer and some of my R calculations take hours and sometimes days to run. I'm sure they can be made more efficient but in the meanwhile I would like to find out about a simple way to show how far along R is in doing the needed calculations.
In a loop this can easily be done by print(i). Is something similar available when doing data.table calculations ?
For instance, the following code takes about 50 hours to run on my machine
q[, ties := sum(orig[pnum == origpat, inventors] %in% ref[pnum == ref.pat, inventors]), by = idx]
q is a data.table with columns origpat, ref.pat and idx (an index) as columns. The data tables orig and ref both contain columns pnum and inventors. The code simply finds the number of overlapping inventors in both groups but given the iterative nature (by = idx), it takes a long time.
I'd like my screen to post progress, e.g. for every 1,000 rows (there are about 20 mio rows).
Any way to do this simply?

Try
q[, ies := {
print(.GRP)
sum(orig[pnum == origpat, inventors] %in% ref[pnum == ref.pat, inventors])
}, by=idx]
This is analogous to print(i) for a by-group operation.

Related

R data.table - quick comparison of strings

I would like to find a fast solution to the following problem.
The example is very small, real data big and speed is an important factor.
I have two vectors of strings, currently in data.tables but this not so important. I need to find frequency of occurrences of strings from one vector in the second one and keep these results.
Example
library(data.table)
dt1<-data.table(c("first","second","third"),c(0,0,0))
dt2<-data.table(c("first and second","third and fifth","second and no else","first and second and third"))
Now, for every item in dt1 I need to find in how many items from dt2 it is contained and save the final frequencies to the second column of dt1.
The task itself is not difficult. I have, however, not managed to find reasonably quick solution.
The solution I have now is this:
pm<-proc.time()
for (l in 1:dim(dt2)[1]) {
for (k in 1:dim(dt1)[1]) set(dt1,k,2L,dt1[k,V2]+as.integer(grepl(dt1[k,V1],dt2[l,V1])))
}
proc.time() - pm
Real data are very large and this is pretty slow, on my PC even this larger version takes 2 seconds
dt1<-data.table(rep(c("first","second","third"),10),rep(c(0,0,0),10))
dt2<-data.table(rep(c("first and second","third and fifth","second and no else","first and second and third"),10))
pm<-proc.time()
for (l in 1:dim(dt2)[1]) {
for (k in 1:dim(dt1)[1]) set(dt1,k,2L,dt1[k,V2]+as.integer(grepl(dt1[k,V1],dt2[l,V1])))
}
proc.time() - pm
user system elapsed
1.93 0.06 2.06
Do I miss a better solution to this - I would say quite simple - task?
Actually it is so simple that I am sure that it must be a duplicate, but I have not managed to find it here or anything equivalent.
Cross merging of the data.tables is not possible due to memory problems (in the real situation).
Thank you.
dt1[, V2 := sapply(V1, function(x) sum(grepl(x, dt2$V1)))]
Also you probably can use fixed string matching for speed.
In that case you can use stri_detect_fixed from stringi package:
dt1[, V2 := sapply(V1, function(x) sum(stri_detect_fixed(dt2$V1, x)))]

How to efficiently chunk data.frame into smaller ones and process them

I have a bigger data.frame which i want to cut into small ones, depending on some "unique_keys" ( In reffer to MySQL ). At the moment I am doing this with this loop, but it takes awfully long ~45sec for 10k rows.
for( i in 1:nrow(identifiers_test) ) {
data_test_offer = data_test[(identifiers_test[i,"m_id"]==data_test[,"m_id"] &
identifiers_test[i,"a_id"]==data_test[,"a_id"] &
identifiers_test[i,"condition"]==data_test[,"condition"] &
identifiers_test[i,"time_of_change"]==data_test[,"time_of_change"]),]
# Sort data by highest prediction
data_test_offer = data_test_offer[order(-data_test_offer[,"prediction"]),]
if(data_test_offer[1,"is_v"]==1){
true_counter <- true_counter+1
}
}
How can i refactor this, to make it more "R" - and faster?
Before applying groups you are filtering your data.frame using another data.frame. I would use merge then by.
ID <- c("m_id","a_id","condition","time_of_change")
filter_data <- merge(data_test,identifiers_test,by=ID)
by(filter_data, do.call(paste,filter_data[,ID]),
FUN=function(x)x[order(-x[,"prediction"]),])
Of course the same thing can be written using data.table more efficiently:
library(data.table)
setkeyv(setDT(identifiers_test),ID)
setkeyv(setDT(data_test),ID)
data_test[identifiers_test][rev(order(prediction)),,ID]
NOTE: the answer below is not tested since you don't provide a data to test it.

Alternate to using loops to replace values for big datasets in R?

I have a large (~4.5 million records) data frame, and several of the columns have been anonymised by hashing, and I don't have the key, but I do wish to renumber them to something more legible to aid analysis.
To this end, for example, I've deduced that 'campaignID' has 161 unique elements over the 4.5 records, and have created a vector to hold these. I've then tried writing a FOR/IF loop to search through the full dataset using the unique element vector - for each value of 'campaignID', it is checked against the unique element vector, and when it finds a match, it returns the index value of the unique element vector as the new campaign ID.
campaigns_length <- length(unique_campaign)
dataset_length <- length(dataset$campaignId)
for (i in 1:dataset_length){
for (j in 1:campaigns_length){
if (dataset$campaignId[[i]] == unique_campaign[[j]]){
dataset$campaignId[[i]] <- j
}}}
The problem of course is that, while it works, it takes an enormously long time - I had to stop it after 12 hours! Can anything think of a better approach that's much, much quicker and computationally less expensive?
You could use match.
dataset$campaignId <- match(dataset$campaignId, unique_campaign)
See Is there an R function for finding the index of an element in a vector?
You might benefit from using the data.table package in this case:
library(data.table)
n = 10000000
unique_campaign = sample(1:10000, 169)
dataset = data.table(
campaignId = sample(unique_campaign, n, TRUE),
profit = round(runif(n, 100, 1000))
)
dataset[, campaignId := match(campaignId, unique_campaign)]
This example with 10 million rows will only take you a few seconds to run.
You could avoid the inside loop with a dictionnary-like structure :
id_dict = list()
for (id in 1:unique_campaign) {
id_dict[[ unique_campaign[[id]] ]] = id
}
for (i in 1:dataset_length) {
dataset$campaignId[[i]] = id_dict[[ dataset$campaignId[[i]] ]]
}
As pointed in this post, list do not have O(1) access so it will not divided the time recquired by 161 but by a smaller factor depending on the repartition of ids in your list.
Also, the main reason why your code is so slow is because you are using those inefficient lists (dataset$campaignId[[i]] alone can take a lot of time if i is big). Take a look at the hash package which provides O(1) access to the elements (see also this thread on hashed structures in R)

R speeding up calculation process on 2.5million obs

I have a huge data.frame (2 million obs.) where I calculate the sum of multiple column values based on one identical column value, like this (convert to data.table first):
check <- dt[,sumOB := (sum(as.numeric(as.character(OB))), by = "BIK"]
This gives me a new column with the sum values of, where applicable multiple values with the same BIK. After I add the following calculation.
calc <- check[,NewVA := (((as.numeric(as.character(VA)))
/ sumOB) * (as.numeric(as.character(OB)))), by = ""]
This works perfectly fine, giving me a new column with the desired values. My dataframe contains of as said 2 million observations and this process is extremely slow and memory intensive (I have 8GB of ram and I use all of it).
I would like to speed up this process, is there a more efficient way to reach the same results?
Thanks in advance,
Robert
I don't understand why you wrap everything in as.numeric(as.character(...)). That's a performance cost you shouldn't need.
Also why do you copy your data.table? That's your biggest mistake. Look at
dt[,sumOB := (sum(as.numeric(as.character(OB))), by = "BIK"]
dt[,NewVA :=
(((as.numeric(as.character(VA))) / sumOB) * (as.numeric(as.character(OB))))]
print(dt)
(possibly without all that type conversions).

R Checking for duplicates is painfully slow, even with mclapply

I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.
Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.
I'm trying to do this with the following code:
library("doMC")
Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big"))
compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))}))
IsGoodId = function(Id){
Sub = Data[Data$ID==Id,]
if (length(Sub[,1]) > 1){
return(compare(Sub[,"Size"]))
}else{
return(TRUE)
}
}
WhichAreGood = mclapply(unique(Data$ID),IsGoodId)
But it's painfully, awfully, terribly slow on my quad-core i5.
Can anyone see where the bottleneck is? I'm a newbie to R optimisation.
Thanks,
-N
Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.
dups = unique(Data$ID[duplicated(Data$ID)])
DupData = Data[Data$ID %in% dups,,drop=FALSE]
The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size
tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)
This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...
Hmm, thinking about this a bit more, I guess
u = unique(Data)
u$ID[duplicated(u$ID)]
does the trick.

Resources