I have a huge data.frame (2 million obs.) where I calculate the sum of multiple column values based on one identical column value, like this (convert to data.table first):
check <- dt[,sumOB := (sum(as.numeric(as.character(OB))), by = "BIK"]
This gives me a new column with the sum values of, where applicable multiple values with the same BIK. After I add the following calculation.
calc <- check[,NewVA := (((as.numeric(as.character(VA)))
/ sumOB) * (as.numeric(as.character(OB)))), by = ""]
This works perfectly fine, giving me a new column with the desired values. My dataframe contains of as said 2 million observations and this process is extremely slow and memory intensive (I have 8GB of ram and I use all of it).
I would like to speed up this process, is there a more efficient way to reach the same results?
Thanks in advance,
Robert
I don't understand why you wrap everything in as.numeric(as.character(...)). That's a performance cost you shouldn't need.
Also why do you copy your data.table? That's your biggest mistake. Look at
dt[,sumOB := (sum(as.numeric(as.character(OB))), by = "BIK"]
dt[,NewVA :=
(((as.numeric(as.character(VA))) / sumOB) * (as.numeric(as.character(OB))))]
print(dt)
(possibly without all that type conversions).
Related
I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a CrankâNicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.
I have a slow computer and some of my R calculations take hours and sometimes days to run. I'm sure they can be made more efficient but in the meanwhile I would like to find out about a simple way to show how far along R is in doing the needed calculations.
In a loop this can easily be done by print(i). Is something similar available when doing data.table calculations ?
For instance, the following code takes about 50 hours to run on my machine
q[, ties := sum(orig[pnum == origpat, inventors] %in% ref[pnum == ref.pat, inventors]), by = idx]
q is a data.table with columns origpat, ref.pat and idx (an index) as columns. The data tables orig and ref both contain columns pnum and inventors. The code simply finds the number of overlapping inventors in both groups but given the iterative nature (by = idx), it takes a long time.
I'd like my screen to post progress, e.g. for every 1,000 rows (there are about 20 mio rows).
Any way to do this simply?
Try
q[, ies := {
print(.GRP)
sum(orig[pnum == origpat, inventors] %in% ref[pnum == ref.pat, inventors])
}, by=idx]
This is analogous to print(i) for a by-group operation.
I have a large (~4.5 million records) data frame, and several of the columns have been anonymised by hashing, and I don't have the key, but I do wish to renumber them to something more legible to aid analysis.
To this end, for example, I've deduced that 'campaignID' has 161 unique elements over the 4.5 records, and have created a vector to hold these. I've then tried writing a FOR/IF loop to search through the full dataset using the unique element vector - for each value of 'campaignID', it is checked against the unique element vector, and when it finds a match, it returns the index value of the unique element vector as the new campaign ID.
campaigns_length <- length(unique_campaign)
dataset_length <- length(dataset$campaignId)
for (i in 1:dataset_length){
for (j in 1:campaigns_length){
if (dataset$campaignId[[i]] == unique_campaign[[j]]){
dataset$campaignId[[i]] <- j
}}}
The problem of course is that, while it works, it takes an enormously long time - I had to stop it after 12 hours! Can anything think of a better approach that's much, much quicker and computationally less expensive?
You could use match.
dataset$campaignId <- match(dataset$campaignId, unique_campaign)
See Is there an R function for finding the index of an element in a vector?
You might benefit from using the data.table package in this case:
library(data.table)
n = 10000000
unique_campaign = sample(1:10000, 169)
dataset = data.table(
campaignId = sample(unique_campaign, n, TRUE),
profit = round(runif(n, 100, 1000))
)
dataset[, campaignId := match(campaignId, unique_campaign)]
This example with 10 million rows will only take you a few seconds to run.
You could avoid the inside loop with a dictionnary-like structure :
id_dict = list()
for (id in 1:unique_campaign) {
id_dict[[ unique_campaign[[id]] ]] = id
}
for (i in 1:dataset_length) {
dataset$campaignId[[i]] = id_dict[[ dataset$campaignId[[i]] ]]
}
As pointed in this post, list do not have O(1) access so it will not divided the time recquired by 161 but by a smaller factor depending on the repartition of ids in your list.
Also, the main reason why your code is so slow is because you are using those inefficient lists (dataset$campaignId[[i]] alone can take a lot of time if i is big). Take a look at the hash package which provides O(1) access to the elements (see also this thread on hashed structures in R)
I have a function theresults which takes a 71x2446 data frame and returns a 2x2446 double matrix. the first number in each of the 2446 pairs represents an integer 1-6, which is basically what category the line fits into, and the second number is the Profit or Loss on that category. I want to calculate the sum of profits across each category while counting the frequency of each category. My question is if the way I've written it is an efficient use of vectors
vec<-as.data.frame(t(apply(theData,1,theresults)))
vec[2][vec[1]==1]->successCrossed
vec[2][vec[1]==2]->failCrossed
vec[2][vec[1]==3]->successFilled
vec[2][vec[1]==4]->failFilled
vec[2][vec[1]==5]->naCount
vec[2][vec[1]==6]->otherCount
then there are a bunch of calls to length() and mean() while summarizing the results.
theresults references the original data frame in this sort of way
theresults<-function(theVector)
{
if(theVector[['Aggressor']]=="Y")
{
if(theVector[['Side']]=="Sell")
{choice=6}
else
{choice=3}
if(!is.na(theVector[['TradePrice']])&&!is.na(theVector[['L1_BidPri_1']])&&!is.na(theVector[['L1_AskPri_1']])&&!is.na(theVector[['L2_BidPri_1']])&&!is.na(theVector[['L2_AskPri_1']]))
{
Profit<- switch(choice,
-as.numeric(theVector[['TradePrice']]) + 10000*as.numeric(theVector[['L1_AskPri_1']])/as.numeric(theVector[['L2_BidPri_1']]),
-as.numeric(theVector[['TradePrice']]) + 10000*as.numeric(theVector[['L1_BidPri_1']])/as.numeric(theVector[['L2_BidPri_1']]),
You can try combining the 2x2446 vector into a string vector representing the type and profit statuses...then calling "table" on it.
Here's an example:
data = cbind(sample(1:6, replace=T, 30),
sample (c("profit", "loss"), replace=T, 30))
x = apply(data, MARGIN=1, paste, collapse="")
table(x)
I'm pretty sure that for this type of operation, even if the data set were in the hundreds of thousands of rows, the correct answer would be to use Uwe's maxim; this code is fast enough and will not be a bottleneck in the program.
(in response to the other answer, cbind is slow and memory intensive relative to my current solution.)
I have a list with hundreds of columns and rows. What I'm doing is looping through nearly every possible iteration of taking the difference between two columns. For example take the difference between 1st and 2nd column, 1st and 3rd column..1st and 500th column... 499th column and 500th column. Once I have those differences I compute some descriptive statistics (ie. mean, st dev, kurtosis, skewness, etc) for output. I know I can use lapply to calculate those statistics for each column individually but sd(x)-sd(y) <> sd(x-y) so it doesn't really cut down much on my looping. I can use avg(x)-avg(y)=avg(x-y) but that's the only statistic where I can use this property.
Here's some pseudo code that I have:
for (n1 in 1:(number of columns) {
for (n2 in n1:(number of columns) {
temp<-bigdata[n1]-bigdata[n2]
results[abc]<-(maxdrawdown,mean,skewness,kurtosis,count,st dev,
median, downsidedeviation)
}
}
Doing it this way can take literally days so I'm looking for some improvements. I'm already using Compiler with enableJIT(3) which actually does make it noticeably faster. I had a couple other ideas and any incites would be helpful. One is trying to utilize the snowfall package (still trying to get my head around how to implement it) with the thought that one core could compute skew and kurtosis while the other computes the other statistics. The other idea is creating big chunks of temp (ie. 1-2, 1-3, 1-4) as another data.frame (or list) so as to use lapply against it to knock out many iterations at once. Would this make much of a difference? Is there anything else I can do that I'm not even thinking of?
A reproducible example would really help, because the way you describe your problem are confusing (e.g. lists don't have rows/columns). My guess is that bigdata and results are data.frames, in which case converting each of them to a matrix will make your loops appreciably faster.
I don't know if it will be any faster, but the following might make the code a bit easier to read if not faster, although it should get a bit faster as well because you've eliminated the for() ....
Try using expand.grid(), which I tend to use less often than I probably should
For instance:
nC <- 3 # Num of cols
nR <- 4 # Num of cols
indices <- expand.grid(nC, nC)
# Now you can use apply cleanly
apply(indices, 1,
function(x) {
c1 <- x[1]; c2 <- x[2]
yourResult[c1,c2] <- doYourThing(bigData[,c1], bigData[,c2])
}
)
Well, you get the idea. :-)