I have two datasets with 24k and 15k rows. I used nested for loops in order to rewrite some data... however it takes forever to compute the operation.
does anyone have a suggestion how to optimize the code to speed the process?
my code:
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the full code with the imput looks like this:
df <- data[grepl("Trennscheiben", data$a_naziv) & data$SestavKolicina > 1,]
for(i in 1:length(df$kolicina)){
df$kolicina[i] <- df$kolicina[i] / 10
}
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the data:
LIXcode a_naziv RacunCenaNaEM kolicina
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 30
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 1
I haven't tried this on my own machine, but this should work
fun <- function(x,y){
x[which(x$LIXcode %in% y$LIXcode)]$kolicina =
y[which(x$LIXcode %in% y$LIXcode)]$kolicina
}
}
fun(data,df)
R has the capability to do them all in parallel
As far as I understand, the question concerns table "dt1" with key column "a" and any number of value columns and any number of observations. And then we have a "dt2" that has some sort of mapping - which means that column "a" has unique values and some column "b" has values that need to be written into "dt1" where columns "a" match.
I would suggest joining tables:
require(data.table)
dt1 <- data.table(a = sample(1:10, 1000, replace = T),
b = sample(letters, 1000, replace = T))
dt2 <- data.table(a = 1:10,
b = letters[1:10])
output <- merge(dt1, dt2, by = "a", all.x = T)
Also you can try:
dt1[,new_value:=dt2$b[match(a, dt2$a)]
Both of these solutions are vectorized, therefore almost instant.
Base solution (no data.table syntax, although I'd highly recommend you to learn it):
dt1$new_value <- dt2$b[match(dt1$a, dt2$a)]
And that's if I understood the question correctly...
Here's a working solution to accommodate for expected output:
dt1[a %in% dt2$a, b:=dt2$b[match(a, dt2$a)]]
Related
I have a data table that provides the length and composition of given vectors
for example:
set.seed(1)
dt = data.table(length = c(100, 150),
n_A = c(30, 30),
n_B = c(20, 100),
n_C = c(50, 20))
I need to randomly split each vector into two subsets with 80% and 20% of observations respectively. I can currently do this using a for loop. For example:
dt_80_list <- list() # create output lists
dt_20_list <- list()
for (i in 1:nrow(dt)){ # for each row in the data.table
sample_vec <- sample( c( rep("A", dt$n_A[i]), # create a randomised vector with the given nnumber of each component.
rep("B", dt$n_B[i]),
rep("C", dt$n_C[i]) ) )
sample_vec_80 <- sample_vec[1:floor(length(sample_vec)*0.8)] # subset 80% of the vector
dt_80_list[[i]] <- data.table( length = length(sample_vec_80), # count the number of each component in the subset and output to list
n_A = length(sample_vec_80[which(sample_vec_80 == "A")]),
n_B = length(sample_vec_80[which(sample_vec_80 == "B")]),
n_C = length(sample_vec_80[which(sample_vec_80 == "C")])
)
dt_20_list[[i]] <- data.table( length = dt$length[i] - dt_80_list[[i]]$length, # subtract the number of each component in the 80% to identify the number in the 20%
n_A = dt$n_A[i] - dt_80_list[[i]]$n_A,
n_B = dt$n_B[i] - dt_80_list[[i]]$n_B,
n_C = dt$n_C[i] - dt_80_list[[i]]$n_C
)
}
dt_80 <- do.call("rbind", dt_80_list) # collapse lists to output data.tables
dt_20 <- do.call("rbind", dt_20_list)
However, the dataset I need to apply this to is very large, and this is too slow. Does anyone have any suggestions for how I could improve performance?
Thanks.
(I assumed your dataset consists of many more rows (but only a few colums).)
Here's a version I came up with, with mainly three changes
use .N and by= to count the number of "A","B","C" drawn in each row
use the size argument in sample
join the original dt and dt_80 to calculate dt_20 without a for-loop
## draw training data
dt_80 <- dcast(
dt[,row:=1:nrow(dt)
][, .(draw=sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*length) )
, by=row
][,.N,
by=.(row,draw)],
row~draw,value.var="N")[,length80:=A80+B80+C80]
## draw test data
dt_20 <- dt[dt_80,
.(A20=n_A-A80,
B20=n_B-B80,
C20=n_C-C80),on="row"][,length20:=A20+B20+C20]
There is probably still room for optimization, but I hope it already helps :)
EDIT
Here I add my initial first idea, I did not post this because the code above is much faster. But this one might be more memory-efficient which seems crucial in your case. So, even if you already have a working solution, this might be of interest...
library(data.table)
library(Rfast)
## add row numbers
dt[,row:=1:nrow(dt)]
## sampling function
sampfunc <- function(n_A,n_B,n_C){
draw <- sample(c(rep("A80",n_A),
rep("B80",n_B),
rep("C80",n_C)),
size=.8*(n_A+n_B+n_C))
out <- Rfast::Table(draw)
return(as.list(out))
}
## draw training data
dt_80 <- dt[,sampfunc(n_A,n_B,n_C),by=row]
Given a data frame or matrix with arbitrary number of rows and columns, what is the fastest way to apply a function to all pairwise combinations of columns?
For example, if I have a data table:
N <- 3
K <- 3
data <- data.table(id=seq(N))
for(k in seq(K)) {
data[[k]] <- runif(N)
}
And I want to compute the simple difference between all pairs of columns, I could loop (or lapply) over columns:
differences = data.table(foo=seq(N))
for(var1 in names(data)) {
for(var2 in names(data)) {
if (var1==var2) next
if (which(names(data)==var1)>which(names(data)==var2)) next
combo <- paste0(var1, var2)
differences[[combo]] <- data[[var1]]-data[[var2]]
}
}
But as K gets larger, this becomes absurdly slow.
One solution I've considered is to make two new data tables using combn and subtract them:
a <- data[,combn(colnames(data),2)[1,],with=F]
b <- data[,combn(colnames(data),2)[2,],with=F]
differences <- a-b
But as N and K get larger, this becomes very memory intensive (though faster than looping).
It seems to me that the outer product of the matrix with itself is probably the best way to go, but I can't piece it together. This is especially hard if I want to apply an arbitrary function (RMSE for example), instead of just the difference.
What's the fastest way?
If it is necessary to have the data in a matrix first, you can do the following:
library(data.table)
data <- matrix(runif(300*500), nrow = 300, ncol = 500)
data.DT <- setkey(data.table(c(data), colId = rep(1:500, each = 300), rowId = rep(1:300, times = 500)), colId)
diff.DT <- data.DT[
, {
ccl <- unique(colId)
vv <- V1
data.DT[colId > ccl, .(col2 = colId, V1 - vv)]
}
, keyby = .(col1 = colId)
]
Is it possible to perform substitution inside an lapply (or similar) function?
I frequently have cases where depending on some key I wish to transform some elements of a data.frame / xts object.
At the moment, i do this using a for loop -- as follows:
set.seed(1)
dx2 <- dx <- xts(data.frame(uni = runif(10),
nrm = rnorm(10),
uni2 = runif(10) - 0.5,
nrm2 = rnorm(10) - 0.5),
order.by = Sys.Date() + 1:10)
key_dx <- data.frame(dd = sample(index(dx), 4),
repTest = sample(c(TRUE, FALSE), 4, rep=TRUE),
colNum = 1:4,
refNum = c(3,4,1,2))
for (i in 1:nrow(key_dx)) {
if(key_dx$repTest[i]) {
dx[key_dx$dd[i], key_dx$colNum[i]] <- 100 + dx[key_dx$dd[i], key_dx$refNum[i]]^2
}
}
This feels like the kind thing that i ought to be able to do using an *apply function.
It would certainly make it more readable -- however i cannot fathom how to test and assign within one.
Is it possible? If so, how might i do this?
The main issue is to return the changed rows seperately and then rbind them to the rows that didn't need to change. I think this is actually more difficult to read than your loop version.
do.call(rbind, # rbind all rows
# only consider rows with repTest=TRUE
c(lapply(which(key_dx$repTest), function(i) {
# change rows
dx[key_dx$dd[i], key_dx$colNum[i]] <-
100 + dx[key_dx$dd[i], key_dx$refNum[i]]^2
# return the changed row
dx[key_dx$dd[i], ]
}),
# return all rows that didn't change
list(dx[!index(dx) %in% key_dx$dd[key_dx$repTest], ])
))
You could also use plyr (neater than working with lapply() results):
require(plyr)
origframe<-data.frame(dd=index(dx),dx) # original data
editframe<-merge(key_dx,origframe,by="dd") # merge wiyh key_dx to bring
# conditional data into the rows
editframe<-editframe[editframe$repTest,] # only test TRUE
editframe<-adply(editframe,1,function(x){ # modify subset rows in adply call
x[as.numeric(x["colNum"])+4]<-100 + # +4 adusts col index
as.numeric(x[as.numeric(x["refNum"])+4])^2 # +4 adusts col index
return(x)
})[,c(1,5:ncol(editframe))]
updatedframe<-rbind(editframe,origframe[origframe$dd%notin%editframe$dd,])
# then back to ts
dx2<-xts(updatedframe[,c("uni","nrm","uni2","nrm2")],order.by=updatedframe$dd)
All,
Consider a simple problem:
set.seed(1) # if generating sample data, it's helpful to set a seed
idx <- rep(1:4,each=4)
c1 <- rnorm(16)
c2 <- rnorm(16)
tmp <- data.frame(idx,c1,c2)
for(i in 2:4){
rows <- which(idx==i)
tmp$delt[rows] <- (tmp$c2[min(rows)-1] - tmp$c1[min(rows)])/tmp$c2[min(rows)-1]
}
tmp
I would like to know if there is an efficient way to generate the delt column using an apply-class function. This example works well enough, but will likely get bogged down when implemented on a large data set.
Cheers
Here is a solution using ave
FUN <- function(i) {
i1 <- i[1]
if (i1 > 1) 1 - tmp$c1[i1] / tmp$c2[i1 - 1] else NA
}
tmp$delt <- ave(1:nrow(tmp), tmp$idx, FUN = FUN)
you can merge the table with itself.
Especially if the data is large, data.table will be quite fast
# put your data into a data.table, keying by idx
library(data.table)
tmpDT <- data.table(idx,c1,c2, key="idx")
# merge to itself and calculate, using tail() and head()
tmpDT[ tmpDT[, list(c2prev = tail(c2, 1)), by=(idx+1)]
, delt := (c2prev - head(c1, 1)) / c2prev ]
Here's a base method:
dal <- c(FALSE, as.logical(diff(idx)))
dal_s <- c(as.logical(diff(idx)), FALSE)
d <- data.frame(idx=2:4, delt=1-tmp$c1[dal]/tmp$c2[dal_s])
merge(tmp, d, all=TRUE)
Note that (x - y)/x = 1 - y/x. You could use the former expression above if necessary.
I am trying to reshape/ reduce my data. So far, I employ a for loop (very slow) but from what I perceive, this should be quite fast with Plyr.
I have many groups (firms, as a factor in the dataset) and I want to drop entirely every firm which shows a 0 entry for value in any of that firm's cells. I thus create a new data.frame but leave out all groups showing 0 for value at some point.
The forloop:
Data Creation:
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
-----------------------------
splitby = mydf$firmname
new.data <- data.frame()
for (i in 1:(length(unique(splitby)))) {
temp <- subset(mydf, splitby == as.character(paste(unique(splitby)[i])))
if (all(temp$value > 0) == "TRUE") {
new.data <- rbind(new.data, temp)
}
}
Delete all empty firm factors
new.data$splitby <- factor(new.data$splitby)
Is there a way to achieve that with the plyr package? Can the subset function be used in that context?
EDIT: For the purpose of the reproduction of the problem, data creation, as suggested by BenBarnes, is added. Ben, thanks a lot for that. Furthermore, my code is altered so as to comply with the answers provided below.
You could supply an anonymous function to the .fun argument in ddply():
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
library(plyr)
ddply(mydf,.(firmname), function(x) if(any(x$value==0)) NULL else x )
Or using [, as suggested by Andrie:
firms0 <- unique(mydf$firmname[which(mydf$value == 0)])
mydf[-which(mydf$firmname %in% firms0), ]
Note that the results of ddply are sorted according to firmname
EDIT
For the example in your comments, this approach is again faster than using ddply() to subset, selecting only firms with more than three entries:
firmTable <- table(mydf$firmname)
firmsGT3 <- names(firmTable)[firmTable > 3]
mydf[mydf$firmname %in% firmsGT3, ]