Threshold for using join over merge - r

My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?

I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.

Related

Loop for doing multiple correlations on dataset in R

I have a dataset with x number of columns, consisting of groups of test results, for example test1_1, test1_2 etc. Each set of tests has a different number of test results associated with it so the actual numbers aren't the same across each test. The final column is my target variable. I'm looking to establish which tests are correlated with the target variable, but I also want to create datasets for each set of tests. I'm also going to be plotting correlation plots of each test against the target variable. I suspect I could probably achieve all of this in a few lines of code within a for/while loop, however, I'm not sure where to begin.
Using lapply this could be achieved like so:
library(dplyr)
library(corrplot)
set.seed(42)
dataset <- data.frame(
test1_1 = runif(20),
test1_2 = runif(20),
test2_1 = runif(20),
test2_2 = runif(20),
Target = runif(20)
)
test_cols <- gsub("_\\d+$", "", names(dataset))
test_cols <- test_cols[grepl("^test", test_cols)]
test_cols <- unique(test_cols)
test_cols <- setNames(test_cols, test_cols)
test_fun <- function(x, test) {
x <- x %>%
select((starts_with(test)) | matches("Target"))
cor(x)
}
cor_test <- lapply(test_cols, test_fun, x = dataset)
cplot <- lapply(cor_test, corrplot)
This is similar to #stefan's answer using split.default to split the columns by pattern in the column names.
tmp <- dplyr::select(dataset, -Target)
list_plot <- lapply(split.default(tmp, sub('_.*', '', names(tmp))), function(x) {
corrplot::corrplot(cor(cbind(x, Target = dataset$Target)))
})

Vectorization of a nested for-loop that inputs all paired combinations

I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))

How to use loop to create multiple columns

I am struggling with creating multiple columns in a DRY way.
I have searched google and stack exchange and I am still struggling with the below.
df <- data.frame(red = 1:10, blue=seq(1,30,3))
myfunction <- function(x){
log(x) + 10
}
df$green <- myfunction(df$red)
df$yellow <- myfunction(df$blue)
My questions are:
how can I create the green and yellow columns using a for loop?
how can I create the green and yellow using an apply function?
I've spent a bit of time working on these kinds of things. Most of the time you're going to want to either know all the names of the new variables or have them work in an orderly pattern so you can just paste together the names with an indexing variable.
df <- data.frame(red = 1:10, blue=seq(1,30,3))
myfunction <- function(x){
log(x) + 10
}
newcols = apply(X = df, MARGIN = 2, FUN = myfunction)
colnames(newcols) = c("greeen","yellow")
df = cbind(df,newcols)
# Alternative
df <- data.frame(red = 1:10, blue=seq(1,30,3))
colors = c("green", "yellow")
for(i in 1:length(colors)){
df[,colors[i]] = myfunction(df[,i])
}
As pointed out by Sotos, apply is slower than lapply. So I believe the optimal solution is:
df[,c("green","yellow")] <- lapply(df, myfunction)

How to apply a function that gets data from differents data frames and has conditions in it?

I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame

Fastest way to apply function to all pairwise combinations of columns

Given a data frame or matrix with arbitrary number of rows and columns, what is the fastest way to apply a function to all pairwise combinations of columns?
For example, if I have a data table:
N <- 3
K <- 3
data <- data.table(id=seq(N))
for(k in seq(K)) {
data[[k]] <- runif(N)
}
And I want to compute the simple difference between all pairs of columns, I could loop (or lapply) over columns:
differences = data.table(foo=seq(N))
for(var1 in names(data)) {
for(var2 in names(data)) {
if (var1==var2) next
if (which(names(data)==var1)>which(names(data)==var2)) next
combo <- paste0(var1, var2)
differences[[combo]] <- data[[var1]]-data[[var2]]
}
}
But as K gets larger, this becomes absurdly slow.
One solution I've considered is to make two new data tables using combn and subtract them:
a <- data[,combn(colnames(data),2)[1,],with=F]
b <- data[,combn(colnames(data),2)[2,],with=F]
differences <- a-b
But as N and K get larger, this becomes very memory intensive (though faster than looping).
It seems to me that the outer product of the matrix with itself is probably the best way to go, but I can't piece it together. This is especially hard if I want to apply an arbitrary function (RMSE for example), instead of just the difference.
What's the fastest way?
If it is necessary to have the data in a matrix first, you can do the following:
library(data.table)
data <- matrix(runif(300*500), nrow = 300, ncol = 500)
data.DT <- setkey(data.table(c(data), colId = rep(1:500, each = 300), rowId = rep(1:300, times = 500)), colId)
diff.DT <- data.DT[
, {
ccl <- unique(colId)
vv <- V1
data.DT[colId > ccl, .(col2 = colId, V1 - vv)]
}
, keyby = .(col1 = colId)
]

Resources