Faster way to fill in missing columns in R data frame - r

Can any R experts provide a faster way to do the following? My code works, but it takes 1 minute to do a 30,000-[column] by 12-[row] data frame. Thanks!
sync.columns = function(old.data, new.colnames)
{
# Given a data frame and a vector of column names,
# makes a new data frame containing exactly the named
# columns in the specified order; any that were not
# present are filled in as columns of zeroes.
if (length(new.colnames) == ncol(old.data) &&
all(new.colnames == colnames(old.data)))
{
old.data # nothing to do
}
else
{
m = matrix(nrow=nrow(old.data),ncol=length(new.colnames))
for (t in 1:length(new.colnames))
{
if (new.colnames[t] %in% colnames(old.data))
{
m[,t] = old.data[,new.colnames[t]] # copy column
}
else
{
m[,t] = rep(0,nrow(m)) # fill with zeroes
}
}
result = as.data.frame(m)
rownames(result) = rownames(old.data)
colnames(result) = new.colnames
result
}
}
Maybe something with cbind?

This seems rather fast. First create a data.frame full of zeroes, then only replace what you can find in the old data:
sync.columns <- function(old.data, new.colnames) {
M <- nrow(old.data)
N <- length(new.colnames)
rn <- rownames(old.data)
cn <- new.colnames
new.data <- as.data.frame(matrix(0, M, N, dimnames = list(rn, cn)))
keep.col <- intersect(cn, colnames(old.data))
new.data[keep.col] <- old.data[keep.col]
new.data
}
M <- 30000
x <- data.frame(b = runif(M), i = runif(M), z = runif(M))
rownames(x) <- paste0("z", 1:M)
system.time(y <- sync.columns(x, letters[1:12]))
# user system elapsed
# 0.031 0.010 0.043
head(y)
# a b c d e f g h i j k l
# z1 0 0.27994248 0 0 0 0 0 0 0.3785181 0 0 0
# z2 0 0.75291520 0 0 0 0 0 0 0.7414294 0 0 0
# z3 0 0.07036461 0 0 0 0 0 0 0.1543653 0 0 0
# z4 0 0.40748957 0 0 0 0 0 0 0.5564374 0 0 0
# z5 0 0.98769595 0 0 0 0 0 0 0.4277466 0 0 0
# z6 0 0.82117781 0 0 0 0 0 0 0.2034743 0 0 0
Edit: following comments with the OP below, here is a matrix version:
sync.columns <- function(old.data, new.colnames) {
M <- nrow(old.data)
N <- length(new.colnames)
rn <- rownames(old.data)
cn <- new.colnames
new.data <- matrix(0, M, N, dimnames = list(rn, cn))
keep.col <- intersect(cn, colnames(old.data))
new.data[, keep.col] <- old.data[, keep.col]
new.data
}
x <- t(as.matrix(x)) # a wide matrix
system.time(y <- sync.columns(x, paste0("z", sample(1:50000, 30000))))
# user system elapsed
# 0.049 0.002 0.051

Related

Better way to adding elements in data frame without looping in R

I want to create a dataframe that calculates the odds ratio with the standard error and confidence intervals in R.
I have a dataset similar to the one like so:
dat <- read.table(header = TRUE, text = "
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 target
0 0 1 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 1 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 0 1
1 1 1 1 1 0 0 0 0 0 0 0")
And create a function that calculates everything I need in the dataframe for a particular future in the data set like so:
get_ci <- function(df, feature) {
tab <- table(df[[feature]], df$target)
a <- tab[1,1]
b <- tab[1,2]
c <- tab[2,1]
d <- tab[2,2]
odds_ratio <- (a/b)/(c/d)
standard_error <- sqrt(1/a + 1/b + 1/c + 1/d)
log_ci_lower <- log(odds_ratio) - 1.96 + standard_error
log_ci_upper <- log(odds_ratio) - 1.96 + standard_error
ci_lower <- exp(log_ci_lower)
ci_upper <- exp(log_ci_upper)
df <- data.frame(Feature = feature,
`Odds Ratio` = odds_ratio,
`Standard Error` = standard_error,
`Lower Bound CI` = ci_lower,
`Upper Bound CI` = ci_upper
)
}
I want to create a DF that computes the odds ratio, standard error, and confidence interval for each features (f1-f11). What is the most efficient way to do this?
I am currently creating an empty dataframe and looping through the features in the df to populate one but I feel like this is not the right way to do it. I was looking at the apply functions, but not sure how I can apply that with my function I created
I think the first table line in the function should be :
tab <- table(factor(df[[feature]], levels = 0:1), df$target)
otherwise, if you have all 1's and all 0's in a particular column the next lines would break.
With that change, you can use lapply passing the column names
result <- do.call(rbind, lapply(paste0('f', 1:11), get_ci, df = dat))
Or using purrr's map_df
result <- map_df(paste0('f', 1:11), get_ci, df = dat)
Here's another solution.
get_ci <- function(x, target) {
tab <- table(factor(x, levels=0:1), target) #changed
...
ci_upper <- exp(log_ci_upper)
c(`Odds Ratio` = odds_ratio, # changed
`Standard Error` = standard_error,
`Lower Bound CI` = ci_lower,
`Upper Bound CI` = ci_upper
)
}
as.data.frame(apply(dat[,1:11], 2, function(x) { get_ci(x, dat$target) })) #changed

Multiple for loop time computation very high in R

I have data about machines in the following form
Number of rows - 900k
Data
A B C D E F G H I J K L M N
---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 1 1 1 0 1 1 0 0
2 0 0 0 0 1 1 1 0 1 1 0 0 1 0
3 0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 indicates that the machine was active and 0 indicates that it was inactive.
I want my output to look like
A B C D E F G H I J K L M N
---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
2 0 0 0 0 1 1 1 1 1 1 0 0 1 0
3 0 0 0 0 0 0 0 1 1 1 1 1 0 0
Basically all I am trying to do is look for zeros in a particular row and if that zero is surrounded by ones on either side, replace 0 with 1
example -
in row 1 you have zero in column J
but you also have 1 in column I and K
which means I replace that 0 by 1 because it is surrounded by 1s
The code I am using is this
for(j in 2:13) {
if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1){
data[i,j] = 1
}
}
}
Is there a way to reduce the time computation for this? This takes me almost 30 mins to run in R. Any help would be appreciated.
this is faster because it does not require to iterate through the rows.
for(j in 2:13) {
data[,j] = ifelse(data[,j-1] * data[,j+1]==1,1,data[,j])
}
or a littlebit more optimized, without using ifelse
for(j in 2:(ncol(data) - 1)) {
data[data[, j - 1] * data[, j + 1] == 1, j] <- 1
}
You could also use gsub to replace any instances of 101 with 111 using the following code:
collapsed <- gsub('101', '111', apply(df1, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]
Here's a comparison of everyone's solutions:
library(data.table)
library(rbenchmark)
library(tidyverse)
set.seed(1)
numLetters <- 13
df <- as_tibble(matrix(round(runif(numLetters * 100)), ncol = numLetters))
names(df) <- LETTERS[1:numLetters]
benchmark(
'gsub' = {
data <- df
collapsed <- gsub('101', '111', apply(data, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]
},
'for_orig' = {
data <- df
for(i in 1:nrow(data)) {
for(j in 2:(ncol(data) - 1)) {
if(data[i, j] == 0 && data[i, j - 1] == 1 && data[i, j + 1] == 1) {
data[i, j] = 1
}
}
}
},
'for_norows' = {
data <- df
for(j in 2:(ncol(data) - 1)) {
data[, j] = ifelse(data[, j - 1] * data[, j + 1] == 1, 1, data[, j])
}
},
'vectorize' = {
data <- df
for(i in seq(ncol(data) - 2) + 1) {
condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
data[which(condition), i] <- 1
}
},
'index' = {
data <- df
idx <- apply(data, 1, function(x) c(0, diff(x)))
data[which(idx == -1 & lead(idx == 1), arr.ind = TRUE)[, 2:1]] <- 1
},
replications = 100
)
The indexing solution (which has since been deleted) wins hands-down in terms of computational time for a 13-by-100 data frame.
test replications elapsed relative user.self sys.self user.child
3 for_norows 100 1.19 7.438 1.19 0 NA
2 for_orig 100 9.29 58.063 9.27 0 NA
1 gsub 100 0.28 1.750 0.28 0 NA
5 index 100 0.16 1.000 0.16 0 NA
4 vectorize 100 0.87 5.438 0.87 0 NA
sys.child
3 NA
2 NA
1 NA
5 NA
4 NA
Cut the time by using vectorized operations. As you are planning to do the same thing for every row, this can be done by utilizing the vectorized conditional statements.
for(i in seq(ncol(data) - 2) + 1){ #<== all but last and first column
#Find all neighbouring columns that are equal, where the the center column is equal to 0
condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
#Overwrite only the values that holds the condition
data[which(condition), i] <- 1
}
You can avoid loops altogether and use indexing to replace all the values at once:
nc <- ncol(df)
df[, 2:(nc - 1)][df[, 1:(nc - 2)] * df[, 3:nc] == 1] <- 1

Output of a function into a data.frame

I have two functions:
c <- function(i, n=2009, t=2000){
vect1 <- vector('numeric', length(t:n))
w <- n-t
for(q in 0:w){
a <- 1/((1+i)^(q+1))
b <- q+t
vect1[q+1] <- a
}
return(vect1)
}
and
p <- function(i, n=2009, t=2000){
w <- n-t
for(q in 0:w){
a <- c(i, n, q+t)
print(a)
}
}
Upon defining both functions and running p(0.10), a table similar to the following is obtained, but it is printed. I need the output in a data.frame so that I can join it with other data.
0.909090909 0 0 0 0 0 0 0 0 0
0.826446281 0.909090909 0 0 0 0 0 0 0 0
0.751314801 0.826446281 0.909090909 0 0 0 0 0 0 0
0.683013455 0.751314801 0.826446281 0.909090909 0 0 0 0 0 0
0.620921323 0.683013455 0.751314801 0.826446281 0.909090909 0 0 0 0 0
0.56447393 0.620921323 0.683013455 0.751314801 0.826446281 0.909090909 0 0
etc.
You could do something like this, which gives you a one-column data frame. I'm not exactly sure what you're trying to do, or to what other data you want to join this. Maybe it will help you get to where you're trying to go:
cFunc <- function(i, n=2009, t=2000){
vect1 <- vector('numeric', length(t:n))
w <- n-t
for(q in 0:w){
a <- 1/((1+i)^(q+1))
b <- q+t
vect1[q+1] <- a
}
return(vect1)
}
p <- function(i, n=2009, t=2000){
w <- n-t
a1 <- list()
for(q in 0:w){
q <- q+1
a1[[q]] <- cFunc(i, n, q+t)
}
return(a1)
}
listp <- p(0.10)
listp <- lapply(listp, as.data.frame)
listp <- data.table::rbindlist(listp, fill = TRUE)
Result:
> head(listp)
X[[i]]
1: 0.9090909
2: 0.8264463
3: 0.7513148
4: 0.6830135
5: 0.6209213
6: 0.5644739

Converting counts to individual observations in r

I have a data set that looks as follows
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
I want to reshape the dataframe to look like this
# name judgement1 judgement2 judgement3
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# a 1 0 0
# b 1 0 0
# b 0 1 0
# b 0 0 1
And so on. I have seen that untable is recommended on some other threads, but it does not appear to work with the current version of r. Is there a package that can convert summarised counts into individual observations?
You could try something like this:
df <- data.frame( name = c("a", "b", "c"),
judgement1= c(5, 0, NA),
judgement2= c(1, 1, NA),
judgement3= c(2, 1, NA))
rep.vec <- colSums(df[colnames(df) %in% paste0("judgement", (1:nrow(df)), sep="")], na.rm = TRUE)
want <- data.frame(name=df$name, cbind(diag(nrow(df))))
colnames(want)[-1] <- paste0("judgement", (1:nrow(df)), sep="")
(want <- want[rep(1:nrow(want), rep.vec), ])
I wrote a function that works to give you your desired output:
untabl <- function(df, id.col, count.cols) {
df[is.na(df)] <- 0 # replace NAs
out <- lapply(count.cols, function(x) { # for each column with counts
z <- df[rep(1:nrow(df), df[,x]), ] # replicate rows
z[, -c(id.col)] <- 0 # set all other columns to zero
z[, x] <- 1 # replace the count values with 1
z
})
out <- do.call(rbind, out) # combine the list
out <- out[order(out[,c(id.col)]),] # reorder (you can change this)
rownames(out) <- NULL # return to simple row numbers
out
}
untabl(df = df, id.col = 1, count.cols = c(2,3,4))
# name judgement1 judgement2 judgement3
#1 a 1 0 0
#2 a 1 0 0
#3 a 1 0 0
#4 a 1 0 0
#5 a 1 0 0
#6 a 0 1 0
#7 b 0 1 0
#8 a 0 0 1
#9 a 0 0 1
#10 b 0 0 1
And for your reference, reshape::untable consists of the following code:
function (df, num)
{
df[rep(1:nrow(df), num), ]
}

Split a string column into several dummy variables

As a relatively inexperienced user of the data.table package in R, I've been trying to process one text column into a large number of indicator columns (dummy variables), with a 1 in each column indicating that a particular sub-string was found within the string column. For example, I want to process this:
ID String
1 a$b
2 b$c
3 c
into this:
ID String a b c
1 a$b 1 1 0
2 b$c 0 1 1
3 c 0 0 1
I have figured out how to do the processing, but it takes longer to run than I would like, and I suspect that my code is inefficient. A reproduceable version of my code with dummy data is below. Note that in the real data, there are over 2000 substrings to search for, each substring is roughly 30 characters long, and there may be up to a few million rows. If need be, I can parallelize and throw lots of resources at the problem, but I want to optimize the code as much as possible. I have tried running Rprof, which suggested no obvious (to me) improvements.
set.seed(10)
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))
random_string <- function(min_length, max_length, separator) {
selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)
return(selection)
}
dt <- data.table(id = c(1:1000), messy_string = "")
dt[ , messy_string := random_string(2, 5, "$"), by = id]
create_indicators <- function(search_list, searched_string) {
y <- rep(0, length(search_list))
for(j in 1:length(search_list)) {
x <- regexpr(search_list[j], searched_string)
x <- x[1]
y[j] <- ifelse(x > 0, 1, 0)
}
return(y)
}
timer <- proc.time()
indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list))
for(n in 1:nrow(dt)) {
indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)]
}
indicators <- data.table(indicators)
setnames(indicators, elements_list)
dt <- cbind(dt, indicators)
proc.time() - timer
user system elapsed
13.17 0.08 13.29
EDIT
Thanks for the great responses--all much superior to my method. The results of some speed tests below, with slight modifications to each function to use 0L and 1L in my own code, to store the results in separate tables by method, and to standardize the ordering. These are elapsed times from single speed tests (rather than medians from many tests), but the larger runs each take a long time.
Number of rows in dt 2K 10K 50K 250K 1M
OP 28.6 149.2 717.0
eddi 5.1 24.6 144.8 1950.3
RS 1.8 6.7 29.7 171.9 702.5
Original GT 1.4 7.4 57.5 809.4
Modified GT 0.7 3.9 18.1 115.2 473.9
GT4 0.1 0.4 2.26 16.9 86.9
Pretty clearly, the modified version of GeekTrader's approach is best. I'm still a bit vague on what each step is doing, but I can go over that at my leisure. Although somewhat out of bounds of the original question, if anyone wants to explain what GeekTrader and Ricardo Saporta's methods are doing more efficiently, it would be appreciated both by me and probably by anyone who visits this page in the future. I'm particularly interested to understand why some methods scale better than others.
*****EDIT # 2*****
I tried to edit GeekTrader's answer with this comment, but that seems not to work. I made two very minor modifications to the GT3 function, to a) order the columns, which adds a small amount of time, and b) replace 0 and 1 with 0L and 1L, which speeds things up a bit. Call the resulting function GT4. Table above edited to add times for GT4 at different table sizes. Clearly the winner by a mile, and it has the added advantage of being intuitive.
UPDATE : VERSION 3
Found even faster way. This function is also highly memory efficient.
Primary reason previous function was slow because of copy/assignments happening inside lapply loop as well as rbinding of the result.
In following version, we preallocate matrix with appropriate size, and then change values at appropriate coordinates, which makes it very fast compared to other looping versions.
funcGT3 <- function() {
#Get list of column names in result
resCol <- unique(dt[, unlist(strsplit(messy_string, split="\\$"))])
#Get dimension of result
nresCol <- length(resCol)
nresRow <- nrow(dt)
#Create empty matrix with dimensions same as desired result
mat <- matrix(rep(0, nresRow * nresCol), nrow = nresRow, dimnames = list(as.character(1:nresRow), resCol))
#split each messy_string by $
ll <- strsplit(dt[,messy_string], split="\\$")
#Get coordinates of mat which we need to set to 1
coords <- do.call(rbind, lapply(1:length(ll), function(i) cbind(rep(i, length(ll[[i]])), ll[[i]] )))
#Set mat to 1 at appropriate coordinates
mat[coords] <- 1
#Bind the mat to original data.table
return(cbind(dt, mat))
}
result <- funcGT3() #result for 1000 rows in dt
result
ID messy_string zn tc sv db yx st ze qs wq oe cv ut is kh kk im le qg rq po wd kc un ft ye if zl zt wy et rg iu
1: 1 zn$tc$sv$db$yx 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 2 st$ze$qs$wq 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 3 oe$cv$ut$is 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 4 kh$kk$im$le$qg 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5: 5 rq$po$wd$kc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
---
996: 996 rp$cr$tb$sa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
997: 997 cz$wy$rj$he 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
998: 998 cl$rr$bm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
999: 999 sx$hq$zy$zd 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1000: 1000 bw$cw$pw$rq 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Benchmark againt version 2 suggested by Ricardo (this is for 250K rows in data) :
Unit: seconds
expr min lq median uq max neval
GT2 104.68672 104.68672 104.68672 104.68672 104.68672 1
GT3 15.15321 15.15321 15.15321 15.15321 15.15321 1
VERSION 1
Following is version 1 of suggested answer
set.seed(10)
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))
random_string <- function(min_length, max_length, separator) {
selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)
return(selection)
}
dt <- data.table(ID = c(1:1000), messy_string = "")
dt[ , messy_string := random_string(2, 5, "$"), by = ID]
myFunc <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
COLS <- do.call(rbind,
lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]]))
)
}
)
)
res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ))
dt <- cbind(dt, res)
for (j in names(dt))
set(dt,which(is.na(dt[[j]])),j,0)
return(dt)
}
create_indicators <- function(search_list, searched_string) {
y <- rep(0, length(search_list))
for(j in 1:length(search_list)) {
x <- regexpr(search_list[j], searched_string)
x <- x[1]
y[j] <- ifelse(x > 0, 1, 0)
}
return(y)
}
OPFunc <- function() {
indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list))
for(n in 1:nrow(dt)) {
indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)]
}
indicators <- data.table(indicators)
setnames(indicators, elements_list)
dt <- cbind(dt, indicators)
return(dt)
}
library(plyr)
plyrFunc <- function() {
indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i)
dt[i,
data.frame(t(as.matrix(table(strsplit(messy_string,
split = "\\$")))))
]))
dt = cbind(dt, indicators)
#dt[is.na(dt)] = 0 #THIS DOESN'T WORK. USING FOLLOWING INSTEAD
for (j in names(dt))
set(dt,which(is.na(dt[[j]])),j,0)
return(dt)
}
BENCHMARK
system.time(res <- myFunc())
## user system elapsed
## 1.01 0.00 1.01
system.time(res2 <- OPFunc())
## user system elapsed
## 21.58 0.00 21.61
system.time(res3 <- plyrFunc())
## user system elapsed
## 1.81 0.00 1.81
VERSION 2 : Suggested by Ricardo
I'm posting this here instead of in my answer as the framework is really #GeekTrader's -Rick_
myFunc.modified <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
## MODIFICATIONS:
# using `rbindlist` instead of `do.call(rbind.. )`
COLS <- rbindlist( lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]])),
# MODICIATION: Not coercing to factors
stringsAsFactors = FALSE
)
}
)
)
# MODIFICATION: Preserve as matrix, the output of tapply
res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )
# FLATTEN into a data.table
resdt <- data.table(r=c(res2))
# FIND & REPLACE NA's of single column
resdt[is.na(r), r:=0L]
# cbind with dt, a matrix, with the same attributes as `res2`
cbind(dt,
matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2)))
}
### Benchmarks:
orig = quote({dt <- copy(masterDT); myFunc()})
modified = quote({dt <- copy(masterDT); myFunc.modified()})
microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L)
# Unit: milliseconds
# expr min lq median uq max
# 1 Modified 895.025 971.0117 1011.216 1189.599 2476.972
# 2 Orig 1953.638 2009.1838 2106.412 2230.326 2356.802
# split the `messy_string` and create a long table, keeping track of the id
DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val")
# add the columns, initialize to 0
DT2[, c(elements_list) := 0L]
# warning expected, re:adding large ammount of columns
# iterate over each value in element_list, assigning 1's ass appropriate
for (el in elements_list)
DT2[el, c(el) := 1L]
# sum by ID
DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]
Note that we are carrying along the messy_string column since it is cheaper than leaving it behind and then joining on ID to get it back.
If you dont need it in the final output, just delete it above.
Benchmarks:
Creating the sample data:
# sample data, using OP's exmple
set.seed(10)
N <- 1e6 # number of rows
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))
messy_string_vec <- random_string_fast(N, 2, 5, "$") # Create the messy strings in a single shot.
masterDT <- data.table(ID = c(1:N), messy_string = messy_string_vec, key="ID") # create the data.table
Side Note
It is significantly faster to create the random strings all at once and assign the results as a single column
than to call the function N times and assign each, one by one.
# Faster way to create the `messy_string` 's
random_string_fast <- function(N, min_length, max_length, separator) {
ints <- seq(from=min_length, to=max_length)
replicate(N, paste(sample(elements_list, sample(ints)), collapse=separator))
}
Comparing Four Methods:
this answer -- "DT.RS"
#eddi's answer -- "Plyr.eddi"
#GeekTrader's answer -- DT.GT
GeekTrader's' answer with some modifications -- DT.GT_Mod
Here is the setup:
library(data.table); library(plyr); library(microbenchmark)
# data.table method - RS
usingDT.RS <- quote({DT <- copy(masterDT);
DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val"); DT2[, c(elements_list) := 0L]
for (el in elements_list) DT2[el, c(el) := 1L]; DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]})
# data.table method - GeekTrader
usingDT.GT <- quote({dt <- copy(masterDT); myFunc()})
# data.table method - GeekTrader, modified by RS
usingDT.GT_Mod <- quote({dt <- copy(masterDT); myFunc.modified()})
# ply method from below
usingPlyr.eddi <- quote({dt <- copy(masterDT); indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i) dt[i, data.frame(t(as.matrix(table(strsplit(messy_string, split = "\\$"))))) ]));
dt = cbind(dt, indicators); dt[is.na(dt)] = 0; dt })
Here are the benchmark results:
microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), usingPlyr.eddi=eval(usingPlyr.eddi), times=5L)
On smaller data:
N = 600
Unit: milliseconds
expr min lq median uq max
1 usingDT.GT 1189.7549 1198.1481 1200.6731 1202.0972 1203.3683
2 usingDT.GT_Mod 581.7003 591.5219 625.7251 630.8144 650.6701
3 usingDT.RS 2586.0074 2602.7917 2637.5281 2819.9589 3517.4654
4 usingPlyr.eddi 2072.4093 2127.4891 2225.5588 2242.8481 2349.6086
N = 1,000
Unit: seconds
expr min lq median uq max
1 usingDT.GT 1.941012 2.053190 2.196100 2.472543 3.096096
2 usingDT.RS 3.107938 3.344764 3.903529 4.010292 4.724700
3 usingPlyr 3.297803 3.435105 3.625319 3.812862 4.118307
N = 2,500
Unit: seconds
expr min lq median uq max
1 usingDT.GT 4.711010 5.210061 5.291999 5.307689 7.118794
2 usingDT.GT_Mod 2.037558 2.092953 2.608662 2.638984 3.616596
3 usingDT.RS 5.253509 5.334890 6.474915 6.740323 7.275444
4 usingPlyr.eddi 7.842623 8.612201 9.142636 9.420615 11.102888
N = 5,000
expr min lq median uq max
1 usingDT.GT 8.900226 9.058337 9.233387 9.622531 10.839409
2 usingDT.GT_Mod 4.112934 4.293426 4.460745 4.584133 6.128176
3 usingDT.RS 8.076821 8.097081 8.404799 8.800878 9.580892
4 usingPlyr.eddi 13.260828 14.297614 14.523016 14.657193 16.698229
# dropping the slower two from the tests:
microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), times=6L)
N = 10,000
Unit: seconds
expr min lq median uq max
1 usingDT.GT_Mod 8.426744 8.739659 8.750604 9.118382 9.848153
2 usingDT.RS 15.260702 15.564495 15.742855 16.024293 16.249556
N = 25,000
... (still running)
-----------------
Functions Used in benchmarking:
# original random string function
random_string <- function(min_length, max_length, separator) {
selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)
return(selection)
}
# GeekTrader's function
myFunc <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
COLS <- do.call(rbind,
lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]]))
)
}
)
)
res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ))
dt <- cbind(dt, res)
for (j in names(dt))
set(dt,which(is.na(dt[[j]])),j,0)
return(dt)
}
# Improvements to #GeekTrader's `myFunc` -RS '
myFunc.modified <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
## MODIFICATIONS:
# using `rbindlist` instead of `do.call(rbind.. )`
COLS <- rbindlist( lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]])),
# MODICIATION: Not coercing to factors
stringsAsFactors = FALSE
)
}
)
)
# MODIFICATION: Preserve as matrix, the output of tapply
res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )
# FLATTEN into a data.table
resdt <- data.table(r=c(res2))
# FIND & REPLACE NA's of single column
resdt[is.na(r), r:=0L]
# cbind with dt, a matrix, with the same attributes as `res2`
cbind(dt,
matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2)))
}
### Benchmarks comparing the two versions of GeekTrader's function:
orig = quote({dt <- copy(masterDT); myFunc()})
modified = quote({dt <- copy(masterDT); myFunc.modified()})
microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L)
# Unit: milliseconds
# expr min lq median uq max
# 1 Modified 895.025 971.0117 1011.216 1189.599 2476.972
# 2 Orig 1953.638 2009.1838 2106.412 2230.326 2356.802
Here's a somewhat newer approach, using cSplit_e() from the splitstackshape package.
library(splitstackshape)
cSplit_e(dt, split.col = "String", sep = "$", type = "character",
mode = "binary", fixed = TRUE, fill = 0)
# ID String String_a String_b String_c
#1 1 a$b 1 1 0
#2 2 b$c 0 1 1
#3 3 c 0 0 1
Here's a ~10x faster version using rbind.fill.
library(plyr)
indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i)
dt[i,
data.frame(t(as.matrix(table(strsplit(messy_string,
split = "\\$")))))
]))
dt = cbind(dt, indicators)
# dt[is.na(dt)] = 0
# faster NA replace (thanks geektrader)
for (j in names(dt))
set(dt, which(is.na(dt[[j]])), j, 0L)
Here is an approach using rapply and table.
I'm sure there would be a slightly faster approach than using table here, but it is still slightly faster than the myfunc.Modified from #ricardo;s answer
# a copy with enough column pointers available
dtr <- alloc.col(copy(dt) ,1000L)
rapplyFun <- function(){
ll <- strsplit(dtr[, messy_string], '\\$')
Vals <- rapply(ll, classes = 'character', f= table, how = 'replace')
Names <- unique(rapply(Vals, names))
dtr[, (Names) := 0L]
for(ii in seq_along(Vals)){
for(jj in names(Vals[[ii]])){
set(dtr, i = ii, j = jj, value =Vals[[ii]][jj])
}
}
}
microbenchmark(myFunc.modified(), rapplyFun(),times=5)
Unit: milliseconds
# expr min lq median uq max neval
# myFunc.modified() 395.1719 396.8706 399.3218 400.6353 401.1700 5
# rapplyFun() 308.9103 309.5763 309.9368 310.2971 310.3463 5
Here's another solution, that constructs a sparse matrix object instead of what you have. This shaves off a lot of time AND memory.
It produces ordered results and even with conversion to data.table it's faster than GT3 with 0L and 1L and without reordering (this could be because I use a different method for arriving at the required coordinates - I didn't go through the GT3 algo), however if you don't convert and keep it as a sparse matrix it's about 10-20x faster than GT3 (and has a much smaller memory footprint).
library(Matrix)
strings = strsplit(dt$messy_string, split = "$", fixed = TRUE)
element.map = data.table(el = elements_list, n = seq_along(elements_list), key = "el")
tmp = data.table(n = seq_along(strings), each = unlist(lapply(strings, length)))
rows = tmp[, rep(n, each = each), by = n][, V1]
cols = element.map[J(unlist(strings))][,n]
dt.sparse = sparseMatrix(rows, cols, x = 1,
dims = c(max(rows), length(elements_list)))
# optional, should be avoided until absolutely necessary
dt = cbind(dt, as.data.table(as.matrix(dt.sparse)))
setnames(dt, c('id', 'messy_string', elements_list))
The idea is to split to strings, then use a data.table as a map object to map each substring to its correct column position. From there on it's just a matter of figuring out the rows correctly and filling in the matrix.

Resources