Identify and Convert to Numeric/Integer - r

I have a situation where I need to look at character data, and convert to numeric or integer. I need to perform this operation on a data.table and it needs to be sofastthatIdontnoticeit when working with a data.table that has ~1000 columns and 1e6 rows. There's a lot of missing, or sparse data so that is a confounding element.
fread from the data.table package does this incredibly quickly and is well tested from a csv file (amoung other options).
Is there a way to apply the column identification used in fread to an existing data.table?
Otherwise, here's the approach I was considering (which is still too slow):
Dummy Data:
library(data.table)
size = 1e6
resample <- function(x,size = 1e6) sample(x,size,replace = TRUE)
text <- c("Canada","Peru","Australia",
"Angola","France","", NA_character_)
text2 <- c("Oh Canada.","Arriba Peru.",
"Australia?","Vive la France.")
numerics <- rnorm(1e6)
dt <- data.table(
id = as.character(1:1e6),
i1 = resample(c(as.character(c(0:5,NA)),"")), # sometimes just blank
i2 = resample(c(as.character(c(100:500,NA)))),
n1 = as.character(round(rnorm(1e6),3)),
t1 = resample(text),
t2 = resample(text2)
)
str(dt)
My approach so far, is to use grep to test the columns for alpha, and a literal . and then write a short function to apply as.* as identified.
decide <- data.frame(
vars = names(dt),
character = unlist(lapply(dt, function(x) length(grep("[a-z]",x)))),
numeric = unlist(lapply(dt, function(x) length(grep("[.]",x))))
)
what_is_it <- function(character, numeric) {
if(character == 0 & numeric == 0) {
return("as.integer")
}
if(character > 0) {
return("as.character")
}
if(numeric > 0 & character == 0) {
return("as.numeric")
}
}
decide$fun <- apply(decide[-1], 1, function(x) what_is_it(x[1],x[2]))
for(var in decide$vars) {
fun <- get(decide$fun[decide$vars == var])
dt[, (var) := fun(get(var))]
dt[]
}
system.time(source("https://gist.githubusercontent.com/1beb/183511b51d615751860204344a02c799/raw/91fcee73f24596ac6bdec00edaad944b5b1b7713/quick_convert.R"))
Running at about 3.5 seconds on my machine, but for only 7 columns.

As provided by user20650. The answer is type.convert

Related

Quickly apply & operation to pairs of columns in R

Let’s say I have two large data.tables and need to combine their columns pairwise using the & operation. The combinations are dictated by grid (combine dt1 column1 with dt2 column2, etc.)
Right now I'm using a mclapply loop and the script takes hours when I run the full dataset. I tried converting the data to a matrix and using a vectorized approach but that took even longer. Is there a faster and/or more elegant way to do this?
mx1 <- replicate(10, sample(c(T,F), size = 1e6, replace = T)) # 1e6 rows x 10 columns
mx1 <- as.data.table(mx1)
colnames(mx1) <- LETTERS[1:10]
mx2 <- replicate(10, sample(c(T,F), size = 1e6, replace = T)) # 1e6 rows x 10 columns
mx2 <- as.data.table(mx2)
colnames(mx2) <- letters[1:10]
grid <- expand.grid(col1 = colnames(mx1), col2 = colnames(mx2)) # the combinations I want to evaluate
out <- new_layer <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) { # <--- mclapply loop
mx1[[col1]] & mx2[[col2]]
}, SIMPLIFY = F)
setDT(out) # convert output into data table
colnames(out) <- paste(grid$col1, grid$col2, sep = "_")
For context, this data is from a gene expression matrix where 1 row = 1 cell
This can be done directly with no mapply: Just ensure that the with argument is FALSE
ie:
mx1[, grid$col1, with = FALSE] & mx2[, grid$col2, with=FALSE]
After some digging around I found a package called bit that is specifically designed for fast boolean operations. Converting each column of my data.table from logical to bit gave me a 100-fold increase in compute speed.
# Load libraries.
library(data.table)
library(bit)
# Create data set.
mx1 <- replicate(10, sample(c(T,F), size = 5e6, replace = T)) # 5e6 rows x 10 columns
colnames(mx1) <- LETTERS[1:10]
mx2 <- replicate(10, sample(c(T,F), size = 5e6, replace = T)) # 5e6 rows x 10 columns
colnames(mx2) <- letters[1:10]
grid <- expand.grid(col1 = colnames(mx1), col2 = colnames(mx2)) # combinations I want to evaluate
# Single operation with logical matrix.
system.time({
out <- mx1[, grid$col1] & mx2[, grid$col2]
}) # 26.014s
# Loop with logical matrix.
system.time({
out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
mx1[, col1] & mx2[, col2]
})
}) # 31.914s
# Single operation with logical data.table.
mx1.dt <- as.data.table(mx1)
mx2.dt <- as.data.table(mx2)
system.time({
out <- mx1.dt[, grid$col1, with = F] & mx2.dt[, grid$col2, with = F] # 26.014s
}) # 32.349s
# Loop with logical data.table.
system.time({
out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
mx1.dt[[col1]] & mx2.dt[[col2]]
})
}) # 15.031s <---- SECOND FASTEST TIME, ~2X IMPROVEMENT
# Loop with bit data.table.
mx1.bit <- mx1.dt[, lapply(.SD, as.bit)]
mx2.bit <- mx2.dt[, lapply(.SD, as.bit)]
system.time({
out <- mapply(grid$col1, grid$col2, FUN = function(col1, col2) {
mx1.bit[[col1]] & mx2.bit[[col2]]
})
}) # 0.383s <---- FASTEST TIME, ~100X IMPROVEMENT
# Convert back to logical table.
out <- setDT(out)
colnames(out) <- paste(grid$col1, grid$col2, sep = "_")
out <- out[, lapply(.SD, as.logical)]
There are also special functions like sum.bit and ri that you can use to aggregate data without converting it back to logical.

Optimizing processing time in the nested for loops - R

I have two datasets with 24k and 15k rows. I used nested for loops in order to rewrite some data... however it takes forever to compute the operation.
does anyone have a suggestion how to optimize the code to speed the process?
my code:
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the full code with the imput looks like this:
df <- data[grepl("Trennscheiben", data$a_naziv) & data$SestavKolicina > 1,]
for(i in 1:length(df$kolicina)){
df$kolicina[i] <- df$kolicina[i] / 10
}
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the data:
LIXcode a_naziv RacunCenaNaEM kolicina
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 30
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 1
I haven't tried this on my own machine, but this should work
fun <- function(x,y){
x[which(x$LIXcode %in% y$LIXcode)]$kolicina =
y[which(x$LIXcode %in% y$LIXcode)]$kolicina
}
}
fun(data,df)
R has the capability to do them all in parallel
As far as I understand, the question concerns table "dt1" with key column "a" and any number of value columns and any number of observations. And then we have a "dt2" that has some sort of mapping - which means that column "a" has unique values and some column "b" has values that need to be written into "dt1" where columns "a" match.
I would suggest joining tables:
require(data.table)
dt1 <- data.table(a = sample(1:10, 1000, replace = T),
b = sample(letters, 1000, replace = T))
dt2 <- data.table(a = 1:10,
b = letters[1:10])
output <- merge(dt1, dt2, by = "a", all.x = T)
Also you can try:
dt1[,new_value:=dt2$b[match(a, dt2$a)]
Both of these solutions are vectorized, therefore almost instant.
Base solution (no data.table syntax, although I'd highly recommend you to learn it):
dt1$new_value <- dt2$b[match(dt1$a, dt2$a)]
And that's if I understood the question correctly...
Here's a working solution to accommodate for expected output:
dt1[a %in% dt2$a, b:=dt2$b[match(a, dt2$a)]]

R speed up the for loop using apply() or lapply() or etc

I wrote a special "impute' function that replaces the column values that have missing (NA) values with either mean() or mode() based on the specific column name.
The input dataframe is 400,000+ rows and its vert slow , how can i speed up the imputation part using lapply() or apply().
Here is the function , mark section I want optimized with START OPTIMIZE & END OPTIMIZE:
specialImpute <- function(inputDF)
{
discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
dfList <- list()
counter = 1;
Whilecounter = nrow(inputDF)
#for testing just do 10 iterations,i = 10;
while (Whilecounter >0)
{
studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]
vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
#was discovered and subset before
if (!is.null(vect))
{
#not subset before
if (length(vect)<1)
{
#subset the dataframe base on regex inputDF$STUDYID_SUBJID
df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)
#START OPTIMIZE
for (i in nrow(df))
{
#impute , add column mean & add to list
#apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})
if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
#impute using mean for CONTINUOUS variables
if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
#impute using mode ordinal & nominal values
if (is.na(df[i,"COVAR_ORDINAL_1"])) {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
if (is.na(df[i,"COVAR_ORDINAL_2"])) {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
if (is.na(df[i,"COVAR_ORDINAL_3"])) {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
if (is.na(df[i,"COVAR_ORDINAL_4"])) {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
#nominal
if (is.na(df[i,"COVAR_NOMINAL_1"])) {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
if (is.na(df[i,"COVAR_NOMINAL_2"])) {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
if (is.na(df[i,"COVAR_NOMINAL_3"])) {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
if (is.na(df[i,"COVAR_NOMINAL_4"])) {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
if (is.na(df[i,"COVAR_NOMINAL_5"])) {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
if (is.na(df[i,"COVAR_NOMINAL_6"])) {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
if (is.na(df[i,"COVAR_NOMINAL_7"])) {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
if (is.na(df[i,"COVAR_NOMINAL_8"])) {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}
}#for
#END OPTIMIZE
dfList[[counter]] <- df
#add to discoveredDf since already substed
discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
counter = counter +1;
#for debugging to check progress
if (counter %% 100 == 0)
{
print(counter)
}
}
}
Whilecounter = Whilecounter -1;
}#end while
return (dfList)
}
Thanks
It's likely that performance can be improved in many ways, so long as you use a vectorized function on each column. Currently, you're iterating through each row, and then handling each column separately, which really slows you down. Another improvement is to generalize the code so you don't have to keep typing a new line for each variable. In the examples I'll give below, this is handled because continuous variables are numeric, and categorical are factors.
To get straight to an answer, you can replace your code to be optimized with the following (though fixing variable names) provided that your numeric variables are numeric and ordinal/categorical are not (e.g., factors):
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
# Correct cols_to_impute with names of your variables to be imputed
# e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)
cols_to_impute <- names(df) %in% c("names", "of", "columns")
library(purrr)
df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)
Below is a detailed comparison of five approaches:
Your original approach using for to iterate on rows; each column then handled separately.
Using a for loop.
Using lapply().
Using sapply().
Using dmap() from the purrr package.
The new approaches all iterate on the data frame by column and make use of a vectorized function called impute, which imputes missing values in a vector with the mean (if numeric) or the mode (otherwise). Otherwise, their differences are relatively minor (except sapply() as you'll see), but interesting to check.
Here are the utility functions we'll use:
# Function to simulate a data frame of numeric and factor variables with
# missing values and `n` rows
create_dat <- function(n) {
set.seed(13)
data.frame(
con_1 = sample(c(10:20, NA), n, replace = TRUE), # continuous w/ missing
con_2 = sample(c(20:30, NA), n, replace = TRUE), # continuous w/ missing
ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
ord_2 = sample(c(letters, NA), n, replace = TRUE) # ordinal w/ missing
)
}
# Function that imputes missing values in a vector with mean (if numeric) or
# mode (otherwise)
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
Now, wrapper functions for each approach:
# Original approach
func0 <- function(d) {
for (i in 1:nrow(d)) {
if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)
if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)
if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))
if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
}
return(d)
}
# for loop operates directly on d
func1 <- function(d) {
for(i in seq_along(d)) {
d[[i]] <- impute(d[[i]])
}
return(d)
}
# Use lapply()
func2 <- function(d) {
lapply(d, function(col) {
impute(col)
})
}
# Use sapply()
func3 <- function(d) {
sapply(d, function(col) {
impute(col)
})
}
# Use purrr::dmap()
func4 <- function(d) {
purrr::dmap(d, impute)
}
Now, we'll compare the performance of these approaches with n ranging from 10 to 100 (VERY small):
library(microbenchmark)
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
ORIGINAL = func0(dat),
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
# Plot the results
library(tidyr)
library(ggplot2)
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
It's pretty clear that the original approach is much slower than the new approaches that use the vectorized function impute on each column. What about differences between the new ones? Let's bump up our sample size to check:
ns <- seq(5000, 50000, by = 5000)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
Looks like sapply() is not great (as #Martin pointed out). This is because sapply() is doing extra work to get our data into a matrix shape (which we don't need). If you run this yourself without sapply(), you'll see that the remaining approaches are all pretty comparable.
So the major performance improvement is to use a vectorized function on each column. I suggested using dmap at the beginning because I'm a fan of the function style and the purrr package generally, but you can comfortably substitute for whichever approach you prefer.
Aside, many thanks to #Martin for the very useful comment that got me to improve this answer!
If you are going to be working with what looks like a matrix, then use a matrix instead of a dataframe, since indexing into a dataframe, like it was a matrix, is very costly. You might want to extract the numerical values to a matrix for part of your calculations. This can provide a significant increase in speed.
Here is a really simple and fast solution using data.table.
library(data.table)
# name of columns
cols <- c("a", "c")
# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x))) , .SDcols = cols ]
I haven't compared the performance of this solution to the one provided by #Simon Jackson, but this should be pretty fast.
data from reproducible example
set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1),
b=sample(1:15, 9, replace=TRUE),
c=LETTERS[c(1:6,NA,NA,1)])

How to speed up missing search process in a R data.table

I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows
dt<-data.table(
num1=c(1,2,3,4,NA,5,NA,6),
num3=c(1,2,3,4,5,6,7,8),
int1=as.integer(c(NA,NA,102,105,NA,300,400,700)),
int3=as.integer(c(1,10,102,105,200,300,400,700)),
cha1=c('a','b','c',NA,NA,'c','d','e'),
cha3=c('xcda','b','c','miss','no','c','dfg','e'),
fact1=c('a','b','c',NA,NA,'c','d','e'),
fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'),
allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)),
miss=as.character(c("","",'c','miss','no','c','dfg','e')),
miss2=as.integer(c('','',3,4,5,6,7,8)),
miss3=as.factor(c(".",".",".","c","d","e","f","g")),
miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')),
miss5=as.character(c(NA,NA,'.','.','','','t1','t2'))
)
I was using this code to flag out missing values:
dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]
But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing.
So I am planning to write this for missing value identification
dt[miss5 %in% c(NA,'','.'),flag:=1]
but on a 6 million record set it takes close to 1 second to run this whereas
dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run.
My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing?
Any help is highly appreciated.
== and %in% are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that:
a) we use dt[...] instead of set() as it's not yet implemented in set(), #1196.
b) When RHS to %in% is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just NA and not the "." or "".
Using #akrun's data, here's the code and run time:
in_col = grep("^miss", names(dt), value=TRUE)
out_col = gsub("^miss", "flag", in_col)
system.time({
dt[, (out_col) := 0L]
for (j in seq_along(in_col)) {
if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) {
lookup = c("", ".", NA)
} else lookup = NA
expr = call("%in%", as.name(in_col[j]), lookup)
tt = dt[eval(expr), (out_col[j]) := 1L]
}
})
# user system elapsed
# 1.174 0.295 1.476
How it works:
a) we first initiate all output columns to 0.
b) Then, for each column, we check it's type and create lookup accordingly.
c) We then create the corresponding expression for i - miss(.) %in% lookup
d) Then we evaluate the expression in i, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search.
Note: If necessary, you can add a set2key(dt, NULL) at the end of for-loop so that the created indices are removed immediately after use (to save space).
Compared to this run, #akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup.
Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column.
Calling [.data.table a 100 times could be expensive. When auto indexing is implemented in set(), it'd be nice to compare the performance.
You can loop through the 'miss' columns and create corresponding 'flag' columns with set.
library(data.table)#v1.9.5+
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L)
}
Benchmarks
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5))
set.seed(23)
df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5,
replace=TRUE), ncol=5))
set.seed(234)
i1 <- sample(10)
dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10))
dt <- as.data.table(dfN)
system.time({
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0L]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L)
}
}
)
#user system elapsed
# 8.352 0.150 8.496
system.time({
m1 <- matrix(0, nrow=6e6, ncol=10)
m2 <- sapply(seq_along(dt), function(i) {
ind <- which(dt[[i]] %in% c('.', '', NA))
replace(m1[,i], ind, 1L)})
cbind(dt, m2)})
#user system elapsed
# 14.227 0.362 14.582

Subsetting data by condition

I am trying to reshape/ reduce my data. So far, I employ a for loop (very slow) but from what I perceive, this should be quite fast with Plyr.
I have many groups (firms, as a factor in the dataset) and I want to drop entirely every firm which shows a 0 entry for value in any of that firm's cells. I thus create a new data.frame but leave out all groups showing 0 for value at some point.
The forloop:
Data Creation:
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
-----------------------------
splitby = mydf$firmname
new.data <- data.frame()
for (i in 1:(length(unique(splitby)))) {
temp <- subset(mydf, splitby == as.character(paste(unique(splitby)[i])))
if (all(temp$value > 0) == "TRUE") {
new.data <- rbind(new.data, temp)
}
}
Delete all empty firm factors
new.data$splitby <- factor(new.data$splitby)
Is there a way to achieve that with the plyr package? Can the subset function be used in that context?
EDIT: For the purpose of the reproduction of the problem, data creation, as suggested by BenBarnes, is added. Ben, thanks a lot for that. Furthermore, my code is altered so as to comply with the answers provided below.
You could supply an anonymous function to the .fun argument in ddply():
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
library(plyr)
ddply(mydf,.(firmname), function(x) if(any(x$value==0)) NULL else x )
Or using [, as suggested by Andrie:
firms0 <- unique(mydf$firmname[which(mydf$value == 0)])
mydf[-which(mydf$firmname %in% firms0), ]
Note that the results of ddply are sorted according to firmname
EDIT
For the example in your comments, this approach is again faster than using ddply() to subset, selecting only firms with more than three entries:
firmTable <- table(mydf$firmname)
firmsGT3 <- names(firmTable)[firmTable > 3]
mydf[mydf$firmname %in% firmsGT3, ]

Resources