Creating indicator variable columns in dplyr chain - r

Updated: With apologies to those who replied, in my original example I overlooked the fact that data.frame() created var as a factor rather than as a character vector, as I had intended. I have corrected the example, and this will break at least one of the answers.
I have a data frame that I'm performing a series of dplyr and tidyr manipulations on, and I would like to add columns for indicator variables that would be encoded as 0 or 1, and do this within the dplyr chain. Each level of a factor (presently stored as character vectors) should be encoded in a separate column, and the column names are a concatenation of a fixed prefix with the variable level, e.g. var has level a, new column var_a will be 1, and all other rows of var_a will be 0.
The following minimal example using base R produces exactly the results that I want (thanks to this blog post), but I'd like to roll it all into the dplyr chain, and can't quite figure out how to do it.
df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE)
for(level in unique(df$var)){
df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0)
Note that the real data set contains multiple columns, none of which should be altered or dropped when creating the indicator variables, with the exception of the column var, which could be converted to type factor.

It's not pretty, but this function should work
dummy <- function(data, col) {
for(c in col) {
idx <- which(names(data)==c)
v <- data[[idx]]
m <- matrix(0, nrow=nrow(data), ncol=nlevels(v))
m[cbind(seq_along(v), as.integer(v))]<-1
colnames(m) <- paste(c, levels(v), sep="_")
r <- data.frame(m)
if ( idx>1 ) {
r <- cbind(data[1:(idx-1)],r)
if ( idx<ncol(data) ) {
r <- cbind(r, data[(idx+1):ncol(data)])
data <- r
Here's a sample data.frame
dd <- data.frame(a=runif(30),
and you specify the columns you want to expand as a character vector. You can do
dd %>% dummy("b")
dd %>% dummy(c("b","d"))

It's possible without creating a function, although it does require lapply. If var is a factor, you can work with its levels; we can bind its columns to an lapply which loops over the levels of var and creates the values, names them with setNames, and converts them into a tbl_df.
df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var),
function(x){as.integer(df$var == x)}),
paste0('var2_', levels(df$var)))))
Source: local data frame [10 x 5]
var var_d var_c var2_c var2_d
(fctr) (dbl) (dbl) (int) (int)
1 d 1 0 0 1
2 c 0 1 1 0
3 c 0 1 1 0
4 c 0 1 1 0
5 d 1 0 0 1
6 d 1 0 0 1
7 c 0 1 1 0
8 c 0 1 1 0
9 d 1 0 0 1
10 c 0 1 1 0
If var is a character vector, not a factor, you can do the same thing, but using unique instead of levels:
df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var),
function(x){as.integer(df$var == x)}),
paste0('var2_', unique(df$var)))))
Two notes:
This approach will work regardless of the data type, but will be slower. In your data is big enough that it matters, it likely makes sense to store the data as factor anyway, as it contains a lot of repeated levels.
Both versions pull data from df$var as it lives in the calling environment, not as it may exist in a larger chain, and assume var is unchanged in whatever it is passed. To reference the dynamic value of var aside from dplyr's normal NSE is rather a pain, insofar as I've seen.
One more alternative that's a little simpler and factor-agnostic, using reshape2::dcast:
df %>% cbind(1 * !, seq_along(var) ~ var, value.var = 'var')[,-1]))
It still pulls the version of df from the calling environment, so the chain really only determines what you're joining to. Because it uses cbind instead of bind_cols, the result will be a data.frame, too, not tbl_df, so if you want to keep it all tbl_df (smart if the data is big), you'll need to replace the cbind with bind_cols(as_data_frame( ... )); bind_cols doesn't seem to want to do the conversion for you.
Note, however, that while this version is simpler, it is comparatively slower, both on factor data:
Unit: microseconds
expr min lq mean median uq max neval
factor 358.889 384.0010 479.5746 427.9685 501.580 3995.951 100
unique 547.249 585.4205 696.4709 633.4215 696.402 4528.099 100
dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796 100
and string data:
Unit: microseconds
expr min lq mean median uq max neval
unique 307.190 336.422 414.1031 362.6485 419.3625 3693.340 100
dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178 100
For small data it won't matter, but for bigger data, it may be worth putting up with the complication.

The only requirements for a function to be part of a dplyr pipeline are that it takes a data frame as input, and returns a data frame as output. So, leveraging model.matrix:
make_inds <- function(df, cols=names(df))
# do each variable separately to get around model.matrix dropping aliased columns, c(df, lapply(cols, function(n) {
x <- df[[n]]
mm <- model.matrix(~ x - 1)
colnames(mm) <- gsub("^x", paste(n, "_", sep=""), colnames(mm))
# insert into pipeline
data %>% ... %>% make_inds %>% ...

I landed on this Q&A first because I really wanted to put model.matrix in a magrittr pipe workflow or produce the equivalent output with just tidyverse functions (sorry, baseRs).
Later, I landed on this solution that had the elegant use of the functions that I thought was possible (but I wasn't coming up with on my own):
df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE))
df %>%
mutate(unique_row_id = 1:n()) %>% #The rows need to be unique for `spread` to work.
mutate(dummy = 1) %>%
spread(var, dummy, fill = 0)
So, I'm adding an updated/modified version of the linked solution so that people who land here first don't have to keep looking (like I did).


Trying to cast molted table to very wide table gives error: "Cross product ... would result in xyz rows which exceeds .Machine$integer.max..."

I am trying to transform my data by applying one hot encoding to some of my columns. I am using data.table's melt and dcast to get there. For small datasets this works. But it breaks on the entire datasets while handling the call to dcast with the following error:
Error in CJ(1:1992389, 1:3228) :
Cross product of elements provided to CJ() would result in 6431431692 rows which exceeds .Machine$integer.max == 2147483647
Calls: create.and.insert.into.table.psms_wide.2 -> dcast -> -> -> CJ
Execution halted
Apparently I am hitting some integral limit in one of the internal operations. My code looks like this.
# record count is 4668480
df <- MY_DATA
#Make sure that the to-be-pivotted columns are of factor type.
df$X = as.factor(df$X)
df$Y = as.factor(df$Y)
# Make sure that X and Y have the same level set so that melt won't complain.
all_levels <- union(levels(df$X), levels(df$Y))
df$X = factor(df$X, levels = all_levels)
df$Y = factor(df$Y, levels = all_levels)
# transforming regular table to the 'melted' form
# non_measure_column_1,..., non_measure_column_N, measure, value
# non_measure_value_11, ..., non_measure_value_1N, X , some_X_value
# non_measure_value_21, ..., non_measure_value_2N, Y , some_Y_value
# non_measure_value_31, ..., non_measure_value_3N, X , some_other_X_value
# ...
df <- melt(df, measure.vars = c("X", "Y"), value.factor=TRUE, na.rm=TRUE)
# now transform the melted form to the wide one-hot encoded form:
# non_measure_column_1,..., non_measure_column_N , X_some_X_value, Y_some_Y_value, X_some_other_X_value, ...
# non_measure_value_11, ..., non_measure_value_1N , 1 , 0 , 0
# non_measure_value_21, ..., non_measure_value_2N , 0 , 1 , 0
# non_measure_value_31, ..., non_measure_value_3N , 0 , 0 , 1
df <- dcast(df, ... ~ variable + value, fun = length)
I realize that my expected result is both very wide and very tall. For system resource limits I already had to shift from data.frame's to data.table's and from reshape2 to data.table methods. So far I managed to get my code to work if I take on subsets of my data smaller then a million rows.
What do I need to do to make this work given the R restrictions?
Chunkwise tall-to-wide + union of intermediate results?

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2,, 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <-"c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2,, 1, sum) < 10, ]
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

R Check a row of strings, if equal, assign equal ID, less time consuming

im fairly new to R and was wondering if anyone here had a better solution to my problem, as mine is too time consuming. I know R is not very "for-loop-friendly" so I am sure there is a better way to solve this.
I have a data frame where x is a text string and y is a numeric id:
x = c("a", "b", "c", "b", "a")
y = c(1,2,3,4,5)
df <- data.frame(x, y)
I want a to find all matches in column x, and assign them the same numeric value as the first in y. I have solved this with the following:
for(i in 1:NROW(df)) {
for(j in i:NROW(df)) {
if(df$x[j] == df$x[i]){
df$y[j] <- df$y[i]
j = j + 1
i = i + 1
Problem is, I have a fairly large dataset which makes this process take a lot of time! Hope anyone here knows a less time consuming alternative!
If your dataset is indeed large, then data.table will probably the fastest solution (see benchmarks here).
df[, y := first(y), by = x]
R likes vectorised code, so things like arithmetic operations and assignments can be slow if done in a loop. Consider for example assigning the vector 1, 2, ... 1,000,000 to a variable x in two different ways
x <- 1:1e6
x <- numeric(x, 1e6) # initialise a numeric vector of length 1 million
for (i in 1:1e6) x[i] <- i
If you try this out you will see that the second method takes much longer.
Coming to your problem, you want to group the data by the value in df$x and replace the values of y by their first element <- by(df$x, function(d) transform(d, y = y[1]), data = df)
will set the second column of each subset of df (subsetting based on df$x) equal to its first element. The result is
#df$x: a
# x y
#1 a 1
#5 a 1
#df$x: b
# x y
#2 b 2
#4 b 2
#df$x: c
# x y
#3 c 3
To combine these back to a data frame, use <-, One (possibly unwanted) side effect of this operation is that it will change the order of the rows.
If you are new to R check out the dplyr package, it's got a smooth learning curve and easy to write and read syntax. What you want to do could be accomplished in only a few lines.
df %>% group_by(x) %>% mutate(y = y[1])
will do it!

Using R's plyr package to reorder groups within a dataframe

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.
The data looks like this (code to generate below)
group value
2 b 0.1408790
6 b 1.1450040 #2nd b is smaller than 1st
1 c 5.7433568
3 c 2.2109819
4 d 0.5384659
5 d 4.5382979
What I would like is this.
group value
b 1.1450040 #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659
So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.
I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?
Example data <- 6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,,replace = TRUE),
value = runif(,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter
The inefficient approach using loops:
My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)
Vector of the unique groups in the dataframe
groups.u <- unique(df$group)
Create empty list
my.list <- as.list(groups.u); names(my.list) <- groups.u
Break up df by $group into list
for(i in 1:length(groups.u)){
i.working <- which(df$group == groups.u[i])
my.list[[i]] <- df[i.working, ]
Sort elements within list using order
for(i in 1:length(my.list)){
order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
my.list[[i]] <- my.list[[i]][order.x, ]
Finally rebuild df from the list. 1st, make seed for loop
new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
new.df <- rbind(new.df,my.list[[i]])
Remove seed
new.df <- new.df[-1,]
You could use dplyr which is a newer version of plyr that focuses on data frames:
arrange(df, group, desc(value))
It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.
In "data.table", you could use setorder, like this:
setorder(setDT(df), group, -value)
That command does two things:
It converts your data.frame to a data.table without copying.
It sorts your columns by reference (again, no copying).
You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.
set.seed(1) <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
group = sample(groups,, replace = TRUE),
value = runif(,0,10), stringsAsFactors = FALSE)
dt1 <- function()[order(group, -value)]
dt2 <- function() setorder(, group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1() 5.749002 5.981274 7.725225 6.270664 8.831899 67.402052 100
# dt2() 4.956020 5.096143 5.750724 5.229124 5.663545 8.620155 100
# dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736 100

Replace parts of a variable using numeric indices in dplyr. Do I need to create an index column and use ifelse?

At one stage in longer chain of dplyr functions, I need to replace parts of a variable using numeric indices to specify which elements to replace.
My data looks like this:
df1 <- data.frame(grp = rep(1:2, each = 3),
a = 1:6,
b = rep(c(10, 20), each = 3))
# grp a b
# 1 1 1 10
# 2 1 2 10
# 3 1 3 10
# 4 2 4 20
# 5 2 5 20
# 6 2 6 20
Assume that we, within each group, wish to replace elements in variable a with the corresponding elements in b, at one or more positions. In this simple example I use a single index (id), but this could be a vector of indices. First, here's how I would do it with ddply:
id <- 2
ddply(.data = df1, .variables = .(grp), function(x){
x$a[id] <- x$b[id]
# grp a b
# 1 1 1 10
# 2 1 10 10
# 3 1 3 10
# 4 2 4 20
# 5 2 20 20
# 6 2 6 20
In dplyr I could think of some different ways to perform the replacement. (1) Use do with an anonymous function, similar to the one used in ddply. (2) Use mutate: concatenate a vector where the replacement is 'inserted' using numeric indexing. This is probably only fruitful for a single index. (3) Use mutate: create an index vector and use conditional replacement with ifelse (see e.g. here, here, here, and here).
detach("package:plyr", unload = TRUE)
# (1)
fun_do <- function(df){
l <- df %.%
group_by(grp) %.%
dat$a[id] <- dat$b[id]
}), l)
# (2)
fun_mut <- function(df){
df %.%
group_by(grp) %.%
a = c(a[1:(id - 1)], b[id], a[(id + 1):length(a)])
# (3)
fun_mut_ifelse <- function(df){
df %.%
group_by(grp) %.%
idx = 1:n(),
a = ifelse(idx %in% id, b, a)) %.%
In a benchmark with a slightly larger data set, the 'jigsaw puzzle insertion' is fastest, but again, this method is probably only suited for single replacements. And it doesn't look very clean...
df2 <- data.frame(grp = rep(1:200, each = 3),
a = rnorm(600),
b = rnorm(600))
times = 10)
# Unit: microseconds
# expr min lq median uq max neval
# fun_do(df2) 48443.075 49912.682 51356.631 53369.644 55108.769 10
# fun_mut(df2) 891.420 933.996 1019.906 1066.663 1155.235 10
# fun_mut_ifelse(df2) 2503.579 2667.798 2869.270 3027.407 3138.787 10
Just to check the influence of the part in the do function, try without it:
fun_do2 <- function(df){
df %.%
group_by(grp) %.%
dat$a[2] <- dat$b[2]
Then a new benchmark on a larger data set:
df3 <- data.frame(grp = rep(1:2000, each = 3),
a = rnorm(6000),
b = rnorm(6000))
times = 10)
Again, a simple 'insertion' is fastest, while the do function is losing ground. In the help text do is described as "a general purpose complement" to the other dplyr functions. To me it seemed to be a natural choice for an anonymous function. However, I was surprised that do was so much slower, also when the non-dplyr rbinding part was skipped. Currently, the do documentation is rather scarce, so I wonder if I am abusing the function, and that there may be more appropriate (undocumented?) ways to do it?
I got no hits on index/indices when I searched the dplyr help text or vignette. So now I wonder:
Are there other dplyr methods to replace parts of a variable using numeric indices which I have overlooked? Specifically, is the creation of an index column in combination with ifelse the way to go, or are there more direct a[i] <- b[i]-like alternatives?
Edit following comment from #G.Grothendieck (Thanks!). Added replace alternative (a candidate for 'See also' in ?[).
fun_replace <- function(df){
df %.%
group_by(grp) %.%
a = replace(a, id, b[id]))
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun_do(df3) 685.154605 693.327160 706.055271 712.180410 851.757790 10
# fun_do2(df3) 291.787455 294.047747 297.753888 299.624730 302.368554 10
# fun_mut(df3) 5.736640 5.883753 6.206679 6.353222 7.381871 10
# fun_mut_ifelse(df3) 24.321894 26.091049 29.361553 32.649924 52.981525 10
# fun_replace(df3) 4.616757 4.748665 4.981689 5.279716 5.911503 10
replace function is fastest, and for sure easier to use than fun_mut when there are more than one index.
Edit 2 fun_do and fun_do2 no longer works in dplyr 0.2; Error: Results are not data frames at positions:
Here's a much faster modify-in-place approach:
# select rows we want, then assign b to a for those rows, in place
fun_dt = function(dt) dt[dt[, .I[id], by = grp]$V1, a := b]
# benchmark
df4 = data.frame(grp = rep(1:20000, each = 3),
a = rnorm(60000),
b = rnorm(60000))
dt4 =
# using fastest function from OP
microbenchmark(fun_dt(dt4), fun_replace(df4), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# fun_dt(dt4) 15.62325 17.22828 18.42445 20.83768 21.25371 10
# fun_replace(df4) 99.03505 107.31529 116.74830 188.89134 286.50199 10
