Drop data frame columns by name - r

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do this with fewer commands.
Also, I know that I could drop columns using integer indexing like this:
df <- df[ -c(1, 3:6, 12) ]
But I am concerned that the relative position of my variables may change.
Given how powerful R is, I figured there might be a better way than dropping each column one by one.

You can use a simple list of names :
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]
Or, alternatively, you can make a list of those to keep and refer to them by name :
keeps <- c("y", "a")
DF[keeps]
EDIT :
For those still not acquainted with the drop argument of the indexing function, if you want to keep one column as a data frame, you do:
keeps <- "y"
DF[ , keeps, drop = FALSE]
drop=TRUE (or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y.

There's also the subset command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
UPDATED after comment by #hadley: To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))

within(df, rm(x))
is probably easiest, or for multiple variables:
within(df, rm(x, y))
Or if you're dealing with data.tables (per How do you delete a column by name in data.table?):
dt[, x := NULL] # Deletes column x by reference instantly.
dt[, !"x"] # Selects all but x into a new data.table.
or for multiple variables
dt[, c("x","y") := NULL]
dt[, !c("x", "y")]

You could use %in% like this:
df[, !(colnames(df) %in% c("x","bar","foo"))]

list(NULL) also works:
dat <- mtcars
colnames(dat)
# [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
# [11] "carb"
dat[,c("mpg","cyl","wt")] <- list(NULL)
colnames(dat)
# [1] "disp" "hp" "drat" "qsec" "vs" "am" "gear" "carb"

If you want remove the columns by reference and avoid the internal copying associated with data.frames then you can use the data.table package and the function :=
You can pass a character vector names to the left hand side of the := operator, and NULL as the RHS.
library(data.table)
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# or more simply DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10) #
DT[, c('a','b') := NULL]
If you want to predefine the names as as character vector outside the call to [, wrap the name of the object in () or {} to force the LHS to be evaluated in the calling scope not as a name within the scope of DT.
del <- c('a','b')
DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, (del) := NULL]
DT <- <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, {del} := NULL]
# force or `c` would also work.
You can also use set, which avoids the overhead of [.data.table, and also works for data.frames!
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# drop `a` from df (no copying involved)
set(df, j = 'a', value = NULL)
# drop `b` from DT (no copying involved)
set(DT, j = 'b', value = NULL)

There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:
dfrm2 <- dfrm[ , -grep("\\.B$", names(dfrm)) ]
For the case at hand, using Joris Meys example, it might not be as compact, but it would be:
DF <- DF[, -grep( paste("^",drops,"$", sep="", collapse="|"), names(DF) )]

Another dplyr answer.
Use select(-column).
If your variables have some common naming structure, you might try starts_with(). For example
library(dplyr)
df <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm (5),
var4 = rnorm(5), char1 = rnorm(5), char2 = rnorm(5))
df
# var2 char1 var4 var3 char2 var1
#1 -0.4629512 -0.3595079 -0.04763169 0.6398194 0.70996579 0.75879754
#2 0.5489027 0.1572841 -1.65313658 -1.3228020 -1.42785427 0.31168919
#3 -0.1707694 -0.9036500 0.47583030 -0.6636173 0.02116066 0.03983268
df1 <- df %>% select(-starts_with("char"))
df1
# var2 var4 var3 var1
#1 -0.4629512 -0.04763169 0.6398194 0.75879754
#2 0.5489027 -1.65313658 -1.3228020 0.31168919
#3 -0.1707694 0.47583030 -0.6636173 0.03983268
If you want to drop a sequence of variables in the data frame, you can use :. For example if you wanted to drop var2, var3, and all variables in between, you'd just be left with var1:
df2 <- df1 %>% select(-c(var2:var3) )
df2
# var1
#1 0.75879754
#2 0.31168919
#3 0.03983268

Dplyr Solution
I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a dplyr chain I use one_of() in the select clause:
Here is a simple, reproducable example:
undesired <- c('mpg', 'cyl', 'hp')
mtcars <- mtcars %>%
select(-one_of(undesired))
Documentation can be found by running ?one_of or here:
http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

Another possibility:
df <- df[, setdiff(names(df), c("a", "c"))]
or
df <- df[, grep('^(a|c)$', names(df), invert=TRUE)]

DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
DF
Output:
x y z a
1 1 10 5 11
2 2 9 5 12
3 3 8 5 13
4 4 7 5 14
5 5 6 5 15
6 6 5 5 16
7 7 4 5 17
8 8 3 5 18
9 9 2 5 19
10 10 1 5 20
DF[c("a","x")] <- list(NULL)
Output:
y z
1 10 5
2 9 5
3 8 5
4 7 5
5 6 5
6 5 5
7 4 5
8 3 5
9 2 5
10 1 5

Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:
df <- data.frame(x=1, y=2)
This gives a data frame
subset(df, select=-y)
but this gives a vector
df[,-2]
This is all explained in ?[ but it's not exactly expected behaviour. Well at least not to me...

Here is a dplyr way to go about it:
#df[ -c(1,3:6, 12) ] # original
df.cut <- df %>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6) # with dplyr::select()
I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using - to remove elements.

I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
# return everything except a and c
df <- df[,-match(c("a","c"),names(df))]
df

There's a function called dropNamed() in Bernd Bischl's BBmisc package that does exactly this.
BBmisc::dropNamed(df, "x")
The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in magrittr (just like the dplyr approaches):
df %>% BBmisc::dropNamed("x")

Another solution if you don't want to use #hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:
df[,-which(names(df) == "COLUMN_NAME")]

Beyond select(-one_of(drop_col_names)) demonstrated in earlier answers, there are a couple other dplyr options for dropping columns using select() that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):
library(dplyr)
starwars %>%
select(-(name:mass)) %>% # the range of columns from 'name' to 'mass'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^f.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
If you need to drop a column that may or may not exist in the data frame, here's a slight twist using select_if() that unlike using one_of() will not throw an Unknown columns: warning if the column name does not exist. In this example 'bad_column' is not a column in the data frame:
starwars %>%
select_if(!names(.) %in% c('height', 'mass', 'bad_column'))

Provide the data frame and a string of comma separated names to remove:
remove_features <- function(df, features) {
rem_vec <- unlist(strsplit(features, ', '))
res <- df[,!(names(df) %in% rem_vec)]
return(res)
}
Usage:
remove_features(iris, "Sepal.Length, Petal.Width")

Drop and delete columns by columns name in data frame.
A <- df[ , c("Name","Name1","Name2","Name3")]

There are a lot of ways you can do...
Option-1:
df[ , -which(names(df) %in% c("name1","name2"))]
Option-2:
df[!names(df) %in% c("name1", "name2")]
Option-3:
subset(df, select=-c(name1,name2))

Find the index of the columns you want to drop using which. Give these indexes a negative sign (*-1). Then subset on those values, which will remove them from the dataframe. This is an example.
DF <- data.frame(one=c('a','b'), two=c('c', 'd'), three=c('e', 'f'), four=c('g', 'h'))
DF
# one two three four
#1 a d f i
#2 b e g j
DF[which(names(DF) %in% c('two','three')) *-1]
# one four
#1 a g
#2 b h

If you have a large data.frame and are low on memory use [ . . . . or rm and within to remove columns of a data.frame, as subset is currently (R 3.6.2) using more memory - beside the hint of the manual to use subset interactively.
getData <- function() {
n <- 1e7
set.seed(7)
data.frame(a = runif(n), b = runif(n), c = runif(n), d = runif(n))
}
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- DF[setdiff(names(DF), c("a", "c"))] ##
#DF <- DF[!(names(DF) %in% c("a", "c"))] #Alternative
#DF <- DF[-match(c("a","c"),names(DF))] #Alternative
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- subset(DF, select = -c(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#357 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- within(DF, rm(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF[c("a", "c")] <- NULL ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used

Another data.table option which hasn't been posted yet is using the special verb .SD, which stands for subset of data. Together with the .SDcols argument you can select/drop columns by name or index.
require(data.table)
# data
dt = data.table(
A = LETTERS[1:5],
B = 1:5,
C = rep(TRUE, 5)
)
# delete B
dt[ , .SD, .SDcols =! 'B' ]
# delete all matches (i.e. all columns)
cols = grep('[A-Z]+', names(dt), value = TRUE)
dt[ , .SD, .SDcols =! cols ]
A summary of all the options for such a task in data.table can be found here

df <- data.frame(
+ a=1:5,
+ b=6:10,
+ c=rep(22,5),
+ d=round(runif(5)*100, 2),
+ e=round(runif(5)*100, 2),
+ f=round(runif(5)*100, 2),
+ g=round(runif(5)*100, 2),
+ h=round(runif(5)*100, 2)
+ )
> df
a b c d e f g h
1 1 6 22 76.31 39.96 66.62 72.75 73.14
2 2 7 22 53.41 94.85 96.02 97.31 85.32
3 3 8 22 98.29 38.95 12.61 29.67 88.45
4 4 9 22 20.04 53.53 83.07 77.50 94.99
5 5 10 22 5.67 0.42 15.07 59.75 31.21
> # remove cols: d g h
> newDf <- df[, c(1:3, 5), drop=TRUE]
> newDf
a b c e
1 1 6 22 39.96
2 2 7 22 94.85
3 3 8 22 38.95
4 4 9 22 53.53
5 5 10 22 0.42

Another option using the function fselect from the collapse package. Here is a reproducible example:
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
library(collapse)
fselect(DF, -z)
#> x y a
#> 1 1 10 11
#> 2 2 9 12
#> 3 3 8 13
#> 4 4 7 14
#> 5 5 6 15
#> 6 6 5 16
#> 7 7 4 17
#> 8 8 3 18
#> 9 9 2 19
#> 10 10 1 20
Created on 2022-08-26 with reprex v2.0.2

Related

How to delete tibble columns names of which is contained in vector? [duplicate]

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do this with fewer commands.
Also, I know that I could drop columns using integer indexing like this:
df <- df[ -c(1, 3:6, 12) ]
But I am concerned that the relative position of my variables may change.
Given how powerful R is, I figured there might be a better way than dropping each column one by one.
You can use a simple list of names :
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]
Or, alternatively, you can make a list of those to keep and refer to them by name :
keeps <- c("y", "a")
DF[keeps]
EDIT :
For those still not acquainted with the drop argument of the indexing function, if you want to keep one column as a data frame, you do:
keeps <- "y"
DF[ , keeps, drop = FALSE]
drop=TRUE (or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y.
There's also the subset command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
UPDATED after comment by #hadley: To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))
within(df, rm(x))
is probably easiest, or for multiple variables:
within(df, rm(x, y))
Or if you're dealing with data.tables (per How do you delete a column by name in data.table?):
dt[, x := NULL] # Deletes column x by reference instantly.
dt[, !"x"] # Selects all but x into a new data.table.
or for multiple variables
dt[, c("x","y") := NULL]
dt[, !c("x", "y")]
You could use %in% like this:
df[, !(colnames(df) %in% c("x","bar","foo"))]
list(NULL) also works:
dat <- mtcars
colnames(dat)
# [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
# [11] "carb"
dat[,c("mpg","cyl","wt")] <- list(NULL)
colnames(dat)
# [1] "disp" "hp" "drat" "qsec" "vs" "am" "gear" "carb"
If you want remove the columns by reference and avoid the internal copying associated with data.frames then you can use the data.table package and the function :=
You can pass a character vector names to the left hand side of the := operator, and NULL as the RHS.
library(data.table)
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# or more simply DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10) #
DT[, c('a','b') := NULL]
If you want to predefine the names as as character vector outside the call to [, wrap the name of the object in () or {} to force the LHS to be evaluated in the calling scope not as a name within the scope of DT.
del <- c('a','b')
DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, (del) := NULL]
DT <- <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, {del} := NULL]
# force or `c` would also work.
You can also use set, which avoids the overhead of [.data.table, and also works for data.frames!
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# drop `a` from df (no copying involved)
set(df, j = 'a', value = NULL)
# drop `b` from DT (no copying involved)
set(DT, j = 'b', value = NULL)
There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:
dfrm2 <- dfrm[ , -grep("\\.B$", names(dfrm)) ]
For the case at hand, using Joris Meys example, it might not be as compact, but it would be:
DF <- DF[, -grep( paste("^",drops,"$", sep="", collapse="|"), names(DF) )]
Another dplyr answer.
Use select(-column).
If your variables have some common naming structure, you might try starts_with(). For example
library(dplyr)
df <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm (5),
var4 = rnorm(5), char1 = rnorm(5), char2 = rnorm(5))
df
# var2 char1 var4 var3 char2 var1
#1 -0.4629512 -0.3595079 -0.04763169 0.6398194 0.70996579 0.75879754
#2 0.5489027 0.1572841 -1.65313658 -1.3228020 -1.42785427 0.31168919
#3 -0.1707694 -0.9036500 0.47583030 -0.6636173 0.02116066 0.03983268
df1 <- df %>% select(-starts_with("char"))
df1
# var2 var4 var3 var1
#1 -0.4629512 -0.04763169 0.6398194 0.75879754
#2 0.5489027 -1.65313658 -1.3228020 0.31168919
#3 -0.1707694 0.47583030 -0.6636173 0.03983268
If you want to drop a sequence of variables in the data frame, you can use :. For example if you wanted to drop var2, var3, and all variables in between, you'd just be left with var1:
df2 <- df1 %>% select(-c(var2:var3) )
df2
# var1
#1 0.75879754
#2 0.31168919
#3 0.03983268
Dplyr Solution
I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a dplyr chain I use one_of() in the select clause:
Here is a simple, reproducable example:
undesired <- c('mpg', 'cyl', 'hp')
mtcars <- mtcars %>%
select(-one_of(undesired))
Documentation can be found by running ?one_of or here:
http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
Another possibility:
df <- df[, setdiff(names(df), c("a", "c"))]
or
df <- df[, grep('^(a|c)$', names(df), invert=TRUE)]
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
DF
Output:
x y z a
1 1 10 5 11
2 2 9 5 12
3 3 8 5 13
4 4 7 5 14
5 5 6 5 15
6 6 5 5 16
7 7 4 5 17
8 8 3 5 18
9 9 2 5 19
10 10 1 5 20
DF[c("a","x")] <- list(NULL)
Output:
y z
1 10 5
2 9 5
3 8 5
4 7 5
5 6 5
6 5 5
7 4 5
8 3 5
9 2 5
10 1 5
Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:
df <- data.frame(x=1, y=2)
This gives a data frame
subset(df, select=-y)
but this gives a vector
df[,-2]
This is all explained in ?[ but it's not exactly expected behaviour. Well at least not to me...
Here is a dplyr way to go about it:
#df[ -c(1,3:6, 12) ] # original
df.cut <- df %>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6) # with dplyr::select()
I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using - to remove elements.
I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
# return everything except a and c
df <- df[,-match(c("a","c"),names(df))]
df
There's a function called dropNamed() in Bernd Bischl's BBmisc package that does exactly this.
BBmisc::dropNamed(df, "x")
The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in magrittr (just like the dplyr approaches):
df %>% BBmisc::dropNamed("x")
Another solution if you don't want to use #hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:
df[,-which(names(df) == "COLUMN_NAME")]
Beyond select(-one_of(drop_col_names)) demonstrated in earlier answers, there are a couple other dplyr options for dropping columns using select() that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):
library(dplyr)
starwars %>%
select(-(name:mass)) %>% # the range of columns from 'name' to 'mass'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^f.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
If you need to drop a column that may or may not exist in the data frame, here's a slight twist using select_if() that unlike using one_of() will not throw an Unknown columns: warning if the column name does not exist. In this example 'bad_column' is not a column in the data frame:
starwars %>%
select_if(!names(.) %in% c('height', 'mass', 'bad_column'))
Provide the data frame and a string of comma separated names to remove:
remove_features <- function(df, features) {
rem_vec <- unlist(strsplit(features, ', '))
res <- df[,!(names(df) %in% rem_vec)]
return(res)
}
Usage:
remove_features(iris, "Sepal.Length, Petal.Width")
Drop and delete columns by columns name in data frame.
A <- df[ , c("Name","Name1","Name2","Name3")]
There are a lot of ways you can do...
Option-1:
df[ , -which(names(df) %in% c("name1","name2"))]
Option-2:
df[!names(df) %in% c("name1", "name2")]
Option-3:
subset(df, select=-c(name1,name2))
Find the index of the columns you want to drop using which. Give these indexes a negative sign (*-1). Then subset on those values, which will remove them from the dataframe. This is an example.
DF <- data.frame(one=c('a','b'), two=c('c', 'd'), three=c('e', 'f'), four=c('g', 'h'))
DF
# one two three four
#1 a d f i
#2 b e g j
DF[which(names(DF) %in% c('two','three')) *-1]
# one four
#1 a g
#2 b h
If you have a large data.frame and are low on memory use [ . . . . or rm and within to remove columns of a data.frame, as subset is currently (R 3.6.2) using more memory - beside the hint of the manual to use subset interactively.
getData <- function() {
n <- 1e7
set.seed(7)
data.frame(a = runif(n), b = runif(n), c = runif(n), d = runif(n))
}
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- DF[setdiff(names(DF), c("a", "c"))] ##
#DF <- DF[!(names(DF) %in% c("a", "c"))] #Alternative
#DF <- DF[-match(c("a","c"),names(DF))] #Alternative
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- subset(DF, select = -c(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#357 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- within(DF, rm(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF[c("a", "c")] <- NULL ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
Another data.table option which hasn't been posted yet is using the special verb .SD, which stands for subset of data. Together with the .SDcols argument you can select/drop columns by name or index.
require(data.table)
# data
dt = data.table(
A = LETTERS[1:5],
B = 1:5,
C = rep(TRUE, 5)
)
# delete B
dt[ , .SD, .SDcols =! 'B' ]
# delete all matches (i.e. all columns)
cols = grep('[A-Z]+', names(dt), value = TRUE)
dt[ , .SD, .SDcols =! cols ]
A summary of all the options for such a task in data.table can be found here
df <- data.frame(
+ a=1:5,
+ b=6:10,
+ c=rep(22,5),
+ d=round(runif(5)*100, 2),
+ e=round(runif(5)*100, 2),
+ f=round(runif(5)*100, 2),
+ g=round(runif(5)*100, 2),
+ h=round(runif(5)*100, 2)
+ )
> df
a b c d e f g h
1 1 6 22 76.31 39.96 66.62 72.75 73.14
2 2 7 22 53.41 94.85 96.02 97.31 85.32
3 3 8 22 98.29 38.95 12.61 29.67 88.45
4 4 9 22 20.04 53.53 83.07 77.50 94.99
5 5 10 22 5.67 0.42 15.07 59.75 31.21
> # remove cols: d g h
> newDf <- df[, c(1:3, 5), drop=TRUE]
> newDf
a b c e
1 1 6 22 39.96
2 2 7 22 94.85
3 3 8 22 38.95
4 4 9 22 53.53
5 5 10 22 0.42
Another option using the function fselect from the collapse package. Here is a reproducible example:
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
library(collapse)
fselect(DF, -z)
#> x y a
#> 1 1 10 11
#> 2 2 9 12
#> 3 3 8 13
#> 4 4 7 14
#> 5 5 6 15
#> 6 6 5 16
#> 7 7 4 17
#> 8 8 3 18
#> 9 9 2 19
#> 10 10 1 20
Created on 2022-08-26 with reprex v2.0.2

Aggregating data frame rows using an input vector

I have this toy data.frame:
df = data.frame(id = c("a","b","c","d"), value = c(2,3,6,5))
and I'd like to aggregate its rows according to this toy vector:
collapsed.ids = c("a,b","c","d")
where the aggregated data.frame should keep max(df$value) of its aggregated rows.
So for this toy example the output would be:
> aggregated.df
id value
1 a,b 3
2 c 6
3 d 5
I should note that my real data.frame is ~150,000 rows
I would use data.table for this.
Something like the following should work:
library(data.table)
DT <- data.table(df, key = "id") # Main data.table
Key <- data.table(ind = collapsed.ids) # your "Key" table
## We need your "Key" table in a long form
Key <- Key[, list(id = unlist(strsplit(ind, ",", fixed = TRUE))), by = ind]
setkey(Key, id) # Set the key to facilitate a merge
## Merge and aggregate in one step
DT[Key][, list(value = max(value)), by = ind]
# ind value
# 1: a,b 3
# 2: c 6
# 3: d 5
You don't need data.table, you can just use base R.
split.ids <- strsplit(collapsed.ids, ",")
split.df <- data.frame(id = tmp <- unlist(split.ids),
joinid = rep(collapsed.ids, sapply(split.ids, length)))
aggregated.df <- aggregate(value ~ id, data = merge(df, split.df), max)
Result:
# id value
# 1 a,b 3
# 2 c 6
# 3 d 5
Benchmark
df <- df[rep(1:4, 50000), ] # Make a big data.frame
system.time(...) # of the above code
# user system elapsed
# 1.700 0.154 1.947
EDIT: Apparently Ananda's code runs in 0.039, so I'm eating crow. But either are acceptable for this size.

How can I delete column from data frame without causing a memory allocation error? [duplicate]

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do this with fewer commands.
Also, I know that I could drop columns using integer indexing like this:
df <- df[ -c(1, 3:6, 12) ]
But I am concerned that the relative position of my variables may change.
Given how powerful R is, I figured there might be a better way than dropping each column one by one.
You can use a simple list of names :
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]
Or, alternatively, you can make a list of those to keep and refer to them by name :
keeps <- c("y", "a")
DF[keeps]
EDIT :
For those still not acquainted with the drop argument of the indexing function, if you want to keep one column as a data frame, you do:
keeps <- "y"
DF[ , keeps, drop = FALSE]
drop=TRUE (or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y.
There's also the subset command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
UPDATED after comment by #hadley: To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))
within(df, rm(x))
is probably easiest, or for multiple variables:
within(df, rm(x, y))
Or if you're dealing with data.tables (per How do you delete a column by name in data.table?):
dt[, x := NULL] # Deletes column x by reference instantly.
dt[, !"x"] # Selects all but x into a new data.table.
or for multiple variables
dt[, c("x","y") := NULL]
dt[, !c("x", "y")]
You could use %in% like this:
df[, !(colnames(df) %in% c("x","bar","foo"))]
list(NULL) also works:
dat <- mtcars
colnames(dat)
# [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
# [11] "carb"
dat[,c("mpg","cyl","wt")] <- list(NULL)
colnames(dat)
# [1] "disp" "hp" "drat" "qsec" "vs" "am" "gear" "carb"
If you want remove the columns by reference and avoid the internal copying associated with data.frames then you can use the data.table package and the function :=
You can pass a character vector names to the left hand side of the := operator, and NULL as the RHS.
library(data.table)
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# or more simply DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10) #
DT[, c('a','b') := NULL]
If you want to predefine the names as as character vector outside the call to [, wrap the name of the object in () or {} to force the LHS to be evaluated in the calling scope not as a name within the scope of DT.
del <- c('a','b')
DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, (del) := NULL]
DT <- <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, {del} := NULL]
# force or `c` would also work.
You can also use set, which avoids the overhead of [.data.table, and also works for data.frames!
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# drop `a` from df (no copying involved)
set(df, j = 'a', value = NULL)
# drop `b` from DT (no copying involved)
set(DT, j = 'b', value = NULL)
There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:
dfrm2 <- dfrm[ , -grep("\\.B$", names(dfrm)) ]
For the case at hand, using Joris Meys example, it might not be as compact, but it would be:
DF <- DF[, -grep( paste("^",drops,"$", sep="", collapse="|"), names(DF) )]
Another dplyr answer.
Use select(-column).
If your variables have some common naming structure, you might try starts_with(). For example
library(dplyr)
df <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm (5),
var4 = rnorm(5), char1 = rnorm(5), char2 = rnorm(5))
df
# var2 char1 var4 var3 char2 var1
#1 -0.4629512 -0.3595079 -0.04763169 0.6398194 0.70996579 0.75879754
#2 0.5489027 0.1572841 -1.65313658 -1.3228020 -1.42785427 0.31168919
#3 -0.1707694 -0.9036500 0.47583030 -0.6636173 0.02116066 0.03983268
df1 <- df %>% select(-starts_with("char"))
df1
# var2 var4 var3 var1
#1 -0.4629512 -0.04763169 0.6398194 0.75879754
#2 0.5489027 -1.65313658 -1.3228020 0.31168919
#3 -0.1707694 0.47583030 -0.6636173 0.03983268
If you want to drop a sequence of variables in the data frame, you can use :. For example if you wanted to drop var2, var3, and all variables in between, you'd just be left with var1:
df2 <- df1 %>% select(-c(var2:var3) )
df2
# var1
#1 0.75879754
#2 0.31168919
#3 0.03983268
Dplyr Solution
I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a dplyr chain I use one_of() in the select clause:
Here is a simple, reproducable example:
undesired <- c('mpg', 'cyl', 'hp')
mtcars <- mtcars %>%
select(-one_of(undesired))
Documentation can be found by running ?one_of or here:
http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
Another possibility:
df <- df[, setdiff(names(df), c("a", "c"))]
or
df <- df[, grep('^(a|c)$', names(df), invert=TRUE)]
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
DF
Output:
x y z a
1 1 10 5 11
2 2 9 5 12
3 3 8 5 13
4 4 7 5 14
5 5 6 5 15
6 6 5 5 16
7 7 4 5 17
8 8 3 5 18
9 9 2 5 19
10 10 1 5 20
DF[c("a","x")] <- list(NULL)
Output:
y z
1 10 5
2 9 5
3 8 5
4 7 5
5 6 5
6 5 5
7 4 5
8 3 5
9 2 5
10 1 5
Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:
df <- data.frame(x=1, y=2)
This gives a data frame
subset(df, select=-y)
but this gives a vector
df[,-2]
This is all explained in ?[ but it's not exactly expected behaviour. Well at least not to me...
Here is a dplyr way to go about it:
#df[ -c(1,3:6, 12) ] # original
df.cut <- df %>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6) # with dplyr::select()
I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using - to remove elements.
I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
# return everything except a and c
df <- df[,-match(c("a","c"),names(df))]
df
There's a function called dropNamed() in Bernd Bischl's BBmisc package that does exactly this.
BBmisc::dropNamed(df, "x")
The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in magrittr (just like the dplyr approaches):
df %>% BBmisc::dropNamed("x")
Another solution if you don't want to use #hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:
df[,-which(names(df) == "COLUMN_NAME")]
Beyond select(-one_of(drop_col_names)) demonstrated in earlier answers, there are a couple other dplyr options for dropping columns using select() that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):
library(dplyr)
starwars %>%
select(-(name:mass)) %>% # the range of columns from 'name' to 'mass'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^f.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
If you need to drop a column that may or may not exist in the data frame, here's a slight twist using select_if() that unlike using one_of() will not throw an Unknown columns: warning if the column name does not exist. In this example 'bad_column' is not a column in the data frame:
starwars %>%
select_if(!names(.) %in% c('height', 'mass', 'bad_column'))
Provide the data frame and a string of comma separated names to remove:
remove_features <- function(df, features) {
rem_vec <- unlist(strsplit(features, ', '))
res <- df[,!(names(df) %in% rem_vec)]
return(res)
}
Usage:
remove_features(iris, "Sepal.Length, Petal.Width")
Drop and delete columns by columns name in data frame.
A <- df[ , c("Name","Name1","Name2","Name3")]
There are a lot of ways you can do...
Option-1:
df[ , -which(names(df) %in% c("name1","name2"))]
Option-2:
df[!names(df) %in% c("name1", "name2")]
Option-3:
subset(df, select=-c(name1,name2))
Find the index of the columns you want to drop using which. Give these indexes a negative sign (*-1). Then subset on those values, which will remove them from the dataframe. This is an example.
DF <- data.frame(one=c('a','b'), two=c('c', 'd'), three=c('e', 'f'), four=c('g', 'h'))
DF
# one two three four
#1 a d f i
#2 b e g j
DF[which(names(DF) %in% c('two','three')) *-1]
# one four
#1 a g
#2 b h
If you have a large data.frame and are low on memory use [ . . . . or rm and within to remove columns of a data.frame, as subset is currently (R 3.6.2) using more memory - beside the hint of the manual to use subset interactively.
getData <- function() {
n <- 1e7
set.seed(7)
data.frame(a = runif(n), b = runif(n), c = runif(n), d = runif(n))
}
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- DF[setdiff(names(DF), c("a", "c"))] ##
#DF <- DF[!(names(DF) %in% c("a", "c"))] #Alternative
#DF <- DF[-match(c("a","c"),names(DF))] #Alternative
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- subset(DF, select = -c(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#357 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- within(DF, rm(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF[c("a", "c")] <- NULL ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
Another data.table option which hasn't been posted yet is using the special verb .SD, which stands for subset of data. Together with the .SDcols argument you can select/drop columns by name or index.
require(data.table)
# data
dt = data.table(
A = LETTERS[1:5],
B = 1:5,
C = rep(TRUE, 5)
)
# delete B
dt[ , .SD, .SDcols =! 'B' ]
# delete all matches (i.e. all columns)
cols = grep('[A-Z]+', names(dt), value = TRUE)
dt[ , .SD, .SDcols =! cols ]
A summary of all the options for such a task in data.table can be found here
df <- data.frame(
+ a=1:5,
+ b=6:10,
+ c=rep(22,5),
+ d=round(runif(5)*100, 2),
+ e=round(runif(5)*100, 2),
+ f=round(runif(5)*100, 2),
+ g=round(runif(5)*100, 2),
+ h=round(runif(5)*100, 2)
+ )
> df
a b c d e f g h
1 1 6 22 76.31 39.96 66.62 72.75 73.14
2 2 7 22 53.41 94.85 96.02 97.31 85.32
3 3 8 22 98.29 38.95 12.61 29.67 88.45
4 4 9 22 20.04 53.53 83.07 77.50 94.99
5 5 10 22 5.67 0.42 15.07 59.75 31.21
> # remove cols: d g h
> newDf <- df[, c(1:3, 5), drop=TRUE]
> newDf
a b c e
1 1 6 22 39.96
2 2 7 22 94.85
3 3 8 22 38.95
4 4 9 22 53.53
5 5 10 22 0.42
Another option using the function fselect from the collapse package. Here is a reproducible example:
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
library(collapse)
fselect(DF, -z)
#> x y a
#> 1 1 10 11
#> 2 2 9 12
#> 3 3 8 13
#> 4 4 7 14
#> 5 5 6 15
#> 6 6 5 16
#> 7 7 4 17
#> 8 8 3 18
#> 9 9 2 19
#> 10 10 1 20
Created on 2022-08-26 with reprex v2.0.2

More concise way to drop a column than data[ , !colnames( data ) %in% "todrop" ] [duplicate]

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:
df$x <- NULL
But I was hoping to do this with fewer commands.
Also, I know that I could drop columns using integer indexing like this:
df <- df[ -c(1, 3:6, 12) ]
But I am concerned that the relative position of my variables may change.
Given how powerful R is, I figured there might be a better way than dropping each column one by one.
You can use a simple list of names :
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]
Or, alternatively, you can make a list of those to keep and refer to them by name :
keeps <- c("y", "a")
DF[keeps]
EDIT :
For those still not acquainted with the drop argument of the indexing function, if you want to keep one column as a data frame, you do:
keeps <- "y"
DF[ , keeps, drop = FALSE]
drop=TRUE (or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y.
There's also the subset command, useful if you know which columns you want:
df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))
UPDATED after comment by #hadley: To drop columns a,c you could do:
df <- subset(df, select = -c(a, c))
within(df, rm(x))
is probably easiest, or for multiple variables:
within(df, rm(x, y))
Or if you're dealing with data.tables (per How do you delete a column by name in data.table?):
dt[, x := NULL] # Deletes column x by reference instantly.
dt[, !"x"] # Selects all but x into a new data.table.
or for multiple variables
dt[, c("x","y") := NULL]
dt[, !c("x", "y")]
You could use %in% like this:
df[, !(colnames(df) %in% c("x","bar","foo"))]
list(NULL) also works:
dat <- mtcars
colnames(dat)
# [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
# [11] "carb"
dat[,c("mpg","cyl","wt")] <- list(NULL)
colnames(dat)
# [1] "disp" "hp" "drat" "qsec" "vs" "am" "gear" "carb"
If you want remove the columns by reference and avoid the internal copying associated with data.frames then you can use the data.table package and the function :=
You can pass a character vector names to the left hand side of the := operator, and NULL as the RHS.
library(data.table)
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# or more simply DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10) #
DT[, c('a','b') := NULL]
If you want to predefine the names as as character vector outside the call to [, wrap the name of the object in () or {} to force the LHS to be evaluated in the calling scope not as a name within the scope of DT.
del <- c('a','b')
DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, (del) := NULL]
DT <- <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, {del} := NULL]
# force or `c` would also work.
You can also use set, which avoids the overhead of [.data.table, and also works for data.frames!
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# drop `a` from df (no copying involved)
set(df, j = 'a', value = NULL)
# drop `b` from DT (no copying involved)
set(DT, j = 'b', value = NULL)
There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:
dfrm2 <- dfrm[ , -grep("\\.B$", names(dfrm)) ]
For the case at hand, using Joris Meys example, it might not be as compact, but it would be:
DF <- DF[, -grep( paste("^",drops,"$", sep="", collapse="|"), names(DF) )]
Another dplyr answer.
Use select(-column).
If your variables have some common naming structure, you might try starts_with(). For example
library(dplyr)
df <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm (5),
var4 = rnorm(5), char1 = rnorm(5), char2 = rnorm(5))
df
# var2 char1 var4 var3 char2 var1
#1 -0.4629512 -0.3595079 -0.04763169 0.6398194 0.70996579 0.75879754
#2 0.5489027 0.1572841 -1.65313658 -1.3228020 -1.42785427 0.31168919
#3 -0.1707694 -0.9036500 0.47583030 -0.6636173 0.02116066 0.03983268
df1 <- df %>% select(-starts_with("char"))
df1
# var2 var4 var3 var1
#1 -0.4629512 -0.04763169 0.6398194 0.75879754
#2 0.5489027 -1.65313658 -1.3228020 0.31168919
#3 -0.1707694 0.47583030 -0.6636173 0.03983268
If you want to drop a sequence of variables in the data frame, you can use :. For example if you wanted to drop var2, var3, and all variables in between, you'd just be left with var1:
df2 <- df1 %>% select(-c(var2:var3) )
df2
# var1
#1 0.75879754
#2 0.31168919
#3 0.03983268
Dplyr Solution
I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a dplyr chain I use one_of() in the select clause:
Here is a simple, reproducable example:
undesired <- c('mpg', 'cyl', 'hp')
mtcars <- mtcars %>%
select(-one_of(undesired))
Documentation can be found by running ?one_of or here:
http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
Another possibility:
df <- df[, setdiff(names(df), c("a", "c"))]
or
df <- df[, grep('^(a|c)$', names(df), invert=TRUE)]
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
DF
Output:
x y z a
1 1 10 5 11
2 2 9 5 12
3 3 8 5 13
4 4 7 5 14
5 5 6 5 15
6 6 5 5 16
7 7 4 5 17
8 8 3 5 18
9 9 2 5 19
10 10 1 5 20
DF[c("a","x")] <- list(NULL)
Output:
y z
1 10 5
2 9 5
3 8 5
4 7 5
5 6 5
6 5 5
7 4 5
8 3 5
9 2 5
10 1 5
Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:
df <- data.frame(x=1, y=2)
This gives a data frame
subset(df, select=-y)
but this gives a vector
df[,-2]
This is all explained in ?[ but it's not exactly expected behaviour. Well at least not to me...
Here is a dplyr way to go about it:
#df[ -c(1,3:6, 12) ] # original
df.cut <- df %>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6) # with dplyr::select()
I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using - to remove elements.
I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:
df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
# return everything except a and c
df <- df[,-match(c("a","c"),names(df))]
df
There's a function called dropNamed() in Bernd Bischl's BBmisc package that does exactly this.
BBmisc::dropNamed(df, "x")
The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in magrittr (just like the dplyr approaches):
df %>% BBmisc::dropNamed("x")
Another solution if you don't want to use #hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:
df[,-which(names(df) == "COLUMN_NAME")]
Beyond select(-one_of(drop_col_names)) demonstrated in earlier answers, there are a couple other dplyr options for dropping columns using select() that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):
library(dplyr)
starwars %>%
select(-(name:mass)) %>% # the range of columns from 'name' to 'mass'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^f.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
If you need to drop a column that may or may not exist in the data frame, here's a slight twist using select_if() that unlike using one_of() will not throw an Unknown columns: warning if the column name does not exist. In this example 'bad_column' is not a column in the data frame:
starwars %>%
select_if(!names(.) %in% c('height', 'mass', 'bad_column'))
Provide the data frame and a string of comma separated names to remove:
remove_features <- function(df, features) {
rem_vec <- unlist(strsplit(features, ', '))
res <- df[,!(names(df) %in% rem_vec)]
return(res)
}
Usage:
remove_features(iris, "Sepal.Length, Petal.Width")
Drop and delete columns by columns name in data frame.
A <- df[ , c("Name","Name1","Name2","Name3")]
There are a lot of ways you can do...
Option-1:
df[ , -which(names(df) %in% c("name1","name2"))]
Option-2:
df[!names(df) %in% c("name1", "name2")]
Option-3:
subset(df, select=-c(name1,name2))
Find the index of the columns you want to drop using which. Give these indexes a negative sign (*-1). Then subset on those values, which will remove them from the dataframe. This is an example.
DF <- data.frame(one=c('a','b'), two=c('c', 'd'), three=c('e', 'f'), four=c('g', 'h'))
DF
# one two three four
#1 a d f i
#2 b e g j
DF[which(names(DF) %in% c('two','three')) *-1]
# one four
#1 a g
#2 b h
If you have a large data.frame and are low on memory use [ . . . . or rm and within to remove columns of a data.frame, as subset is currently (R 3.6.2) using more memory - beside the hint of the manual to use subset interactively.
getData <- function() {
n <- 1e7
set.seed(7)
data.frame(a = runif(n), b = runif(n), c = runif(n), d = runif(n))
}
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- DF[setdiff(names(DF), c("a", "c"))] ##
#DF <- DF[!(names(DF) %in% c("a", "c"))] #Alternative
#DF <- DF[-match(c("a","c"),names(DF))] #Alternative
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- subset(DF, select = -c(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#357 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- within(DF, rm(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF[c("a", "c")] <- NULL ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used
Another data.table option which hasn't been posted yet is using the special verb .SD, which stands for subset of data. Together with the .SDcols argument you can select/drop columns by name or index.
require(data.table)
# data
dt = data.table(
A = LETTERS[1:5],
B = 1:5,
C = rep(TRUE, 5)
)
# delete B
dt[ , .SD, .SDcols =! 'B' ]
# delete all matches (i.e. all columns)
cols = grep('[A-Z]+', names(dt), value = TRUE)
dt[ , .SD, .SDcols =! cols ]
A summary of all the options for such a task in data.table can be found here
df <- data.frame(
+ a=1:5,
+ b=6:10,
+ c=rep(22,5),
+ d=round(runif(5)*100, 2),
+ e=round(runif(5)*100, 2),
+ f=round(runif(5)*100, 2),
+ g=round(runif(5)*100, 2),
+ h=round(runif(5)*100, 2)
+ )
> df
a b c d e f g h
1 1 6 22 76.31 39.96 66.62 72.75 73.14
2 2 7 22 53.41 94.85 96.02 97.31 85.32
3 3 8 22 98.29 38.95 12.61 29.67 88.45
4 4 9 22 20.04 53.53 83.07 77.50 94.99
5 5 10 22 5.67 0.42 15.07 59.75 31.21
> # remove cols: d g h
> newDf <- df[, c(1:3, 5), drop=TRUE]
> newDf
a b c e
1 1 6 22 39.96
2 2 7 22 94.85
3 3 8 22 38.95
4 4 9 22 53.53
5 5 10 22 0.42
Another option using the function fselect from the collapse package. Here is a reproducible example:
DF <- data.frame(
x=1:10,
y=10:1,
z=rep(5,10),
a=11:20
)
library(collapse)
fselect(DF, -z)
#> x y a
#> 1 1 10 11
#> 2 2 9 12
#> 3 3 8 13
#> 4 4 7 14
#> 5 5 6 15
#> 6 6 5 16
#> 7 7 4 17
#> 8 8 3 18
#> 9 9 2 19
#> 10 10 1 20
Created on 2022-08-26 with reprex v2.0.2

Conditional merge/replacement in R

I have two data frames:
df1
x1 x2
1 a
2 b
3 c
4 d
and
df2
x1 x2
2 zz
3 qq
I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:
df1
x1 x2
1 a
2 zz
3 qq
4 d
use match(), assuming values in df1 are unique.
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
If the values aren't unique, use :
for(id in 1:nrow(df2)){
df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}
We can use {powerjoin}, and handle the conflicting columns with coalesce_yx
library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)
power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#> x1 x2
#> 1 1 a
#> 2 2 zz
#> 3 3 qq
#> 4 4 d
The first part of Joris' answer is good, but in the case of non-unique values in df1, the row-wise for-loop will not scale well on large data.frames.
You could use a data.table "update join" to modify in place, which will be quite fast:
library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]
Or, assuming you don't care about maintaining row order, you could use SQL-inspired dplyr:
library(dplyr)
union_all(
inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
anti_join( df1, df2["x1"] ) # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order
Either of these will scale much better than the row-wise for-loop.
I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in df2 that are not in df1 which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:
df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )
It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)
df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
df1
#-----------
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
(Note that recent versions of R do not have stringsAsFactors set to TRUE in the data.frame function defaults, unlike it was for most of the history of R.)
You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.
df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above
df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3
As of dplyr 1.0.0 there is a function specifically for this:
library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
rows_update(df1, df2, by = "x1")
See https://stackoverflow.com/a/65254214/2738526

Resources