Selecting a subset of columns in a data.table - r

I'd like to print all the columns of a data table dt except one of them named V3 but don't want to refer to it by number but by name. This is the code that I have:
dt = data.table(matrix(sample(c(0,1),5,rep=T),50,10))
dt[,-3,with=FALSE] # Is this the only way to not print column "V3"?
Using the data frame way, one could do this through the code:
df = data.frame(matrix(sample(c(0,1),5,rep=T),50,10))
df[,!(colnames(df)%in% c("X3"))]
So, my question is: is there another way to not print one column in a data table without the necessity of refer to it by number? I'd like to find something similar to the data frame syntax I used above but using data table.

Use a very similar syntax as for a data.frame, but add the argument with=FALSE:
dt[, setdiff(colnames(dt),"V9"), with=FALSE]
V1 V2 V3 V4 V5 V6 V7 V8 V10
1: 1 1 1 1 1 1 1 1 1
2: 0 0 0 0 0 0 0 0 0
3: 1 1 1 1 1 1 1 1 1
4: 0 0 0 0 0 0 0 0 0
5: 0 0 0 0 0 0 0 0 0
6: 1 1 1 1 1 1 1 1 1
The use of with=FALSE is nicely explained in the documentation for the j argument in ?data.table:
j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) same as j in [.data.frame.
From v1.10.2 onwards it is also possible to do this as follows:
keep <- setdiff(names(dt), "V9")
dt[, ..keep]
Prefixing a symbol with .. will look up in calling scope (i.e. the Global Environment) and its value taken to be column names or numbers (source).

Edit 2019-09-27 with a more modern approach
You can do this with patterns as mentioned above; or, you can do it with ! if there's a vector of names already:
dt[ , !'V3']
# or
drop_cols = 'V3'
dt[ , !..drop_cols]
.. means "look up one level"
Older version using with=FALSE (data.table is moving away from this argument steadily)
Here's a way that uses grep to convert to numeric and allow negative column indexing:
dt[, -grep("^V3$", names(dt)), with=FALSE]
You did say "V3" was to be excluded, right?

Maybe it's only in recent versions of data.table (I'm using 1.9.6), but you can do:
dt[, -'V3']
For several columns:
dt[, -c('V3', 'V9')]
Note that the quotes around the variable names are necessary.
Also, if your column names are stored in a variable, say cols, you'll need to do dt[, -cols, with=FALSE].

From version 1.12.0 onwards, it is also possible to select columns using regular expressions on their names:
iris_DT <- as.data.table(iris)
iris_DT[, .SD, .SDcols = patterns(".e.al")]

To summarize the answer to this question, and also to make it
a) negation-friendly (so that you can also select columns by negation),
b) pipe-line friendly (so that you can use in a pipeline with %>% operator), and
c) so that you can select using both column numbers and column names,
these are available options:
library(data.table);
select1 <- function (dt, range) dt[, range, with=F]
select2 <- function (dt, range) dt[, ..range]
select3 <- function (dt, range) dt[, .SD, .SDcols=range]
dt <- ggplot2::diamonds
range <- 1:3 # or
range <- dt %>% names %>% .[1:3]
dt %>% select1(range);
dt %>% select2(range);
dt %>% select3(range);
dt %>% select1(-range);
dt %>% select2(-range);
dt %>% select3(-range); # DOES NOT WORK
Also we note that this
dt %>% .[, ..(names(dt)[1:3])] # DOES NOT WORK
Therefore the best (most universal and fast) way to select multiple columns in data.table is the following:
# columns are selected using column numbers:
range <- 1:3
dt %>% select1(range);
dt %>% .[, range, with=F]
# The same works if columns are selected using column names:
range <- names(dt) [1:3]
dt %>% select1(range);
dt %>% .[, range, with=F]
PS.
If, instead of selecting multiple columns, you want to efficiently delete multiple columns from data.table by reference (i.e. instead of copying the entire data.table), then you can use data.table's := operator. But I don't know how to do it for multiple columns in one line

Related

replacing all NA with a 0 in data.table in R

I have a data.table with many columns. There are 4 columns where I want to replace NA with an 0.
I have a working solution:
claimsMonthly[is.na(claim9month),claim9month := 0
][is.na(claim10month),claim10month := 0
][is.na(claim11month),claim11month := 0
][is.na(claim12month),claim12month := 0]
However, this is quite repetitive and I wanted to reduce this by using an loop (not sure if that is the smartest idea though?):
for (i in 9:12){
claimsMonthly[is.na(paste0("claim", i, "month")), paste0("claim", i, "month") := 0]
}
When I run this loop nothing happens. I guess it is due to the pact that the paste0() returns "claim12month", so I get in.na("claim12month"). The result of that is FALSE despite the fact that there are NA in my data. I guess this has something to do with the quotes?
This is not the first time i have issues with using paste0() or running loops with data.table, so I must be missing something important here.
Any ideas how to fix this?
We can either specify the .SDcols with the names of the columns ('nm1'), loop over the .SD (Subset of Data.table) and assign the NA to 0 (replace_na from tidyr)
library(data.table)
library(tidyr)
nm1 <- paste0("claim", 9:12, "month")
setDT(claimsMonthly)[, (nm1) := lapply(.SD, replace_na, 0), .SDcols = nm1]
Or as #jangorecki mentioned in the comments, nafill from data.table would be better
setDT(claimsMonthly)[, (nm1) := lapply(.SD, nafill, fill = 0), .SDcols = nm1]
or using a loop with set, assign the columns of interest with 0 based on the NA values in each column by specifying the i (for row index) and j for column index/name
for(j in nm1){
set(claimsMonthly, i = which(is.na(claimsMonthly[[j]])), j =j, value = 0)
}
Or with setnafill
setnafill(claimsMonthly, cols = nm1, fill = 0)
You can use:
claimsMonthly[, 9:12][is.na(claimsMonthly[, 9:12])] <- 0
Also you can use variable names:
claimsMonthly[c("claim9month", "claim10month","claim11month","claim12month")][is.na(claimsMonthly[c("claim9month", "claim10month","claim11month","claim12month")])] <- 0
Or even better you can use a vector with all variables with "claimXXmonth" pattern.

Save column names when extracting columns by variable name

Let's say I have the following data.table.
dt = data.table(one=rep(2,4), two=rnorm(4))
dt
Now I have created a variable with a name of one column.
col_name = "one"
If I want to return that column as a data.table, I can do one of the following. The first option will return the column name as V1 and the second will actually set the column name to "one".
dt[,.(get(col_name))]
dt[,col_name, with=FALSE]
I'm wondering if there is a way to specify the column name which using the get command. Something like the following, which doesn't work.
dt[,as.symbol(col_name) = .(get(col_name))]
The reason that I need the column names with get is that I have pretty extensive loop whereby I'm filling in empty columns. So it could end up looking like this, whereby I loop through and replace imp_val with the median by the columns in cols.
dat2[is.na(get(imp_val)),
as.symbol(imp_val) := dat2[.BY, median(get(imp_val), na.rm=TRUE), on=get(cols)], by=c(get(cols))]
We can specify it in the .SDcols
dt[, .SD,.SDcols = col_name]
Or with ..
dt[, ..col_name]
if the intention is to rename the column as 'col_name'
setnames(dt[, ..col_name], deparse(substitute(col_name)))[]
# col_name
#1: 2
#2: 2
#3: 2
#4: 2
You could also use the tidyverse approach for this. Setup:
library(data.table)
library(magrittr)
library(dplyr)
dt = data.table(one=rep(2,4), two=rnorm(4))
col_name = "one"
Then use select with the non-standard evaluation operator !! (pronounced bang-bang):
> dt %>% dplyr::select(!!col_name)
one
1: 2
2: 2
3: 2
4: 2
The returned object is still a data.table:
> dt %>%
dplyr::select(!!col_name) %>%
class
[1] "data.table" "data.frame"
I'm not sure what you mean with the second part of your question on replacing NAs with the median. Maybe you could update your answer with a small example?

Creating indicator variable columns in dplyr chain

Updated: With apologies to those who replied, in my original example I overlooked the fact that data.frame() created var as a factor rather than as a character vector, as I had intended. I have corrected the example, and this will break at least one of the answers.
--original--
I have a data frame that I'm performing a series of dplyr and tidyr manipulations on, and I would like to add columns for indicator variables that would be encoded as 0 or 1, and do this within the dplyr chain. Each level of a factor (presently stored as character vectors) should be encoded in a separate column, and the column names are a concatenation of a fixed prefix with the variable level, e.g. var has level a, new column var_a will be 1, and all other rows of var_a will be 0.
The following minimal example using base R produces exactly the results that I want (thanks to this blog post), but I'd like to roll it all into the dplyr chain, and can't quite figure out how to do it.
library(dplyr)
df <- data.frame(var = sample(x = letters[1:4], size = 10, replace = TRUE), stringsAsFactors = FALSE)
for(level in unique(df$var)){
df[paste("var", level, sep = "_")] <- ifelse(df$var == level, 1, 0)
}
Note that the real data set contains multiple columns, none of which should be altered or dropped when creating the indicator variables, with the exception of the column var, which could be converted to type factor.
It's not pretty, but this function should work
dummy <- function(data, col) {
for(c in col) {
idx <- which(names(data)==c)
v <- data[[idx]]
stopifnot(class(v)=="factor")
m <- matrix(0, nrow=nrow(data), ncol=nlevels(v))
m[cbind(seq_along(v), as.integer(v))]<-1
colnames(m) <- paste(c, levels(v), sep="_")
r <- data.frame(m)
if ( idx>1 ) {
r <- cbind(data[1:(idx-1)],r)
}
if ( idx<ncol(data) ) {
r <- cbind(r, data[(idx+1):ncol(data)])
}
data <- r
}
data
}
Here's a sample data.frame
dd <- data.frame(a=runif(30),
b=sample(letters[1:3],30,replace=T),
c=rnorm(30),
d=sample(letters[10:13],30,replace=T)
)
and you specify the columns you want to expand as a character vector. You can do
dd %>% dummy("b")
or
dd %>% dummy(c("b","d"))
It's possible without creating a function, although it does require lapply. If var is a factor, you can work with its levels; we can bind its columns to an lapply which loops over the levels of var and creates the values, names them with setNames, and converts them into a tbl_df.
df %>% bind_cols(as_data_frame(setNames(lapply(levels(df$var),
function(x){as.integer(df$var == x)}),
paste0('var2_', levels(df$var)))))
returns
Source: local data frame [10 x 5]
var var_d var_c var2_c var2_d
(fctr) (dbl) (dbl) (int) (int)
1 d 1 0 0 1
2 c 0 1 1 0
3 c 0 1 1 0
4 c 0 1 1 0
5 d 1 0 0 1
6 d 1 0 0 1
7 c 0 1 1 0
8 c 0 1 1 0
9 d 1 0 0 1
10 c 0 1 1 0
If var is a character vector, not a factor, you can do the same thing, but using unique instead of levels:
df %>% bind_cols(as_data_frame(setNames(lapply(unique(df$var),
function(x){as.integer(df$var == x)}),
paste0('var2_', unique(df$var)))))
Two notes:
This approach will work regardless of the data type, but will be slower. In your data is big enough that it matters, it likely makes sense to store the data as factor anyway, as it contains a lot of repeated levels.
Both versions pull data from df$var as it lives in the calling environment, not as it may exist in a larger chain, and assume var is unchanged in whatever it is passed. To reference the dynamic value of var aside from dplyr's normal NSE is rather a pain, insofar as I've seen.
One more alternative that's a little simpler and factor-agnostic, using reshape2::dcast:
library(reshape2)
df %>% cbind(1 * !is.na(dcast(df, seq_along(var) ~ var, value.var = 'var')[,-1]))
It still pulls the version of df from the calling environment, so the chain really only determines what you're joining to. Because it uses cbind instead of bind_cols, the result will be a data.frame, too, not tbl_df, so if you want to keep it all tbl_df (smart if the data is big), you'll need to replace the cbind with bind_cols(as_data_frame( ... )); bind_cols doesn't seem to want to do the conversion for you.
Note, however, that while this version is simpler, it is comparatively slower, both on factor data:
Unit: microseconds
expr min lq mean median uq max neval
factor 358.889 384.0010 479.5746 427.9685 501.580 3995.951 100
unique 547.249 585.4205 696.4709 633.4215 696.402 4528.099 100
dcast 2265.517 2490.5955 2721.1118 2628.0730 2824.949 3928.796 100
and string data:
Unit: microseconds
expr min lq mean median uq max neval
unique 307.190 336.422 414.1031 362.6485 419.3625 3693.340 100
dcast 2117.807 2249.077 2517.0417 2402.4285 2615.7290 3793.178 100
For small data it won't matter, but for bigger data, it may be worth putting up with the complication.
The only requirements for a function to be part of a dplyr pipeline are that it takes a data frame as input, and returns a data frame as output. So, leveraging model.matrix:
make_inds <- function(df, cols=names(df))
{
# do each variable separately to get around model.matrix dropping aliased columns
do.call(cbind, c(df, lapply(cols, function(n) {
x <- df[[n]]
mm <- model.matrix(~ x - 1)
colnames(mm) <- gsub("^x", paste(n, "_", sep=""), colnames(mm))
mm
})))
}
# insert into pipeline
data %>% ... %>% make_inds %>% ...
I landed on this Q&A first because I really wanted to put model.matrix in a magrittr pipe workflow or produce the equivalent output with just tidyverse functions (sorry, baseRs).
Later, I landed on this solution that had the elegant use of the functions that I thought was possible (but I wasn't coming up with on my own):
df <- data_frame(var = sample(x = letters[1:4], size = 10, replace = TRUE))
df %>%
mutate(unique_row_id = 1:n()) %>% #The rows need to be unique for `spread` to work.
mutate(dummy = 1) %>%
spread(var, dummy, fill = 0)
So, I'm adding an updated/modified version of the linked solution so that people who land here first don't have to keep looking (like I did).

Adding multiple columns to a data.table, where column names are held in a vector

I want to add a large number of columns to a data.table in R.
The column names are held in a vector a. How to do it?
x <- data.table(a=1,b=1)
f <- function(x) {list(0)}
The following works:
x <- x[, c("col1","col2","col3") := f()]
but the following doesn't:
a <- c("col1","col2","col3")
x <- x[, a := f()]
How do I add the columns defined within a?
In order to make that work, you have to wrap the a in () like this:
x[, (a) := f()]
this results in the following datatable:
> x
a b col1 col2 col3
1: 1 1 0 0 0
Explanation: when you use x[, a:=f()] you assign the outcome of f() to column a (data.table allows this for convenience). Thus a is treated as a name in this occasion. When you use (a), a is treated as an expression (in this case a vector of column names).
Furthermore: you don't need to assign this to x again with x <- as the datatable is updated by reference because the := operator is used.

Replace NA in a certain column with values from equal key from same column

I created a means column for a group based on a criterium C. Now I want those means to be filled out over the entire column, even when criterium C does not hold. So basically I want to replace NA's with the mean value calculated for that group. You can see the grp, val and C colum in the next Data.table
grp val C
1: 1 NA 0
2: 1 NA 0
3: 1 42 1
4: 1 42 1
5: 2 16 1
6: 2 16 1
7: 2 NA 0
8: 2 NA 0
9: 3 32 1
10: 3 32 1
11: 3 32 1
12: 3 32 1
So I want to replace the val NA's with the mean value in the same group.
Here is sample code of how I attempt to do it.
Basicly I extract another data.table, remove the NA's and duplicates and then try to merge it with the original table.
x <- data.table(grp=c(1,1,1,1,2,2,2,2,3,3,3,3),val=c(NA,NA,42,42,16,16,NA,NA,32,32,32,32),C=c(0,0,1,1,1,1,0,0,1,1,1,1))
y <- x[!is.na(val),]
y <- y[!duplicated(y),]
setkey(x,grp)
setkey(y,grp)
x[y,val:=val,by=grp]
while this does not give any errors it leaves the original column val untouched. What am I doing wrong? what would be a better approach?
So it seems like this question is driving lots of "noise", so I'll add this as an answer.
So data.table has an "assignment by reference operator" which is := (see here for more info and use cases/benchmarks).
This operator is assigning values to all the members of the particular group (although you can also use it without grouping by anything), similar to mutate function in dplyr or ave and transform in base R, but it does it by reference (which isn't too important for this question specifically, but is probably its greatest advantage over the equivalents in other packages/base R), i.e., it is updating the data set itself without creating copies while using the <- operator.
To sum things up, if you want to calculate some metric per group and assign it to each value in that particular group, use :=.
On other hand, if you want just the summary, use = instead (with combination with list() or just .()), or if you don't want to name the result of the aggregation, you don't have to use anything at all as in:
x[, .(val = mean(val, na.rm = TRUE)), grp]
Or
x[, list(val = mean(val, na.rm = TRUE)), grp]
Or just
x[, mean(val, na.rm = TRUE), grp] # will call the aggregated variable `V1` by default
The equivalents for this in dplyr would be summarise and in base R it would be aggregate or sometimes tapply.
That being said, in your specific case you would use the := operator in order to assign the mean value per group to each value in that particular group as in:
x[, val := mean(val, na.rm = TRUE), grp]
For imputing the NA's with group mean, data.table and dplyr would work well (data.table vs dplyr is a separate discussion). Refer # David Arenburg's comment for data.table method code for replacing NA with mean.
Using dplyr:
library(dplyr)
df %>% group_by(grp) %>% mutate(val= replace(val, is.na(val), mean(val, na.rm=TRUE))) # ifelse can also be tried instead of replace
Less elegant way is through a custom function combined with ddply:
library(plyr)
# function to replace NA with mean for that group
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
df <- ddply(df, ~ grp, transform, val = impute.mean(val))

Resources