Save column names when extracting columns by variable name - r

Let's say I have the following data.table.
dt = data.table(one=rep(2,4), two=rnorm(4))
dt
Now I have created a variable with a name of one column.
col_name = "one"
If I want to return that column as a data.table, I can do one of the following. The first option will return the column name as V1 and the second will actually set the column name to "one".
dt[,.(get(col_name))]
dt[,col_name, with=FALSE]
I'm wondering if there is a way to specify the column name which using the get command. Something like the following, which doesn't work.
dt[,as.symbol(col_name) = .(get(col_name))]
The reason that I need the column names with get is that I have pretty extensive loop whereby I'm filling in empty columns. So it could end up looking like this, whereby I loop through and replace imp_val with the median by the columns in cols.
dat2[is.na(get(imp_val)),
as.symbol(imp_val) := dat2[.BY, median(get(imp_val), na.rm=TRUE), on=get(cols)], by=c(get(cols))]

We can specify it in the .SDcols
dt[, .SD,.SDcols = col_name]
Or with ..
dt[, ..col_name]
if the intention is to rename the column as 'col_name'
setnames(dt[, ..col_name], deparse(substitute(col_name)))[]
# col_name
#1: 2
#2: 2
#3: 2
#4: 2

You could also use the tidyverse approach for this. Setup:
library(data.table)
library(magrittr)
library(dplyr)
dt = data.table(one=rep(2,4), two=rnorm(4))
col_name = "one"
Then use select with the non-standard evaluation operator !! (pronounced bang-bang):
> dt %>% dplyr::select(!!col_name)
one
1: 2
2: 2
3: 2
4: 2
The returned object is still a data.table:
> dt %>%
dplyr::select(!!col_name) %>%
class
[1] "data.table" "data.frame"
I'm not sure what you mean with the second part of your question on replacing NAs with the median. Maybe you could update your answer with a small example?

Related

Match rows between dataframe and replace with a value in another column of the second dataframe

I have two dataframes. The first contains a column containing IDs and various other columns while the other contains mapping information for these IDs (ID to Name).
I want to replace the ID in the first dataframe with the Name from the other dataframe.
I am able to do this
for(id in 1:nrow(df1)){
df1$X[df1$X %in% df2$ID[id]] <- df2$Name[id]
}
This works so long as I do not have repeating IDs in the mapping file such as this:
ID,Name
MSTRG.11187,gng7.S
MSTRG.11187,Novel
But this occurs quite a lot. I think my previous code will work if I can get rid of any rows from the mapping file which contain the word Novel in them. I am just struggling to do this. I have tried this :
data = data %>% group_by(GeneID) %>% filter(!("Novel" %in% Gene_Name))
But in the previous example of the repeating IDs with different names, it gets rid of the row with gng7.S as well as getting rid of the row with Novel. I'd like to do this but keep the row with gng7.S and only get rid of the row with Novel.
I'm thinking this might be something to do with the group_by part.
Thanks,
S
Edit: Here are some example dataframes
df1=data.frame(X=c("MSTRG.199","MSTRG.18989","MSTRG.8890","MSTRG.7767"))
df2=data.frame(ID=c("MSTRG.18989","MSTRG.18989","MSTRG.8890","MSTRG.7767", "MSTRG.199"),Name=c("gng7.S", "Novel", "Novel","cdc20", "Novel"))
The question is not fully clear whether any appearances of "Novel" should be removed from df2 or only in cases of duplicate ID. The second case is quite tricky, so I propose a data.table solution which I'm more fluent in (and Q isn't explicitely tagged with dplyr)
df1 <- data.frame(X = c("MSTRG.199", "MSTRG.18989", "MSTRG.8890", "MSTRG.7767"))
df2 <- data.frame(
ID = c("MSTRG.18989", "MSTRG.18989", "MSTRG.8890", "MSTRG.7767", "MSTRG.199"),
Name = c("gng7.S", "Novel", "Novel", "cdc20", "Novel"))
library(data.table)
DT1 <- data.table(df1)
DT2 <- data.table(df2)
# case 1
# remove all rows with Name == Novel before joining
DT2[!Name %in% c("Novel")][DT1, on = .(ID = X)]
ID Name N
1: MSTRG.199 NA NA
2: MSTRG.18989 gng7.S 2
3: MSTRG.8890 NA NA
4: MSTRG.7767 cdc20 1
# case 2
# remove Novel in cases of duplicate appearances of ID
DT2[, N := .N, by = ID][!(N > 1L & Name %in% "Novel")][, N := NULL][DT1, on = .(ID = X)]
ID Name
1: MSTRG.199 Novel
2: MSTRG.18989 gng7.S
3: MSTRG.8890 Novel
4: MSTRG.7767 cdc20

Aggregating in R

I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Selecting a subset of columns in a data.table

I'd like to print all the columns of a data table dt except one of them named V3 but don't want to refer to it by number but by name. This is the code that I have:
dt = data.table(matrix(sample(c(0,1),5,rep=T),50,10))
dt[,-3,with=FALSE] # Is this the only way to not print column "V3"?
Using the data frame way, one could do this through the code:
df = data.frame(matrix(sample(c(0,1),5,rep=T),50,10))
df[,!(colnames(df)%in% c("X3"))]
So, my question is: is there another way to not print one column in a data table without the necessity of refer to it by number? I'd like to find something similar to the data frame syntax I used above but using data table.
Use a very similar syntax as for a data.frame, but add the argument with=FALSE:
dt[, setdiff(colnames(dt),"V9"), with=FALSE]
V1 V2 V3 V4 V5 V6 V7 V8 V10
1: 1 1 1 1 1 1 1 1 1
2: 0 0 0 0 0 0 0 0 0
3: 1 1 1 1 1 1 1 1 1
4: 0 0 0 0 0 0 0 0 0
5: 0 0 0 0 0 0 0 0 0
6: 1 1 1 1 1 1 1 1 1
The use of with=FALSE is nicely explained in the documentation for the j argument in ?data.table:
j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) same as j in [.data.frame.
From v1.10.2 onwards it is also possible to do this as follows:
keep <- setdiff(names(dt), "V9")
dt[, ..keep]
Prefixing a symbol with .. will look up in calling scope (i.e. the Global Environment) and its value taken to be column names or numbers (source).
Edit 2019-09-27 with a more modern approach
You can do this with patterns as mentioned above; or, you can do it with ! if there's a vector of names already:
dt[ , !'V3']
# or
drop_cols = 'V3'
dt[ , !..drop_cols]
.. means "look up one level"
Older version using with=FALSE (data.table is moving away from this argument steadily)
Here's a way that uses grep to convert to numeric and allow negative column indexing:
dt[, -grep("^V3$", names(dt)), with=FALSE]
You did say "V3" was to be excluded, right?
Maybe it's only in recent versions of data.table (I'm using 1.9.6), but you can do:
dt[, -'V3']
For several columns:
dt[, -c('V3', 'V9')]
Note that the quotes around the variable names are necessary.
Also, if your column names are stored in a variable, say cols, you'll need to do dt[, -cols, with=FALSE].
From version 1.12.0 onwards, it is also possible to select columns using regular expressions on their names:
iris_DT <- as.data.table(iris)
iris_DT[, .SD, .SDcols = patterns(".e.al")]
To summarize the answer to this question, and also to make it
a) negation-friendly (so that you can also select columns by negation),
b) pipe-line friendly (so that you can use in a pipeline with %>% operator), and
c) so that you can select using both column numbers and column names,
these are available options:
library(data.table);
select1 <- function (dt, range) dt[, range, with=F]
select2 <- function (dt, range) dt[, ..range]
select3 <- function (dt, range) dt[, .SD, .SDcols=range]
dt <- ggplot2::diamonds
range <- 1:3 # or
range <- dt %>% names %>% .[1:3]
dt %>% select1(range);
dt %>% select2(range);
dt %>% select3(range);
dt %>% select1(-range);
dt %>% select2(-range);
dt %>% select3(-range); # DOES NOT WORK
Also we note that this
dt %>% .[, ..(names(dt)[1:3])] # DOES NOT WORK
Therefore the best (most universal and fast) way to select multiple columns in data.table is the following:
# columns are selected using column numbers:
range <- 1:3
dt %>% select1(range);
dt %>% .[, range, with=F]
# The same works if columns are selected using column names:
range <- names(dt) [1:3]
dt %>% select1(range);
dt %>% .[, range, with=F]
PS.
If, instead of selecting multiple columns, you want to efficiently delete multiple columns from data.table by reference (i.e. instead of copying the entire data.table), then you can use data.table's := operator. But I don't know how to do it for multiple columns in one line

How do you delete a column by name in data.table?

To get rid of a column named "foo" in a data.frame, I can do:
df <- df[-grep('foo', colnames(df))]
However, once df is converted to a data.table object, there is no way to just remove a column.
Example:
df <- data.frame(id = 1:100, foo = rnorm(100))
df2 <- df[-grep('foo', colnames(df))] # works
df3 <- data.table(df)
df3[-grep('foo', colnames(df3))]
But once it is converted to a data.table object, this no longer works.
Any of the following will remove column foo from the data.table df3:
# Method 1 (and preferred as it takes 0.00s even on a 20GB data.table)
df3[,foo:=NULL]
df3[, c("foo","bar"):=NULL] # remove two columns
myVar = "foo"
df3[, (myVar):=NULL] # lookup myVar contents
# Method 2a -- A safe idiom for excluding (possibly multiple)
# columns matching a regex
df3[, grep("^foo$", colnames(df3)):=NULL]
# Method 2b -- An alternative to 2a, also "safe" in the sense described below
df3[, which(grepl("^foo$", colnames(df3))):=NULL]
data.table also supports the following syntax:
## Method 3 (could then assign to df3,
df3[, !"foo"]
though if you were actually wanting to remove column "foo" from df3 (as opposed to just printing a view of df3 minus column "foo") you'd really want to use Method 1 instead.
(Do note that if you use a method relying on grep() or grepl(), you need to set pattern="^foo$" rather than "foo", if you don't want columns with names like "fool" and "buffoon" (i.e. those containing foo as a substring) to also be matched and removed.)
Less safe options, fine for interactive use:
The next two idioms will also work -- if df3 contains a column matching "foo" -- but will fail in a probably-unexpected way if it does not. If, for instance, you use any of them to search for the non-existent column "bar", you'll end up with a zero-row data.table.
As a consequence, they are really best suited for interactive use where one might, e.g., want to display a data.table minus any columns with names containing the substring "foo". For programming purposes (or if you are wanting to actually remove the column(s) from df3 rather than from a copy of it), Methods 1, 2a, and 2b are really the best options.
# Method 4:
df3[, .SD, .SDcols = !patterns("^foo$")]
Lastly there are approaches using with=FALSE, though data.table is gradually moving away from using this argument so it's now discouraged where you can avoid it; showing here so you know the option exists in case you really do need it:
# Method 5a (like Method 3)
df3[, !"foo", with=FALSE]
# Method 5b (like Method 4)
df3[, !grep("^foo$", names(df3)), with=FALSE]
# Method 5b (another like Method 4)
df3[, !grepl("^foo$", names(df3)), with=FALSE]
You can also use set for this, which avoids the overhead of [.data.table in loops:
dt <- data.table( a=letters, b=LETTERS, c=seq(26), d=letters, e=letters )
set( dt, j=c(1L,3L,5L), value=NULL )
> dt[1:5]
b d
1: A a
2: B b
3: C c
4: D d
5: E e
If you want to do it by column name, which(colnames(dt) %in% c("a","c","e")) should work for j.
I simply do it in the data frame kind of way:
DT$col = NULL
Works fast and as far as I could see doesn't cause any problems.
UPDATE: not the best method if your DT is very large, as using the $<- operator will lead to object copying. So better use:
DT[, col:=NULL]
Very simple option in case you have many individual columns to delete in a data table and you want to avoid typing in all column names #careadviced
dt <- dt[, -c(1,4,6,17,83,104)]
This will remove columns based on column number instead.
It's obviously not as efficient because it bypasses data.table advantages but if you're working with less than say 500,000 rows it works fine
Suppose your dt has columns col1, col2, col3, col4, col5, coln.
To delete a subset of them:
vx <- as.character(bquote(c(col1, col2, col3, coln)))[-1]
DT[, paste0(vx):=NULL]
Here is a way when you want to set a # of columns to NULL given their column names
a function for your usage :)
deleteColsFromDataTable <- function (train, toDeleteColNames) {
for (myNm in toDeleteColNames)
train <- train [,(myNm):=NULL]
return (train)
}
DT[,c:=NULL] # remove column c
For a data.table, assigning the column to NULL removes it:
DT[,c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the extra comma if DT is a data.table
... which is the equivalent of:
DT$col1 <- NULL
DT$col2 <- NULL
DT$col3 <- NULL
DT$col4 <- NULL
The equivalent for a data.frame is:
DF[c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the missing comma if DF is a data.frame
Q. Why is there a comma in the version for data.table, and no comma in the version for data.frame?
A. As data.frames are stored as a list of columns, you can skip the comma. You could also add it in, however then you will need to assign them to a list of NULLs, DF[, c("col1", "col2", "col3")] <- list(NULL).

Resources