How to remove duplicated (by name) column in data.tables in R? - r

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?

How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()

.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1

Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Related

In base or data.table for R, use a function, evaluated on a column, to select rows?

Given a data table DT with a column Col1, select the rows of DT where the values x in Col1 satisfy some boolean expression, for example f(x) == TRUE or another example f(x) <= 4, and then doing more data table operations.
For example, I tried something like
DT[f(Col1) == TRUE, Col2 := 2]
which does not work because f() acts on values not vectors. Using lapply(), seems to work but it take a long time to run with a very large DT.
A workaround would be to create a column and using that to select the rows
DT[, fvalues := f(Col1)][fvalues == TRUE, Col2 := 2]
but it would be better not to increase the size of DT.
EDIT: Here is an example.
map1<-data.table(k1=c("A","B","C"), v=c(-1,2,3))
map2<-data.table(k2=c("A","B","A","A","C","B"), k3=c("A","B","C","B","C","B"))
f <- function(x) map1[k1 == x, v]
To find the rows in map2 using the corresponding value in map1: these do not work (returning an error about a length mismatch around ==)
map2[f(k2) == 2, flag1 := TRUE]
map2[f(k2) + f(k3) == 2, flag2 := TRUE]
but using lapply() the first one works but it is somehow slower (for a large data table) than adding column to map2 with the values of f and selecting based on that new column
map2[lapply(k2,f) == 2, flag1 := TRUE]
and the second one
map2[lapply(k2,f) + lapply(k3,f) == 2, flag2 := TRUE]
returns an error (non-numeric argument).
The question would be how to do this most efficiently, particularly without making the data table larger.
I think the problem perhaps is when you're adding the parts which are aiming to modify the data.table in-place (i.e. your := parts). I don't think you can filter in place, as such filter really requires writing to a new memory location.
This works for filtering, whilst creating a new object:
library(data.table)
f <- function(x) x > 0.5
DT <- data.table(Col1 = runif(10))
DT[f(Col1),]
#> Col1
#> 1: 0.7916055
#> 2: 0.5391773
#> 3: 0.6855657
#> 4: 0.5250881
#> 5: 0.9089948
#> 6: 0.6639571
To do more data.table operations on a filtered table, assign to a new object and work with that one:
DT2 <- DT[f(Col1),]
DT2[, Col2 := 2]
Perhaps I've misunderstood your problem though - what function are you using? Could you post more code so we can replicate your problem more precisely?
If f is a function working only on scalar values, you could Vectorize it:
DT[Vectorize(f)(Col1)]
Not sure this fully answers your question because Vectorize uses lapply/mapply

Save column names when extracting columns by variable name

Let's say I have the following data.table.
dt = data.table(one=rep(2,4), two=rnorm(4))
dt
Now I have created a variable with a name of one column.
col_name = "one"
If I want to return that column as a data.table, I can do one of the following. The first option will return the column name as V1 and the second will actually set the column name to "one".
dt[,.(get(col_name))]
dt[,col_name, with=FALSE]
I'm wondering if there is a way to specify the column name which using the get command. Something like the following, which doesn't work.
dt[,as.symbol(col_name) = .(get(col_name))]
The reason that I need the column names with get is that I have pretty extensive loop whereby I'm filling in empty columns. So it could end up looking like this, whereby I loop through and replace imp_val with the median by the columns in cols.
dat2[is.na(get(imp_val)),
as.symbol(imp_val) := dat2[.BY, median(get(imp_val), na.rm=TRUE), on=get(cols)], by=c(get(cols))]
We can specify it in the .SDcols
dt[, .SD,.SDcols = col_name]
Or with ..
dt[, ..col_name]
if the intention is to rename the column as 'col_name'
setnames(dt[, ..col_name], deparse(substitute(col_name)))[]
# col_name
#1: 2
#2: 2
#3: 2
#4: 2
You could also use the tidyverse approach for this. Setup:
library(data.table)
library(magrittr)
library(dplyr)
dt = data.table(one=rep(2,4), two=rnorm(4))
col_name = "one"
Then use select with the non-standard evaluation operator !! (pronounced bang-bang):
> dt %>% dplyr::select(!!col_name)
one
1: 2
2: 2
3: 2
4: 2
The returned object is still a data.table:
> dt %>%
dplyr::select(!!col_name) %>%
class
[1] "data.table" "data.frame"
I'm not sure what you mean with the second part of your question on replacing NAs with the median. Maybe you could update your answer with a small example?

delete column in data.table in R based on condition [duplicate]

I'm trying to manipulate a number of data.tables in similar ways, and would like to write a function to accomplish this. I would like to pass in a parameter containing a list of columns that would have the operations performed. This works fine when the vector declaration of columns is the left hand side of the := operator, but not if it is declared earlier (or passed into the function). The follow code shows the issue.
dt = data.table(a = letters, b = 1:2, c=1:13)
colsToDelete = c('b', 'c')
dt[,colsToDelete := NULL] # doesn't work but I don't understand why not.
dt[,c('b', 'c') := NULL] # works fine, but doesn't allow passing in of columns
The error is "Adding new column 'colsToDelete' then assigning NULL (deleting it)." So clearly, it's interpreting 'colsToDelete' as a new column name.
The same issue occurs when doing something along these lines
dt[, colNames := lapply(.SD, adjustValue, y=factor), .SDcols = colNames]
I new to R, but rather more experienced with some other languages, so this may be a silly question.
It's basically because we allow symbols on LHS of := to add new columns, for convenience: ex: DT[, col := val]. So, in order to distinguish col itself being the name from whatever is stored in col being the column names, we check if the LHS is a name or an expression.
If it's a name, it adds the column with the name as such on the LHS, and if expression, then it gets evaluated.
DT[, col := val] # col is the column name.
DT[, (col) := val] # col gets evaluated and replaced with its value
DT[, c(col) := val] # same as above
The preferred idiom is: dt[, (colsToDelete) := NULL]
HTH
I am surprised no answer provided uses the set() function.
set(DT, , colsToDelete, NULL)
This should be the easiest.
To extend on previous answer, you can delete columns by reference doing:
# delete columns 10 to 15
dt[ , (10:15) := NULL ]
or
# delete columns 3, 5 and 10 to 15
dt[ , (c(3,5,10:15)) := NULL ]
This code did the job for me. you need to have the position of the columns to be deleted e.g., posvec as mentioned in the ?set
j: Column name(s) (character) or number(s) (integer) to be assigned value
when column(s) already exist, and only column name(s) if they are to
be created.
DT_removed_slected_col = set(DT, j = posvec, value = NULL)
Also if you want to get the posvec you can try this:
selected_col = c('col_a','col_b',...)
selected_col = unlist(sapply(selected_col, function(x) grep(x,names(DT))))
namvec = names(selected_col) #col names
posvec = unname(selected_col) #col positions

Aggregating in R

I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))

How do you delete a column by name in data.table?

To get rid of a column named "foo" in a data.frame, I can do:
df <- df[-grep('foo', colnames(df))]
However, once df is converted to a data.table object, there is no way to just remove a column.
Example:
df <- data.frame(id = 1:100, foo = rnorm(100))
df2 <- df[-grep('foo', colnames(df))] # works
df3 <- data.table(df)
df3[-grep('foo', colnames(df3))]
But once it is converted to a data.table object, this no longer works.
Any of the following will remove column foo from the data.table df3:
# Method 1 (and preferred as it takes 0.00s even on a 20GB data.table)
df3[,foo:=NULL]
df3[, c("foo","bar"):=NULL] # remove two columns
myVar = "foo"
df3[, (myVar):=NULL] # lookup myVar contents
# Method 2a -- A safe idiom for excluding (possibly multiple)
# columns matching a regex
df3[, grep("^foo$", colnames(df3)):=NULL]
# Method 2b -- An alternative to 2a, also "safe" in the sense described below
df3[, which(grepl("^foo$", colnames(df3))):=NULL]
data.table also supports the following syntax:
## Method 3 (could then assign to df3,
df3[, !"foo"]
though if you were actually wanting to remove column "foo" from df3 (as opposed to just printing a view of df3 minus column "foo") you'd really want to use Method 1 instead.
(Do note that if you use a method relying on grep() or grepl(), you need to set pattern="^foo$" rather than "foo", if you don't want columns with names like "fool" and "buffoon" (i.e. those containing foo as a substring) to also be matched and removed.)
Less safe options, fine for interactive use:
The next two idioms will also work -- if df3 contains a column matching "foo" -- but will fail in a probably-unexpected way if it does not. If, for instance, you use any of them to search for the non-existent column "bar", you'll end up with a zero-row data.table.
As a consequence, they are really best suited for interactive use where one might, e.g., want to display a data.table minus any columns with names containing the substring "foo". For programming purposes (or if you are wanting to actually remove the column(s) from df3 rather than from a copy of it), Methods 1, 2a, and 2b are really the best options.
# Method 4:
df3[, .SD, .SDcols = !patterns("^foo$")]
Lastly there are approaches using with=FALSE, though data.table is gradually moving away from using this argument so it's now discouraged where you can avoid it; showing here so you know the option exists in case you really do need it:
# Method 5a (like Method 3)
df3[, !"foo", with=FALSE]
# Method 5b (like Method 4)
df3[, !grep("^foo$", names(df3)), with=FALSE]
# Method 5b (another like Method 4)
df3[, !grepl("^foo$", names(df3)), with=FALSE]
You can also use set for this, which avoids the overhead of [.data.table in loops:
dt <- data.table( a=letters, b=LETTERS, c=seq(26), d=letters, e=letters )
set( dt, j=c(1L,3L,5L), value=NULL )
> dt[1:5]
b d
1: A a
2: B b
3: C c
4: D d
5: E e
If you want to do it by column name, which(colnames(dt) %in% c("a","c","e")) should work for j.
I simply do it in the data frame kind of way:
DT$col = NULL
Works fast and as far as I could see doesn't cause any problems.
UPDATE: not the best method if your DT is very large, as using the $<- operator will lead to object copying. So better use:
DT[, col:=NULL]
Very simple option in case you have many individual columns to delete in a data table and you want to avoid typing in all column names #careadviced
dt <- dt[, -c(1,4,6,17,83,104)]
This will remove columns based on column number instead.
It's obviously not as efficient because it bypasses data.table advantages but if you're working with less than say 500,000 rows it works fine
Suppose your dt has columns col1, col2, col3, col4, col5, coln.
To delete a subset of them:
vx <- as.character(bquote(c(col1, col2, col3, coln)))[-1]
DT[, paste0(vx):=NULL]
Here is a way when you want to set a # of columns to NULL given their column names
a function for your usage :)
deleteColsFromDataTable <- function (train, toDeleteColNames) {
for (myNm in toDeleteColNames)
train <- train [,(myNm):=NULL]
return (train)
}
DT[,c:=NULL] # remove column c
For a data.table, assigning the column to NULL removes it:
DT[,c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the extra comma if DT is a data.table
... which is the equivalent of:
DT$col1 <- NULL
DT$col2 <- NULL
DT$col3 <- NULL
DT$col4 <- NULL
The equivalent for a data.frame is:
DF[c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the missing comma if DF is a data.frame
Q. Why is there a comma in the version for data.table, and no comma in the version for data.frame?
A. As data.frames are stored as a list of columns, you can skip the comma. You could also add it in, however then you will need to assign them to a list of NULLs, DF[, c("col1", "col2", "col3")] <- list(NULL).

Resources