R data.table syntax to create and select on the fly - r

I have a data.table in R that I would like to modify one of the columns on the fly (but not change the original object) and select a limited number of its columns afterwards by reference. Best I managed is as follows but then my column names are changed, any suggestions?
tmp <- data.table(a = 'X', b = 'Y', d = 1)
tmp[,.(d = d * - 1, .SD), .SDcols = colNames]

try c(list(d=d*-1), .SD) in j argument
j expects a list
.SD is a list
So when adding new column like this you just need to put it into list and combine with c function.

Related

Assign values to multiple columns on subset using ":=" from data.table

Following up on this question, how would you assign values to multiple columns in a data table using the ":=" sign?
For example:
x <- data.table(a = 1:3, b = 1:6, c = 11:16)
I can get what i want using two lines:
x[a>2, b:=NA]
x[a>2, c:=NA]
but would like to be able to do it in one, something like this:
x[a>2, .(b:=NA, c:=NA)]
But unfortunately that doesn't work. Is there another way?
We can use the := once with
x[a >2, `:=`(b = NA, c = NA)]
If there are many columns, another option is set
for(nm in names(x)[-1]) set(x, i=which(x[["a"]]>2), j=nm, value = NA)

delete column in data.table in R based on condition [duplicate]

I'm trying to manipulate a number of data.tables in similar ways, and would like to write a function to accomplish this. I would like to pass in a parameter containing a list of columns that would have the operations performed. This works fine when the vector declaration of columns is the left hand side of the := operator, but not if it is declared earlier (or passed into the function). The follow code shows the issue.
dt = data.table(a = letters, b = 1:2, c=1:13)
colsToDelete = c('b', 'c')
dt[,colsToDelete := NULL] # doesn't work but I don't understand why not.
dt[,c('b', 'c') := NULL] # works fine, but doesn't allow passing in of columns
The error is "Adding new column 'colsToDelete' then assigning NULL (deleting it)." So clearly, it's interpreting 'colsToDelete' as a new column name.
The same issue occurs when doing something along these lines
dt[, colNames := lapply(.SD, adjustValue, y=factor), .SDcols = colNames]
I new to R, but rather more experienced with some other languages, so this may be a silly question.
It's basically because we allow symbols on LHS of := to add new columns, for convenience: ex: DT[, col := val]. So, in order to distinguish col itself being the name from whatever is stored in col being the column names, we check if the LHS is a name or an expression.
If it's a name, it adds the column with the name as such on the LHS, and if expression, then it gets evaluated.
DT[, col := val] # col is the column name.
DT[, (col) := val] # col gets evaluated and replaced with its value
DT[, c(col) := val] # same as above
The preferred idiom is: dt[, (colsToDelete) := NULL]
HTH
I am surprised no answer provided uses the set() function.
set(DT, , colsToDelete, NULL)
This should be the easiest.
To extend on previous answer, you can delete columns by reference doing:
# delete columns 10 to 15
dt[ , (10:15) := NULL ]
or
# delete columns 3, 5 and 10 to 15
dt[ , (c(3,5,10:15)) := NULL ]
This code did the job for me. you need to have the position of the columns to be deleted e.g., posvec as mentioned in the ?set
j: Column name(s) (character) or number(s) (integer) to be assigned value
when column(s) already exist, and only column name(s) if they are to
be created.
DT_removed_slected_col = set(DT, j = posvec, value = NULL)
Also if you want to get the posvec you can try this:
selected_col = c('col_a','col_b',...)
selected_col = unlist(sapply(selected_col, function(x) grep(x,names(DT))))
namvec = names(selected_col) #col names
posvec = unname(selected_col) #col positions

Using Data Table Define Column From Character Vector

I've been getting used to data.tables and just cannot seem to find the answer to something that feels so simple (or at least is with data frames).
I want to use data.table to aggregate, however, I don't always know which column to aggregate ahead of time (it takes input from the user). I want to define what column to use based off of a character vector. Here's a short example of what I want to do:
require(data.table)
myDT <- data.table(a = 1:10, b = 11:20, n1 = c("first", "second"))
aggWith <- "a"
Now I want to use the aggWith object to define what column to sum on. This does not work:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1)]
Error in sum(aggWith) : invalid 'type' (character) of argument
Nor does this:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1), with = FALSE]
Error in sum(aggWith) : invalid 'type' (character) of argument
This does:
myDT.Agg <- myDT[, .(Agg = sum(a)), by = .(n1)]
However, I want to be able to define which column "a" is arbitrarily based off a character vector. I've looking through ?data.table, but am just not seeing what I need. Sorry in advance if this is really simple and I'm just overlooking something.
We could specify the 'aggWith' as .SDcols and then get the sum of .SD
myDT[, list(Agg= sum(.SD[[1L]] )), by = n1, .SDcols=aggWith]
If there are multiple columns, then loop with lapply
myDT[, lapply(.SD, sum), by = n1, .SDcols= aggWith]
Another option would be to use eval(as.name
myDT[, list(Agg= sum(eval(as.name(aggWith)))), by = n1]

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

assigning a subset of data.table rows and columns by join

I'm trying to do something similar but different enough from what's described here:
Update subset of data.table based on join
Specifically, I'd like to assign to matching key values (person_id is a key in both tables) column values from table control. CI is the column index. The statement below says 'with=F' was not used. when I delete those parts, it also doesn't work as expected. Any suggestions?
To rephrase: I'd like to set the subset of flatData that corresponds to control FROM control.
flatData[J(eval(control$person_id)), ci, with=F] = control[, ci, with=F]
To give a reproducible example using classic R:
x = data.frame(a = 1:3, b = 1:3, key = c('a', 'b', 'c'))
y = data.frame(a = c(2, 5), b = c(11, 2), key = c('a', 'b'))
colidx = match(c('a', 'b'), colnames(y))
x[x$key %in% y$key, colidx] = y[, colidx]
As an aside, someone please explain how to easily assign SETS of columns without using indices! Indices and data.table are a marriage made in hell.
You can use the := operator along with the join simultaneously as follows:
First prepare data:
require(data.table) ## >= 1.9.0
setDT(x) ## converts DF to DT by reference
setDT(y)
setkey(x, key) ## set key column
setkey(y, key)
Now the one-liner:
x[y, c("a", "b") := list(i.a, i.b)]
:= modifies by reference (in-place). The rows to modify are provided by the indices computed from the join in i.
i.a and i.b are the column names data.table internally generates for easy access to i's columns when both x and i have identical column names, when performing a join of the form x[i].
HTH
PS: In your example y's columns a and b are of type numeric and x's are of type integer and therefore you'll get a warning when run on your data, that the types dint match and therefore a coercion had to take place.

Resources