Following up on this question, how would you assign values to multiple columns in a data table using the ":=" sign?
For example:
x <- data.table(a = 1:3, b = 1:6, c = 11:16)
I can get what i want using two lines:
x[a>2, b:=NA]
x[a>2, c:=NA]
but would like to be able to do it in one, something like this:
x[a>2, .(b:=NA, c:=NA)]
But unfortunately that doesn't work. Is there another way?
We can use the := once with
x[a >2, `:=`(b = NA, c = NA)]
If there are many columns, another option is set
for(nm in names(x)[-1]) set(x, i=which(x[["a"]]>2), j=nm, value = NA)
Related
I have a data.table in R that I would like to modify one of the columns on the fly (but not change the original object) and select a limited number of its columns afterwards by reference. Best I managed is as follows but then my column names are changed, any suggestions?
tmp <- data.table(a = 'X', b = 'Y', d = 1)
tmp[,.(d = d * - 1, .SD), .SDcols = colNames]
try c(list(d=d*-1), .SD) in j argument
j expects a list
.SD is a list
So when adding new column like this you just need to put it into list and combine with c function.
I have a Data Table with two Text columns. I need to use column b to determine which letters to replace in column a with an "x".
I can do it using a for loop as in the code below. however my actual data set has 250,000+ rows so the script takes ages. Is there a more efficient way to do this? I considered lappy but couldn't get my head round it.
DT <- data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT$c <- ""
for (i in 1 : NROW(DT)){
DT[i]$c <- sub(DT[i,b], "x", DT[i,a])
}
Here is one approach using the tidyverse
library(tidyverse)
DT <- data.table::data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT %>%
mutate(new_vec = str_replace_all(string = a,pattern = b,replacement = "X"))
I've been getting used to data.tables and just cannot seem to find the answer to something that feels so simple (or at least is with data frames).
I want to use data.table to aggregate, however, I don't always know which column to aggregate ahead of time (it takes input from the user). I want to define what column to use based off of a character vector. Here's a short example of what I want to do:
require(data.table)
myDT <- data.table(a = 1:10, b = 11:20, n1 = c("first", "second"))
aggWith <- "a"
Now I want to use the aggWith object to define what column to sum on. This does not work:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1)]
Error in sum(aggWith) : invalid 'type' (character) of argument
Nor does this:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1), with = FALSE]
Error in sum(aggWith) : invalid 'type' (character) of argument
This does:
myDT.Agg <- myDT[, .(Agg = sum(a)), by = .(n1)]
However, I want to be able to define which column "a" is arbitrarily based off a character vector. I've looking through ?data.table, but am just not seeing what I need. Sorry in advance if this is really simple and I'm just overlooking something.
We could specify the 'aggWith' as .SDcols and then get the sum of .SD
myDT[, list(Agg= sum(.SD[[1L]] )), by = n1, .SDcols=aggWith]
If there are multiple columns, then loop with lapply
myDT[, lapply(.SD, sum), by = n1, .SDcols= aggWith]
Another option would be to use eval(as.name
myDT[, list(Agg= sum(eval(as.name(aggWith)))), by = n1]
Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}
Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))
I'm trying to do something similar but different enough from what's described here:
Update subset of data.table based on join
Specifically, I'd like to assign to matching key values (person_id is a key in both tables) column values from table control. CI is the column index. The statement below says 'with=F' was not used. when I delete those parts, it also doesn't work as expected. Any suggestions?
To rephrase: I'd like to set the subset of flatData that corresponds to control FROM control.
flatData[J(eval(control$person_id)), ci, with=F] = control[, ci, with=F]
To give a reproducible example using classic R:
x = data.frame(a = 1:3, b = 1:3, key = c('a', 'b', 'c'))
y = data.frame(a = c(2, 5), b = c(11, 2), key = c('a', 'b'))
colidx = match(c('a', 'b'), colnames(y))
x[x$key %in% y$key, colidx] = y[, colidx]
As an aside, someone please explain how to easily assign SETS of columns without using indices! Indices and data.table are a marriage made in hell.
You can use the := operator along with the join simultaneously as follows:
First prepare data:
require(data.table) ## >= 1.9.0
setDT(x) ## converts DF to DT by reference
setDT(y)
setkey(x, key) ## set key column
setkey(y, key)
Now the one-liner:
x[y, c("a", "b") := list(i.a, i.b)]
:= modifies by reference (in-place). The rows to modify are provided by the indices computed from the join in i.
i.a and i.b are the column names data.table internally generates for easy access to i's columns when both x and i have identical column names, when performing a join of the form x[i].
HTH
PS: In your example y's columns a and b are of type numeric and x's are of type integer and therefore you'll get a warning when run on your data, that the types dint match and therefore a coercion had to take place.