dynamically subseting data.table in R - r

my data.table contain K columns called claims, among other 30 columns. I want to subset the data.table, such that only rows remain which do not have 0 claims.
So, firstly i get all the column names i need for filtering. For the purpose of this example, i have chosen K = 2
> claimsCols = c("claimsnext", paste0("claims" , 1:K))
> claimsCols
[1] "claimsnext" "claims1" "claims2"
i have tried subsetting like:
for(i in claimsCols){
BTplan <- BTplan[ claimsCols[i] == 0, ]
i+1
}
this doent work:
Error in i + 1 : non-numeric argument to binary operator
I am sure there is a better way to do this?

I would basically do what akrun does
idx = BTplan[ , Reduce(`&`, .SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
The innovations are:
Use patterns in .SDcols to specify the columns to include by pattern
& automatically converts numeric to logical, i.e. 1.1 & 2.2 is TRUE, and becomes FALSE as soon as there's a 0 anywhere (hence filtering the corresponding row)
In a future version of data.table this will be slightly more efficient and concise (and hopefully more readable):
idx = BTplan[ , pall(.SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
Keep an eye on this pull request:
https://github.com/Rdatatable/data.table/pull/4448

In the OP's code, the i is each of the elements of 'claimsCols' which is character, so i +1 won't work and in fact, it is not needed
for(colnm in claimsCols) {
BTplan <- BTplan[BTplan[[colnm]] != 0,]
}
Or using data.table syntax
library(data.table)
setDT(BTplan)
BTplan[BTplan[, Reduce(`&`, lapply(.SD, `!=`, 0)),.SDcols = claimsCols]]

Related

Why Doesn't dt[,ncol(dt)] Return Column Contents?

I've edited this question based on comments by #akrun (thank you!), realizing I didn't accurately ask my question.
I'm confused why the following doesn't return the contents of the last column in a data table.
> dt <- data.table(A=c(10,10,10),B=c(20,20,20),C=c(30,30,30))
> dt[,ncol(dt)]
[1] 3
If I use with=F it behaves as I would expect, returning the last column as a data table
> dt[,ncol(dt),with=F]
C
1: 30
2: 30
3: 30
This returns the same result as dt[,3] which makes sense. But why is it not true that dt[,ncol(dt)] = dt[,3]? From ?data.table,
When j is a vector of column names or positions to select (as in data.frame). There is no need to use with=FALSE anymore.
Doesn't ncol(dt) return a vector of column positions, a vector of length one? Why doesn't dt[,ncol(dt)] return the contents of the last column?
Thanks for your help!
We need drop = FALSE for data.frame
df[,ncol(df), drop = FALSE]
as by default it is TRUE if we check the ?Extract
x[i, j, ... , drop = TRUE]
For the data.table, we need with = FALSE
dt[, ncol(dt), with = FALSE]
and it is mentioned in the ?data.table help
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].

replacing all NA with a 0 in data.table in R

I have a data.table with many columns. There are 4 columns where I want to replace NA with an 0.
I have a working solution:
claimsMonthly[is.na(claim9month),claim9month := 0
][is.na(claim10month),claim10month := 0
][is.na(claim11month),claim11month := 0
][is.na(claim12month),claim12month := 0]
However, this is quite repetitive and I wanted to reduce this by using an loop (not sure if that is the smartest idea though?):
for (i in 9:12){
claimsMonthly[is.na(paste0("claim", i, "month")), paste0("claim", i, "month") := 0]
}
When I run this loop nothing happens. I guess it is due to the pact that the paste0() returns "claim12month", so I get in.na("claim12month"). The result of that is FALSE despite the fact that there are NA in my data. I guess this has something to do with the quotes?
This is not the first time i have issues with using paste0() or running loops with data.table, so I must be missing something important here.
Any ideas how to fix this?
We can either specify the .SDcols with the names of the columns ('nm1'), loop over the .SD (Subset of Data.table) and assign the NA to 0 (replace_na from tidyr)
library(data.table)
library(tidyr)
nm1 <- paste0("claim", 9:12, "month")
setDT(claimsMonthly)[, (nm1) := lapply(.SD, replace_na, 0), .SDcols = nm1]
Or as #jangorecki mentioned in the comments, nafill from data.table would be better
setDT(claimsMonthly)[, (nm1) := lapply(.SD, nafill, fill = 0), .SDcols = nm1]
or using a loop with set, assign the columns of interest with 0 based on the NA values in each column by specifying the i (for row index) and j for column index/name
for(j in nm1){
set(claimsMonthly, i = which(is.na(claimsMonthly[[j]])), j =j, value = 0)
}
Or with setnafill
setnafill(claimsMonthly, cols = nm1, fill = 0)
You can use:
claimsMonthly[, 9:12][is.na(claimsMonthly[, 9:12])] <- 0
Also you can use variable names:
claimsMonthly[c("claim9month", "claim10month","claim11month","claim12month")][is.na(claimsMonthly[c("claim9month", "claim10month","claim11month","claim12month")])] <- 0
Or even better you can use a vector with all variables with "claimXXmonth" pattern.

Using Data Table Define Column From Character Vector

I've been getting used to data.tables and just cannot seem to find the answer to something that feels so simple (or at least is with data frames).
I want to use data.table to aggregate, however, I don't always know which column to aggregate ahead of time (it takes input from the user). I want to define what column to use based off of a character vector. Here's a short example of what I want to do:
require(data.table)
myDT <- data.table(a = 1:10, b = 11:20, n1 = c("first", "second"))
aggWith <- "a"
Now I want to use the aggWith object to define what column to sum on. This does not work:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1)]
Error in sum(aggWith) : invalid 'type' (character) of argument
Nor does this:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1), with = FALSE]
Error in sum(aggWith) : invalid 'type' (character) of argument
This does:
myDT.Agg <- myDT[, .(Agg = sum(a)), by = .(n1)]
However, I want to be able to define which column "a" is arbitrarily based off a character vector. I've looking through ?data.table, but am just not seeing what I need. Sorry in advance if this is really simple and I'm just overlooking something.
We could specify the 'aggWith' as .SDcols and then get the sum of .SD
myDT[, list(Agg= sum(.SD[[1L]] )), by = n1, .SDcols=aggWith]
If there are multiple columns, then loop with lapply
myDT[, lapply(.SD, sum), by = n1, .SDcols= aggWith]
Another option would be to use eval(as.name
myDT[, list(Agg= sum(eval(as.name(aggWith)))), by = n1]

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Elegantly assigning multiple columns in data.table with lapply()

I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As #Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of := but use get instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value of set could be run within scope of DT, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.

Resources