Subsetting and assignment on several columns of a data table - r

Lets say I have a data table like the one below:
library(data.table)
N = 10
x = data.table(id = 1:N,
segm = sample(c("A","B","C"),N,replace=T), r = rnorm(N,20,5),
aa = sample(0:1,N,replace=T), ab = sample(0:1,N,replace=T),
ba = sample(0:1,N,replace=T), bb = sample(0:1,N,replace=T))
I'd like to know how to substitute the 1 values for NA but only for the columns aa, ab, ba and bb using the data table package. I know how to do this using data frame.
I tried using the following:
f = c("aa","ab","ba","bb")
x[,f,with=F][x[,f,with=F]==1] <- "NA"
but I'm getting an error: Error in [<-.data.table(*tmp*, , f, with = F, value = list(aa = c("0", : unused argument (with = F)
To sum up, my question is: How can I subset and assign on several columns of a data table at the same time.
The line of code:
x[f==1,f:="NA"]
is just not working. Why?
Any help is appreciate.

It's possible to accomplish this in another way, for this particular case:
x[, (f) := lapply(.SD, function(x) x * (x | NA)), .SDcols=f]
We use the fact that TRUE | NA = TRUE and FALSE | NA = NA here. The ( in the LHS of := sees it as an expression (rather than a variable name) and therefore evaluates it to obtain the columns contained in it. Specifying .SDcols provides .SD with just the columns f, what we want. And we apply this hack of a function to replace each column, by reference.
DT[f == 1, f := NA]
doesn't work because:
Let's write your expression as DT[i, LHS := RHS]. i being an expression gets evaluated within the scope of DT. [.data.table tries to find a column f within the scope of DT and since there isn't any, it'll try to find in the calling scope and gets the value stored in it, which then becomes: c("aa", "ab", "ba", "bb") == 1. This evaluates to FALSE, FALSE, FALSE, FALSE resulting in an empty data.table - the assignment in j will have no effect.
Also note the ( in LHS in my answer. This is so that we can still conveniently use DT[, f := val] where f is the column name.

There's nothing wrong with using a for() loop here.
Given the nature of your problem, with a different subset of rows being operated on in each of the four columns, you're going to need to use some sort of loop; you might as well construct an explicit one that allows you to take full advantage of data.table's modify-by-reference := operator.
for (i in f)
x[get(i)==1, (i):=NA]
x
# id segm r aa ab ba bb
# 1: 1 C 15.203246 NA NA 0 0
# 2: 2 B 23.536583 NA 0 0 NA
# 3: 3 A 16.404203 NA 0 NA 0
# 4: 4 A 18.673618 0 0 NA NA
# 5: 5 C 30.528967 NA 0 NA NA
# 6: 6 A 18.887781 0 NA NA NA
# 7: 7 C 24.476124 0 0 NA NA
# 8: 8 B 26.862686 0 0 NA 0
# 9: 9 C 9.047837 0 0 0 NA
# 10: 10 C 17.532379 0 0 NA NA

Related

R data.table creating a custom function using lapply to create and reassign multiple variables

I have the following lines of code :
DT[flag==T, temp:=haz_1.5]
DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]
DT[agedays==61, haz_1.5_1:=temp]
I need to convert this into a function, so that it will work on a list of variables, instead of just one single one. I have recently learned how to create a function using lapply by passing through a list of columns and conditions for the creation of one set of new columns. However I'm unsure of how to do it when I'm passing through a list of columns as well as carrying through all values of a variable forward on these columns.
For instance, I can code the following :
columns<-c("haz_1.5", "waz_1.5")
new_cols <- paste(columns, "1", sep = "_")
x=61
maled_anthro[(flag==TRUE)&(agedays==x), (new_cols) := lapply(.SD, function(y) na.locf(y, na.rm=F)), .SDcols = columns]
But I am missing the na.locf step and thus am not getting the same output as the original lines of code prior to building the function. How would I incorporate the line of code which utilizes na.locf to carry forward values (DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]) into this function in a way in which all the data is wrapped up into the single function? Would this work with lapply in the same manner?
Dummy data that's similar to the data table I'm using :
DT <- data.table(pid = c(1,1,2,3,3,4,4,5,5,5),
flag = c(T,T,F,T,T,F,T,T,T,T),
agedays = c(1,61,61,51,61,23,61,1,32,61),
haz_1.5 = c(1,1,1,2,NA,1,3,2,3,4),
waz_1.5 = c(1,NA,NA,NA,NA,2,2,3,4,4))
OP's code can be turned into an anonymous function which is applied to the selected columns:
library(data.table)
columns <- c("haz_1.5", "waz_1.5")
new_cols <- paste0(columns, "_1")
x <- 61
DT[, (new_cols) := lapply(.SD, function(v) {
temp <- fifelse(flag, v, NA_real_)
temp <- nafill(temp, "locf")
fifelse(agedays == x, temp, NA_real_)
}), .SDcols = columns, by = pid][]
pid flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
1: 1 TRUE 1 1 1 NA NA
2: 1 TRUE 61 1 NA 1 1
3: 2 FALSE 61 1 NA NA NA
4: 3 TRUE 51 2 NA NA NA
5: 3 TRUE 61 NA NA 2 NA
6: 4 FALSE 23 1 2 NA NA
7: 4 TRUE 61 3 2 3 2
8: 5 TRUE 1 2 3 NA NA
9: 5 TRUE 32 3 4 NA NA
10: 5 TRUE 61 4 4 4 4
This is the same result we would get when we manually repeat OP's code for the two columns (note that it is required to clear the temp column before assigning by reference parts of it.)
DT[(flag), temp := haz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, haz_1.5_1 := temp]
DT[, temp := NULL]
DT[(flag), temp := waz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, waz_1.5_1 := temp]
DT[, temp := NULL][]
pid flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
1: 1 TRUE 1 1 1 NA NA
2: 1 TRUE 61 1 NA 1 1
3: 2 FALSE 61 1 NA NA NA
4: 3 TRUE 51 2 NA NA NA
5: 3 TRUE 61 NA NA 2 NA
6: 4 FALSE 23 1 2 NA NA
7: 4 TRUE 61 3 2 3 2
8: 5 TRUE 1 2 3 NA NA
9: 5 TRUE 32 3 4 NA NA
10: 5 TRUE 61 4 4 4 4
Some explanations
There is one important difference between OP's "single column" code and this approach: The anonymous function is called for each item in the grouping variable pid. In OP's code, the first and last assignments are working on the ungrouped (full) vectors (which might be somewhat more efficient, perhaps). However, the result of those assignments is independent of pid and the result is the same.
Instead of zoo::na.locf(), data.table's nafill() function is used (new with data.table v1.12.4, on CRAN 03 Oct 2019)
DT[(flag), ...] is equivalent to DT[flag == TRUE, ...]
When fifelse() is used instead of subsetted assign by reference, the no parameter must be NA to be compliant. Thus, DT[, temp := fifelse(flag, haz_1.5, NA_real_)][] is equivalent to DT[(flag), temp := haz_1.5][]

set() in data.table - matching on names instead of column number

Using set() for efficiency to update values insided of a data.table I ran into problems, when the order of the columns changed. So to prevent that I used a workaround to match the column name instead of the column postion.
I would like to know if there's a better way of addressing the column in the j part of the set query.
DT <- as.data.table(cbind( Period = 1:10,
Col.Name=NA))
set(DT, i = 1L , j = as.integer(match("Col.Name",names(DT))), value = 0)
set(DT, i = 3L , j = 2L, value = 0)
So I would like to ask if there's a data.table workaround for this, perhaps a fast matching on the colnames already available.
We can use column name directly in 'j'
set(DT, i = 1L , j = "Col.Name", value = 0)
DT
# Period Col.Name
# 1: 1 0
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
#10: 10 NA

R: By group, test if for each value of one variable, that value exists in another variable

I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a and b uniquely identify an observation. I want to create a new variable, d, which indicates if each observation's value for b is present at least once in c as grouped by a. Such that d would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?
Using data.table:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr version for those of that persuasion. All credit due to #akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base R version as well (via #thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))
The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).

When trying to replace values, "missing values are not allowed in subscripted assignments of data frames"

I have a table that has two columns: whether you were sick (H01) and the number of days sick (H03). However, the number of days sick is NA if H01 == false, and I would like to set it to 0. When I do this:
test <- pe94.person[pe94.person$H01 == 12,]
test$H03 <- 0
It works fine. However, I'd like to replace the values in the original dataframe. This, however, fails:
pe94.person[pe94.person$H01 == 12,]$H03 <- 0
It returns:
> pe94.person[pe94.person$H01 == 12,]$H03 <- 0
Error in `[<-.data.frame`(`*tmp*`, pe94.person$H01 == 12, , value = list( :
missing values are not allowed in subscripted assignments of data frames
Any idea why this is? For what it's worth, here's a frequency table:
> table(pe94.person[pe94.person$H01 == 12,]$H03)
2 3 5 28
3 1 1 1
It is due to missingness in H01 variable.
> x <- data.frame(a=c(NA,2:5), b=c(1:5))
> x
a b
1 NA 1
2 2 2
3 3 3
4 4 4
5 5 5
> x[x$a==2,]$b <- 99
Error in `[<-.data.frame`(`*tmp*`, x$a == 1, , value = list(a = NA_integer_, :
missing values are not allowed in subscripted assignments of data frames
The assignment won't work because x$a has a missing value.
Subsetting first works:
> z <- x[x$a==2,]
> z$b <- 99
> z <- x[x$a==2,]
> z
a b
NA NA NA
2 2 2
But that's because the [<- function apparently can't handle missing values in its extraction indices, even though [ can:
> `[<-`(x,x$a==2,,99)
Error in `[<-.data.frame`(x, x$a == 2, , 99) :
missing values are not allowed in subscripted assignments of data frames
So instead, trying specifying your !is.na(x$a) part when you're doing the assignment:
> `[<-`(x,!is.na(x$a) & x$a==2,'b',99)
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Or, more commonly:
> x[!is.na(x$a) & x$a==2,]$b <- 99
> x
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Note that this behavior is described in the documentation:
The replacement methods can be used to add whole column(s) by specifying non-existent column(s), in which case the column(s) are added at the right-hand edge of the data frame and numerical indices must be contiguous to existing indices. On the other hand, rows can be added at any row after the current last row, and the columns will be in-filled with missing values. Missing values in the indices are not allowed for replacement.
You can use ifelse, like so
pe94.person$foo <- ifelse(!is.na(pe94.person$H01) & pe94.person$H01 == 12, 0, pe94.person$H03)
check if foo meets your criteria and then go ahead and assign it to pe94.person$H03 directly. I find it safer to assign it a new variable and usually use that in subsequent analysis.
There might be an NA somewhere in the column that is causing the error. Run the index on a specific column instead of the entire data frame.
movies[movies$Actors == "N/A",] = NA #ERROR
movies$Actors[movies$Actors == "N/A"] = NA #Works
I realise the question is very old, but I think the most elegant solution is by using the which() function:
pe94.person[which(pe94.person$H01 == 12),]$H03 <- 0
should do what the original poster asked for. Because which() drops the NAs and keeps the (positions of the) TRUE results only.
Simply use the subset() function to exclude all NA from the string.
It works as x[subset & !is.na(subset)]. Look at this data:
> x <- data.frame(a = c(T,F,T,F,NA,F,T, F, NA,NA,T,T,F),
> b = c(F,T,T,F,T, T,NA,NA,F, T, T,F,F))
Subsetting with [ operator returns this:
> x[x$b == T & x$a == F, ]
a b
2 FALSE TRUE
NA NA NA
6 FALSE TRUE
NA.1 NA NA
NA.2 NA NA
And subset() does what we want:
> subset(x, b == T & a == F)
a b
2 FALSE TRUE
6 FALSE TRUE
To change the values of subsetted variables:
> ss <- subset(x, b == T & a == F)
> x[rownames(ss), 'a'] <- T
> x[c(2,6), ]
a b
2 TRUE TRUE
6 TRUE TRUE
Following works. Watch out there is no comma in sub setting:
x <- data.frame(a=c(NA,2:5), b=c(1:5))
x$a[x$a==2] <- 99

Rbind with new columns and data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?
library(data.table)
aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)
dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))
dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either
I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.
Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:
If TRUE fills missing columns with NAs. By default FALSE. When
TRUE, use.names has to be TRUE, and all items of the input list has to
have non-null column names.
Thus you can do (prior to approx v1.9.6):
data.table::rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
UPDATE for v1.9.6:
This now works directly:
rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
Here is an approach that will update the missing columns in
rbind.missing <- function(A, B) {
cols.A <- names(A)
cols.B <- names(B)
missing.A <- setdiff(cols.B,cols.A)
# check and define missing columns in A
if(length(missing.A) > 0L){
# .. means "look up one level"
class.missing.A <- lapply(B[, ..missing.A], class)
nas.A <- lapply(class.missing.A, as, object = NA)
A[,c(missing.A) := nas.A]
}
# check and define missing columns in B
missing.B <- setdiff(names(A), cols.B)
if(length(missing.B) > 0L){
class.missing.B <- lapply(A[, ..missing.B], class)
nas.B <- lapply(class.missing.B, as, object = NA)
B[,c(missing.B) := nas.B]
}
# reorder so they are the same
setcolorder(B, names(A))
rbind(A, B)
}
rbind.missing(dt.1,dt.2)
## aa bb cc
## 1: 1 2 NA
## 2: 2 3 NA
## 3: 3 4 NA
## 4: 1 2 3
## 5: 2 3 4
## 6: 3 4 5
This will not be efficient for many, or large data.tables, as it only works two at a time.
The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.
the basic concept is to add missing columns in both directions: from the running master table
to the newTable and back the other way.
As #menl pointed out in the comments, simply assigning an NA is a problem, because that will
make the whole column of class logical.
One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.
Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class)
as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName)
on the LHS to assign.
We can then wrap this in a function and use S3 methods so that we can simply call
rbindFill(A, B)
Below is the function.
rbindFill.data.table <- function(master, newTable) {
# Append newTable to master
# assign to Master
#-----------------#
# identify columns missing
colMisng <- setdiff(names(newTable), names(master))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
# assign to each column value of NA with appropriate class
master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# assign to newTable
#-----------------#
# identify columns missing
colMisng <- setdiff(names(master), names(newTable))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
# assign to each column value of NA with appropriate class
newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# reorder columns to avoid warning about ordering
#-----------------#
colOrdering <- colOrderingByOtherCol(newTable, names(master))
setcolorder(newTable, colOrdering)
# rbind them!
#-----------------#
rbind(master, newTable)
}
# implement generic function
rbindFill <- function(x, y, ...) UseMethod("rbindFill")
Example Usage:
# Sample Data:
#--------------------------------------------------#
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
#--------------------------------------------------#
# Four iterations of calling rbindFill
master <- rbindFill(A, B)
master <- rbindFill(master, A2)
master <- rbindFill(master, C)
# Results:
master
# a b c d m n f
# 1: 1 1 1 NA NA NA NA
# 2: 2 2 2 NA NA NA NA
# 3: 3 3 3 NA NA NA NA
# 4: NA 1 1 1 A NA NA
# 5: NA 2 2 2 B NA NA
# 6: NA 3 3 3 C NA NA
# 7: 6 6 6 NA NA NA NA
# 8: 7 7 7 NA NA NA NA
# 9: 8 8 8 NA NA NA NA
# 10: 9 9 9 NA NA NA NA
# 11: NA NA 7 NA NA 0.86 TRUE
# 12: NA NA 8 NA NA -1.15 FALSE
# 13: NA NA 9 NA NA 1.10 TRUE
Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.
Also, using rbindlist() should be much faster when dealing with data.tables.
Define the tables (same as mnel's code above):
library(data.table)
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
Insert the missing variables in table A: (note the use of A2[0]
A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
Insert the missing columns in table A2:
A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):
setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL
Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...
DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))
DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL
The result looks the same as mnels' output (except for the random numbers and the column order).
PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?
PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

Resources