Remove data.table rows whose vector elements contain nested NAs - r

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.

You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]

An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

Related

Expressively select rows in a data.table which match rows from another data.table

Given two data tables (tbl_A and tbl_B), I would like to select all the rows in the tbl_A which have matching rows in tbl_B, and I would like the code to be expressive. If the %in% operator were defined for data.tables, something like this would be be ideal:
subset <- tbl_A[tbl_A %in% tbl_B]
I can think of many ways to accomplish what I want such as:
# double negation (set differences)
subset <- tbl_A[!tbl_A[!tbl_B,1,keyby=a]]
# nomatch with keyby and this annoying `[,V1:=NULL]` bit
subset <- tbl_B[,1,keyby=.(a=x)][,V1:=NULL][tbl_A,nomatch=0L]
# nomatch with !duplicated() and setnames()
subset <- tbl_B[!duplicated(tbl_B),.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# nomatch with !unique() and setnames()
subset <- unique(tbl_B)[,.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# use of a temporary variable (Thanks #Frank)
subset <- tbl_A[, found := FALSE][tbl_B, found := TRUE][(found)][,found:=NULL][]
but each expression is difficult to read and it's not obvious at first glance what the code is doing. Is there a more idiomatic / expressive way of accomplishing this task?
For purposes of example, here are some toy data.tables:
# toy tables
tbl_A <- data.table(a=letters[1:5],
b=1:5,
c=rnorm(5))
tbl_B <- data.table(x=letters[3:7],
y=13:17,
z=rnorm(5))
# both tables might have multiple rows with the same key fields.
tbl_A <- rbind(tbl_A,tbl_A)
tbl_B <- rbind(tbl_B,tbl_B)
setkey(tbl_A,a)
setkey(tbl_B,x)
and an expected result containing the rows in tbl_A which match at least one row in tbl_B:
a b c
1: c 3 -0.5403072
2: c 3 -0.5403072
3: d 4 -1.3353621
4: d 4 -1.3353621
5: e 5 1.1811730
6: e 5 1.1811730
Adding 2 more options
tbl_A[fintersect(tbl_A[,.(a)], tbl_B[,.(a=x)])]
and
tbl_A[unique(tbl_A[tbl_B, nomatch=0L, which=TRUE])]
I'm not sure how expressive it is (apologies if not) but this seems to work:
tbl_A[,.(a,b,c,any(a == tbl_B[,x])), by = a][V4==TRUE,.(a,b,c)]
I'm sure it can be improved - I only found out about any() yesterday and still testing it :)

Using data.table function in lapply on a list with data.frames elements (Answer = setDT)

First question, let me know if more info or background is needed in the comments please.
Many answers on here and elsewhere deal with calling lapply in a data.table function. I want to do the opposite, which on paper should be easy lapply(list.of.dfs, fun(x) x) but I cant get it to work with data.table functions.
I have a list that contains several data.frames with the same columns but differing numbers of rows. This comes from the output of several simulation scenarios so they must be treated seperately and not rbind'ed.
#sample list of data.frames
scenarios <- replicate(5, data.frame(a=sample(letters[1:4],10,T),
b=sample(1:2,10,T),
x=sample(1:10, 10),
y =runif(10)), simplify = FALSE)
I want to add a column to every element that is the sum of x/y by a and b.
From the data.table documentation in the examples section the process to do this for one data.frame is the following (search: add new column by reference by group in the doc page):
test <- as.data.table(scenarios[[1]]) #must specify data.table class
test[, newcol := sum(x/y), by = .(a , b)][]
I want to use lapply to do the same thing to every element in the scenarios list and return the list.
My most recent attempt:
lapply(scenarios, function(i) {as.data.table(i[, z := sum(x/y), by=.(a,b)]); i})
but I keep getting the error unused argument (by = .a,b))
After pouring over the results of this and other sites I have been unable to solve this problem. Which I'm fairly sure means that there is something I dont understand about calling anonymous functions, and/or using the data.table function. Is this one of those cases where one you use the [ as the function? Or possibly my as.data.table is out of place.
This answer was a step in the right direction (I think), it covers the use of fun(x) {... ; x} to use an anonymous function and return x.
Thanks!
You can use setDT here instead.
scenarios <- lapply(scenarios, function(i) setDT(i)[, z := sum(x/y), by=.(a,b)])
scenarios[[1]]
a b x y z
1: c 2 2 0.87002174 2.298793
2: b 2 10 0.19720775 78.611837
3: b 2 8 0.47041670 78.611837
4: b 2 4 0.36705023 78.611837
5: a 1 5 0.78922686 12.774035
6: a 1 6 0.93186209 12.774035
7: b 1 3 0.83118438 3.609307
8: c 1 1 0.08248658 30.047494
9: c 1 7 0.89382050 30.047494
10: c 1 9 0.89172831 30.047494
Using as.data.table, the syntax would be
scenarios <- lapply(scenarios, function(i) {i <- as.data.table(i); i[, z := sum(x/y),
by=.(a,b)]})
but this wouldn't be recommended as it will create an additional copy, which is avoided by setDT.

Function which uses different dataframes with same column names

I have two dataframes which look like follows:
df1 <- data.frame(V1 = 1:4, V2 = rep(2, 4), V3 = 7:4)
df2 <- data.frame(V2 = rep(NA, 4), V1 = rep(NA, 4), V3 = rep(NA, 4))
I need to write a function which assigns the values of df1 to df2, if the columnnames of both dataframes are the same. The structure of the function should look like this:
fun <- function(x){
if(# If the name of x is the same like the name of a column in df1)
out <- df1$? # Here I need to assign df1$"x" somehow
out
}
fun(df2$V1)
The output should look like this:
[1] 1 2 3 4
Unfortunately I couldnt find a solution by myself. Is there a way how I could do this? Thank you very much in advance!
I need to write a function which assigns the values of df1 to df2, if
the columnnames of both dataframes are the same.
Are you sure you need a function?
names_in_common <- intersect(names(df1),names(df2))
df2[,names_in_common] <- df1[,names_in_common]
Using Joachim Schork's code:
names_in_common <- intersect(names(df1),names(df2))
df2[,names_in_common] <- df1[,names_in_common]
and if you want to change a single column of df2:
names_in_common <- intersect(names(df1), names(df2[, "V1", drop=FALSE]))
df2[,names_in_common] <- df1[,names_in_common]
This is impossible, because when you access a column of a data.frame using the dollar syntax you lose the column name. There's no way for fun() to determine the column name of the vector that was passed in as an argument.
Instead, you can simply call fun() using the column name itself as the argument, rather than the vector of NAs, which are not useful and not used at all inside the function. In other words, the call becomes
fun('V1');
Then you can write the function as follows:
fun <- function(name) df1[[name]];
Demo:
fun('V1');
## [1] 1 2 3 4
Although now that I think about it, you might as well just index df1 directly, since that's all the function does now:
df1$V1;
## [1] 1 2 3 4
Rereading your question, you said you want to assign the column from df1 to df2, although your example code doesn't do that. Assuming you did want to carry out this assignment inside the function, you could do this:
fun <- function(name) df2[[name]] <<- df1[[name]];
Demo:
fun('V1');
df2;
## V2 V1 V3
## 1 NA 1 NA
## 2 NA 2 NA
## 3 NA 3 NA
## 4 NA 4 NA
This makes use of the superassignment operator <<-.

Create a list from a dataframe in R

Consider the following dataframe:
test.df <- data.frame(a = c("1991-01-01","1991-01-01","1991-02-01","1991-02-01"), b = rnorm(4), c = rnorm(4))
I would like to create a list from test.df. Each element of the list would be a subset dataframe of test.df corresponding to a specific value of column a, i.e. each date. In other words, in this case, column a takes unique values 1991-01-01 and 1991-02-01. Therefore, the resulting list would be comprised of two elements: the subset of test.df when a = 1991-01-01 (excluding column a), and the other element of the list would be the subset of test.df when 1991-02-01 = 2 (excluding column a). Here is the output I am looking for:
lst <- list(test.df[1:2,2:3], test.df[3:4,2:3])
Note that the subset dataframes may not have the same number of rows.
In my real practical example, column a is a date column with many more values.
I would appreciate any attempt of help! Thanks a lot!
You can use split
lst <- split(test.df, test.df$a)
If you want to get rid of column a, use split(test.df[-1], test.df$a) (thanks to #akrun for comment).
You can use the following code:
sapply(union(test.df$a,NULL), function(y,x) x[x$a==y,], x=test.df, simplify=FALSE)
You could also use the dlply function in the plyr package:
> library(plyr)
> dlply(test.df, .(a))
$`1991-01-01`
a b c
1 1991-01-01 1.3658775 0.9805356
2 1991-01-01 -0.2292211 2.2812914
$`1991-02-01`
a b c
1 1991-02-01 -0.2678131 0.5323250
2 1991-02-01 0.3736910 0.4988308
Or the data.table package:
> library(data.table)
> setDT(test.df)
> dt <- test.df[, list(list(.SD)), by = a]$V1
> names(dt) <- unique(test.df$a)
> dt
$`1991-01-01`
b c
1: 1.3658775 0.9805356
2: -0.2292211 2.2812914
$`1991-02-01`
b c
1: -0.2678131 0.5323250
2: 0.3736910 0.4988308

Retain Vector Names as Dataframe Column Names

In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1

Resources