subset data.table by indexed column and rows - r

I am looking to subset a data table recursively, by changing the index of the column z AND at the same time filter rows based on some %in% based vector.
dt <- setDT(copy(diamonds))
dt <- setDT(data.frame(lapply(dt, as.character), stringsAsFactors=FALSE))
z=4
subset_by <- unique(dt[,z])[1:2]
### obviously does not work
###dt1<-dt[ z %in% subset_by]
I am looking for the most memory-efficient operation to do this and I am sure there is a way without using colnames, but I just cannot find it. I looked at a lot of posts, with this beign the most relevant

If we are subsetting based on the index or names, we can specify it in .SDcols
i1 <- dt[, .I[.SD[[1]] %chin% subset_by], .SDcols = z]
dt[i1]
Note that subsetting a column in data.table/tbl_df/data_frame would be either [[ or $
subset_by <- unique(dt[[z]])[1:2]

Related

efficiently locf by groups in a single R data.table - FromLast

I am wondering how to efficiently locf by groups in a single R data.table from the last, i.e. filling in NA values backward from the last know value.
There is a code efficiently locf by groups in a single R data.table for forward direction but I am looking for the opposite direction. Any idea how to adjust the code?
A bit workaround, but anyway:
first, sort data reversely, apply the code to replace the NAs, sort back.
DT <- arrange(DT, desc(id))
id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT <- DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
DT <- arrange(DT, id)

How to subset a data.table in i (eg finding NAs) based on a character vector of column names

This should be easy but google and me are failing. Say I have this data:
library(data.table)
mydata <- data.table(a = c(1, NA),
b = c(NA, NA),
pointer = c(1,2))
and I want to get the rows where both a and b are NA. Of course i can do this manually like:
mydata[is.na(a) & is.na(b)]
but the issue arises deep in other code and I want to do this based on a character vector (or list, or whatever, this is flexible) of the column names such as:
myvector <- c("a","b")
again I can do this manually if I know how many elements the vector has:
mydata[is.na(get(myvector[1])) & is.na(get(myvector[2]))]
But I don't know how many elements myvector has in my application. How can I do this without specifying the number of entries in myvector? Essentially, I'm looking for something like with = F but for i in data.table. So I want to use myvector like this:
mydata[is.na(somefunction(myvector))]
I tried all kind of paste0(myvector, collapse = " & ") combinations with get() or as.formula() but it is getting me nowhere.
We can specify the .SDcols with the vector of column names, loop over the .SD (Subset of Data.table), create a list of logical vectors with is.na and Reduce the list to a single logical vector with & (which checks the corresponding elements of the list or column with & condition), use that to subset the rows of data
library(data.table)
mydata[mydata[, Reduce(`&`, lapply(.SD, is.na)), .SDcols = myvector]]
-output
# a b pointer
#1: NA NA 2
Or use mget
mydata[mydata[, Reduce(`&`, lapply(mget(myvector), is.na))]]
Here is another solution assuming that myvector is a character vector:
library(data.table)
mydata[rowSums(!is.na(mydata[, ..myvector])) == 0]

Assign a vector to a specific existing row of data table in R

I've been looking through tutorials and documentation, but have not figured out how to assign a vector of values for all columns to one existing row in a data.table.
I start with an empty data.table that has already the correct number of columns and rows:
dt <- data.table(matrix(nrow=10, ncol=5))
Now I calculate some values for one row outside of the data.table and place them in a vector vec, e. g.:
vec <- rnorm(5)
How could I assign the values of vec to e. g. the first row of the data.table while achieving a good performance (since I also want to fill the other rows step by step)?
First you need to get the correct column types, as the NA matrix you've created is logical. The column types won't be magically changed by assigning numerics to them.
dt[, names(dt) := lapply(.SD, as.numeric)]
Then you can change the first row's values with
dt[1, names(dt) := as.list(vec)]
That said, if you begin with a numeric matrix you wouldn't have to change the column types.
dt <- data.table(matrix(numeric(), 10, 5))
dt[1, names(dt) := as.list(vec)]

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

Element wise mean for a list of dataframes with NA

I have a list of data frames x and I want to find the mean of each element across the data frames. I found an elegant solution online courtesy of Dimitris Rizopoulos.
x.mean = Reduce("+", x) / length(x)
However this doesn't really work when the data frames contain NA. Is there a good way to accomplish this?
Here is an approach that uses data.table
The steps are (1) coerce each data.frame [element] in x to data.table, with a column (called rn) identifying the rownames. (2) on the large data.table, by rowname calculate the mean of each column (with na.rm = TRUE dealing with NA values). (3) remove the rn column
library(data.table)
results <- rbindlist(lapply(x,data.table, keep.rownames = TRUE))[,
lapply(.SD, mean,na.rm = TRUE),by=rn][,rn := NULL]
an alternative would be to coerce to matrix, "simplify" to a 3-dimensional array then apply a mean over the appropriate margins
# for example
results <- as.data.frame(apply(simplify2array(lapply(x, as.matrix)),1:2,mean, na.rm = TRUE))
I like #mnel's solution better, but as an educational exercise here's how you can modify your expression to work with NA values while keeping the same type of logic:
Reduce(function(y,z) {y[is.na(y)] <- 0; z[is.na(z)] <- 0; y + z}, x) /
Reduce('+', lapply(x, function(y) !is.na(y)))

Resources