efficiently locf by groups in a single R data.table - FromLast - r

I am wondering how to efficiently locf by groups in a single R data.table from the last, i.e. filling in NA values backward from the last know value.
There is a code efficiently locf by groups in a single R data.table for forward direction but I am looking for the opposite direction. Any idea how to adjust the code?

A bit workaround, but anyway:
first, sort data reversely, apply the code to replace the NAs, sort back.
DT <- arrange(DT, desc(id))
id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT <- DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
DT <- arrange(DT, id)

Related

subset data.table by indexed column and rows

I am looking to subset a data table recursively, by changing the index of the column z AND at the same time filter rows based on some %in% based vector.
dt <- setDT(copy(diamonds))
dt <- setDT(data.frame(lapply(dt, as.character), stringsAsFactors=FALSE))
z=4
subset_by <- unique(dt[,z])[1:2]
### obviously does not work
###dt1<-dt[ z %in% subset_by]
I am looking for the most memory-efficient operation to do this and I am sure there is a way without using colnames, but I just cannot find it. I looked at a lot of posts, with this beign the most relevant
If we are subsetting based on the index or names, we can specify it in .SDcols
i1 <- dt[, .I[.SD[[1]] %chin% subset_by], .SDcols = z]
dt[i1]
Note that subsetting a column in data.table/tbl_df/data_frame would be either [[ or $
subset_by <- unique(dt[[z]])[1:2]

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Normalize values in a column to obtain overheads

I have the following data frame:
df <- data.frame(
Target=rep(LETTERS[1:3],each=8),
Prov=rep(letters[1:4],each=2),
B=rep("5MB"),
S=rep("1MB"),
BUF=rep("8kB"),
M=rep(c('g','p')),
Thr.mean=1:24)
whose column Thr.mean I would like to normalize by the values where Target=='C' (I don't mind attaching a new column).
To clarify, I would like to end up with:
Thr.mean <- c(1/17,2/18,3/19,4/20,5/21,6/22,7/23,8/24,9/17,10/18,11/19,12/20,13/21,14/22,15/23,16/24,1,1,1,1,1,1,1,1)
Now, it may happen that there are rows in this data frame, where Target!='C', and they have values in S or B that are not present in rows where Target=='C', and for these I would also like to calculate the overhead. The most important column for matching is M, then BUF, B, and S.
Any ideas how to do it? I could write several loops and ifs, but I'm looking for a more elegant solution.
For posterity,
the way how I solved my problem is by using data.table:
DT <- data.table(df)
DT[, Thr.Norm.C := .SD[Target=='C', Thr.mean], by = 'B,BUF,Prov']
DT[, over.thr := Thr.Norm.C/Thr.mean]

Element wise mean for a list of dataframes with NA

I have a list of data frames x and I want to find the mean of each element across the data frames. I found an elegant solution online courtesy of Dimitris Rizopoulos.
x.mean = Reduce("+", x) / length(x)
However this doesn't really work when the data frames contain NA. Is there a good way to accomplish this?
Here is an approach that uses data.table
The steps are (1) coerce each data.frame [element] in x to data.table, with a column (called rn) identifying the rownames. (2) on the large data.table, by rowname calculate the mean of each column (with na.rm = TRUE dealing with NA values). (3) remove the rn column
library(data.table)
results <- rbindlist(lapply(x,data.table, keep.rownames = TRUE))[,
lapply(.SD, mean,na.rm = TRUE),by=rn][,rn := NULL]
an alternative would be to coerce to matrix, "simplify" to a 3-dimensional array then apply a mean over the appropriate margins
# for example
results <- as.data.frame(apply(simplify2array(lapply(x, as.matrix)),1:2,mean, na.rm = TRUE))
I like #mnel's solution better, but as an educational exercise here's how you can modify your expression to work with NA values while keeping the same type of logic:
Reduce(function(y,z) {y[is.na(y)] <- 0; z[is.na(z)] <- 0; y + z}, x) /
Reduce('+', lapply(x, function(y) !is.na(y)))

data.frame: create column by applying a function to groups of rows

I have a data frame consisting of results from multiple runs of an experiment, each of which serves as a log, with its own ascending counter. I'd like to add another column to the data frame that has the maximum value of iteration for each distinct value of experiment.num in the sample below:
df <- data.frame(
iteration = rep(1:5,5),
experiment.num = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
some.val=42,
another.val=12
)
In this example, the extra column would look like this (as all the subsets have the same maximum for iteration):
df$max <- rep(5,25)
The naive solution I currently use is:
df$max <- sapply(df$experiment.num,function(exp.num) max(df$iteration[df$experiment.num == exp.num]))
I've also used sapply(unique(df$experiment.num), function(n) c(n,max(df$iteration[df$experiment.num==n]))) to build another frame which I can then merge with the original, but both of these approaches seem more complicated than necessary.
The experiment.num column is a factor, so I think I might be able to exploit that to avoid iteratively doing this naive subsetting for all rows.
Is there a better way to get a column of maximum values for subsets of a data.frame?
Using plyr:
ddply(df, .(experiment.num), transform, max = max(iteration))
Using ave in base R:
df$i_max <- with(df, ave(iteration, experiment.num, FUN=max))
Here's a way in base R:
within(df[order(df$experiment.num), ],
max <- rep(tapply(iteration, experiment.num, max),
rle(experiment.num)$lengths))
I think you can use data.table:
install.packages("data.table")
library("data.table")
dt <- data.table(df) #make your data frame into a data table)
dt[, pgIndexBY := .BY, by = list(experiment.num)] #this will add a new column to your data table called pgIndexBY

Resources