data.table replace data using values from another data.table, conditionally - r

This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.

This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to #brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.

Related

Understand the meaning of[.... with=F][[1]]

I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?

assign individual vector element to individual columns by rown and by name - r

I work mostly with data table but a data frame solution would work as well.
I have the result of an apply which returns this data structure
applyres=structure(c(0.0260, 3.6775, 0.92
), .Names = c("a.1", "a.2", "a.3"))
Then I have a data table
coltoadd=c('a.1','a.2','a.3')
dt <- data.table(variable1 = factor(c("what","when","where")))
dt[,coltoadd]=as.numeric(NA)
Now I would like to add the elements of applyres to the corresponding columns, just one row at a time, because applyres is calculated from another function. I have tried different assignments but nothing seems to work. Ideally I would like to assign based on column name, just in case the columns change order in one of the two structures.
This doesn't work
dt[1,coltoadd]=applyres
I also tried
dt[1,coltoadd := applyres]
And tried to change applyrest to a matrix or a data table and transpose.
I would like to do something like this
dt[1,coltoadd[i]]=applyres[coltoadd[i]]
But not sure if it should go in a loop, doesn't seem the best way to do it.
How do I avoid doing single assignments if I have a large number of columns?
1) data.frame Convert to data.frame, perform the assignments and convert back.
DF <- as.data.frame(dt)
DF[1, -1] <- applyres
# perform remaining of assignments
dt <- as.data.table(DF)
2) loop Another possibility is a for loop:
for(i in 2:ncol(dt)) dt[1, i] <- applyres[i-1]

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Using data.table to calculate a function which depends on many columns

There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:
# Create a data table with 26 columns. Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)
Now, say I would like to count the number of 'a's in columns 5,6,7 and 8.
I cannot see how to do this with SDcols and end up doing:
data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]
Which is very tedious. Is there a more sensible way to do this?
Thanks
I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD and .SDcols.
.SD is just a data.table containing the data for current group. And .SDcols tells the columns .SD should have. A useful way is to use print to see the content.
# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]
Since there is no by here, .SD contains all the rows of data.dt, corresponding to the columns specified in .SDcols.
Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.
data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]
We return a logical matrix by comparing all the columns in .SD to "a". And then use rowSums to sum them up.
Another way using Reduce:
data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]

Normalize values in a column to obtain overheads

I have the following data frame:
df <- data.frame(
Target=rep(LETTERS[1:3],each=8),
Prov=rep(letters[1:4],each=2),
B=rep("5MB"),
S=rep("1MB"),
BUF=rep("8kB"),
M=rep(c('g','p')),
Thr.mean=1:24)
whose column Thr.mean I would like to normalize by the values where Target=='C' (I don't mind attaching a new column).
To clarify, I would like to end up with:
Thr.mean <- c(1/17,2/18,3/19,4/20,5/21,6/22,7/23,8/24,9/17,10/18,11/19,12/20,13/21,14/22,15/23,16/24,1,1,1,1,1,1,1,1)
Now, it may happen that there are rows in this data frame, where Target!='C', and they have values in S or B that are not present in rows where Target=='C', and for these I would also like to calculate the overhead. The most important column for matching is M, then BUF, B, and S.
Any ideas how to do it? I could write several loops and ifs, but I'm looking for a more elegant solution.
For posterity,
the way how I solved my problem is by using data.table:
DT <- data.table(df)
DT[, Thr.Norm.C := .SD[Target=='C', Thr.mean], by = 'B,BUF,Prov']
DT[, over.thr := Thr.Norm.C/Thr.mean]

Resources