Replace Inf/-Inf values from vector of variable names, with values from similarly named vector of variables (substr/grep/gsub) - r

I'm currently stumped making some efficient code. I have a vector of variables (med.vars) that were transformed by the in-year global median. Sometimes the global median is 0, which creates Inf/-Inf values I would like to replace with the pre-transformed variable value (vars). I can't figure out how to do this efficiently with some type of data.table 'dat[,:=lapply(.SD), .SDcols=med.vars] function or a for loop with get(), noquotes(), etc.
dat<-data.table(v1=c(2,10,7),v2=c(5,6,5),v3=c(10,15,20),v1.med=c(1,Inf,5),v2.med=c(5,6,5),v3.med=c(-Inf,2,3))
vars<-c("v1","v2","v3")
med.vars<-c("v1.med","v2.med","v3.med")
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -Inf
2: 10 6 15 Inf 6 2
3: 7 5 20 5 5 3
In reality these vectors are 50+ vars I pull from names(dat) with grep() and use gsub(".med","",med.vars) to create the second vector of pre-transformed variable names.
I would like to efficiently perform
dat[v1.med==Inf | v1.med==-Inf, v1.med:=v1]
dat[v3.med==Inf | v3.med==-Inf, v3.med:=v3]
for each element, med.vars[i], and its corresponding element, vars[i] such that the resulting data.table is:
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -10
2: 10 6 15 10 6 2
3: 7 5 20 5 5 3
Thank you for your time

OP mentions efficiency, so maybe move to long form. Then the standard syntax can be used:
DT = melt(dat, meas=list(vars, med.vars), value.name=c("var", "med"))
DT[!is.finite(med), med := sign(med)*var]
variable var med
1: 1 2 1
2: 1 10 10
3: 1 7 5
4: 2 5 5
5: 2 6 6
6: 2 5 5
7: 3 10 -10
8: 3 15 2
9: 3 20 3

As these are corresponding columns, we can make use of Map
dat[, (med.vars) := Map(function(x, y) ifelse(is.finite(y), y,
x * sign(y)), .SD[, vars, with = FALSE],
.SD[, med.vars, with = FALSE])]
dat
# v1 v2 v3 v1.med v2.med v3.med
#1: 2 5 10 1 5 -10
#2: 10 6 15 10 6 2
#3: 7 5 20 5 5 3
Or another option is set by looping through the columns with a for loop
for(j in seq_along(vars)) {
i1 <- !is.finite(dat[[med.vars[j]]])
v1 <- dat[[vars[j]]]
v2 <- dat[[med.vars[j]]]
set(dat, i = which(i1), j = med.vars[j], value = sign(v2[i1]) * v1[i1])
}
This can also be done in base R (on a data.frame)
i1 <- !sapply(dat[med.vars], is.finite)
dat[med.vars][i1] <- dat[vars][i1] * sign(dat[med.vars][i1])

Related

How do we avoid for-loops when we want to conditionally add columns by reference? (condition to be evaluated seperately in each row)

I have a data.table with many numbered columns. As a simpler example, I have this:
dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(1:5,10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')
I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.
My current code for this has two for-loops which is clearly not the way to do it:
for(i in 1:nrow(dat)){
dat[i,'sumCol':={temp=0;
for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
dat[i,paste0('col',j),with=F]!=1
){temp=temp+dat[i,paste0('col',j),with=F]}};
temp}]}
I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.
A possible solution:
dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]
which gives:
> dat
col1 col2 col3 col4 oneMoreCol sumCol
1: 4 5 5 3 a 17
2: 4 5 NA 5 a 14
3: 2 3 4 3 a 12
4: 1 2 3 4 a 9
5: 4 3 NA 5 a 12
6: 2 2 1 4 a 8
7: NA 2 NA 5 a 7
8: 4 2 2 4 a 12
9: 4 1 5 4 a 13
10: 2 1 5 1 a 7
Used data:
set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(c(1:5,NA),10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

Creating a new data table for each row of an existing data table R while avoiding memory vector issue

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?
A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

(Using a custom function to) Sum above N rows in a datatable (dataframe) by groups

I need a function that sums the above N+1 rows in dataframes (data tables) by groups.
An equivalent function for a vector, would be something like below. (Please forgive me if the function below is inefficient)
Function1<-function(x,N){
y<-vector(length=length(x))
for (i in 1:length(x))
if (i<=N)
y[i]<-sum(x[1:i])
else if (i>N)
y[i]<-sum(x[(i-N):i])
return(y)}
Function1(c(1,2,3,4,5,6),3)
#[1] 1 3 6 10 14 18 # Sums previous (above) 4 values (rows)
I wanted to use this function with sapply, like below..
sapply(X=DF<-data.frame(A=c(1:10), B=2), FUN=Function1(N=3))
but couldn't.. because I could not figure out how to set a default for the x in my function. Thus, I built another function for data.frames.
Function2<-function(x, N)
if(is.data.frame(x)) {
y<-data.frame()
for(j in 1:ncol(x))
for(i in 1:nrow(x))
if (i<=N) {
y[i,j]<-sum(x[1:i,j])
} else if (i>N) {
y[i,j]<-sum(x[(i-N):i,j])}
return(y)}
DF<-data.frame(A=c(1:10), B=2)
Function2(DF, 2)
# V1 V2
1 1 2
2 3 4
3 6 6
4 9 6
5 12 6
6 15 6
7 18 6
8 21 6
9 24 6
10 27 6
However, I still need to perform this by groups. For example, for the following data frame with a character column.
DF<-data.frame(Name=rep(c("A","B"),each=5), A=c(1:10), B=2)
I would like to apply my function by group "Name" -- which would result in.
A 1 2
A 3 4
A 6 6
A 9 6
A 12 6
B 6 2
B 13 4
B 21 6
B 24 6
B 27 6
#Perform function2 separately for group A and B.
I was hoping to use function with the data.table package (by=Groups), but couldn't figure out how.
What would be the best way to do this?
(Also, it would be really nice, if I could learn how to make my Function1 to work in sapply)
With data.table, we group by 'Name', loop through the columns of interest specified in .SDcols (here all the columns are of interest so we are not specifying it) and apply the Function1
library(data.table)
setDT(DF)[, lapply(.SD, Function1, 2), Name]
# Name A B
# 1: A 1 2
# 2: A 3 4
# 3: A 6 6
# 4: A 9 6
# 5: A 12 6
# 6: B 6 2
# 7: B 13 4
# 8: B 21 6
# 9: B 24 6
#10: B 27 6

Add marginals to data table?

What is the right way to add marginal sums to a data table?
What I do right now:
> (a <- data.table(x=c(1,2,1,2,2,3,3),y=c(10,10,20,20,30,30,40),z=1:7,key=c("x")))
x y z
1: 1 10 1
2: 1 20 3
3: 2 10 2
4: 2 20 4
5: 2 30 5
6: 3 30 6
7: 3 40 7
> (a <- a[a[,sum(z),by=x]])
x y z V1
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
> setnames(a,"V1","x.z")
> setkeyv(a,"y")
> (a <- a[a[,sum(z),by=y]])
y x z x.z V1
1: 10 1 1 4 3
2: 10 2 2 11 3
3: 20 1 3 4 7
4: 20 2 4 11 7
5: 30 2 5 11 11
6: 30 3 6 13 11
7: 40 3 7 13 7
> setnames(a,"V1","y.z")
I am pretty sure this is not The Right Way.
What is?
One alternative is this one:
> a[,Sum:=sum(z), by="x"]
> a
x y z Sum
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
Edit: Some more explanation on := usage:
The := operator enables add/update by reference. With this, you can:
add new columns or update existing columns by reference
DT[, x2 := x+1] # add one new column
DT[, `:=`(x2 = x+1, y2 = y+1)] # adding more than 1 col
DT[, x := x+1] # modify existing column
add or update certain rows of new or existing columns by reference
DT[x == 1L, y := NA] # modify 'y' just where expression in 'i' matches
DT[x == 1L, `:=`(y = NA, z=NA)] # same but for multiple columns
DT[x == 1L, newcol := 5L] # matched rows for 'newcol' will be 5, all other 'NA'
add or update cols while grouping, by reference - by default, the computed result is recycled within each group.
DT[, zsum := sum(z), by=x]
Here, sum(z) returns 1 value for each group in x. The result is then recycled for length of that group and is added/updated by reference to zsum.
add or update during a by-without-by operation. That is, when you perform a data.table join and you want to add/update column while joining:
X <- data.table(x=rep(1:3, each=2), y=1:6, key="x")
Y <- data.table(x=1:3, y=c(3L, 1L, 2L), key="x")
X[Y, y.gt := y > i.y]
Finally, you can also remove columns by reference (i.e. instantly even it's a 20GB table) :
DT[, x := NULL] # just 1 column
DT[, c("x","y") := NULL] # 1 or more columns
toRemove = c("x","y")
DT[, (toRemove) := NULL] # wrap with brackets to lookup variable
Hope this helps clarify the usage on :=. Also check out ?set. It is similar to :=, but with the limitation that it can not be combined with joins. This allows for it to be faster inside a for loop (due to reduced overhead from not calling [.data.table) for all operations it is capable of than :=.
It can be quite handy, especially, in some scenarios. See this post for a nice usage.

data.table assigning with `sapply` in a merge

I have some data.tables like so:
x <- data.table(id=rep(1:3, 2), a=1:6)
y <- data.table(id=1:3, b=2:4)
I can merge them like this:
setkey(x, id)
setkey(y, id)
x[y]
id a b
1: 1 1 2
2: 1 4 2
3: 2 2 3
4: 2 5 3
5: 3 3 4
6: 3 6 4
Now, I want to create a new column in x based off a and b which is the sum of a and b.
I can do this with:
x[y, val:=a + b]
However, now suppose for some reason that the '+' operator is not vectorised. How can I store a row-wise calculation into x where x[y] is needed for the calculation? Also, assume I cannot use mapply (because for my actual problem, mapply is not suited to the function).
I'm trying to use sapply like so to add in a row-wise manner:
x[y, sapply(1:nrow(x), function (i) a[i] + b[i])]
However this returns the incorrect result:
id V1
1: 1 3
2: 1 NA
3: 1 NA
4: 1 NA
5: 1 NA
6: 1 NA
7: 2 5
8: 2 NA
9: 2 NA
10: 2 NA
11: 2 NA
12: 2 NA
13: 3 7
14: 3 NA
15: 3 NA
16: 3 NA
17: 3 NA
18: 3 NA
If I do this it works:
x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]
# [1] 3 6 5 8 7 10
BUT when I try and assign this to a column in x, it is not stored (makes sense because it looks like I'm trying to save the new column into x[y]).
x[y][, val:=sapply(1:nrow(x), function (i) a[i] + b[i])]
Is there any way to do the above but save the output into x[, val]?
Is this how I am supposed to do it, or is there a more data.table-y way?
x[, val:=x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]]
You are doing by-without-by without knowing it, (see below for the description from the help)
Advanced: Aggregation for a subset of known groups is particularly
efficient when passing those groups in i. When i is a data.table,
DT[i,j] evaluates j for each row of i. We call this by without by or
grouping by i. Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
This means that j is evaluated for each row of i (cylcing through y one row at a time -- so that if you run sapply(1:nrow(x),...) in j it will create a vector of length nrow(x) each time, when this is not what you want.
So your second option is definitely a valid approach (as it is one of the recommended approaches for doing this)
Otherwise you could use .N (When grouping by i, .N is the number of rows in x matched to, for each row of i) not nrow(x), but you will have to think about the length of your objects and how your function is to be vectorized.
Take this as an example
x[y, {browser(); a+b}]
Called from: `[.data.table`(x, y, {
browser()
a + b
})
Browse[1]> a
[1] 1 4
Browse[1]> b
[1] 2
Browse[1]> .N
[1] 2
a has length two, because value of the key matches with 2 rows from x. b only has length 1 because it only has length 1 in y.
I think the best approach is to correctly Vectorize your function (which is hard to give advice upon without more of an example)
another approach would be to replicate b to the length of a eg
x[y, val := {
bl <- rep_len(b, .N)
sapply(seq_len(.N), function(i) a[i] + bl[i])}]
x
id a val
1: 1 1 3
2: 1 4 6
3: 2 2 5
4: 2 5 8
5: 3 3 7
6: 3 6 10
or if you know that y has unique rows for each value of id, then you don't need to try and index any columns from it.
x[y, val2 := sapply(seq_len(.N), function(i) a[i] + b)]
# an alternative would be to use sapply on a (avoid creating another vector)
x[y, val3 := sapply(a, function(ai) ai + b)]
x
# id a val val2 val3
# 1: 1 1 3 3 3
# 2: 1 4 6 6 6
# 3: 2 2 5 5 5
# 4: 2 5 8 8 8
# 5: 3 3 7 7 7
# 6: 3 6 10 10 10

Resources