data.table - efficiently manipulate large data set

data.table - efficiently manipulate large data set - r

I am amazed by the blazing speed of data.table. The coding below does exactly what I need however when executed on a large table it does not perform very well.
convinced that this can be done faster with data.table but I do not see how.
Output
The output needs to be a matrix with the rownames a regular sequence of days.
For each column separately:
All values before the first value need to be NA
All values after the last value need to be NA
Between the first and the last value 0 need to be added as the do not exist in the input table
The following coding shows how the result should look like:
M <-
matrix(c(NA, NA, NA, 2, 0, 1, 3, 0, 2 , NA,
NA, NA, 3, 1, 3, 2, 1, 2, NA, NA),
ncol = 2,
dimnames = list(as.character((Sys.Date() + 0:9)),
c("E1", "E2")))
Output example
## E1 E2
## 2017-01-27 NA NA
## 2017-01-28 NA NA
## 2017-01-29 NA 2
## 2017-01-30 2 2
## 2017-01-31 0 2
## 2017-02-01 3 1
## 2017-02-02 1 3
## 2017-02-03 0 3
## 2017-02-04 2 NA
## 2017-02-05 NA NA
Input
The following table shows the source/input for the coding/function:
DS <- data.table(
E = c(rep("E1", 4), rep("E2", 6)),
C = c(c(Sys.Date() + c(3, 5, 6, 8)),
c(Sys.Date() + c(2, 3, 4, 5, 6, 7))),
S = round(runif(n = 10,min = 1, max = 3), 0),
key = c("E", "C"))
## E C S
## 1: E1 2017-01-30 3
## 2: E1 2017-02-01 1
## 3: E1 2017-02-02 2
## 4: E1 2017-02-04 1
## 5: E2 2017-01-29 3
## 6: E2 2017-01-30 2
## 7: E2 2017-01-31 3
## 8: E2 2017-02-01 1
## 9: E2 2017-02-02 2
## 10: E2 2017-02-03 3
Input example
Code working
The following few lines do exactly what I need and is simple. However it is not efficient.
The real table has 700 unique C values and 2 Million E values.
# Create the regular time line per day
CL <- c(C= (Sys.Date() + 0:9))
# Determine first and last per E
DM <- DS[, .(MIN = min(C), MAX = max(C)), by =.(E)]
# Generate all combinations
CJ <- CJ(E = DS$E, C = CL, unique = TRUE)
# Join
DC <- DS[CJ, on = .(E, C)][!is.na(E)]
# replace NA by 0
DC[is.na(S), S:=0]
# Lead-in
DC[DM, on=.(E, C<MIN), S:=NA]
# Lead-out
DC[DM, on=.(E, C>MAX), S:=NA]
# Cast to matrix format
DC2 <- dcast(
data = DC, formula = C ~ E,
fun.aggregate = sum, value.var = "S")
# coerce to matrix
M3 <- as.matrix(DC2[, -1])
# add row nanes
rownames(M3) <- format(CL, "%Y-%m-%d")
I made some long, un-readable, clumsy coding which creates the matrix with 1.2B cells in 35 secs. This must be possible as quick but far more elegant with data.table, however not like this.

A data.table, like a data.frame underneath everything is a list (with length = number of columns)
200 Million columns is a lot of columns - this will make anything slow.
The description of the conversion to "wide" will bloat the data with large number of NA values. You can almost certainly perform the analysis you need on the "long form" and using keys.
It isn't clear from your question what you need, but you can calculate the various sums
# convert to an IDate
DT[, CALDAY := as.IDate(CALDAY)]
# get range of dates
rangeDays <- DT[,range(CALDAY)]
all_days <- as.IDate(seq(rangeDays[1],rangeDays[2], by=1))
# create sums
DT_sum <- DT[, list(VALUE= sum(VALUE)), keyby = list(ENTITY, CALDAY)]
and then index using entity and dates.
DT_sum[.("2a8605e2-e283-11e6-a3bb-bbe3fd226f8d", all_days)]
and if you need to replace NA with 0
na_replace <- function(x,repl=0){x[is.na(x)]<-repl;x}
DT_sum[.("2a8605e2-e283-11e6-a3bb-bbe3fd226f8d", all_days), na_replace(VALUE)]

This does the trick. But still the performance is not good.
It takes DS as input parameter. The result is a data.table which should be coerced to matrix by:
M <- as.matrix(build_timeseries_DT(DS))
Function
build_timeseries_DT <- function(DS){
# regular time serie for complete range with index
dtC <- data.table(
CAL = seq(min(DS$C), max(DS$C), by = "day"))[, idx:= 1:.N]
# add row index (idx) to sales
DQ <- dtC[DS, on = "CAL"]
setkey(DQ, "ENT")
# calculate min index per ENT
DM <- DQ[, .(MIN = min(idx), MAX = max(idx)), by = .(ENT)]
# allocate memory, assign 0 and set rownames by reference
DT <- dtC[, .(CAL)][, (DM[, ENT]):= 0L][, CAL:= NULL]
setattr(DT, "row.names", format(dtC$CAL, "%Y-%m-%d"))
# Set NA for the Lead-in and out, next populate values by ref
for(j in colnames(DT)){
set(x = DT,
i = c(1L:(DM[j, MIN]), (DM[j, MAX]):DT[, .N]),
j = j,
value = NA )
set(x = DT,
i = DQ[j, idx],
j = j,
value = DQ[j, SLS] )}
return(DT)
}
Test Data
DS <- data.table(
ENT = c("A", "A", "A", "B", "B", "C", "C", "C", "D", "D"),
CAL = c(Sys.Date() + c(0, 5, 6, 3, 8, 1, 2, 9, 3, 5)),
SLS = as.integer(c(1, 2, 1, 2, 3, 1, 2, 3, 2, 1)),
key = c("ENT", "CAL"))
ENT CAL SLS
1: A 2017-01-28 1
2: A 2017-02-02 2
3: A 2017-02-03 1
4: B 2017-01-31 2
5: B 2017-02-05 3
6: C 2017-01-29 1
7: C 2017-01-30 2
8: C 2017-02-06 3
9: D 2017-01-31 2
10: D 2017-02-02 1
Result
as.matrix(build_timeseries_DT(DS))
A B C D
[1,] 1 NA NA NA
[2,] 0 NA 1 NA
[3,] 0 NA 2 NA
[4,] 0 2 0 2
[5,] 0 0 0 0
[6,] 2 0 0 1
[7,] 1 0 0 NA
[8,] NA 0 0 NA
[9,] NA 3 0 NA
[10,] NA NA 3 NA
result with colors

Related

replace missing values using other rows only when other columns are the same in R

I guess that other people have already looked for it but couldn't find what I'm looking for.
I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table solution.
Right now, I've managed to do it only with a (very inefficient) loop.
In addition, my current code does not replace NA in case that there are two NA's in the same row.
I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?
ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(NA, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
desired_ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(2, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
print(r)
to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
if (length(to_check) == 0) {
print("no NA- next")
next
}
for (col_check in to_check) {
.ex <- copy(ex)[seq(from = r, to = r + 1), ]
.ex[[col_check]] <- NULL
if (nrow(unique(.ex)) == 1) {
ex[[col_check]][r] <- ex[[col_check]][r + 1]
}
}
}
all.equal(ex, desired_ex)

Here is a solution which will work for an arbitrary number of rows and columns within each id not just pairs of rows:
library(data.table)
ex[,
if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N))
} else {
.SD
}, by = id]
or, more compact,
ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id attr1 attr2 attr3 attr4
1: 1 NA 2 2 1
2: 1 NA 2 2 1
3: 2 3 NA 2 1
4: 2 3 3 1 3
Explanation
For each id it is checked if the rows fulfill the condition. If not .SD is returned unchanged. If the condition is fulfilled a new .SD is created by picking the first non-NA value in each column (or NA in case of all NA) using fcoalesce() and replicating this value as many times as there are rows in .SD.
The check for the condition consists of 2 parts. First, it is checked for each column in .SD if all values are identical thereby ignoring any NA. Finally, it is checked if this is TRUE for all columns.
Note that .SD is a data.table containing the Subset of Data for each group, excluding any columns used in by.
Another use case with more rows and columns
ex2 <- fread("
id foo bar baz attr4 attr5
1 NA 2 NA 1 5
1 NA 2 2 1 NA
1 NA 2 NA NA NA
2 3 NA 2 1 2
2 3 3 1 3 2
2 3 3 1 4 2
3 5 2 NA 1 3
3 NA 2 2 1 3
4 NA NA NA NA NA
")
ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id foo bar baz attr4 attr5
1: 1 NA 2 2 1 5
2: 1 NA 2 2 1 5
3: 1 NA 2 2 1 5
4: 2 3 NA 2 1 2
5: 2 3 3 1 3 2
6: 2 3 3 1 4 2
7: 3 5 2 2 1 3
8: 3 5 2 2 1 3
9: 4 NA NA NA NA NA

Here is an option mixing base R with data.table:
#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]
#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) &
apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))
#update those rows with values from row below
ex[(flags), (cols) :=
mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]
Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.

Create a categorical variable

I want to categorized one variable with the next conditionals:
0 - 4: "fail"
5 - 7: "good"
8 - 10: "excellent"
None of the above: NA
I tried using the recode function
The values of variable is numeric
segur <- data$segur
Created a new variable using recode
dt1 <- recode(segur, "c(0,4)='suspenso';c(5, 7)='aceptable';c(8,10)='excelente'; else='NA'")
dt1
How can I fix?

using factor in base R
Data:
# set random seed
set.seed(1L)
# without any NA
x1 <- sample(x = 1:10, size = 20, replace=TRUE)
# with NA
x2 <- sample(x = c(1:10, NA), size = 20, replace=TRUE)
Code:
# without any NA
as.character(factor(x1, levels = c(0:10), labels = c(rep("fail", 5), rep("good", 3), rep("excellent", 3)), exclude=NA))
# with NA
as.character(factor(x2, levels = c(0:10), labels = c(rep("fail", 5), rep("good", 3), rep("excellent", 3)), exclude=NA))

I guess you can use cut like below
cut(segur, c(0, 4, 7, 10), labels = c("fail", "good", "excellent"))
Example
> segur
[1] 6 1 4 -2 -1 10 8 0 5 9
> cut(segur, c(0, 4, 7, 10), labels = c("fail", "good", "excellent"))
[1] good fail fail <NA> <NA> excellent excellent
[8] <NA> good excellent
Levels: fail good excellent

Here is a solution using the fmtr package. You can create a categorical format using the value and condition functions, and then apply the format to the numeric data using the fapply function. Here is an example:
library(fmtr)
# Create sample data
df <- read.table(header = TRUE, text = '
ID segur
1 0
2 8
3 5
4 11
5 7')
# Create format
fmt <- value(condition(x >= 0 & x <=4, "fail"),
condition(x >= 5 & x <=7, "good"),
condition(x >= 8 & x <= 10, "excellent"),
condition(TRUE, NA))
# Apply categorization
df$segur_cat <- fapply(df$segur, fmt)
# View results
df
# ID segur segur_cat
# 1 1 0 fail
# 2 2 8 excellent
# 3 3 5 good
# 4 4 11 <NA>
# 5 5 7 good

Issue with local variables in r custom function

I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.

Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4

how to merge matrices in R with different number of rows

I would like to merge several matrices using their row names.
These matrices do not have the same number of rows and columns.
For instance:
m1 <- matrix(c(1, 2, 3, 4, 5, 6), 3, 2)
rownames(m1) <- c("a","b","c")
m2 <- matrix(c(1, 2, 3, 5, 4, 5, 6, 2), 4, 2)
rownames(m2) <- c("a", "b", "c", "d")
m3 <- matrix(c(1, 2, 3, 4), 2,2)
rownames(m3) <- c("d", "e")
mlist <- list(m1, m2, m3)
For them I would like to get:
Row.names V1.x V2.x V1.y V2.y V1.z V2.z
a 1 4 1 4 NA NA
b 2 5 2 5 NA NA
c 3 6 3 6 NA NA
d NA NA 5 2 1 3
e NA NA NA NA 2 4
I have tried to use lapply with the function merge:
M <- lapply(mlist, merge, mlist, by = "row.names", all = TRUE)
However, it did not work:
Error in data.frame(c(1, 2, 3, 4, 5, 6), c(1, 2, 3, 5, 4, 5, 6, 2), c(1, :
arguments imply differing number of rows: 3, 4, 2
Is there an elegant way to merge these matrices?

You are trying to apply a reduction (?Reduce) to the list of matrices, where the reduction is basically merge. The problem is that merge(m1, m2, by = "row.names", all = T) doesn't give you a new merged matrix with row names, but instead returns the row names in the first column. This is why we need additional logic in the reduction function.
Reduce(function(a,b) {
res <- merge(a,b,by = "row.names", all = T);
rn <- res[,1]; # Row.names column of merge
res <- res[,-1]; # Actual data
row.names(res) <- rn; # Assign row.names
return(res) # Return the merged data with proper row.names
},
mlist[-1], # Reduce (left-to-right) by applying function(a,b) repeatedly
init = mlist[[1]] # Start with the first matrix
)

Or alternatively:
df <- mlist[[1]]
for (i in 2:length(mlist)) {
df <- merge(df, mlist[[i]], by = "row.names", all=T)
rownames(df) <- df$Row.names
df <- df[ , !(names(df) %in% "Row.names")]
}
# V1.x V2.x V1.y V2.y V1 V2
# a 1 4 1 4 NA NA
# b 2 5 2 5 NA NA
# c 3 6 3 6 NA NA
# d NA NA 5 2 1 3
# e NA NA NA NA 2 4

This could also be conceptualised as a reshape operation if the right long-form data.frame is passed to the function:
tmp <- do.call(rbind, mlist)
tmp <- data.frame(tmp, id=rownames(tmp),
time=rep(seq_along(mlist),sapply(mlist,nrow)) )
reshape(tmp, direction="wide")
# id X1.1 X2.1 X1.2 X2.2 X1.3 X2.3
#a a 1 4 1 4 NA NA
#b b 2 5 2 5 NA NA
#c c 3 6 3 6 NA NA
#d d NA NA 5 2 1 3
#e e NA NA NA NA 2 4

using intervals to assign categorical values

Take the following generic data
A <- c(5,7,11,10,23,30,24,6)
B <- c(1,2,3,1,2,3,1,2)
C <- data.frame(A,B)
and the following intervals
library(intervals)
interval1 <- Intervals(
matrix(
c(
5, 15,
15, 25,
25, 35,
35, 100
),
ncol = 2, byrow = TRUE
),
closed = c( TRUE, FALSE ),
type = "Z"
)
rownames(interval1) <- c("A","B","C", "D")
interval2 <- Intervals(
matrix(
c(
0, 10,
12, 20,
22, 30,
30, 100
),
ncol = 2, byrow = TRUE
),
closed = c( TRUE, FALSE ),
type = "Z"
)
rownames(interval2) <- c("P","Q","R", "S")
Now I want to create the following output table
So where the A value overlap the two invervals, I want to 'copy' all the data to a line below.
We also introduce data$X which is the interval1 value and data$y which is the interval2 value.
Where data does not fit within any of the interval, I want to remove it from the data.frame
I am not sure if the break() function would be better used to create the intervals or if the dplyr function can be used to make the reoccuring data rows

You can use foverlaps in data.table:
library(data.table)
C.DT <- data.table(C)
C.DT[, A1:=A] # required for `foverlaps` so we can do a range search
# `D` and `E` are your interval matrices
I1 <- data.table(cbind(data.frame(D), idX=LETTERS[1:4], idY=NA))
I2 <- data.table(cbind(data.frame(E), idX=NA, idY=LETTERS[16:19]))
setkey(I1, X1, X2) # set the keys on our interval ranges
setkey(I2, X1, X2)
rbind(
foverlaps(C.DT, I1, by.x=c("A", "A1"), nomatch=0), # match every value in `C.DT$A` to the ranges in `I1`
foverlaps(C.DT, I2, by.x=c("A", "A1"), nomatch=0)
)[order(A, B), .(A, B, X=idX, Y=idY)]
Produces:
A B X Y
1: 5 1 A NA
2: 5 1 NA P
3: 6 2 A NA
4: 6 2 NA P
5: 7 2 A NA
6: 7 2 NA P
7: 10 1 A NA
8: 10 1 NA P
9: 11 3 A NA
10: 23 2 B NA
11: 23 2 NA R
12: 24 1 B NA
13: 24 1 NA R
14: 30 3 C NA
15: 30 3 NA R
16: 30 3 NA S
Note you can easily change what you get instead of NA, by modifying the steps where I1 and I2 are created.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data.table - efficiently manipulate large data set - r

Related

replace missing values using other rows only when other columns are the same in R

Create a categorical variable

Issue with local variables in r custom function

how to merge matrices in R with different number of rows

using intervals to assign categorical values

Categories

Resources