Reshaping data.table with cumulative sum - r

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!

Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3

Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3

One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Related

data.table::dcast long to wide data while ignoring NA-Category?

I want to transform my data from long to wide after some joins, resulting in a few NAs in the data provided.
Unfortunately, these NAs are also present in the richt-hand side (RHS), which defines the newly added columns via the transformation.
Consider this example:
library(data.table)
dt <- data.table(id=c(1,2,1,2,3,4),
group = c("A","A","B","B",NA,NA),
values = c(7,8,9,10,NA,NA))
dt_wide <- dcast(dt,
id ~ group,
value.var = c("values"))
In the data, rows 5 and 6 do not have any group or associated value:
id group values
1: 1 A 7
2: 2 A 8
3: 1 B 9
4: 2 B 10
5: 3 <NA> NA
6: 4 <NA> NA
if there is an associated value, a group does exist, therefore: (group == NA) => (value == NA)
the transformed dataframe wrongly considers NA as its own group in the group- column, which results in the following wide data table:
id NA A B
1: 1 NA 7 9
2: 2 NA 8 10
3: 3 NA NA NA
4: 4 NA NA NA
I would not prefer to build a possible buggy workaround where i retroactively delete the NA column by name or values (as it might handle different colnames and columns later in production).
Is there a way to tell dcast to ignore the NAs in group and not make an extra column out of it, while preserving all rows in the transformed table?
Like this:
id A B
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA
This is tricky, but seems to work:
dcast(dt,
id ~ ifelse(is.na(group),unique(na.omit(dt$group)),group),
value.var = c("values"))
Key: <id>
id A B
<num> <num> <num>
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA
I don't think it's possible to prevent dcast from doing that. I'd just filter them out afterwards:
dt_wide[, names(dt_wide) != "NA", with = FALSE]
Output:
id A B
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA

Skip NAs when using Reduce() in data.table

I'm trying to get the cumulative sum of data.table rows and was able to find this code in another stackoverflow post:
devDF1[,names(devDF1):=Reduce(`+`,devDF1,accumulate=TRUE)]
It does what I need it to do, however when it comes across a row that starts off with an NA, it will just replace every element in that row with NA (instead of the cumsum of the other elements in the row). I don't want to replace the NAs with 0s, because I'll be needing this output for further processes and don't want the same final cumsum duplicated in the rows. Is there any way I can adjust that piece of code to ignore the NAs? Or is there an alternate code that could be used to get the cumulative sum of the rows in a data.table while ignoring NAs?
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt
# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4
If you want to carry previous value to NA values you can use :
dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))),
.SDcols = names(dt)]
dt
# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8
If you want to keep NA as NA :
dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]
dt
# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

How do we avoid for-loops when we want to conditionally add columns by reference? (condition to be evaluated seperately in each row)

I have a data.table with many numbered columns. As a simpler example, I have this:
dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(1:5,10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')
I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.
My current code for this has two for-loops which is clearly not the way to do it:
for(i in 1:nrow(dat)){
dat[i,'sumCol':={temp=0;
for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
dat[i,paste0('col',j),with=F]!=1
){temp=temp+dat[i,paste0('col',j),with=F]}};
temp}]}
I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.
A possible solution:
dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]
which gives:
> dat
col1 col2 col3 col4 oneMoreCol sumCol
1: 4 5 5 3 a 17
2: 4 5 NA 5 a 14
3: 2 3 4 3 a 12
4: 1 2 3 4 a 9
5: 4 3 NA 5 a 12
6: 2 2 1 4 a 8
7: NA 2 NA 5 a 7
8: 4 2 2 4 a 12
9: 4 1 5 4 a 13
10: 2 1 5 1 a 7
Used data:
set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(c(1:5,NA),10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

How to do a complex wide-to-long operation for network analysis

I have survey data that includes who the respondent is (iAmX), who they work with (withX), how frequently they work with each partner (freqX), and how satisfied they are with each partner (likeX). Participants can select multiple options for who they are and who they work with.
I would like to go from something like this, with one row per respondent:
df <- read.table(header=T, text='
id iAmA iAmB iAmC withA withB withC freqA freqB freqC likeA likeB likeC
1 X X NA X X NA 3 2 NA 3 2 NA
2 NA NA X X NA NA 5 NA NA 5 NA NA
')
To something like this, with one row per combination, where "from" is who the actor is and "to" is who they work with:
goal <- read.table(header=T, text='
id from to freq like
1 A A 3 3
1 B A 3 3
1 A B 2 2
1 B B 2 2
2 C A 5 5
')
I have tried some melt, gather, and reshape functions but frankly I think I'm just not up to the logic puzzle today. I would really appreciate some help!
Although I must admit I have not fully understood OP's logic, the code below reproduces the expected goal.
The key points here are data.table's incarnation of the melt() function which is able to reshape multiple measure columns simultaneously and the cross join function CJ().
library(data.table)
# reshape multiple measure columns simultaneously
cols <- c("iAm", "with", "freq", "like")
long <- melt(setDT(df), measure.vars = patterns(cols),
value.name = cols, variable.name = "to")[
# rename factor levels
, to := forcats::fct_relabel(to, function(x) LETTERS[as.integer(x)])]
# create combinations for each id
combi <- long[, CJ(from = na.omit(to[iAm == "X"]), to = na.omit(to[with == "X"])), by = id]
# join to append freq and like
result <- combi[long, on = .(id, to), nomatch = 0L][, -c("iAm", "with")]
# reorder result
setorder(result, id)
result
id from to freq like
1: 1 A A 3 3
2: 1 B A 3 3
3: 1 A B 2 2
4: 1 B B 2 2
5: 2 C A 5 5
The intermediate results are
long
id to iAm with freq like
1: 1 A X X 3 3
2: 2 A <NA> X 5 5
3: 1 B X X 2 2
4: 2 B <NA> <NA> NA NA
5: 1 C <NA> <NA> NA NA
6: 2 C X <NA> NA NA
and
combi
id from to
1: 1 A A
2: 1 A B
3: 1 B A
4: 1 B B
5: 2 C A

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

Resources