Divide by sum by factor and carry back - r

I have a data.table that looks like:
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7)
M <- data.table(A,B,C,D)
My question is similar to R help: divide values by sum produced through factor with a few extra considerations. A specifies a date (I'm simply using integers here). B are individuals. C is a classification in the individual belongs to. D is a value variable.
For each classification c of C, for each day a of A, divide the value D by the sum of the values for all individuals in c, carrying backward when needed such that 0<x-a<=N where x is the date of another individual (meaning that we pick the smallest x-a and use that as an approximation for the value of the other individual in group c on day a).
Let's say N=5. Here's my expected output.
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7/(0.7+0.3), 0.3/(0.3), 0.5/(0.5), 0.9/(0.9), 4/(4+7), 7/(4+7))
M <- data.table(A,B,C,D)
Note that the values for group B are not carried backward for individual 3, as the length is greater than 5 (20-5). Is there a nice way of doing this in data.table?
For each value in D, I wish to divide by the sum of all the values of the same group (either I, II,II) on that day. However, you'll notice for some groups, observations do not exist on that day. I'll try and walk through the logic on a few observations.
Edit: Let me try and walk through a few cases.
For individual 1 (column B) on day 1 (column A), the individual is of group I (column C). Other individuals of group I are: 2. For each of those others, we see that for individual 2, their nearest observation is on day 3 and 3-1<=5, so we'll use 0.3 in the denominator.
For individual 3 (column B) on day 5 (column A), the individual is of group II (column C). Other individuals of group II are: 3. For each of those others, we see that for individual 3, their nearest observation is on day 20 and 20-5>5, so we cannot use their observation in the denominator.

This, I think, will give you your answer:
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
V <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
N=5
#Put data into a frame
test = data.frame(A,B,C,V)
#order the data
test = test[order(as.numeric(test$C), test$A),]
#Get the 'rollback' possibilities for each value
Roll = sapply(test$A, FUN = function(x){paste(which(test$A < (x+N) & test$A >= x), collapse=",")})
#Get the groupings
Group = sapply(test$C, FUN = function(x){paste(which(test$C == x), collapse=",")})
#Intersect the values
ToGet = apply(cbind(Roll, Group), MARGIN=1, FUN=function(x){intersect(unlist(strsplit(x[1],",")), unlist(strsplit(x[2],",")))})
#Calculate the denominators
test$D = sapply(ToGet, FUN=function(x){sum(test$V[as.numeric(x)])})
test$Calc = test$V/test$D
Output:
> test
A B C V D Calc
1 1 1 I 0.7 1.0 0.7000000
2 3 2 I 0.3 0.4 0.7500000
7 7 7 I 0.1 0.1 1.0000000
3 5 3 II 0.5 0.5 1.0000000
4 20 4 II 0.9 0.9 1.0000000
5 21 5 III 4.0 11.0 0.3636364
6 21 6 III 7.0 11.0 0.6363636

The questions is tagged with data.table, so here is a data.table solution which uses non-equi joins to identify individuals within each group to treat them as cohort if the observations fall within a date window of 5 days.
library(data.table) # CRAN version 1.10.4 used
# set length of date window in days
N <- 5L
# give columns more semantic names according to OP's description
setnames(M, c("day", "id", "grp", "val"))
# prepare data for non-equi join: allowable date range
ranged <- M[, .(start = day, end = day + N, co.id = id, grp)]
# non-equi join to determine cohort
joined <- M[ranged, on = c("grp", "day>=start", "day<=end")]
# compute denominator for each cohort
grouped <- joined[, .(den = sum(val)), by = co.id]
# final update on join and order
result <- M[grouped, on = c("id==co.id"), calc := val / den][order(grp, id)]
result
# day id grp val calc
#1: 1 1 I 0.7 0.7000000
#2: 3 2 I 0.3 0.7500000
#3: 7 7 I 0.1 1.0000000
#4: 5 3 II 0.5 1.0000000
#5: 20 4 II 0.9 1.0000000
#6: 21 5 III 4.0 0.3636364
#7: 21 6 III 7.0 0.6363636
Data
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
M <- data.table(A,B,C,D)
Compact versions
For those who prefer compact code, here is a more convoluted version:
joined <- M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")]
M[joined[, .(den = sum(val)), by = co.id], on = c("id==co.id"),
calc := val / den][order(grp, id)]
Or, as a "one-liner":
M[M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")
][, .(den = sum(val)), co.id],
on = c("id==co.id"), calc := val / den][order(grp, id)]

Related

Sum of data frame's rows in range defined by columns

I have an integer based dataframe with positional coordinates in one column and a variable in the second. The coordinates range from 1-10 million, the variables from 0-950 - I'm interested in returning the sum of the variables from ranges defined within a separate frame containing the start and end points of the desired range.
To make things a bit easier to compute I've shortened the example:
Data:
a = seq(1,5)
b = c(0,0,1,0,2)
df1 <- data.frame(a, b)
c = c(1,1,2,2,3)
d = c(3,4,3,5,4)
df2 <- data.frame(c,d)
df1:
1, 0
2, 0
3, 1
4, 0
5, 2
df2:
1, 3
1, 4
2, 3
2, 5
3, 4
magic
output:
1,
1,
1,
3,
1,
Where magic is pulling the start and end positions in df2 columns 1 and 2 to pass to rowSums for df1 extraction.
Edit: #Frank's data.table solution: short and fast.
df2[, s := df1[df2, on=.(a >= c, a <= d), sum(b), by=.EACHI]$V1]
# output
c d s
1: 1 3 1
2: 1 4 1
3: 2 3 1
4: 2 5 3
5: 3 4 1
Another way (may be slower but works):
library(data.table)
setDT(df1)
setDT(df2)
## magic function
get_magic <- function(x)
{
spell <- c()
one <- unlist(x[1])
two <- unlist(x[2])
a <- df1[between(a, one, two), sum(b)]
spell <- append(spell, a)
return(spell)
}
# applies to row
d <- apply(df2, 1, get_magic)
print(d)
# output
[1] 1 1 1 3 1
One possible solution is by using mapply. I have used a custom function but one can write an inline function as part of mapply statement.
mapply(row_sum, df2$c, df2$d)
row_sum <- function(x, y){
sum(df1[x:y,2])
}
#Result
#[1] 1 1 1 3 1
Data
a = seq(1,5)
b = c(0,0,1,0,2)
df1 <- data.frame(a, b)
c = c(1,1,2,2,3)
d = c(3,4,3,5,4)
df2 <- data.frame(c,d)

R: Grouping data within cetrain range

I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.
This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000
This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}

Expanding window (cumulative calculation) in data.table: how to improve performance

I have grouped data collected at different time steps. Within each time step, there are several registrations of values. Each value may occur one or more times within and among time steps.
Some toy data:
df <- data.frame(grp = rep(1:2, each = 8),
time = c(rep(1, 3), rep(2, 2), rep(3, 3)),
val = c(1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 3, 1, 1, 1, 2, 3))
df
# grp time val
# 1 1 1 1
# 2 1 1 2
# 3 1 1 1
# 4 1 2 2
# 5 1 2 3
# 6 1 3 2
# 7 1 3 3
# 8 1 3 4
# 9 2 1 1
# 10 2 1 2
# 11 2 1 3
# 12 2 2 1
# 13 2 2 1
# 14 2 3 1
# 15 2 3 2
# 16 2 3 3
Objectives
I wish to do some calculations within an expanding time window, i.e. within time step 1, within time 1 and 2 together, within 1, 2, and 3 together, and so on. Within each window, I wish to calculate the number of unique values, the number of values which have occurred more than once, and the proportion of values which have occurred more than once.
For example, in my toy data, in group (grp) 1, in the second time window (time = 1 & 2 together) three unique values (val 1, 2, 3) have been registered (n_val = 3). Two of them (1, 2) occur more than once (n_re = 2), resulting in a "re_rate" of 0.67 (see below).
My data.table code produces the desired result. On a small data set it is slower than my base attempt, which I believe is fair enough, given some possible overhead in the data.table code. With a larger data set, the data.table code catches up, but is still slower. I expected (hoped) that the benefits would show up earlier.
Thus, what made me post this question is that I believe that the relative performance of my code is a strong indicator of me abusing data.table (I am sure the reason is not data.table performance itself). Thus, the main objective of my question is to get some advice on how to code this in a more data.table-esque way. For example, is it possible to avoid the loop over time windows altogether by vectorizing the calculations, as shown e.g. in the nice answer by #Khashaa here. If not, are there ways to make the loop and assignment more efficient?
My data.table code:
library(data.table)
f_dt <- function(df){
setDT(df, key = c("grp", "time", "val"))[ , {
# key or not only affects speed marginally
# unique time steps
times <- .SD[ , unique(time)]
# index vector to loop over
idx <- seq_along(times)
# pre-allocate data table
d2 <- data.table(time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
# loop to generate expanding window
for(i in idx){
# number of registrations per val
n <- .SD[time %in% times[seq_len(i)], .(n = .N), by = val][ , n]
# number of unique val
set(x = d2, i = i, j = 2L, length(n))
# number of val registered more than once
set(x = d2, i = i, j = 3L, sum(n > 1))
}
# proportion values registered more than once
d2[ , re_rate := round(n_re / n_val, 2)]
d2
}
, by = grp]
}
...which gives the desired result:
f_dt(df)
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.50
# 2: 1 2 3 2 0.67
# 3: 1 3 4 3 0.75
# 4: 2 1 3 0 0.00
# 5: 2 2 3 1 0.33
# 6: 2 3 3 3 1.00
Corresponding base code:
f_by <- function(df){
do.call(rbind,
by(data = df, df$grp, function(d){
times <- unique(d$time)
idx <- seq_along(times)
d2 <- data.frame(grp = d$grp[1],
time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
for(i in idx){
dat <- d[d$time %in% times[seq_len(i)], ]
tt <- table(dat$val)
n_re <- sum(tt > 1)
n_val <- length(tt)
re_rate <- round(n_re / n_val, 2)
d2[i, ] <- data.frame(d2$grp[1], time = times[i], n_val, n_re, re_rate)
}
d2
})
)
}
Timings:
Tiny toy data from above:
library(microbenchmark)
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.481724 1.450203 1.474037 1.452887 1.521378 1.502686 10
Some larger data:
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.094424 1.099642 1.107821 1.096997 1.097693 1.194983 10
No, the data is still not large, but I would expect data.table to catch up by now. If coded properly... I believe this suggests that there is a large potential for improvement of my code. Any advice is highly appreciated.
f <- function(df){
setDT(df)[, n_val := cumsum(!duplicated(val)), grp
][, occ := 1:.N, .(grp, val)
][, occ1 := cumsum(occ == 1) - cumsum(occ == 2), grp
][, n_re := n_val - occ1,
][, re_rate := round(n_re/n_val, 2),
][, .(n_val = n_val[.N], n_re = n_re[.N], re_rate =re_rate[.N]), .(grp, time)]
}
where
cumsum(!duplicated(val)) counts the (cumulative) occurrences of the unique values, n_val,
occ counts the cumulative occurrences each value (note that it is grouped by val).
occ1 then counts the number of elements in val occurred only once so far.
The number of values occurred only once increases by 1 when occ==1, decreases by 1 when occ==2; hence cumsum(occ == 1) - cumsum(occ == 2).
The number of values which have occurred more than once is n_val-occ1
Speed Comparison
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
system.time(f(df))
# user system elapsed
# 0.038 0.000 0.038
system.time(f_dt(df))
# user system elapsed
# 16.617 0.013 16.727
system.time(f_by(df))
# user system elapsed
# 16.077 0.040 16.122
Hope this helps.
Was looking for a better way to code expanding window of non-duplicated groups and came across this question.
This question seems to be more about expanding window where the group (i.e. time in the question) is duplicated. Below is a solution making use of between.
#expanding group by where groups are duplicated
library(data.table)
setDT(df)
df[ , {
#get list of unique time groups to be used in the expanding group
uniqt <- unique(time)
c(list(time=uniqt), #output time as well
#expanding window of each unique time group
do.call(rbind, lapply(uniqt, function(n) {
#tabulate the occurrences
x <- table(val[between(time, uniqt[1L], n)])
#calculate desired values
n_val <- length(x)
n_re <- sum(x > 1)
data.frame(n_val=n_val, n_re=n_re, re_rate=n_re/n_val)
})))
}, by=grp]
result:
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.5000000
# 2: 1 2 3 2 0.6666667
# 3: 1 3 4 3 0.7500000
# 4: 2 1 3 0 0.0000000
# 5: 2 2 3 1 0.3333333
# 6: 2 3 3 3 1.0000000
I was unable to find in which version of data.table was between first released and hence, between might be released after this question was posted.

Converting values between a certain range to a letter

Is there an easy way to let values between a certain range equal a letter?. So in the following example, how would I convert all values in df so that:
Values less than or equal to 1 = A.
Values less than or equal to 5 = B.
Values greater than 5 = C.
A small example dataset:
df1 <- rnorm (100, mean = 1, sd = 0.3)
df2 <- rnorm (100, mean = 5, sd = 1.6)
df <- cbind(df1,df2)
as.data.frame(apply(df,2, function(x) cut(x, c(-Inf,1,5,Inf), labels=c('A','B','C'))))
# df1 df2
# 1 A C
# 2 A C
# 3 B B
# 4 A C
# 5 A C
# 6 A B
# 7 A C
# 8 B B
# 9 B C
# 10 A C
Remember to use -Inf and Inf when creating cut points for your outer boundary. It's wrapped in an apply function to repeat the process over each column.

Groupby bins and aggregate in R

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?
A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")
There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2
I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

Resources