binned frequency from multiple variables in same column - r

Using R, I have melted my table into two columns, variable and value. There are multiple variables and values represent lengths. I would like to end up with three columns: variable, fixed pre-determined bins and frequency of each bin per variable.
Example Data:
original table/data frame:
Var Value
a 509
a 1000
b 251
b 330
...
new table/data frame:
Var Bin Count
a 0-250 50
a 251-500 30
a 501-1000 1
b 0-250 80
b 251-500 55
b 501-1000 30
...

To my surprise, I haven't found a duplicate question googling the question's title.
So, here are three possible solutions using cut() implemented in base R, dplyr, and data.table.
# pre-define break points of bins
breaks <- seq(0L, 1000L, by = 250L)
Note that cut() creates left open, right closed intervals by default, e.g,
250 < x <= 500.
For the first interval, the parameter include.lowest = TRUE forces a left closed interval as well, e.g.,
0 <= x <= 250
which seems to be in line with OP's expected output.
base R
with(DF, table(Var, cut(Value, breaks, include.lowest = TRUE)))
Var [0,250] (250,500] (500,750] (750,1e+03]
a 0 0 1 1
b 1 2 0 0
dpylr
library(dplyr)
DF %>%
group_by(Var, Bin = cut(Value, breaks, include.lowest = TRUE)) %>%
count()
# A tibble: 4 x 3
# Groups: Var, Bin [4]
Var Bin n
<chr> <fctr> <int>
1 a (500,750] 1
2 a (750,1e+03] 1
3 b [0,250] 1
4 b (250,500] 2
data.table
library(data.table)
setDT(DF)[, .N, by = .(Var, Bin = cut(Value, breaks, include.lowest = TRUE))]
Var Bin N
1: a (500,750] 1
2: a (750,1e+03] 1
3: b [0,250] 1
4: b (250,500] 2
Data
library(data.table)
DF <- fread("Var Value
a 509
a 1000
b 0
b 251
b 330")

Related

How can I perform following operation in R? [duplicate]

I want to calculate mean (or any other summary statistics of length one, e.g. min, max, length, sum) of a numeric variable ("value") within each level of a grouping variable ("group").
The summary statistic should be assigned to a new variable which has the same length as the original data. That is, each row of the original data should have a value corresponding to the current group value - the data set should not be collapsed to one row per group. For example, consider group mean:
Before
id group value
1 a 10
2 a 20
3 b 100
4 b 200
After
id group value grp.mean.values
1 a 10 15
2 a 20 15
3 b 100 150
4 b 200 150
You may do this in dplyr using mutate:
library(dplyr)
df %>%
group_by(group) %>%
mutate(grp.mean.values = mean(value))
...or use data.table to assign the new column by reference (:=):
library(data.table)
setDT(df)[ , grp.mean.values := mean(value), by = group]
Have a look at the ave function. Something like
df$grp.mean.values <- ave(df$value, df$group)
If you want to use ave to calculate something else per group, you need to specify FUN = your-desired-function, e.g. FUN = min:
df$grp.min <- ave(df$value, df$group, FUN = min)
One option is to use plyr. ddply expects a data.frame (the first d) and returns a data.frame (the second d). Other XXply functions work in a similar way; i.e. ldply expects a list and returns a data.frame, dlply does the opposite...and so on and so forth. The second argument is the grouping variable(s). The third argument is the function we want to compute for each group.
require(plyr)
ddply(dat, "group", transform, grp.mean.values = mean(value))
id group value grp.mean.values
1 1 a 10 15
2 2 a 20 15
3 3 b 100 150
4 4 b 200 150
Here is another option using base functions aggregate and merge:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", "mean"))
group id value.x value.y
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150
You can get "better" column names with suffixes:
merge(x, aggregate(value ~ group, data = x, mean),
by = "group", suffixes = c("", ".mean"))
group id value value.mean
1 a 1 10 15
2 a 2 20 15
3 b 3 100 150
4 b 4 200 150

Outer function to identify matches: Issue with scalability

I want to perform matching between two groups in a data frame consisting of 10 million rows, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. The end result should be a data frame with 2 columns: (1) id number and (2) id number of matched row To do this, I use the outer function. See the toy example below:
set.seed(123)
# Creating data
df <- data.frame(id = c(1:10000000),
group = rbinom(10000000,1, 0.3),
value = round(runif(10000000),2))
threshold <- round(sd(df$value)*0.1,2)
#################################################################
# Identifying matches
library(tidyverse)
library(data.table)
# All values
dist_mat <- df$value
# Adding identifier
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))
# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
This code works for smaller datasets but is having issues when scaling up the data size (for obvious reasons). You can try to reduce the data frame size to 100 or 1000 rows and it should run more smoothly. The issue is related to the outer function and is stated as: Error: cannot allocate vector of size 156431.9 Gb.
As a way to solve this, I tried to do the matching row-wise, i.e., one row at a time. But this takes a tremendously long time (2500 rows in 8h, where I have 3 million rows to loop through...). See code below:
dist_mat <- df$value
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
matched_df <- data.frame()
for (i in 1:length(dist_mat_row)) {
print(i)
dist_mat <- as.matrix(abs(outer(dist_mat_row[i], dist_mat_col, "-")))
colnames(dist_mat) <- names(dist_mat_col)
rownames(dist_mat) <- names(dist_mat_row[i])
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
matched_df <- rbind(matched_df, dist_mat)
rm(dist_mat)
gc()
}
Is there any way of doing this that does not run out of memory or takes a tremendous time? So far, I've been trying to "trim some meat" off the data to reduce the size, and perhaps there are any more ways to do this? An alternative is to not do this the "brute" way but to find an alternative. Does anyone have any suggestions or ideas?
Thanks!
This will be my correct answer.
First, we need a function that will generate a data set with the appropriate proportion of the number of unique values. Here it is.
library(tidyverse)
library(collapse)
fdf = function(n, nup=.1) {
vp = 1/n/nup
tibble(
id = c(1:n),
group = rbinom(n, 1, 0.3),
value = round(runif(n)/vp)*vp)
}
For example, let's generate a set of 350 records with a ratio of unique values equal to 0.15
fdf(350, .15) %>% funique(cols=3) %>% nrow()
output
[1] 53
Now for a second example. 1000 lines with approximately 100 unique values.
fdf(1000, .1) %>% funique(cols=3) %>% nrow()
output
[1] 101
Now the most important and crucial thing. A binary search function that finds a range of val values that differ by tresh.
fbin = function(x, val, tresh = 0){
vmin = val - tresh
vmax = val + tresh
n = length(x)
e = .Machine$double.eps
if((x[1]-vmax)>=e | (vmin-x[n])>=e) NULL else{
l = 1
r = n
if(abs(x[1]-vmin)<=e | abs(x[1]-vmax)<=e |
((x[1]-vmin)>=e & (vmax-x[1])>=e)) imin=1 else {
while(l <= r){
i = (l + r) %/% 2
if((vmin-x[i])>e){
l = i + 1
} else {
if(!(vmin-x[i-1])>e){
r = i - 1
} else break
}
}
imin=i
}
l = imin
r = n
if(abs(x[n]-vmin)<=e | abs(x[n]-vmax)<=e |
((x[n]-vmin)>=e & (vmax-x[n])>=e)) imax = n else {
while(l <= r){
i = (l + r) %/% 2
if((x[i]-vmax)>e){
r = i - 1
} else {
if(!((x[i+1]-vmax)>e)){
l = l + 1
} else break
}
}
imax=i
}
imin:imax
}
}
First, a few notes about this feature. I took into account the fact that the val and tresh variables of the double type, and thus, due to the inaccuracy of the calculations, ordinary comparisons cannot be used here
such as x[i]>vmax or x[i]==vmax.
My search function requires the argument x to be sorted in descending order!
Let's do some unit tests.
set.seed(123)
x = sample(1:10, 30, replace=T) %>% sort()
x
#[1] 1 2 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7 8 9 9 9 9 9 9 9 10 10 10 10 10
x[fbin(x, 100, 0)]
#integer(0)
x[fbin(x, -10, 0)]
#integer(0)
x[fbin(x, 1, 0)]
#[1] 1
x[fbin(x, 10, 0)]
#[1] 10 10 10 10 10
x[fbin(x, 1, 1)]
#[1] 1 2
x[fbin(x, 10, 1)]
# [1] 9 9 9 9 9 9 9 10 10 10 10 10
x[fbin(x, 5, 0)]
#[1] 5 5 5
x[fbin(x, 5, 2)]
#[1] 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7
x[fbin(x, 5, 10)]
# [1] 1 2 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7 8 9 9 9 9 9 9 9 10 10 10 10 10
As you can see, the function returns the indexes for which the vector x values fall within the range of <val-tresh, val+tresh>.
Now it's time for a specific test. We'll see how fbin does a 10,000,000-element vector search.
set.seed(123)
n = 10000000
x = runif(n) %>% round(6) %>% sort()
funique(x) %>% length()
x[fbin(x, .5)]
#[1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
x[fbin(x, .5, .000001)]
# [1] 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999
# [10] 0.499999 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000
# [19] 0.500000 0.500000 0.500000 0.500000 0.500000 0.500001 0.500001 0.500001 0.500001
# [28] 0.500001 0.500001 0.500001 0.500001
Now let's see how long such a search will take.
library(microbenchmark)
ggplot2::autoplot(microbenchmark(fbin(x, .5, .001),
fbin(x, .5, .002),
fbin(x, .5, .003),
fbin(x, .5, .004),
times=10))
As you can see, the search takes about 1000 us.
Now let's compare that to the subset functions.
ggplot2::autoplot(microbenchmark(x[fbin(x, .5, .001)],
ss(x, x>=(0.5+0.001) & x<=(0.5-0.001)),
subset(x, x>=(0.5+0.001) & x<=(0.5-0.001)),
times=10))
As you can see, it is two or three orders faster!
It's time for the right function to solve your task.
fmatch = function(df, tresh){
#Adding a column with the row number
df = df %>% ftransform(row = 1:nrow(.))
#Splitting into two sorted subsets
df0 = df %>% roworder(value) %>% fsubset(group == 0)
df1 = df %>% roworder(value) %>% fsubset(group == 1)
#Transformations on matrices
M0 = df0 %>% qM()
M1 = df1 %>% qM()
#Prepare unique values from group 1
uM1 = df1$value %>% funique()
out = list()
for(i in 1:length(uM1)){
iM0 = fbin(M0[,3], uM1[i], tresh)
if(length(iM0)>0){
iM1 = fbin(M1[,3], uM1[i])
out[[paste0(uM1[i])]] = list(
row0 = M0[iM0, 4],
row1 = M1[iM1, 4]
)
}
}
out
}
How does this feature work?
I will describe it step by step.
Complete the data frame with the line numbers
Split the frame into two sorted subsets for grooup 1 and group 0
Convert it into matrices to speed up the operation (maybe you don't need it)
Prepare unique values from the subset of groups 1
For each unique value in the subset of group 1, do:
5.1 In the set for group 0, search for all rows for which value does not differ from the current unique value + - threshold
5.2 If only such lines exist, write one list which will contain the line numbers from the subset for group 1 with the value equal to the current value, and the line numbers from the subset group 0.
Let's see this for an example
#Preparation of data and threshold
set.seed(123)
df = fdf(100)
threshold = round(sd(df$value)*0.1,2)
out = fmatch(df, threshold)
df[out[[1]]$row1,]
# # A tibble: 1 x 3
# id group value
# <int> <int> <dbl>
# 1 16 1 0.1
df[out[[1]]$row0,]
# # A tibble: 6 x 3
# id group value
# <int> <int> <dbl>
# 1 10 0 0.1
# 2 13 0 0.1
# 3 28 0 0.1
# 4 29 0 0.1
# 5 48 0 0.1
# 6 55 0 0.1
df[out[[2]]$row1,]
# # A tibble: 3 x 3
# id group value
# <int> <int> <dbl>
# 1 24 1 0.2
# 2 58 1 0.2
# 3 68 1 0.2
df[out[[2]]$row0,]
# # A tibble: 9 x 3
# id group value
# <int> <int> <dbl>
# 1 27 0 0.2
# 2 44 0 0.2
# 3 46 0 0.2
# 4 47 0 0.2
# 5 49 0 0.2
# 6 54 0 0.2
# 7 60 0 0.2
# 8 72 0 0.2
# 9 99 0 0.2
Now I will change the threshold to 0.2 and repeat the test.
out = fmatch(df, 0.2)
df[out[[1]]$row1,]
# # A tibble: 1 x 3
# id group value
# <int> <int> <dbl>
# 1 16 1 0.1
df[out[[1]]$row0,]
# # A tibble: 24 x 3
# id group value
# <int> <int> <dbl>
# 1 43 0 0
# 2 10 0 0.1
# 3 13 0 0.1
# 4 28 0 0.1
# 5 29 0 0.1
# 6 48 0 0.1
# 7 55 0 0.1
# 8 27 0 0.2
# 9 44 0 0.2
# 10 46 0 0.2
# # ... with 14 more rows
df[out[[2]]$row1,]
# # A tibble: 3 x 3
# id group value
# <int> <int> <dbl>
# 1 24 1 0.2
# 2 58 1 0.2
# 3 68 1 0.2
df[out[[2]]$row0,]
# # A tibble: 32 x 3
# id group value
# <int> <int> <dbl>
# 1 43 0 0
# 2 10 0 0.1
# 3 13 0 0.1
# 4 28 0 0.1
# 5 29 0 0.1
# 6 48 0 0.1
# 7 55 0 0.1
# 8 27 0 0.2
# 9 44 0 0.2
# 10 46 0 0.2
# # ... with 22 more rows
Now it's time to test with 100,000 rows.
set.seed(123)
df = fdf(100000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch(df, threshold)
end_time <- Sys.time()
end_time - start_time
#Time difference of 13.9958 secs
object.size(out)
#319309040 bytes
As you can see, the whole thing took only 14 seconds. The output list is 320 MB. This could be crucial.
I ran another test on a set of 500,000 rows.
set.seed(123)
df = fdf(500000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch(df, threshold)
end_time <- Sys.time()
end_time - start_time
#Time difference of 7.982853 mins
length(out)
#47509
object.size(out)
#7889344576 bytes
As you hang, the fivefold increase in the data set has made the time 34 times longer. The initial list has grown 24 times and now takes almost 8 GB!
There is a very important conclusion from this. Probably for 10,000,000 lines you will not have enough memory to complete the operation. So I suggest slightly modifying the fmatch function so that it returns results only for a specific subset of unique values.
Perhaps we could also optimize the binary search functionality a bit more. But I would need to know what your values are in the variable value in your dataframe.
However, as you can see, the critical factor here is not the execution time, but the memory availability.
I will be waiting for your opinion.
Also write if my solution is clear to you and if you need any additional explanations.
Last update
I did one more test tonight. However, it required minimal modification to my fmatch function. It added two additional arguments, vmin and vmax. The function will now only run for unique values in the range <vmin, vmax).
fmatch1 = function(df, tresh, vmin=0, vmax=1){
#Adding a column with the row number
df = df %>% ftransform(row = 1:nrow(.))
#Splitting into two sorted subsets
df0 = df %>% roworder(value) %>% fsubset(group == 0)
df1 = df %>% roworder(value) %>% fsubset(group == 1)
#Transformations on matrices
M0 = df0 %>% qM()
M1 = df1 %>% qM()
#Prepare unique values from group 1
uM1 = df1$value %>% funique() %>% ss(.>=vmin & .<vmax)
out = list()
for(i in 1:length(uM1)){
iM0 = fbin(M0[,3], uM1[i], tresh)
if(length(iM0)>0){
iM1 = fbin(M1[,3], uM1[i])
out[[paste0(uM1[i])]] = list(
row0 = M0[iM0, 4],
row1 = M1[iM1, 4]
)
}
}
out
}
Now I was able to perform a data frame test with 10,000,000 rows.
However, I limited myself to values in the range <0, 0.005).
set.seed(123)
df = fdf(10000000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch1(df, threshold, 0, .005)
end_time <- Sys.time()
end_time - start_time
#Time difference of 6.865934 mins
length(out)
#4706
object.size(out)
#8557061840 bytes
The whole thing took almost 7 minutes and the result was as much as 9 GB of memory !!
If we now assume that it will be relatively linear, we can expect that for all unique values in the data frame with 10,000,000 lines, the function runtime will be approx. 24 hours and the result should be approx. 1,800 GB. Unfortunately, my computer does not have that much memory.
In fact, what I am writing now will not be the actual answer. This is going to be quite a long comment. Unfortunately, I would not fit it in one or even several comments. Therefore, I am asking everyone to be understanding and not to criticize what I am writing here.
Now to the point.
I looked at your problem. I've even been able to write a program that will do your job in much less time. With 100,000 lines, the program only ran for a few minutes. What compared to the 8 hours you gained on 2,500 rows is a clear difference. The problem, however, probably lies in the assumptions of the task itself.
When you write yourself, you have 10,000,000 rows. However, of those 10,000,000 lines, you only have 100 unique values, which is due to round(runif(n), 2)). So the first question to ask: it is the same for your real data?
Later you will say you want to match group id 0 to group id 1 if the difference between the values ​​is less than the specified threshold (let's assume the threshold for a moment is 0.3). So let's check what it gives in the output. If you only have 100 unique values ​​and 10,000,000 rows, you can expect group 0 to be around 50,000 values ​​of 0.99. Each of these values, of course, has a different id. However, in group 1, you will have approximately 3,450,000 rows with values ​​less than 0.69. Now, if you want to match each of these 50,000 IDs to 3,450,000 Group 1 IDs, you will get 172,500,000,000 matches in total !! Recall that we matched only the id from group 0, for which the value was 0.99.
Finally, my 100,000 row code generated a result set of only 10,000,000 rows! And although he did it in minutes, it strained my computer's memory a lot.
In addition, I wonder if by any chance you did not want to match the id not as you write, but when the absolute value of the difference between the values is less than the accepted threshold? abs(value1 - value0)<threshold?
If you are very curious, here is my code that I wrote about above.
library(tidyverse)
library(collapse)
set.seed(123)
n = 100000
df = tibble(
id = c(1:n),
group = rbinom(n,1, 0.3),
value = round(runif(n),2))
threshold = round(sd(df$value)*0.1,2)
m1 = df %>%
fsubset(group == 1) %>%
roworder(value) %>%
ftransform(row = 1:nrow(.))
m1.idx = m1 %>% funique(cols=3)
m1.M = m1 %>% qM()
m0 = df %>%
fsubset(group == 0) %>%
roworder(value)
m0.idx = m0 %>% funique(cols=3)
m0.M = m0 %>% qM
out = list()
for(i in 1:nrow(m0.M)){
id0 = m0.M[i,1]
value0 = m0.M[i,3]
value1 = round(value0 - threshold, 2)
idx = m1.idx %>% fsubset(value<=value1) %>% qM
if(nrow(idx)>1){
last.row = idx[nrow(idx), 4]-1
out[[paste0(id0)]] = m0 %>% ss(1:last.row,1)
}
}
dfout = unlist2d(out) %>% frename(.id = id0, id = id1) %>% qTBL()
However, I would suggest a slightly different solution. Perhaps it will be enough to remember only each of the 100 unique values from one of the groups and to each of them assign all id from group 0 for which this value exists, and all id from group 1 for which the value is less than the set threshold, or the absolute difference of values is smaller than this threshold.
Unfortunately, I do not know if such a solution would be acceptable for you. I will be waiting for a comment from you.

Check whether the data in column1 lies in a range And add data in column 2

I have a data frame in R containing 2 columns. I want to check whether the data in column one lies in the following range: x>80,70
count1
Var1 Freq
1 0.00000 7
2 10.00000 1
3 16.66667 1
4 30.95238 1
5 33.33333 2
Data frame contains 32 rows in total with values in column 1 ranging from 0 to 100.
output should be something like this :
Var1 Freq
1 x<60 12
2 60<x<70 *something*
3 70<x<80 *something*
4 x>80 *something*
With the datatable library
df is your dataframe :
breaks <- c(0,60,70,80,Inf)
setDT(df)
df[,list(SUM = sum(freq)),by = list(VAR=cut(var1,breaks = breaks))][order(VAR)]
With dplyr library :
df %>%
group_by(VAR = cut(var1, breaks = breaks)) %>%
summarise(SUM = sum(freq)) %>%
arrange(VAR)

calculate rolling sum based on row index in R

I am trying to calculate a grouped rolling sum based on a window size k but, in the event that the within group row index (n) is less than k, I want to calculate the rolling sum using the condition k=min(n,k).
My issue is similar to this question R dplyr rolling sum but I am looking for a solution that provides a non-NA value for each row.
I can get part of the way there using dplyr and rollsum:
library(zoo)
library(dplyr)
df <- data.frame(Date=rep(seq(as.Date("2000-01-01"),
as.Date("2000-12-01"),by="month"),2),
ID=c(rep(1,12),rep(2,12)),value=1)
df <- tbl_df(df)
df <- df %>%
group_by(ID) %>%
mutate(total3mo=rollsum(x=value,k=3,align="right",fill="NA"))
df
Source: local data frame [24 x 4]
Groups: ID [2]
Date ID value tota3mo
(date) (dbl) (dbl) (dbl)
1 2000-01-01 1 1 NA
2 2000-02-01 1 1 NA
3 2000-03-01 1 1 3
4 2000-04-01 1 1 3
5 2000-05-01 1 1 3
6 2000-06-01 1 1 3
7 2000-07-01 1 1 3
8 2000-08-01 1 1 3
9 2000-09-01 1 1 3
10 2000-10-01 1 1 3
.. ... ... ... ...
In this case, what I would like is to return the value 1 for observations on 2000-01-01 and the value 2 for observations on 2000-02-01. More generally, I would like the rolling sum to be calculated over the largest window possible but no larger than k.
In this particular case it's not too difficult to change some NA values by hand. However, ultimately I would like to add several more columns to my data frame that will be rolling sums calculated over various windows. In this more general case it will get quite tedious to go back change many NA values by hand.
Using the partial=TRUE argument of rollapplyr :
df %>%
group_by(ID) %>%
mutate(roll = rollapplyr(value, 3, sum, partial = TRUE)) %>%
ungroup()
or without dplyr (still need zoo):
roll <- function(x) rollapplyr(x, 3, sum, partial = TRUE)
transform(df, roll = ave(value, ID, FUN = roll))

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Resources