looking for mutate_if, but for rows not columns [duplicate] - r

I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition. Here's some example code, with my data.table solution:
library(data.table)
# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]
Is there a simple dplyr solution to this same problem? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:
1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}
DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)
1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.
mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}
DF %>%
group_by(is.exit = measure == 'exit') %>%
mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
ungroup() %>%
select(-is.exit)
2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.
library(dplyr)
DF %>% mutate(is.exit = measure == 'exit',
qty.exit = ifelse(is.exit, qty, qty.exit),
cf = (!is.exit) * cf,
delta.watts = replace(delta.watts, is.exit, 13)) %>%
select(-is.exit)
3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.
library(sqldf)
DF %>%
do(sqldf(c("update '.'
set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13
where measure = 'exit'",
"select * from '.'")))
4) row_case_when Also check out row_case_when defined in
Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.
library(dplyr)
DF %>%
row_case_when(
measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
TRUE ~ data.frame(qty.exit, cf, delta.watts)
)
Note 1: We used this as DF
set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

You can do this with magrittr's two-way pipe %<>%:
library(dplyr)
library(magrittr)
dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)
This reduces the amount of typing, but is still much slower than data.table.

Here's a solution I like:
mutate_when <- function(data, ...) {
dots <- eval(substitute(alist(...)))
for (i in seq(1, length(dots), by = 2)) {
condition <- eval(dots[[i]], envir = data)
mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
data[condition, names(mutations)] <- mutations
}
data
}
It lets you write things like e.g.
mtcars %>% mutate_when(
mpg > 22, list(cyl = 100),
disp == 160, list(cyl = 200)
)
which is quite readable -- although it may not be as performant as it could be.

As eipi10 shows above, there's not a simple way to do a subset replacement in dplyr because DT uses pass-by-reference semantics vs dplyr using pass-by-value. dplyr requires the use of ifelse() on the whole vector, whereas DT will do the subset and update by reference (returning the whole DT). So, for this exercise, DT will be substantially faster.
You could alternatively subset first, then update, and finally recombine:
dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)
dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])
But DT is gonna be substantially faster:
(editted to use eipi10's new answer)
library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]},
eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)},
alex= {dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)
dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})
Unit: microseconds
expr min lq mean median uq max neval cld
dt 591.480 672.2565 747.0771 743.341 780.973 1837.539 100 a
eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509 100 b
alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427 100 b

I just stumbled across this and really like mutate_cond() by #G. Grothendieck, but thought it might come in handy to also handle new variables. So, below has two additions:
Unrelated: Second last line made a bit more dplyr by using filter()
Three new lines at the beginning get variable names for use in mutate(), and initializes any new variables in the data frame before mutate() occurs. New variables are initialized for the remainder of the data.frame using new_init, which is set to missing (NA) as a default.
mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
# Initialize any new variables as new_init
new_vars <- substitute(list(...))[-1]
new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
.data[, new_vars] <- new_init
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
.data
}
Here are some examples using the iris data:
Change Petal.Length to 88 where Species == "setosa". This will work in the original function as well as this new version.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)
Same as above, but also create a new variable x (NA in rows not included in the condition). Not possible before.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)
Same as above, but rows not included in the condition for x are set to FALSE.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)
This example shows how new_init can be set to a list to initialize multiple new variables with different values. Here, two new variables are created with excluded rows being initialized using different values (x initialised as FALSE, y as NA)
iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
x = TRUE, y = Sepal.Length ^ 2,
new_init = list(FALSE, NA))

One concise solution would be to do the mutation on the filtered subset and then add back the non-exit rows of the table:
library(dplyr)
dt %>%
filter(measure == 'exit') %>%
mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
rbind(dt %>% filter(measure != 'exit'))

You could split the dataset and do a regular mutate call on the TRUE part.
the split can be done with either dplyr::group_split() or base::split(), I like the base version better here since it preserves names, see the discussion at https://github.com/tidyverse/dplyr/issues/4223 .
df1 <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50),
stringsAsFactors = F)
library(tidyverse)
df1 %>%
group_split(measure == "exit", .keep = FALSE) %>%
modify_at(2, ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
#> # A tibble: 50 × 7
#> site space measure qty qty.exit delta.watts cf
#> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 linear 22 0 100. 0.126
#> 2 3 3 led 12 0 61.5 0.161
#> 3 6 1 led 26 0 25.5 0.307
#> 4 5 2 cfl 16 0 26.5 0.865
#> 5 6 3 linear 19 0 57.5 0.684
#> 6 1 4 led 12 0 14.5 0.802
#> 7 6 4 led 5 0 90.5 0.547
#> 8 5 4 linear 28 0 54.5 0.171
#> 9 1 2 linear 5 0 24.5 0.775
#> 10 1 2 cfl 24 0 96.5 0.144
#> # … with 40 more rows
df1 %>%
split(~measure == "exit") %>%
modify_at("TRUE", ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
#> site space measure qty qty.exit delta.watts cf
#> 1 5 1 linear 22 0 100.5 0.125646491
#> 2 3 3 led 12 0 61.5 0.160692291
#> 3 6 1 led 26 0 25.5 0.307239765
#> 4 5 2 cfl 16 0 26.5 0.864969074
#> 5 6 3 linear 19 0 57.5 0.683945200
#> 6 1 4 led 12 0 14.5 0.802398642
#> 7 6 4 led 5 0 90.5 0.547211378
#> 8 5 4 linear 28 0 54.5 0.170614207
#> 9 1 2 linear 5 0 24.5 0.774603932
#> 10 1 2 cfl 24 0 96.5 0.144310557
#> 11 3 4 linear 21 0 93.5 0.682622390
#> 12 4 4 led 2 0 48.5 0.941718646
#> 13 4 4 cfl 2 0 100.5 0.918448627
#> 14 5 2 led 11 0 63.5 0.998143780
#> 15 4 1 led 21 0 53.5 0.644740176
#> 16 1 3 cfl 5 0 28.5 0.110610285
#> 17 1 3 linear 24 0 41.5 0.538868200
#> 18 4 3 led 29 0 19.5 0.998474289
#> 19 2 3 cfl 4 0 22.5 0.008167536
#> 20 5 1 led 20 0 56.5 0.740833476
#> 21 3 2 led 5 0 44.5 0.223967706
#> 22 1 4 led 27 0 32.5 0.199850583
#> 23 3 4 cfl 17 0 61.5 0.104023080
#> 24 1 3 cfl 11 0 34.5 0.399036247
#> 25 2 3 linear 29 0 65.5 0.600678235
#> 26 2 4 cfl 23 0 29.5 0.291611352
#> 27 6 2 linear 13 0 37.5 0.225021614
#> 28 2 3 led 17 0 62.5 0.879606956
#> 29 2 4 led 29 0 51.5 0.301759669
#> 30 5 1 led 11 0 54.5 0.793816856
#> 31 2 3 led 20 0 29.5 0.514759195
#> 32 3 4 linear 6 0 68.5 0.475085443
#> 33 1 4 led 21 0 34.5 0.133207588
#> 34 2 4 linear 25 0 80.5 0.164279355
#> 35 5 3 led 7 0 73.5 0.252937836
#> 36 6 2 led 15 0 99.5 0.554864929
#> 37 3 2 linear 6 0 44.5 0.377257874
#> 38 4 4 exit 15 15 13.0 0.000000000
#> 39 3 3 exit 10 10 13.0 0.000000000
#> 40 5 1 exit 15 15 13.0 0.000000000
#> 41 4 2 exit 1 1 13.0 0.000000000
#> 42 5 3 exit 10 10 13.0 0.000000000
#> 43 1 3 exit 14 14 13.0 0.000000000
#> 44 5 2 exit 12 12 13.0 0.000000000
#> 45 2 2 exit 30 30 13.0 0.000000000
#> 46 6 3 exit 28 28 13.0 0.000000000
#> 47 1 1 exit 14 14 13.0 0.000000000
#> 48 3 3 exit 21 21 13.0 0.000000000
#> 49 4 2 exit 13 13 13.0 0.000000000
#> 50 4 3 exit 12 12 13.0 0.000000000
Created on 2022-10-07 by the reprex package (v2.0.1)

mutate_cond is a great function, but it gives an error if there is an NA in the column(s) used to create the condition. I feel that a conditional mutate should simply leave such rows alone. This matches the behavior of filter(), which returns rows when the condition is TRUE, but omits both rows with FALSE and NA.
With this small change the function works like a charm:
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
condition[is.na(condition)] = FALSE
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}

I don't actually see any changes to dplyr that would make this much easier. case_when is great for when there are multiple different conditions and outcomes for one column but it doesn't help for this case where you want to change multiple columns based on one condition. Similarly, recode saves typing if you are replacing multiple different values in one column but doesn't help with doing so in multiple columns at once. Finally, mutate_at etc. only apply conditions to the column names not the rows in the dataframe. You could potentially write a function for mutate_at that would do it but I can't figure out how you would make it behave differently for different columns.
That said here is how I would approach it using nest form tidyr and map from purrr.
library(data.table)
library(dplyr)
library(tidyr)
library(purrr)
# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
dt2 <- dt %>%
nest(-measure) %>%
mutate(data = if_else(
measure == "exit",
map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
data
)) %>%
unnest()

With the creation of rlang, a slightly modified version of Grothendieck's 1a example is possible, eliminating the need for the envir argument, as enquo() captures the environment that .p is created in automatically.
mutate_rows <- function(.data, .p, ...) {
.p <- rlang::enquo(.p)
.p_lgl <- rlang::eval_tidy(.p, .data)
.data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
.data
}
dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)

I think this answer has not been mentioned before. It runs almost as fast as the 'default' data.table-solution..
Use base::replace()
df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
cf = replace( cf, measure == 'exit', 0 ),
delta.watts = replace( delta.watts, measure == 'exit', 13 ) )
replace recycles the replacement value, so when you want the values of columns qty entered into colums qty.exit, you have to subset qty as well... hence the qty[ measure == 'exit'] in the first replacement..
now, you will probably not want to retype the measure == 'exit' all the time... so you can create an index-vector containing that selection, and use it in the functions above.
#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )
df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
cf = replace( cf, index.v, 0 ),
delta.watts = replace( delta.watts, index.v, 13 ) )
benchmarks
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.005018 1.053370 1.137456 1.112871 1.186228 1.690996 100
# wimpel 1.061052 1.079128 1.218183 1.105037 1.137272 7.390613 100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995 100

At the expense of breaking with the usual dplyr syntax, you can use within from base:
dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
delta.watts[measure == 'exit'] <- 13)
It seems to integrate well with the pipe, and you can do pretty much anything you want inside it.

Related

Outer function to identify matches: Issue with scalability

I want to perform matching between two groups in a data frame consisting of 10 million rows, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. The end result should be a data frame with 2 columns: (1) id number and (2) id number of matched row To do this, I use the outer function. See the toy example below:
set.seed(123)
# Creating data
df <- data.frame(id = c(1:10000000),
group = rbinom(10000000,1, 0.3),
value = round(runif(10000000),2))
threshold <- round(sd(df$value)*0.1,2)
#################################################################
# Identifying matches
library(tidyverse)
library(data.table)
# All values
dist_mat <- df$value
# Adding identifier
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))
# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
This code works for smaller datasets but is having issues when scaling up the data size (for obvious reasons). You can try to reduce the data frame size to 100 or 1000 rows and it should run more smoothly. The issue is related to the outer function and is stated as: Error: cannot allocate vector of size 156431.9 Gb.
As a way to solve this, I tried to do the matching row-wise, i.e., one row at a time. But this takes a tremendously long time (2500 rows in 8h, where I have 3 million rows to loop through...). See code below:
dist_mat <- df$value
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
matched_df <- data.frame()
for (i in 1:length(dist_mat_row)) {
print(i)
dist_mat <- as.matrix(abs(outer(dist_mat_row[i], dist_mat_col, "-")))
colnames(dist_mat) <- names(dist_mat_col)
rownames(dist_mat) <- names(dist_mat_row[i])
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
matched_df <- rbind(matched_df, dist_mat)
rm(dist_mat)
gc()
}
Is there any way of doing this that does not run out of memory or takes a tremendous time? So far, I've been trying to "trim some meat" off the data to reduce the size, and perhaps there are any more ways to do this? An alternative is to not do this the "brute" way but to find an alternative. Does anyone have any suggestions or ideas?
Thanks!
This will be my correct answer.
First, we need a function that will generate a data set with the appropriate proportion of the number of unique values. Here it is.
library(tidyverse)
library(collapse)
fdf = function(n, nup=.1) {
vp = 1/n/nup
tibble(
id = c(1:n),
group = rbinom(n, 1, 0.3),
value = round(runif(n)/vp)*vp)
}
For example, let's generate a set of 350 records with a ratio of unique values equal to 0.15
fdf(350, .15) %>% funique(cols=3) %>% nrow()
output
[1] 53
Now for a second example. 1000 lines with approximately 100 unique values.
fdf(1000, .1) %>% funique(cols=3) %>% nrow()
output
[1] 101
Now the most important and crucial thing. A binary search function that finds a range of val values that differ by tresh.
fbin = function(x, val, tresh = 0){
vmin = val - tresh
vmax = val + tresh
n = length(x)
e = .Machine$double.eps
if((x[1]-vmax)>=e | (vmin-x[n])>=e) NULL else{
l = 1
r = n
if(abs(x[1]-vmin)<=e | abs(x[1]-vmax)<=e |
((x[1]-vmin)>=e & (vmax-x[1])>=e)) imin=1 else {
while(l <= r){
i = (l + r) %/% 2
if((vmin-x[i])>e){
l = i + 1
} else {
if(!(vmin-x[i-1])>e){
r = i - 1
} else break
}
}
imin=i
}
l = imin
r = n
if(abs(x[n]-vmin)<=e | abs(x[n]-vmax)<=e |
((x[n]-vmin)>=e & (vmax-x[n])>=e)) imax = n else {
while(l <= r){
i = (l + r) %/% 2
if((x[i]-vmax)>e){
r = i - 1
} else {
if(!((x[i+1]-vmax)>e)){
l = l + 1
} else break
}
}
imax=i
}
imin:imax
}
}
First, a few notes about this feature. I took into account the fact that the val and tresh variables of the double type, and thus, due to the inaccuracy of the calculations, ordinary comparisons cannot be used here
such as x[i]>vmax or x[i]==vmax.
My search function requires the argument x to be sorted in descending order!
Let's do some unit tests.
set.seed(123)
x = sample(1:10, 30, replace=T) %>% sort()
x
#[1] 1 2 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7 8 9 9 9 9 9 9 9 10 10 10 10 10
x[fbin(x, 100, 0)]
#integer(0)
x[fbin(x, -10, 0)]
#integer(0)
x[fbin(x, 1, 0)]
#[1] 1
x[fbin(x, 10, 0)]
#[1] 10 10 10 10 10
x[fbin(x, 1, 1)]
#[1] 1 2
x[fbin(x, 10, 1)]
# [1] 9 9 9 9 9 9 9 10 10 10 10 10
x[fbin(x, 5, 0)]
#[1] 5 5 5
x[fbin(x, 5, 2)]
#[1] 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7
x[fbin(x, 5, 10)]
# [1] 1 2 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7 8 9 9 9 9 9 9 9 10 10 10 10 10
As you can see, the function returns the indexes for which the vector x values fall within the range of <val-tresh, val+tresh>.
Now it's time for a specific test. We'll see how fbin does a 10,000,000-element vector search.
set.seed(123)
n = 10000000
x = runif(n) %>% round(6) %>% sort()
funique(x) %>% length()
x[fbin(x, .5)]
#[1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
x[fbin(x, .5, .000001)]
# [1] 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999
# [10] 0.499999 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000
# [19] 0.500000 0.500000 0.500000 0.500000 0.500000 0.500001 0.500001 0.500001 0.500001
# [28] 0.500001 0.500001 0.500001 0.500001
Now let's see how long such a search will take.
library(microbenchmark)
ggplot2::autoplot(microbenchmark(fbin(x, .5, .001),
fbin(x, .5, .002),
fbin(x, .5, .003),
fbin(x, .5, .004),
times=10))
As you can see, the search takes about 1000 us.
Now let's compare that to the subset functions.
ggplot2::autoplot(microbenchmark(x[fbin(x, .5, .001)],
ss(x, x>=(0.5+0.001) & x<=(0.5-0.001)),
subset(x, x>=(0.5+0.001) & x<=(0.5-0.001)),
times=10))
As you can see, it is two or three orders faster!
It's time for the right function to solve your task.
fmatch = function(df, tresh){
#Adding a column with the row number
df = df %>% ftransform(row = 1:nrow(.))
#Splitting into two sorted subsets
df0 = df %>% roworder(value) %>% fsubset(group == 0)
df1 = df %>% roworder(value) %>% fsubset(group == 1)
#Transformations on matrices
M0 = df0 %>% qM()
M1 = df1 %>% qM()
#Prepare unique values from group 1
uM1 = df1$value %>% funique()
out = list()
for(i in 1:length(uM1)){
iM0 = fbin(M0[,3], uM1[i], tresh)
if(length(iM0)>0){
iM1 = fbin(M1[,3], uM1[i])
out[[paste0(uM1[i])]] = list(
row0 = M0[iM0, 4],
row1 = M1[iM1, 4]
)
}
}
out
}
How does this feature work?
I will describe it step by step.
Complete the data frame with the line numbers
Split the frame into two sorted subsets for grooup 1 and group 0
Convert it into matrices to speed up the operation (maybe you don't need it)
Prepare unique values from the subset of groups 1
For each unique value in the subset of group 1, do:
5.1 In the set for group 0, search for all rows for which value does not differ from the current unique value + - threshold
5.2 If only such lines exist, write one list which will contain the line numbers from the subset for group 1 with the value equal to the current value, and the line numbers from the subset group 0.
Let's see this for an example
#Preparation of data and threshold
set.seed(123)
df = fdf(100)
threshold = round(sd(df$value)*0.1,2)
out = fmatch(df, threshold)
df[out[[1]]$row1,]
# # A tibble: 1 x 3
# id group value
# <int> <int> <dbl>
# 1 16 1 0.1
df[out[[1]]$row0,]
# # A tibble: 6 x 3
# id group value
# <int> <int> <dbl>
# 1 10 0 0.1
# 2 13 0 0.1
# 3 28 0 0.1
# 4 29 0 0.1
# 5 48 0 0.1
# 6 55 0 0.1
df[out[[2]]$row1,]
# # A tibble: 3 x 3
# id group value
# <int> <int> <dbl>
# 1 24 1 0.2
# 2 58 1 0.2
# 3 68 1 0.2
df[out[[2]]$row0,]
# # A tibble: 9 x 3
# id group value
# <int> <int> <dbl>
# 1 27 0 0.2
# 2 44 0 0.2
# 3 46 0 0.2
# 4 47 0 0.2
# 5 49 0 0.2
# 6 54 0 0.2
# 7 60 0 0.2
# 8 72 0 0.2
# 9 99 0 0.2
Now I will change the threshold to 0.2 and repeat the test.
out = fmatch(df, 0.2)
df[out[[1]]$row1,]
# # A tibble: 1 x 3
# id group value
# <int> <int> <dbl>
# 1 16 1 0.1
df[out[[1]]$row0,]
# # A tibble: 24 x 3
# id group value
# <int> <int> <dbl>
# 1 43 0 0
# 2 10 0 0.1
# 3 13 0 0.1
# 4 28 0 0.1
# 5 29 0 0.1
# 6 48 0 0.1
# 7 55 0 0.1
# 8 27 0 0.2
# 9 44 0 0.2
# 10 46 0 0.2
# # ... with 14 more rows
df[out[[2]]$row1,]
# # A tibble: 3 x 3
# id group value
# <int> <int> <dbl>
# 1 24 1 0.2
# 2 58 1 0.2
# 3 68 1 0.2
df[out[[2]]$row0,]
# # A tibble: 32 x 3
# id group value
# <int> <int> <dbl>
# 1 43 0 0
# 2 10 0 0.1
# 3 13 0 0.1
# 4 28 0 0.1
# 5 29 0 0.1
# 6 48 0 0.1
# 7 55 0 0.1
# 8 27 0 0.2
# 9 44 0 0.2
# 10 46 0 0.2
# # ... with 22 more rows
Now it's time to test with 100,000 rows.
set.seed(123)
df = fdf(100000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch(df, threshold)
end_time <- Sys.time()
end_time - start_time
#Time difference of 13.9958 secs
object.size(out)
#319309040 bytes
As you can see, the whole thing took only 14 seconds. The output list is 320 MB. This could be crucial.
I ran another test on a set of 500,000 rows.
set.seed(123)
df = fdf(500000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch(df, threshold)
end_time <- Sys.time()
end_time - start_time
#Time difference of 7.982853 mins
length(out)
#47509
object.size(out)
#7889344576 bytes
As you hang, the fivefold increase in the data set has made the time 34 times longer. The initial list has grown 24 times and now takes almost 8 GB!
There is a very important conclusion from this. Probably for 10,000,000 lines you will not have enough memory to complete the operation. So I suggest slightly modifying the fmatch function so that it returns results only for a specific subset of unique values.
Perhaps we could also optimize the binary search functionality a bit more. But I would need to know what your values are in the variable value in your dataframe.
However, as you can see, the critical factor here is not the execution time, but the memory availability.
I will be waiting for your opinion.
Also write if my solution is clear to you and if you need any additional explanations.
Last update
I did one more test tonight. However, it required minimal modification to my fmatch function. It added two additional arguments, vmin and vmax. The function will now only run for unique values in the range <vmin, vmax).
fmatch1 = function(df, tresh, vmin=0, vmax=1){
#Adding a column with the row number
df = df %>% ftransform(row = 1:nrow(.))
#Splitting into two sorted subsets
df0 = df %>% roworder(value) %>% fsubset(group == 0)
df1 = df %>% roworder(value) %>% fsubset(group == 1)
#Transformations on matrices
M0 = df0 %>% qM()
M1 = df1 %>% qM()
#Prepare unique values from group 1
uM1 = df1$value %>% funique() %>% ss(.>=vmin & .<vmax)
out = list()
for(i in 1:length(uM1)){
iM0 = fbin(M0[,3], uM1[i], tresh)
if(length(iM0)>0){
iM1 = fbin(M1[,3], uM1[i])
out[[paste0(uM1[i])]] = list(
row0 = M0[iM0, 4],
row1 = M1[iM1, 4]
)
}
}
out
}
Now I was able to perform a data frame test with 10,000,000 rows.
However, I limited myself to values in the range <0, 0.005).
set.seed(123)
df = fdf(10000000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch1(df, threshold, 0, .005)
end_time <- Sys.time()
end_time - start_time
#Time difference of 6.865934 mins
length(out)
#4706
object.size(out)
#8557061840 bytes
The whole thing took almost 7 minutes and the result was as much as 9 GB of memory !!
If we now assume that it will be relatively linear, we can expect that for all unique values in the data frame with 10,000,000 lines, the function runtime will be approx. 24 hours and the result should be approx. 1,800 GB. Unfortunately, my computer does not have that much memory.
In fact, what I am writing now will not be the actual answer. This is going to be quite a long comment. Unfortunately, I would not fit it in one or even several comments. Therefore, I am asking everyone to be understanding and not to criticize what I am writing here.
Now to the point.
I looked at your problem. I've even been able to write a program that will do your job in much less time. With 100,000 lines, the program only ran for a few minutes. What compared to the 8 hours you gained on 2,500 rows is a clear difference. The problem, however, probably lies in the assumptions of the task itself.
When you write yourself, you have 10,000,000 rows. However, of those 10,000,000 lines, you only have 100 unique values, which is due to round(runif(n), 2)). So the first question to ask: it is the same for your real data?
Later you will say you want to match group id 0 to group id 1 if the difference between the values ​​is less than the specified threshold (let's assume the threshold for a moment is 0.3). So let's check what it gives in the output. If you only have 100 unique values ​​and 10,000,000 rows, you can expect group 0 to be around 50,000 values ​​of 0.99. Each of these values, of course, has a different id. However, in group 1, you will have approximately 3,450,000 rows with values ​​less than 0.69. Now, if you want to match each of these 50,000 IDs to 3,450,000 Group 1 IDs, you will get 172,500,000,000 matches in total !! Recall that we matched only the id from group 0, for which the value was 0.99.
Finally, my 100,000 row code generated a result set of only 10,000,000 rows! And although he did it in minutes, it strained my computer's memory a lot.
In addition, I wonder if by any chance you did not want to match the id not as you write, but when the absolute value of the difference between the values is less than the accepted threshold? abs(value1 - value0)<threshold?
If you are very curious, here is my code that I wrote about above.
library(tidyverse)
library(collapse)
set.seed(123)
n = 100000
df = tibble(
id = c(1:n),
group = rbinom(n,1, 0.3),
value = round(runif(n),2))
threshold = round(sd(df$value)*0.1,2)
m1 = df %>%
fsubset(group == 1) %>%
roworder(value) %>%
ftransform(row = 1:nrow(.))
m1.idx = m1 %>% funique(cols=3)
m1.M = m1 %>% qM()
m0 = df %>%
fsubset(group == 0) %>%
roworder(value)
m0.idx = m0 %>% funique(cols=3)
m0.M = m0 %>% qM
out = list()
for(i in 1:nrow(m0.M)){
id0 = m0.M[i,1]
value0 = m0.M[i,3]
value1 = round(value0 - threshold, 2)
idx = m1.idx %>% fsubset(value<=value1) %>% qM
if(nrow(idx)>1){
last.row = idx[nrow(idx), 4]-1
out[[paste0(id0)]] = m0 %>% ss(1:last.row,1)
}
}
dfout = unlist2d(out) %>% frename(.id = id0, id = id1) %>% qTBL()
However, I would suggest a slightly different solution. Perhaps it will be enough to remember only each of the 100 unique values from one of the groups and to each of them assign all id from group 0 for which this value exists, and all id from group 1 for which the value is less than the set threshold, or the absolute difference of values is smaller than this threshold.
Unfortunately, I do not know if such a solution would be acceptable for you. I will be waiting for a comment from you.

R: Count number of times B follows A using dplyr

I have a data.frame of monthly averages of radon measured over a few months. I have labeled each value either "below" or "above" a threshold and would like to count the number of times the average value does: "below to above", "above to below", "above to above" or "below to below".
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
A bit of digging into Matlab answer on here suggests that we could use the Matrix package:
require(Matrix)
sparseMatrix(i=c(2,2,2,1), j=c(2,2,2))
Produces this result which I can't yet interpret.
[1,] | |
[2,] | .
Any thoughts about a tidyverse method?
Sure, just use group by and count the values
library(dplyr)
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
df %>%
group_by(grp = paste(level, lead(level))) %>%
summarise(n = n()) %>%
# drop the observation that does not have a "next" value
filter(!grepl(pattern = "NA", x = grp))
#> # A tibble: 3 × 2
#> grp n
#> <chr> <int>
#> 1 above above 2
#> 2 above below 1
#> 3 below above 1
You could use table from base R:
table(df$level[-1], df$level[-nrow(df)])
above below
above 2 1
below 1 0
EDIT in response to #HCAI's comment: applying table to multiple columns:
First, generate some data:
set.seed(1)
U = matrix(runif(4*20),nrow = 20)
dfU=data.frame(round(U))
library(plyr) # for mapvalues
df2 = data.frame(apply(dfU,
FUN = function(x) mapvalues(x, from=0:1, to=c('below','above')),
MARGIN=2))
so that df2 contains random 'above' and 'below':
X1 X2 X3 X4
1 below above above above
2 below below above below
3 above above above below
4 above below above below
5 below below above above
6 above below above below
7 above below below below
8 above below below above
9 above above above below
10 below below above above
11 below below below below
12 below above above above
13 above below below below
14 below below below below
15 above above below below
16 below above below above
17 above above below above
18 above below above below
19 below above above above
20 above below below above
Now apply table to each column and vectorize the output:
apply(df2,
FUN=function(x) as.vector(table(x[-1],
x[-nrow(df2)])),
MARGIN=2)
which gives us
X1 X2 X3 X4
[1,] 5 2 7 2
[2,] 5 6 4 6
[3,] 6 5 3 6
[4,] 3 6 5 5
All that's left is a bit of care in labeling the rows of the output. Maybe someone can come up with a clever way to merge/join the data frames resulting from apply(df2, FUN=function(x) melt(table(x[-1],x[-nrow(df2)])),2), which would maintain the row names. (I spent some time looking into it but couldn't work out how to do it easily.)
not run, so there may be a typo, but you get the idea. I'll leave it to you to deal with na and the first obs. Single pass through the vector.
library(dplyr)
summarize(increase = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
decrease = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
constant = sum(case_when(value = lag(value) ~ 1, T ~ 0))
)
A slightly different version:
library(dplyr)
library(stringr)
df %>%
group_by(level = str_c(level, lead(level), sep = " ")) %>%
count(level) %>%
na.omit()
level n
<chr> <int>
1 above above 2
2 above below 1
3 below above 1
Another possible solution, based on tidyverse:
library(tidyverse)
df<-data.frame(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
df %>%
mutate(changes = str_c(lag(level), level, sep = "_")) %>%
count(changes) %>% drop_na(changes)
#> changes n
#> 1 above_above 2
#> 2 above_below 1
#> 3 below_above 1
Yet another solution, based on data.table:
library(data.table)
dt<-data.table(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
dt[, changes := paste(shift(level), level, sep = "_")
][2:.N][,.(n = .N), keyby = .(changes)]
#> changes n
#> 1: above_above 2
#> 2: above_below 1
#> 3: below_above 1

linear interpolation (approx) by group in a dplyr pipe in R

I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA

R dplyr with multiple columns with same stem name

I have some data columns that I need to set to NA whenever a corresponding column is >0.
I can do this with mutate and the names of the two columns, but I want a scoped version where I create the name of the corresponding column from the name of the 1st column
(x<-data.frame(x1=(1:4),map.x1=c(0,0,7,0),x2=c(2,2,2,2),map.x2=c(0,7,0,0)))
mutate(x, x1=ifelse(map.x1>0, NA, x1))
mutate_at(x, vars(starts_with("x")), function(v) {
m.name <- paste0("map.", deparse(substitute(v)))
ifelse(get(m.name)>0, NA, v)
)
}
I can see ifelse() is unhappy because it wants the 1st argument to be an object, and I have given an expression.
I could not find a way. I even wondered if there was some way to avoid the function(v) and use (.) in paste0() or get()
I also am considering reshaping so I can do one mutate. What is the best practice here?
This is a solution without reshaping the data.
library(dplyr)
library(rlang)
custom_mutate <- function(df, v){
v <- enquo(v)
map.v <- paste0("map.", quo_name(v))
df %>%
mutate(UQE(v) := ifelse((!!sym(map.v)) > 0, NA, (!!v))) %>%
pull(UQE(v))
}
mutate_at(x, vars(starts_with("x")), funs(custom_mutate(df = x, v = .)))
# x1 map.x1 x2 map.x2
# 1 1 0 2 0
# 2 2 0 NA 7
# 3 NA 7 2 0
# 4 4 0 2 0
The function in the mutate_at call is only applied to the columns and not to the whole dataframe. Therefore you have to explicitly tell the function where to look for your map.x1 column.
To get the name from the column you're working with, first you need to use enquo to turn v in a quosure. Then you can use quo_name to construct the map.-name. In the following mutate call it is important that you tell dplyr, that v is a quosure (therefore the UQE wrapped around it, which is similar to the !! in front of it in the FALSE-part of the ifelse statement).
For the map.x1 column you have to use the sym-function from the rlang-package to get the bare name (without quotations) and then again use the !! to tell dplyr to take this as a column name.
I trief to explain my solution, being not to technical. For a great explanation of how to programm with dplyr see here: Programming with dplyr
Here is one way to get the output you want. No need to write a custom function. Reshaping the file should be sufficient.
library(tibble)
library(dplyr)
library(stats)
# creating dataframe with proper names
x <-
tibble::as_data_frame(cbind(
x_1 = c(1:4),
map.x_1 = c(0, 0, 7, 0),
x_2 = c(2, 2, 2, 2),
map.x_2 = c(0, 7, 0, 0)
)) %>%
tibble::rownames_to_column(df = ., var = 'id')
# converting to long format
x_long <- stats::reshape(
as.data.frame(x),
timevar = "level",
varying = dput(as.character(as.vector(names(
x[, base::grep("^x|^map", names(x))]
)))),
direction = "long",
idvar = c("id"),
sep = "_"
)
#> c("x_1", "map.x_1", "x_2", "map.x_2")
# converting the dataframe based on condition
x_long %>%
group_by(.data = ., level) %>%
dplyr::mutate(.data = .,
x = base::ifelse(test = map.x > 0,
yes = NA,
no = x))
#> # A tibble: 8 x 4
#> # Groups: level [2]
#> id level x map.x
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 1.00 1.00 0
#> 2 2 1.00 2.00 0
#> 3 3 1.00 NA 7.00
#> 4 4 1.00 4.00 0
#> 5 1 2.00 2.00 0
#> 6 2 2.00 NA 7.00
#> 7 3 2.00 2.00 0
#> 8 4 2.00 2.00 0
Created on 2018-02-14 by the reprex
package (v0.1.1.9000).

Create balanced subsets based on similarity of multiple columns

I'm searching for a possibility to find subsets of rows (one subset should contain 6 rows), where the value-means for multiple columns are most similar. So, I would like R to search through my data.frame and create subsets of 6 rows each, so that finally these subsets are most similar to each other. Similarity could be measured as the Euclidean distance (as pointed out by #David Robinson).
My data looks like that:
TID Cue1 Cue2 Cue3
1 2.06 1.90 3.82
2 5.18 4.13 5.10
3 5.09 2.85 2.80
4 1.93 4.14 4.75
... ... ... ...
I'd now like to know if there is a way in R, that I find the following:
-give me e.g. 4 subsets containing 6 rows eachs, whereby the 4 subsets have the most possible similiarty in the Cue1, Cue2 and Cue3 means (SD isn't important) while each subset contains unique rows (no duplicate rows between the subsets).
One example would be (not matching the data in my example):
-subset 1 contains TID 1, TID 6, TID 14, TID 28, TID 39, TID 50 and this subset has the cue-means (Cue1 = 3,2; Cue2 = 2,5; Cue3 = 4)
-subset 2 contains TID 3, TID 12, TID 20, TID 40, TID 54, TID 59 and this subset has the cue-means (Cue1 = 3,3; Cue2 = 2,6; Cue3 = 4,1).
So that the two subsets are very (most) similar in the cue means. R should now name me the rownumbers (or the TID values) forming the subsets.
Is there any possibilty to do this in R?
Here is an reproducible example of how my data looks like:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
Cue1 <- mysamp(n=60, m=3, s=1.5, lwr=1, upr=6, nnorm=1000)
Cue2 <- mysamp(n=60, m=3, s=2.5, lwr=1, upr=6, nnorm=1000)
Cue3 <- mysamp(n=60, m=4, s=1.5, lwr=1, upr=6, nnorm=1000)
df <- data.frame(TID= 1:60, Cue1= Cue1, Cue2= Cue2, Cue3= Cue3)
This is a clustering problem, so you'd want to approach it by:
Calculating a distance matrix
Using that to construct a "tree" of similar groups of nodes
Extracting sub-clusters of your size that appear lowest on the tree
The distance matrix and hierarchical clustering can be done as:
distances <- dist(df[, -1])
h <- hclust(distances)
There are many approaches to algorithmically pulling off low clusters on the tree; since I'm accustomed to working with dplyr/purrr/tidyr I'll show one solution. This takes the approach of using cutree to break the tree apart at every possible level, then find the first time each group of six appears.
library(dplyr)
library(tidyr)
library(purrr)
clusterings <- data_frame(ncluster = seq(nrow(df), 1)) %>%
unnest(membership = map(ncluster, ~ cutree(h, .))) %>%
group_by(ncluster) %>%
mutate(row = row_number()) %>%
ungroup() %>%
nest(-ncluster, -membership) %>%
mutate(size = map_dbl(data, nrow)) %>%
filter(size == 6) %>%
distinct(membership, .keep_all = TRUE) %>%
unnest(data) %>%
mutate(TID = df$TID[row])
On your data, this returns:
# A tibble: 42 × 5
ncluster membership size row TID
<int> <int> <dbl> <int> <int>
1 29 9 6 9 9
2 29 9 6 30 30
3 29 9 6 39 39
4 29 9 6 41 41
5 29 9 6 43 43
6 29 9 6 57 57
7 21 13 6 15 15
8 21 13 6 20 20
9 21 13 6 25 25
10 21 13 6 29 29
# ... with 32 more rows
Thus, (9, 30, 39, 41, 43, 57) make up your first group of 6, while the second group starts with (15, 20, 25, 29...)

Resources