All combinations of values between 0-1 sum to 1 in R - r

Simple question: I'm trying to get all combinations where the weights of 3 numbers (between 0.1 and 0.9) sums to 1.
Example:
c(0.20,0.20,0.60)
c(0.35,0.15,0.50)
.................
with weights differing by 0.05
I have tried this:
library(gregmisc)
permutations(n = 9, r = 3, v = seq(0.1,0.9,0.05))
combn(seq(0.1,0.9,0.05),c(3))
However I would need the 3 numbers (weights) to equal 1, how can I do this?

x <- expand.grid(seq(0.1,1,0.05),
seq(0.1,1,0.05),
seq(0.1,1,0.05))
x <- x[rowSums(x)==1,]
Edit: Use this instead to avoid floating point errors:
x <- x[abs(rowSums(x)-1) < .Machine$double.eps ^ 0.5,]
#if order doesn't matter
unique(apply(x,1,sort), MARGIN=2)
# 15 33 51 69 87 105 123 141 393 411 429 447 465 483 771 789 807 825 #843 1149 1167 1185 1527 1545
#[1,] 0.1 0.10 0.1 0.10 0.1 0.10 0.1 0.10 0.15 0.15 0.15 0.15 0.15 0.15 0.2 0.20 0.2 0.20 0.2 0.25 0.25 0.25 0.3 0.30
#[2,] 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.15 0.20 0.25 0.30 0.35 0.40 0.2 0.25 0.3 0.35 0.4 0.25 0.30 0.35 0.3 0.35
#[3,] 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.70 0.65 0.60 0.55 0.50 0.45 0.6 0.55 0.5 0.45 0.4 0.50 0.45 0.40 0.4 0.35
This will run into performance and memory problems if the possible number of combinations gets huge.

This was an easier to read solution for me:
x_grid <- data.frame(expand.grid(seq(0.1,1,0.05),
seq(0.1,1,0.05),
seq(0.1,1,0.05)))
x_combinations <- x[rowSums(x_grid) == 1, ]

Related

Is there a way to modify specific cells in a data.frame with an apply-statement?

I have a data set
V1 V2 V3 V4
1 0.2 0.1 0.0 0.8
2 0.3 0.4 0.3 0.0
3 0.1 0.3 0.2 0.0
4 0.2 0.1 0.4 0.1
5 0.2 0.1 0.1 0.1
in which each variable has one cell to which I would like to add a fraction (10 %) of the other values in the same column.
This indicates the row in each variable that should receive the bonus:
bonus<-c(2,3,1,4)
And the desired output is this:
V1 V2 V3 V4
1 0.18 0.09 0.10 0.72
2 0.37 0.36 0.27 0.00
3 0.09 0.37 0.18 0.00
4 0.18 0.09 0.36 0.19
5 0.18 0.09 0.09 0.09
I do this with a for-loop:
for(i in 1:ncol(tab)){
tab[bonus[i],i]<-tab[bonus[i],i]+sum(0.1*tab[-bonus[i],i])
tab[-bonus[i],i]<-tab[-bonus[i],i]-(0.1*tab[-bonus[i],i])
}
First row in the {} adds the 0.1*sum_of_other_values to the desired cell whose index is in bonus, second row subtracts from all cells but the one in bonus.
But I need to do this with a lot of columns in a lot of matrices and am struggling with including the information from the external vector bonus into a loop-less function.
Is there a way to vectorise this and then apply it across the datasets to make it faster?
Thanks very much!
( Example data:
tab<-data.frame(V1=c(0.2,0.3,0.1,0.2,0.2),
V2=c(0.1,0.4,0.3,0.1,0.1),
V3=c(0.00,0.3,0.2,0.4,0.1),
V4=c(0.8,0.0,0.0,0.1,0.1))
)
Try this:
mapply(
function(vec, bon) {
more <- vec/10
vec + ifelse(seq_along(vec) %in% bon, sum(more[-bon]), -more)
}, asplit(tab, 2), bonus)
# V1 V2 V3 V4
# [1,] 0.18 0.09 0.10 0.72
# [2,] 0.37 0.36 0.27 0.00
# [3,] 0.09 0.37 0.18 0.00
# [4,] 0.18 0.09 0.36 0.19
# [5,] 0.18 0.09 0.09 0.09
Sometimes I try to separate the change out of the function (such as when you want to troubleshoot the magnitude of change or some other summary statistic about it before updating the original table); if that appeals, then this can be shifted slightly:
changes <- mapply(
function(vec, bon) {
more <- vec/10
ifelse(seq_along(vec) %in% bon, sum(more[-bon]), -more)
}, asplit(tab, 2), bonus)
changes
# V1 V2 V3 V4
# [1,] -0.02 -0.01 0.10 -0.08
# [2,] 0.07 -0.04 -0.03 0.00
# [3,] -0.01 0.07 -0.02 0.00
# [4,] -0.02 -0.01 -0.04 0.09
# [5,] -0.02 -0.01 -0.01 -0.01
tab + changes
# V1 V2 V3 V4
# 1 0.18 0.09 0.10 0.72
# 2 0.37 0.36 0.27 0.00
# 3 0.09 0.37 0.18 0.00
# 4 0.18 0.09 0.36 0.19
# 5 0.18 0.09 0.09 0.09

R: need help matching up table rows and getting differences

I have chromatographic data in a table organized by peak position and integration value of various samples. All samples in the table have a repeated measurement as well with a different sample log number.
What I'm interested in, is the repeatability of the measurements of the various peaks. The measure for that would be the difference in peak integration = 0 for each sample.
The data
Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03
with Log1 is the original sample log number, and Log2 is the repeat log number.
How can I construct a new variable for every peak (being the difference PeakX_Log1 - PeakX_Log2)?
Mind that in my example I only have 5 peaks. The real-life situation is a complex mixture involving >20 peaks, so very hard to do it by hand.
If you will only have two values for each sample, something like this could work:
df <- data.table::fread(
"Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03"
)
library(tidyverse)
new_df <- df %>%
mutate(Log = ifelse(Log1 < Log2,"Log1","Log2")) %>%
select(-Log1,-Log2) %>%
pivot_longer(cols = starts_with("Peak"),names_to = "Peak") %>%
pivot_wider(values_from = value, names_from = Log) %>%
mutate(Variation = Log1 - Log2)
new_df
# A tibble: 15 × 5
Sample Peak Log1 Log2 Variation
<chr> <chr> <dbl> <dbl> <dbl>
1 A Peak1 0.2 0.21 -0.0100
2 A Peak2 0.8 0.7 0.100
3 A Peak3 0.3 0.33 -0.0300
4 A Peak4 0 0 0
5 A Peak5 0 0.1 -0.1
6 B Peak1 0.25 0.2 0.05
7 B Peak2 0.73 0.73 0
8 B Peak3 0.29 0.37 -0.08
9 B Peak4 0.01 0 0.01
10 B Peak5 0.04 0.03 0.01
11 C Peak1 0.2 0.22 -0.0200
12 C Peak2 0.8 0.81 -0.0100
13 C Peak3 0.3 0.31 -0.0100
14 C Peak4 0 0.04 -0.04
15 C Peak5 0.07 0 0.07

Repeat data.frame N times with adding column

I have the following data frame and I want to repeat it N times
dc <- read.table(text = "from 1 2 3 4 5
1 0.01 0.02 0.03 0.04 0.05
2 0.06 0.07 0.08 0.09 0.10
3 0.11 0.12 0.13 0.14 0.15
4 0.16 0.17 0.18 0.19 0.20
5 0.21 0.22 0.23 0.24 0.25", header = TRUE)
n<-20
ddr <- NA
for(i in 1:n) {
ddr <- rbind(ddr, cbind(dc,i))
}
As a result, I would like to receive:
from X1 X2 X3 X4 X5 i
1 0.01 0.02 0.03 0.04 0.05 1
2 0.06 0.07 0.08 0.09 0.10 1
3 0.11 0.12 0.13 0.14 0.15 1
4 0.16 0.17 0.18 0.19 0.20 1
5 0.21 0.22 0.23 0.24 0.25 1
1 0.01 0.02 0.03 0.04 0.05 2
2 0.06 0.07 0.08 0.09 0.10 2
3 0.11 0.12 0.13 0.14 0.15 2
4 0.16 0.17 0.18 0.19 0.20 2
5 0.21 0.22 0.23 0.24 0.25 2
.............................
1 0.01 0.02 0.03 0.04 0.05 20
2 0.06 0.07 0.08 0.09 0.10 20
3 0.11 0.12 0.13 0.14 0.15 20
4 0.16 0.17 0.18 0.19 0.20 20
5 0.21 0.22 0.23 0.24 0.25 20
The matrix must be repeated N times, and repeat number is added.
Is there a correct solution (easy function to do this in R) to this issue? In my case if the ddr is not declared (ddr<-NA), the script does not work. Thanks!
You can use rep() to replicate the row indexes, and also to create the repeat number column.
cbind(dc[rep(1:nrow(dc), n), ], i = rep(1:n, each = nrow(dc)))
Let's break it down:
dc[rep(1:nrow(dc), n), ] uses replicated row indexes in the i value of row indexing of [ for data frames
rep(1:n, each = nrow(dc)) replicates a sequence the length of the n value nrow(dc) times each
cbind(...) combines the two into a single data frame
As #HubertL points out in the comments, this can be further simplified to
cbind(dc, i = rep(1:n, each = nrow(dc)))
thanks to the magic of recycling. Please go give him a vote.
Here is also a more intuitive way, about identical in speed to the other top answer:
n <- 3
data.frame(df,i=rep(1:n,ea=NROW(df)))
Output (repeated 3x):
from X1 X2 X3 X4 X5 i
1 1 0.01 0.02 0.03 0.04 0.05 1
2 2 0.06 0.07 0.08 0.09 0.10 1
3 3 0.11 0.12 0.13 0.14 0.15 1
4 4 0.16 0.17 0.18 0.19 0.20 1
5 5 0.21 0.22 0.23 0.24 0.25 1
6 1 0.01 0.02 0.03 0.04 0.05 2
7 2 0.06 0.07 0.08 0.09 0.10 2
8 3 0.11 0.12 0.13 0.14 0.15 2
9 4 0.16 0.17 0.18 0.19 0.20 2
10 5 0.21 0.22 0.23 0.24 0.25 2
11 1 0.01 0.02 0.03 0.04 0.05 3
12 2 0.06 0.07 0.08 0.09 0.10 3
13 3 0.11 0.12 0.13 0.14 0.15 3
14 4 0.16 0.17 0.18 0.19 0.20 3
15 5 0.21 0.22 0.23 0.24 0.25 3
EDIT: Top Answer Speed Test
This test was scaled up to n=1e+05, iterations=100:
func1 <- function(){
data.frame(df,i=rep(1:n,ea=NROW(df)))
}
func2 <- function(){
cbind(dc, i = rep(1:n, each = nrow(dc)))
}
func3 <- function(){
cbind(dc[rep(1:nrow(dc), n), ], i = rep(1:n, each = nrow(dc)))
}
microbenchmark::microbenchmark(
func1(),func2(),func3())
Unit: milliseconds
expr min lq mean median uq max neval cld
func1() 15.58709 21.69143 28.62695 22.01692 23.85648 117.9012 100 a
func2() 15.99023 21.59375 28.37328 22.18298 23.99953 136.1209 100 a
func3() 414.18741 436.51732 473.14571 453.26099 498.21576 666.8515 100 b

Dynamically Create new columns where iin new columns as per their name LHS becomes 0 and RHS remains same

I have a dataset (nm) which has following columns:
nm
X24_TT_1.1 X35_FTT_2.1 X55_FTT_3.1 X70_FTT_4.1 X1085_TT_5.1 Mean Median
0.09 0.87 0.89 0.15 0.1 35 55
0.94 0.12 0.09 0.92 0.82 55 55
0.89 0.11 0.86 0.08 0.08 70 35
0.12 0.8 0.15 0.18 0.12 35 35
0.08 0.09 0.15 0.88 0.12 85 24
I want to have new dataset (df) with following conditions:
a) Whichever is highest between Mean and Median, the new columns will show LHS value as "0" and RHS should have same previous values.
b) New columns must have "_P" added at the last of their column names
c) As number of columns is very large, so dynamic coding is needed
For e.g.
In Row 1: As 55>35 then select 55 as threshold and set X55_TT as column where the columns which have names/values i.e X24_TT and X35_TT i.e. 24 and 35 are less than 55 should show "0" as value and columns greater than 55 i.e. X_70_FTT & X_1085TT will show same previous values.
Similarly for Row 3: As 70>35, It should show "0" before 70(LHS) and previous values after 70(RHS)
The new dataset(df) will show like this:
df
X24_TT_1.1 X35_FTT_2.1 X55_FTT_3.1 X70_FTT_4.1 X1085_TT_5.1 Mean Median X24_TT___1.1_P X35_FTT_2.1_P X55_FTT__3.1_P X70_FTT__4.1_P X1085_TT_5.1_P
0.09 0.87 0.89 0.15 0.1 35 55 0 0 0.89 0.15 0.1
0.94 0.12 0.09 0.92 0.82 55 55 0 0 0.09 0.92 0.82
0.89 0.11 0.86 0.08 0.08 70 35 0 0 0 0.08 0.08
0.12 0.8 0.15 0.18 0.12 35 35 0 0.8 0.15 0.18 0.12
0.08 0.09 0.15 0.88 0.12 85 24 0 0 0 0 0.12
I have tried several methods but dynamic coding doesn't work for me.
Thanks in advance for providing the solution !!
One option would be to extract the numeric substring from the column names with gsub ('j1'), loop through the rows (apply(nm, 1, ...), get the max value of 'Mean', and 'Median' (i1), assign the first 5 elements to 0 where the 'j1' is less than 'i1', subset the first 5 values, transpose the output and assign it to new columns.
j1 <- as.numeric(gsub("\\D+", "", names(nm)[1:5]))
nm[paste0(names(nm)[1:5], "_P")] <- t(apply(nm, 1, FUN = function(x) {
i1 <- max(x[6:7])
x[1:5][j1 < i1] <- 0
x[1:5]
}))
nm
# X24_TT X35_FTT X55_FTT X70_FTT X85_TT Mean Median X24_TT_P X35_FTT_P
#1 0.09 0.87 0.89 0.15 0.10 35 55 0 0.0
#2 0.94 0.12 0.09 0.92 0.82 55 55 0 0.0
#3 0.89 0.11 0.86 0.08 0.08 70 35 0 0.0
#4 0.12 0.80 0.15 0.18 0.12 35 35 0 0.8
#5 0.08 0.09 0.15 0.88 0.12 85 24 0 0.0
# X55_FTT_P X70_FTT_P X85_TT_P
#1 0.89 0.15 0.10
#2 0.09 0.92 0.82
#3 0.00 0.08 0.08
#4 0.15 0.18 0.12
#5 0.00 0.00 0.12
Or we can vectorize using rep, create a logical matrix, multiply with first 5 columns of 'nm' and assign the output to the new columns
res <- nm[1:5]*matrix(rep(j1, nrow(nm)) >= rep(do.call(pmax, nm[6:7]),
each = 5), ncol=5, byrow=TRUE)
nm[paste0(names(nm)[1:5], "_P")] <- res
nm
#X24_TT X35_FTT X55_FTT X70_FTT X85_TT Mean Median X24_TT_P X35_FTT_P X55_FTT_P X70_FTT_P X85_TT_P
#1 0.09 0.87 0.89 0.15 0.10 35 55 0 0.0 0.89 0.15 0.10
#2 0.94 0.12 0.09 0.92 0.82 55 55 0 0.0 0.09 0.92 0.82
#3 0.89 0.11 0.86 0.08 0.08 70 35 0 0.0 0.00 0.08 0.08
#4 0.12 0.80 0.15 0.18 0.12 35 35 0 0.8 0.15 0.18 0.12
#5 0.08 0.09 0.15 0.88 0.12 85 24 0 0.0 0.00 0.00 0.12
Update
For the new column names, change the 'j1' to
j1 <- as.numeric(sub("\\D+(\\d+)_.*", "\\1", names(nm)[1:5]))
and copy pasting the above code gives
res <- nm[1:5]*matrix(rep(j1, nrow(nm)) >= rep(do.call(pmax, nm[6:7]),
each = 5), ncol=5, byrow=TRUE)
nm[paste0(names(nm)[1:5], "_P")] <- res
nm
# X24_TT_1.1 X35_FTT_2.1 X55_FTT_3.1 X70_FTT_4.1 X1085_TT_5.1 Mean Median X24_TT_1.1_P X35_FTT_2.1_P X55_FTT_3.1_P X70_FTT_4.1_P X1085_TT_5.1_P
#1 0.09 0.87 0.89 0.15 0.10 35 55 0 0.0 0.89 0.15 0.10
#2 0.94 0.12 0.09 0.92 0.82 55 55 0 0.0 0.09 0.92 0.82
#3 0.89 0.11 0.86 0.08 0.08 70 35 0 0.0 0.00 0.08 0.08
#4 0.12 0.80 0.15 0.18 0.12 35 35 0 0.8 0.15 0.18 0.12
#5 0.08 0.09 0.15 0.88 0.12 85 24 0 0.0 0.00 0.00 0.12
data
nm <- structure(list(X24_TT = c(0.09, 0.94, 0.89, 0.12, 0.08), X35_FTT = c(0.87,
0.12, 0.11, 0.8, 0.09), X55_FTT = c(0.89, 0.09, 0.86, 0.15, 0.15
), X70_FTT = c(0.15, 0.92, 0.08, 0.18, 0.88), X85_TT = c(0.1,
0.82, 0.08, 0.12, 0.12), Mean = c(35L, 55L, 70L, 35L, 85L), Median = c(55L,
55L, 35L, 35L, 24L)), .Names = c("X24_TT", "X35_FTT", "X55_FTT",
"X70_FTT", "X85_TT", "Mean", "Median"), class = "data.frame",
row.names = c(NA, -5L))

R: grouping numbers into bins

I am looking to find the smallest number in a column in a data frame that is larger a number in another array.
Example
DistrDF
Bin Freq CumSum
0.1 0.05 0.05
0.2 0.07 0.12
0.3 0.20 0.32
0.4 0.10 0.42
0.5 0.00 0.42
0.6 0.15 0.57
0.7 0.00 0.57
0.8 0.30 0.87
0.9 0.11 0.98
1.0 0.02 1.0
Then I have an array of, say, 10 random numbers between 0 and 1 (i.e. each random number will fall into one of the bins in the DistrDF)
RandNums
0.13
0.50
0.11
0.10
0.70
0.05
0.12
0.80
0.88
0.40
I would like to use these two table to create a third table, which indicates into which bin each of the random numbers falls, as below:
ResultDF
0.30 (because 0.13 < 0.32 and 0.13 > 0.12)
0.60 (because 0.50 < 0.57 and 0.50 > 0.42)
...
0.30 (because 0.40 < 0.42 and 0.40 > 0.32)
Does anyone have any ideas? I feel like an aggregate or something might be in order, but I'm not sure.
The cut function does what you want:
DistrDF <- DistrDF[DistrDF$Freq > 0,] # Remove empty bins
DistrDF$Bin[cut(x$RandNums, c(0, DistrDF$CumSum))]
# [1] 0.3 0.6 0.2 0.2 0.8 0.1 0.2 0.8 0.9 0.4
You can manipulate the include.lowest and right parameters to change how you handle points that fall on the border of bins.

Resources