I'm trying to create a function that will spread a number into an X number of groups that are approximately equal. For example, splitting 32 into 3 separate groups would result in 11, 11, and 10. Splitting 32 into 5 separate groups would result in 7, 7, 6, 6, 6.
I've found a lot of Python approaches, and I've found lots of R approaches that split samples. But, I haven't found any R specific approaches that focus on splitting a specific count rather than a sample.
Any help would be much appreciated!
A transcription of a python code provided by #Poe Dator:
int_split <- function(n, p) n %/% p + (sequence(p) - 1 < n %% p)
int_split(32, 3)
[1] 11 11 10
int_split(32, 5)
[1] 7 7 6 6 6
You could do:
split_count <- function(x, n){
grp <- rep(x%/%n, n)
y <- x%%n
grp[seq_len(y)] <- grp[seq_len(y)] + 1
grp
}
split_count(32, 2)
[1] 16 16
split_count(32, 5)
[1] 7 7 6 6 6
split_count(32, 3)
[1] 11 11 10
Here's a "Monte Carlo" approach. I generate a bunch (N) of random integers (size = grps) that sum to Num and then choose the combination with the least difference.
Num <- 32
grps <- 4
N <- 1000
tmp <- rmultinom(N,Num,rep(1/grps,grps))
i <- which.min(apply(tmp,2,function(x) sum(abs(diff(x)))))
tmp[,i]
Related
I have a vector of nodes taken from a binary regression tree. These are in level order, for example, 1,2,4,5,10,11. I would like to place them in infix order like so: 4,2,10,5,11,1. Thanks to Alistaire I have a solution that uses recursion. But as they point out, "There has to be a better way". I was hoping someone might be able to help me out with a non-recursive approach. The recursive version is very slow for vectors of any reasonable length. I have also tried creating a binary tree using igraph and data.tree but I cannot seem to get the ordering I want from these.
Yes, it's possible to do this without recursion since you are dealing with a binary tree, which has a fixed structure like the following tree with depth 5:
Suppose we have a vector of your nodes:
nodes <- c(1, 2, 4, 5, 10, 11)
First of all, we only want a binary tree that is of a suitable depth to accommodate your largest node. We can get the required depth by doing:
depth <- ceiling(log(max(nodes), 2))
And a data frame that gives the node number, depth and 'leftness' of a sufficiently large binary tree like this:
df <- data.frame(node = seq(2^(depth) - 1),
depth = rep(seq(depth), times = 2^(seq(depth) - 1)),
leftness = unlist(sapply(2^seq(depth) - 1,
function(x) (seq(x)[seq(x) %% 2 ==1])/(x + 1))))
However, we only need the subset of this tree that matches your nodes:
df <- df[match(nodes, df$node),]
df
#> node depth leftness
#> 1 1 1 0.5000
#> 2 2 2 0.2500
#> 4 4 3 0.1250
#> 5 5 3 0.3750
#> 10 10 4 0.3125
#> 11 11 4 0.4375
And we can sort the nodes in order according to leftness:
df$node[order(df$leftness)]
#> [1] 4 2 10 5 11 1
Which is your expected result.
To generalize this, just put the above steps in a function:
sort_left <- function(nodes) {
depth <- ceiling(log(max(nodes), 2))
df <- data.frame(node = seq(2^(depth) - 1),
depth = rep(seq(depth), times = 2^(seq(depth) - 1)),
leftness = unlist(sapply(2^seq(depth) - 1,
function(x) (seq(x)[seq(x) %% 2 ==1])/(x + 1))))
df <- df[match(nodes, df$node),]
df$node[order(df$leftness)]
}
So we can do:
sort_left( c(1, 2, 4, 5, 10, 11))
#> [1] 4 2 10 5 11 1
Or, given the example in your original question,
sort_left(c(1,2,4,5,10,11,20,21))
#> [1] 4 2 20 10 21 5 11 1
Which was the desired result. All without recursion.
I have a numeric vector x of length N and would like to create a vector of the within-set sums of all of the following sets: any possible combination of the x elements with at most M elements in each combination. I put together a slow iterative approach; what I am looking for here is a way without using any loops.
Consider the approach I have been taking, in the following example with N=5 and M=4
M <- 4
x <- 11:15
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
However, as N gets large (above 22 for me), the expand.grid output becomes too big and gives an error (replace x above with x <- 11:55 to observe this). Ideally there would be an expand.grid function that permits restrictions on the rows before constructing the full matrix, which (at least for what I want) would keep the matrix size within memory limits.
Is there a way to achieve this without causing problems for large N?
Your problem has to do with the sheer amount of combinations.
What you appear to be doing is listing all different combinations of 0's and 1's in a sequence of length of x.
In your example x has length 5 and you have 2^5=32 combinations
When x has length 22 you have 2^22=4194304 combinations.
Couldn't you use a binary encoding instead?
In your case that would mean
0 stands for 00000
1 stands for 00001
2 stands for 00010
3 stands for 00011
...
It will not solve your problem completely, but you should be able to get a bit further than now.
Try this:
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
It generates the same result as with your expand.grid approach, shown below for the test data.
M <- 4
x <- 11:15
# expand.grid approach
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
# combn approach
result1 <- c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
all(sort(result[,1]) == sort(result1))
# [1] TRUE
This should be fast (it takes 0.227577 secs on my machine, with N=22, M=4):
x <- 1:22 # N = 22
M <- 4
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 4 5 6 7
you may want to choose the unique values of the sums with
unique(c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k))))))
Surely there has to be a function out there in some package for this?
I've searched and I've found this function to calculate the mode:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
But I'd like a function that lets me easily calculate the 2nd/3rd/4th/nth most common value in a column of data.
Ultimately I will apply this function to a large number of dplyr::group_by()s.
Thank you for your help!
Maybe you could try
f <- function (x) with(rle(sort(x)), values[order(lengths, decreasing = TRUE)])
This gives unique vector values sorted by decreasing frequency. The first will be the mode, the 2nd will be 2nd most common, etc.
Another method is to based on table():
g <- function (x) as.numeric(names(sort(table(x), decreasing = TRUE)))
But this is not recommended, as input vector x will be coerced to factor first. If you have a large vector, this is very slow. Also on exit, we have to extract character names and of the table and coerce it to numeric.
Example
set.seed(0); x <- rpois(100, 10)
f(x)
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
Let's compare with the contingency table from table:
tab <- sort(table(x), decreasing = TRUE)
# 11 12 7 9 8 13 10 14 5 15 6 2 3 16
# 14 14 11 11 10 10 9 7 5 4 2 1 1 1
as.numeric(names(tab))
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
So the results are the same.
Here is an R function that I made (inspired by several other SO posts), which may work for your goal (and I use a local dataset on religious affiliation to illustrate it):
It's simple; only R base functions are involved: length, match, sort, tabulate, table, unique, which, as.character.
Find_Nth_Mode = function(d, N = 2) {
maxN = function(x, N){
len = length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N = length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
(ux = unique(as.character(d)))
(match(d, ux))
(a1 = tabulate(match(d, ux)))
(a2 = maxN(a1, N))
(a3 = which(a1 == a2))
(ux[a3])
}
Sample Output
> table(religion_data$relig11)
0.None 1.Protestant_Conservative 2.Protestant_Liberal 3.Catholic
34486 6134 19678 36880
4.Orthodox 5.Islam_Sunni 6.Islam_Shia 7.Hindu
20702 28170 668 4653
8.Buddhism 9.Jewish 10.Other
9983 381 6851
> Find_Nth_Mode(religion_data$relig11, 1)
[1] "3.Catholic"
> Find_Nth_Mode(religion_data$relig11, 2)
[1] "0.None"
> Find_Nth_Mode(religion_data$relig11, 3)
[1] "5.Islam_Sunni"
Reference:
I want to express my gratitude to these posts, from which I get the two functions and integrate them into one:
function to find the N th largest value: Fastest way to find second (third...) highest/lowest value in vector or column
how to find the second largest mode value?
Calculating the mode or 2nd/3rd/4th most common value
I'm trying to create a tool in R that will calculate the atomic composition (i.e. number of carbon, hydrogen, nitrogen and oxygen atoms) of a peptide chain that is input in single letter amino acid code. For example, the peptide KGHLY consists of the amino acids lysine (K), glycine (G), histidine (H), leucine (L) and tyrosine (Y). Lysine is made of 6 carbon, 13 hydrogen, 1 nitrogen and 2 oxygen. Glycine is made of 2 carbon, 5 hydrogen, 1 nitrogen and 2 oxygen. etc. etc.
I would like the r code to either read the peptide string (KGHLY) from a data frame or take input from the keyboard using readline()
I am new to R and new to programming. I am able to make objects for each amino acid, e.g. G <- c(2, 5, 1, 2) or build a data frame containing all 20 amino acids and their respective atomic compositions.
The bit that I am struggling with is that I don't know how to get R to index from a data frame in response to a string of letters. I have a feeling the solution is probably very simple but so far I have not been able to find a function that is suited to this task.
There's two main components to take care of here: The selection of
a method for the storing of the basic data and the algorithm that
computes the result you desire.
For the computation, it might be preferable to have your data
stored in a matrix, due to the way R recycles the shorter vector
when multiplying two vectors. This recycling also kicks in if you
want to multiply a matrix with a vector, since a matrix is a
vector with some additional attributes (that is to say, dimension
and dimension-names). Consider the example below to see how it
works
test_matrix <- matrix(data = 1:12, nrow = 3)
test_vec <- c(3, 0, 1)
test_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
test_matrix * test_vec
[,1] [,2] [,3] [,4]
[1,] 3 12 21 30
[2,] 0 0 0 0
[3,] 3 6 9 12
Based on this observation, it's possible to deduce that a solution
where each amino acid has one row in a matrix might be a good way
to store the look up data; when we have a counting vector with
specifying the desired amount of contribution from each row, it
will be sufficient to multiply our matrix with our counting
vector, and then sum the columns - the last part solved using
colSums.
colSums(test_matrix * test_vec)
[1] 6 18 30 42
It's in general a "pain" to store this kind of information in a
matrix, since it might be a "lot of work" to update the
information later on. However, I guess it's not that often it's
required to add new amino acids, so that might not be an issue in
this case.
So let's create a matrix for the the five amino acids needed
for the peptide you mentioned in your example. The numbers was
found on Wikipedia, and hopefully I didn't mess up when I copied
them. Just follow suit to add all the other amino acids too.
amino_acids <- rbind(
G = c(C = 2, H = 5, N = 1, O = 2),
L = c(C = 6, H = 13, N = 1, O = 2),
H = c(C = 6, H = 9, N = 3, O = 2),
K = c(C = 6, H = 14, N = 2, O = 2),
Y = c(C = 9, H = 11, N = 1, O = 3))
amino_acids
C H N O
G 2 5 1 2
L 6 13 1 2
H 6 9 3 2
K 6 14 2 2
Y 9 11 1 3
This matrix contains the information we want, but it might be
preferable to have them in lexicographic order - and it would be
nice to ensure that we haven't by mistake added the same row
twice. The code below takes care of both of these issues.
amino_acids <-
amino_acids[sort(unique(rownames(amino_acids))), ]
amino_acids
C H N O
G 2 5 1 2
H 6 9 3 2
K 6 14 2 2
L 6 13 1 2
Y 9 11 1 3
The next part is to figure out how to deal with the peptides. This
will here be done by first using strsplit to split the string
into separate characters, and then use a table-solution upon the
result to get the vector that we want to multiply with the matrix.
peptide <- "KGHLY"
peptide_2 <- unlist(strsplit(x = peptide, split = ""))
peptide_2
[1] "K" "G" "H" "L" "Y"
Using table upon peptide_2 gives us
table(peptide_2)
peptide_2
G H K L Y
1 1 1 1 1
This can thus be used to define a vector to play the role of test_vec in the first example. However, in general the resulting vector will contain fewer components than the rows of the matrix amino_acids; so a restriction must be performed first, in order to get the correct format we want for our computation.
Several options is available, and the simplest one might be to use the names from the table to subset the required rows from amino_acids, such that the computation can proceed without any further fuzz.
peptide_vec <- table(peptide_2)
colSums(amino_acids[names(peptide_vec), ] * as.vector(peptide_vec))
C H N O
29 52 8 11
This outlines one possible solution for the core of your problem,
and this can be collected into a function that takes care of all
the steps for us.
peptide_function <- function(peptide, amino_acids) {
peptide_vec <- table(
unlist(strsplit(x = peptide, split = "")))
## Compute the result and return it to the work flow.
colSums(
amino_acids[names(peptide_vec), ] *
as.vector(peptide_vec))
}
And finally a test to see that we get the same answer as before.
peptide_function(peptide = "GHKLY",
amino_acids = amino_acids)
C H N O
29 52 8 11
What next? Well that depends on how you have stored your
peptides, and what you would like to do with the result. If for
example you have the peptides stored in a vector, and would like
to have the result stored in a matrix, then it might e.g. be
possible to use vapply as given below.
data_vector <- c("GHKLY", "GGLY", "HKLGL")
result <- t(vapply(
X = data_vector,
FUN = peptide_function,
FUN.VALUE = numeric(4),
amino_acids = amino_acids))
result
C H N O
GHKLY 29 52 8 11
GGLY 19 34 4 9
HKLGL 26 54 8 10
I am trying to perform following kind of summation on a matrix:
Let's say the matrix is:
mat <- matrix(c(1:5,rep(0,7),c(1:7),rep(0,5),c(1:10), 0,0), 12,3)
I want to do cumulative sum on rows up to row numbers 5, 7, 10 for column numbers 1,2,3 respectively. (The real data can have arbitrary number of rows and columns).
For now, I have been using following code:
sum1 <- matrix(rep(0, 36), 12, 3)
row_index <- c(5,7,10)
for (k in 1:3) {
sum1[1:row_index[k], k] <- cumsum(mat[1:row_index[k], k])
}
sum1 <- matrix(apply(sum1,1,sum))
To start with, I have the matrix and row_index. I want to avoid using the loop as the data has a lot of columns. I am wondering if there is a way to do that.
depth <- c(5,7,10)
mapply( function(x,y) cumsum(mat[1:x, y]), depth, seq_along(depth) )
[[1]]
[1] 1 3 6 10 15
[[2]]
[1] 1 3 6 10 15 21 28
[[3]]
[1] 1 3 6 10 15 21 28 36 45 55
First, define a function:
sumcolumn <- function(rows, columns, mat){
cumsum(mat[1:rows, columns])
}
then use mapply on your vectors of columns/rows:
mapply(sumcolumn, rows = c(5, 7, 10), columns = c(1, 2, 3), MoreArgs = list(mat = mat))