I am trying to do sampling of variables for a statistical analysis. I have 10 variables, and I want to examine every possible combination of 5 of them. However, I only want those that follow certain rules. I only want those with 1 xor 2, 3 xor 4, 5 xor 6, 7 xor 8 and 9 xor 10. In other words, all combinations given 5 binary choices (32).
Any idea how to do this efficiently?
A simple idea is to find all the 5 out 10 using:
library(gtools)
sets = combinations(10,5) # choose 5 out of 10, all possibilities
sets = split(sets, seq.int(nrow(sets))) #so it's loopable
And then loop over these keeping only the ones that meet the criteria and thus ending up with the 32 ones desired.
But surely there is a more efficient way than this.
This will construct a matrix whose 32 rows enumerate all the possible combinations satisfying your contraint:
m <- as.matrix(expand.grid(1:2, 3:4, 5:6, 7:8, 9:10))
## Inspect a few of the rows to see that this works:
m[c(1,4,9,16,25),]
# Var1 Var2 Var3 Var4 Var5
# [1,] 1 3 5 7 9
# [2,] 2 4 5 7 9
# [3,] 1 3 5 8 9
# [4,] 2 4 6 8 9
# [5,] 1 3 5 8 10
I found a solution too, but it's not quite as elegant as Josh O'Brien's above.
library(R.utils) #for intToBin()
binaries = intToBin(0:31) #binary numbers 0 to 31
sets = list() #empty list
for (set in binaries) { #loop over each binary number string
vars = numeric() #empty vector
for (cif in 1:5) { #loop over each char in the string
if (substr(set,cif,cif)=="0"){ #if its 0
vars = c(vars,cif*2-1) #add the first var
}
else {
vars = c(vars,cif*2) #else, add the second var
}
}
sets[[set]] = as.vector(vars) #add result to list
}
Based on the idea in your answer, an alternative for the record:
n = 5
sets = matrix(1:10, ncol = 2, byrow = TRUE)
#the "on-off" combinations for each position
combs = lapply(0:(2^n - 1), function(x) as.integer(intToBits(x)[seq_len(n)]))
#a way to get the actual values
matrix(sets[cbind(seq_len(n), unlist(combs) + 1L)], ncol = n, byrow = TRUE)
Related
I'd like to make a data frame using only the last computed values from a Repeat loop.
For the repeat and sample functions, I'm using this data. The numbers in Prob column are the probabilities of each number to occur.
enter image description here
b <- 1
repeat {
c <- sample(a$Plus, size=1, prob=(a$Prob))
cat(b, '\t', c, '\n')
b <- b + 1
if (c >= 10) {
{
break
}
}
}
#I'm interested in the result greater than 10 only
If I run the code above, then it will compute something like
1 4
2 8
3 13
If I run this again, it will compute different results like..
1 9
2 3
3 7
4 3
5 11
What I'd like to do is to make a data frame using only the last outputs of each loop.
For example, using the computed data above, I'd like to make a frame that looks like
Trial Result
3 13
5 11
Is there any way to repeat this loop the number of times I want to and make a data frame using only the last outputs of each repeated function?
You can use a user defined function to do this. Since you haven't given your dataframe a, I've defined it as follows:
library(tidyverse)
a <- tibble(
Plus = 1:15,
Prob = seq(from = 15, to = 1, by = -1)
)
The following function does the same thing as your repeat loop, but stores the relevant results in a tibble. I've left your variable b out of this because as far as I can see, it doesn't contribute to your desired output.
samplefun <- function(a) {
c <- sample(a$Plus, size=length(a$Plus), prob=a$Prob)
res <- tibble(
Trial = which(c >= 10)[1],
Result = c[which(c >= 10)[1]]
)
return(res)
}
Then use map_dfr to return as many samples as you like:
nsamples <- 5
map_dfr(1:nsamples, ~ samplefun(a))
Output:
# A tibble: 5 x 2
Trial Result
<int> <int>
1 4 11
2 6 14
3 5 11
4 2 10
5 4 15
This seems like it must be a duplicate but I can't find a solution, probably because I don't know exactly what to search for.
Say I have a bucket of 8 numbered marbles and 10 people who will each sample 1 marble from the bucket.
How can I write a sampling procedure where each person draws a marble from the bucket without replacement until the bucket is empty, at which point all the marbles are put back into the bucket, and sampling continues without replacement? Is there a name for this kind of sampling?
For instance, a hypothetical result from this sampling with our 10 people and bucket of 8 marbles would be:
person marble
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 1
10 10 2
Note that the marbles are drawn randomly, so not necessarily in numerical order. This is just an example output to get my point across.
Building on the answer from MÃ¥nsT, here is a function to do this programmatically. I put in functionality to smoothly handle cases where the number of samples take is less than the population size (and return the expected behavior).
sample_more <- function(pop, n, seed = NULL) {
set.seed(seed) # For reproducibility, if desired. Defaults to NULL, which is no seed
m <- length(pop)
if(n <= m) { # handles case when n is smaller than the population size
sample(pop, n, replace = FALSE)
} else { # handle case when n is greater than population size
c(sample(pop, m, replace = FALSE), sample(pop, n-m, replace = FALSE))
}
}
marbles <- 1:8
sample_more(marbles, 10, seed = 1)
[1] 1 4 8 2 6 3 7 5 2 3
sample_more(marbles, 3, seed = 1)
[1] 1 4 8
Not sure if there is a name for this, but a simple solution would be to just use sample several times:
# Create a bucket with 8 marbles:
bucket <- 1:8
# First draw 8 marbles without replacement, then refill the bucket at draw 2 more:
marbles <- c(sample(bucket, 8, replace = FALSE), sample(bucket, 2, replace = FALSE))
marbles
You can dynamically use sample in a for loop to generate the list.
For marbles 1:8 and over n people
Example:
bucket <- 1:8
n <- 100
marbleList <- c()
for(i in 1:ceiling(n / length(bucket))){
marbleList <- c(marbleList, sample(bucket))
}
marbleList <- marbleList[1:n]
I have multiple vectors Di, where i = 1, 2,..., 40. Now in a for-loop, I want to do some operations on these. The following pseudo-code summarizes my objective.
for i in 1:40
D = Di # How to do this?
# ... do some operations on D #
Edit: Please note that each Di is a separate vector.
put them all in a list, each list object (the vector) can be accessed by using the index notation.
MyVectors = list(D1 = c(1:10),
D2 = c(11:20))
> MyVectors[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
> MyVectors[[2]]
[1] 11 12 13 14 15 16 17 18 19 20
therefore you can access them as such:
for(i in 1:2){
MyVectors[[i]] = MyVectors[[i]] + 2
}
Funnily, I just answered a similar question about 45 minutes ago. I stand by the philosophy I described in that answer with respect to this question. But because you have 40 loose objects, instead of just 2, the "separateness" approach really doesn't make sense. You should use the "systematicness" approach, as follows:
Ds <- list(
c(...), ## 1st vector
c(...), ## 2nd vector
...
c(...) ## 40th vector
);
for (i in seq_along(Ds)) {
## do some operations on Ds[[i]]
};
Funnily, I answered another question an hour ago. In the same way, we can place the vectors in a list and then do the operation with in each list element
MyVectors = list(D1 = c(1:10),
D2 = c(11:20))
lapply(MyVectors, function(x) x +2)
I'm trying to create a tool in R that will calculate the atomic composition (i.e. number of carbon, hydrogen, nitrogen and oxygen atoms) of a peptide chain that is input in single letter amino acid code. For example, the peptide KGHLY consists of the amino acids lysine (K), glycine (G), histidine (H), leucine (L) and tyrosine (Y). Lysine is made of 6 carbon, 13 hydrogen, 1 nitrogen and 2 oxygen. Glycine is made of 2 carbon, 5 hydrogen, 1 nitrogen and 2 oxygen. etc. etc.
I would like the r code to either read the peptide string (KGHLY) from a data frame or take input from the keyboard using readline()
I am new to R and new to programming. I am able to make objects for each amino acid, e.g. G <- c(2, 5, 1, 2) or build a data frame containing all 20 amino acids and their respective atomic compositions.
The bit that I am struggling with is that I don't know how to get R to index from a data frame in response to a string of letters. I have a feeling the solution is probably very simple but so far I have not been able to find a function that is suited to this task.
There's two main components to take care of here: The selection of
a method for the storing of the basic data and the algorithm that
computes the result you desire.
For the computation, it might be preferable to have your data
stored in a matrix, due to the way R recycles the shorter vector
when multiplying two vectors. This recycling also kicks in if you
want to multiply a matrix with a vector, since a matrix is a
vector with some additional attributes (that is to say, dimension
and dimension-names). Consider the example below to see how it
works
test_matrix <- matrix(data = 1:12, nrow = 3)
test_vec <- c(3, 0, 1)
test_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
test_matrix * test_vec
[,1] [,2] [,3] [,4]
[1,] 3 12 21 30
[2,] 0 0 0 0
[3,] 3 6 9 12
Based on this observation, it's possible to deduce that a solution
where each amino acid has one row in a matrix might be a good way
to store the look up data; when we have a counting vector with
specifying the desired amount of contribution from each row, it
will be sufficient to multiply our matrix with our counting
vector, and then sum the columns - the last part solved using
colSums.
colSums(test_matrix * test_vec)
[1] 6 18 30 42
It's in general a "pain" to store this kind of information in a
matrix, since it might be a "lot of work" to update the
information later on. However, I guess it's not that often it's
required to add new amino acids, so that might not be an issue in
this case.
So let's create a matrix for the the five amino acids needed
for the peptide you mentioned in your example. The numbers was
found on Wikipedia, and hopefully I didn't mess up when I copied
them. Just follow suit to add all the other amino acids too.
amino_acids <- rbind(
G = c(C = 2, H = 5, N = 1, O = 2),
L = c(C = 6, H = 13, N = 1, O = 2),
H = c(C = 6, H = 9, N = 3, O = 2),
K = c(C = 6, H = 14, N = 2, O = 2),
Y = c(C = 9, H = 11, N = 1, O = 3))
amino_acids
C H N O
G 2 5 1 2
L 6 13 1 2
H 6 9 3 2
K 6 14 2 2
Y 9 11 1 3
This matrix contains the information we want, but it might be
preferable to have them in lexicographic order - and it would be
nice to ensure that we haven't by mistake added the same row
twice. The code below takes care of both of these issues.
amino_acids <-
amino_acids[sort(unique(rownames(amino_acids))), ]
amino_acids
C H N O
G 2 5 1 2
H 6 9 3 2
K 6 14 2 2
L 6 13 1 2
Y 9 11 1 3
The next part is to figure out how to deal with the peptides. This
will here be done by first using strsplit to split the string
into separate characters, and then use a table-solution upon the
result to get the vector that we want to multiply with the matrix.
peptide <- "KGHLY"
peptide_2 <- unlist(strsplit(x = peptide, split = ""))
peptide_2
[1] "K" "G" "H" "L" "Y"
Using table upon peptide_2 gives us
table(peptide_2)
peptide_2
G H K L Y
1 1 1 1 1
This can thus be used to define a vector to play the role of test_vec in the first example. However, in general the resulting vector will contain fewer components than the rows of the matrix amino_acids; so a restriction must be performed first, in order to get the correct format we want for our computation.
Several options is available, and the simplest one might be to use the names from the table to subset the required rows from amino_acids, such that the computation can proceed without any further fuzz.
peptide_vec <- table(peptide_2)
colSums(amino_acids[names(peptide_vec), ] * as.vector(peptide_vec))
C H N O
29 52 8 11
This outlines one possible solution for the core of your problem,
and this can be collected into a function that takes care of all
the steps for us.
peptide_function <- function(peptide, amino_acids) {
peptide_vec <- table(
unlist(strsplit(x = peptide, split = "")))
## Compute the result and return it to the work flow.
colSums(
amino_acids[names(peptide_vec), ] *
as.vector(peptide_vec))
}
And finally a test to see that we get the same answer as before.
peptide_function(peptide = "GHKLY",
amino_acids = amino_acids)
C H N O
29 52 8 11
What next? Well that depends on how you have stored your
peptides, and what you would like to do with the result. If for
example you have the peptides stored in a vector, and would like
to have the result stored in a matrix, then it might e.g. be
possible to use vapply as given below.
data_vector <- c("GHKLY", "GGLY", "HKLGL")
result <- t(vapply(
X = data_vector,
FUN = peptide_function,
FUN.VALUE = numeric(4),
amino_acids = amino_acids))
result
C H N O
GHKLY 29 52 8 11
GGLY 19 34 4 9
HKLGL 26 54 8 10
I'm using R, and I have two data.frames, A and B. They both have 6 rows, but A has 25000 columns (genes), and B has 30 columns. I'd like to apply a function with two arguments f(x,y) where x is every column of A and y is every column of B. So far it looks like this:
i = 1
for (x in A){
j = 1
for (y in B){
out[i,j] <- f(x,y)
j = j + 1
}
i = i + 1
}
I have two issues with this: from my Python programming I associate keeping track of counters like this as crufty, and from my R programming I am nervous of for loops. However, I can't quite see how to apply apply (or even if I should apply apply) to this problem and was hoping someone might enlighten me. I need to treat f() as atomic (it's actually cor.test()) for now.
Since you are using data frames, it might be faster to use lapply or sapply to do this (specially given the scope of your data frames). For example,
x <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
bl <- lapply(x, function(u){
lapply(y, function(v){
f(u,v) # Function with column from x and column from y as inputs
})
})
out = matrix(unlist(bl), ncol=ncol(y), byrow=T)
Some data
nrows <- 6
A <- data.frame(a = runif(nrows), b = runif(nrows), c = runif(nrows))
B <- data.frame(z = rnorm(nrows), y = rnorm(nrows))
The trick: remember columns with expand.grid
counter <- expand.grid(seq_along(A), seq_along(B))
f <- function(x)
{
cor.test(A[, x["Var1"]], B[, x["Var2"]])$estimate
}
Now we only need 1 call to apply.
stats <- apply(counter, 1, f)
names(stats) <- paste(names(A)[counter$Var1], names(B)[counter$Var2], sep = ",")
stats
Nesting the applies works, not the easiest syntax, though.
x<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y<-data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
z<-apply(x,2,function(col,df2)
{
apply(df2,2,function(col2,col1)
{
col2+col1
},col)
},y)
z
col1 col2 col3
[1,] 2 6 10
[2,] 4 8 12
[3,] 6 10 14
[4,] 8 12 16
[5,] 6 10 14
[6,] 8 12 16
[7,] 10 14 18
[8,] 12 16 20