Limiting Duplication of Specified Columns - r

I'm trying to find a way to add some constraints into a linear programme to force the solution to have a certain level of uniqueness to it. I'll try explain what I mean here. Take the example below, the linear programme returns the max possible Score for a combination of 2 males and 1 female.
Looking at the Team/Grade/Rep columns however we can see that there is a lot of duplication from row to row. In fact Shana and Jason are identical.
Name<-c("Jane","Brad","Harry","Shana","Debra","Jason")
Sex<-c("F","M","M","F","F","M")
Score<-c(25,50,36,40,39,62)
Team<-c("A","A","A","B","B","B")
Grade<-c(1,2,1,2,1,2)
Rep<-c("C","D","C","D","D","D")
df<-data.frame(Name,Sex,Score,Team,Grade,Rep)
df
Name Sex Score Team Grade Rep
1 Jane F 25 A 1 C
2 Brad M 50 A 2 D
3 Harry M 36 A 1 C
4 Shana F 40 B 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"),as.numeric(df$Sex == "F"))
direction <- c("==","==")
rhs<-c(2,1)
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
4 Shana F 40 B 2 D
6 Jason M 62 B 2 D
What I am trying to work out is how to limit say the level of randomness across those last three columns. For example I would like there to no more than ie 2 columns the same across any two rows. So this would mean that either the Shana row or Jason row would be replaced in the model with an alternative.
I'm not sure if this is something that can be easily added into the Rglpk model? Appreciate any help that can be offered.

It sounds like you're asking how to prevent having a pair of individuals who are "too similar" from being returned by your optimization model. Once you have determined a rule for what makes a pair of people "too similar", you can simply add a constraint for each pair, limiting your solution to have no more than one of those two people.
For instance, if we use your rule of having no more than 2 columns the same, we could easily identify all pairs that we want to block:
pairs <- t(combn(nrow(df), 2))
(blocked <- pairs[rowSums(sapply(df[,c("Team", "Grade", "Rep")], function(x) {
x[pairs[,1]] == x[pairs[,2]]
})) >= 3,])
# [,1] [,2]
# [1,] 1 3
# [2,] 4 6
We want to block the pairs Jane/Harry and Shana/Jason. This is easy to do with linear constraints:
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"), as.numeric(df$Sex == "F"),
outer(blocked[,1], seq_len(num), "==") + outer(blocked[,2], seq_len(num), "=="))
direction <- rep(c("==", "<="), c(2, nrow(blocked)))
rhs<-c(2, 1, rep(1, nrow(blocked)))
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
# Name Sex Score Team Grade Rep
# 2 Brad M 50 A 2 D
# 5 Debra F 39 B 1 D
# 6 Jason M 62 B 2 D
The approach of computing every pair to block is attractive because we could have a much more complicated rule for which pairs to block, since we don't need to encode the rule into the linear program. All we need to be able to do is to compute every pair that needs to be blocked.

For each group of rows having the same last 3 columns we construct a constraint such that at most one of those rows may appear. If a is an indictor vector of the rows of such a group then the constraint would look like this:
a'x <= 1
To do that split the row numbers by the last 3 columns into a list of vectors s each of whose components is a vector of row numbers for rows having the same last 3 columns. Only keep those conponents having more than 1 row number giving s1. In this case the first component of s1 is c(1, 3) referring to the Jane and Harry rows and the second component is c(4, 6) referring to the Shana and Jason rows. In this particular data there were 2 rows in each of the groups but in other data there could be more than 2 rows in a group. excl has one row (constraint) for each element of s1.
The data in the question only has groups of size 2 but in general if there were k rows in some group one would need k choose 2 constraint rows to ensure that only one of the k were chosen if this were done pairwise whereas the approach here only requires one constraint row for the entire group. For example, if k = 10 then choose(10, 2) = 45 so this uses 1 constraint in place of 45.
Finally rbind excl to matrix giving matrix2 and adjust the other Rglpk_solve_LP arguments accordingly giving:
nr <- nrow(df)
s <- split(1:nr, df[4:6])
s1 <- s[lengths(s) > 1]
excl <-t(sapply(s1, "%in%", x = 1:nr)) + 0
matrix2 <- rbind(matrix, excl)
direction2 <- c(direction, rep("<=", nrow(excl)))
rhs2 <- c(rhs, rep(1, nrow(excl)))
sol2 <- Rglpk_solve_LP(obj = obj, mat = matrix2,
dir = direction2, rhs = rhs2, types = "B", max = TRUE)
df[ sol2$solution == 1, ]
giving:
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D

Related

how to create a row that is calculated from another row automatically like how we do it in excel?

does anyone know how to have a row in R that is calculated from another row automatically? i.e.
lets say in excel, i want to make a row C, which is made up of (B2/B1)
e.g. C1 = B2/B1
C2 = B3/B2
...
Cn = Cn+1/Cn
but in excel, we only need to do one calculation then drag it down. how do we do it in R?
In R you work with columns as vectors so the operations are vectorized. The calculations as described could be implemented by the following commands, given a data.frame df (i.e. a table) and the respective column names as mentioned:
df["C1"] <- df["B2"]/df["B1"]
df["C2"] <- df["B3"]/df["B2"]
In R you usually would name the columns according to the content they hold. With that, you refer to the columns by their name, although you can also address the first column as df[, 1], the first row as df[1, ] and so on.
EDIT 1:
There are multiple ways - and certainly some more elegant ways to get it done - but for understanding I kept it in simple base R:
Example dataset for demonstration:
df <- data.frame("B1" = c(1, 2, 3),
"B2" = c(2, 4, 6),
"B3" = c(4, 8, 12))
Column calculation:
for (i in 1:ncol(df)-1) {
col_name <- paste0("C", i)
df[col_name] <- df[, i+1]/df[, i]
}
Output:
B1 B2 B3 C1 C2
1 1 2 4 2 2
2 2 4 8 2 2
3 3 6 12 2 2
So you iterate through the available columns B1/B2/B3. Dynamically create a column name in every iteration, based on the number of the current iteration, and then calculate the respective column contents.
EDIT 2:
Rowwise, as you actually meant it apparently, works similarly:
a <- c(10,15,20, 1)
df <- data.frame(a)
for (i in 1:nrow(df)) {
df$b[i] <- df$a[i+1]/df$a[i]
}
Output:
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 NA
You can do this just using vectors, without a for loop.
a <- c(10,15,20, 1)
df <- data.frame(a)
df$b <- c(df$a[-1], 0) / df$a
print(df)
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 0.000000
Explanation:
In the example data, df$a is the vector 10 15 20 1.
df$a[-1] is the same vector with its first element removed, 15 20 1.
And using c() to add a new element to the end so that the vector has the same lenght as before:
c(df$a[-1],0) which is 15 20 1 0
What we want for column b is this vector divided by the original df$a.
So:
df$b <- c(df$a[-1], 0) / df$a

How to calculate mean of a certain number of rows in Column B, given they equal a certain value in column A?

I have 569 rows of data related to breast cancer. In column A, each row either has a value of 'M' or 'B' in the cell (malignant or benign). In column B, the concavity of the nucleus of each tumour is given. I want to find the mean concavity for all malignant tumours, and for all benign tumours, separately.
Edit: first 25 rows of columns A and B given below as an example
> df2
data2.diagnosis data2.concavity_mean
1 M 0.3001000
2 M 0.0869000
3 M 0.1974000
4 M 0.2414000
5 M 0.1980000
6 M 0.1578000
7 M 0.1127000
8 M 0.0936600
9 M 0.1859000
10 M 0.2273000
11 M 0.0329900
12 M 0.0995400
13 M 0.2065000
14 M 0.0993800
15 M 0.2128000
16 M 0.1639000
17 M 0.0739500
18 M 0.1722000
19 M 0.1479000
20 B 0.0666400
21 B 0.0456800
22 B 0.0295600
23 M 0.2077000
24 M 0.1097000
25 M 0.1525000
How do I ask R to give me "the mean of rows in column B, given their value in column A is M" and then "given their value in column A is B"?
Assuming your variable A is a factor, a base R approach for the example dataframe example would be
example <- data.frame(A = as.factor(c('M','B','M', 'B')), B=c(1,2,3,4))
mean(example$B[example$A == 'M'])
#> [1] 2
# for both factor levels simultaneously you can use
by(example$B, example$A, mean)
#> example$A: B
#> [1] 3
# ---- #
#> example$A: M
#> [1] 2
Note. Created on 2022-01-16 by the reprex package (v2.0.1)
Copying one of the examples of the above users (who have provided valid solutions), I am just providing a few alternative solutions using the tidyverse package
example <- data.frame(A = as.factor(c('M','B','M', 'B')), B=c(1,2,3,4))
#first example creates a new table with summarized values
example %>% #takes your data table
group_by(A) %>% #groups it by the factors listed in column A
summarize(mean_A=mean(B)) #finds the mean of each subgroup (from previous step)
If you found this or any of these answers as helpful, please select it as final answer.
As pointed in the comments, it would be nice to have a reproducible example and your data (or at least a subset of them) to see what are you dealing with.
Anyway, the solution to your problem should resemble the following (I am using simulated data):
set.seed(1986)
dta = data.frame("type" = c(rep("B", length = 5), rep("M", length = 5)), "nucleus" = rnorm(10))
mean(dta$nucleus[dta$type == "B"]) # Mean concavity for benign.
mean(dta$nucleus[dta$type == "M"]) # Mean concavity for malign.
Basically, I am just applying the mean() function to two subsets of the data, by selecting rows with the [] operator.
EDIT
Now that we have an idea of your actual data, I can provide a complete solution:
mean(dta$data2.concavity_mean[dta$data2.diagnosis== "B"]) # Mean concavity for benign.
mean(dta$data2.concavity_mean[dta$data2.diagnosis== "M"]) # Mean concavity for malign.

Index and assign multiple sets of rows at once

I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp

Resample with replacement by cluster

I want to draw clusters (defined by the variable id) with replacement from a dataset, and in contrast to previously answered questions, I want clusters that are chosen K times to have each observation repeated K times. That is, I'm doing cluster bootstrapping.
For example, the following samples id=1 twice, but repeats the observations for id=1 only once in the new dataset s. I want all observations from id=1 to appear twice.
f <- data.frame(id=c(1, 1, 2, 2, 2, 3, 3), X=rnorm(7))
set.seed(451)
new.ids <- sample(unique(f$id), replace=TRUE)
s <- f[f$id %in% new.ids, ]
One option would be to lapply over each new.id and save it in a list. Then you can stack that all together:
library(data.table)
rbindlist(lapply(new.ids, function(x) f[f$id %in% x,]))
# id X
#1: 1 1.20118333
#2: 1 -0.01280538
#3: 1 1.20118333
#4: 1 -0.01280538
#5: 3 -0.07302158
#6: 3 -1.26409125
Just in case one would need to have a "new_id" that corresponded to the index number (i.e. sample order) -- (I needed to have "new_id" so that i could run mixed effects models without having several instances of a cluster treated as one cluster because they shared the same id):
library(data.table)
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
set.seed(451); new.ids = sample( unique(f$id), replace=TRUE )
## ss has unique valued `new_id` for each cluster
ss = rbindlist(mapply(function(x, index) cbind(f[f$id %in% x,], new_id=index),
new.ids,
seq_along(new.ids),
SIMPLIFY=FALSE
))
ss
which gives:
> ss
id X new_id
1: 1 -0.3491670 1
2: 1 1.3676636 1
3: 1 -0.3491670 2
4: 1 1.3676636 2
5: 3 0.9051575 3
6: 3 -0.5082386 3
Note the values of X are different because set.seed is not set before the rnorm() call, but the id is the same as the answer of #Mike H.
This link was useful to me in constructing this answer: R lapply statement with index [duplicate]

Assign numbers to each letter so that r calculates the sum of the letters in a word

I'm trying to create a tool in R that will calculate the atomic composition (i.e. number of carbon, hydrogen, nitrogen and oxygen atoms) of a peptide chain that is input in single letter amino acid code. For example, the peptide KGHLY consists of the amino acids lysine (K), glycine (G), histidine (H), leucine (L) and tyrosine (Y). Lysine is made of 6 carbon, 13 hydrogen, 1 nitrogen and 2 oxygen. Glycine is made of 2 carbon, 5 hydrogen, 1 nitrogen and 2 oxygen. etc. etc.
I would like the r code to either read the peptide string (KGHLY) from a data frame or take input from the keyboard using readline()
I am new to R and new to programming. I am able to make objects for each amino acid, e.g. G <- c(2, 5, 1, 2) or build a data frame containing all 20 amino acids and their respective atomic compositions.
The bit that I am struggling with is that I don't know how to get R to index from a data frame in response to a string of letters. I have a feeling the solution is probably very simple but so far I have not been able to find a function that is suited to this task.
There's two main components to take care of here: The selection of
a method for the storing of the basic data and the algorithm that
computes the result you desire.
For the computation, it might be preferable to have your data
stored in a matrix, due to the way R recycles the shorter vector
when multiplying two vectors. This recycling also kicks in if you
want to multiply a matrix with a vector, since a matrix is a
vector with some additional attributes (that is to say, dimension
and dimension-names). Consider the example below to see how it
works
test_matrix <- matrix(data = 1:12, nrow = 3)
test_vec <- c(3, 0, 1)
test_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
test_matrix * test_vec
[,1] [,2] [,3] [,4]
[1,] 3 12 21 30
[2,] 0 0 0 0
[3,] 3 6 9 12
Based on this observation, it's possible to deduce that a solution
where each amino acid has one row in a matrix might be a good way
to store the look up data; when we have a counting vector with
specifying the desired amount of contribution from each row, it
will be sufficient to multiply our matrix with our counting
vector, and then sum the columns - the last part solved using
colSums.
colSums(test_matrix * test_vec)
[1] 6 18 30 42
It's in general a "pain" to store this kind of information in a
matrix, since it might be a "lot of work" to update the
information later on. However, I guess it's not that often it's
required to add new amino acids, so that might not be an issue in
this case.
So let's create a matrix for the the five amino acids needed
for the peptide you mentioned in your example. The numbers was
found on Wikipedia, and hopefully I didn't mess up when I copied
them. Just follow suit to add all the other amino acids too.
amino_acids <- rbind(
G = c(C = 2, H = 5, N = 1, O = 2),
L = c(C = 6, H = 13, N = 1, O = 2),
H = c(C = 6, H = 9, N = 3, O = 2),
K = c(C = 6, H = 14, N = 2, O = 2),
Y = c(C = 9, H = 11, N = 1, O = 3))
amino_acids
C H N O
G 2 5 1 2
L 6 13 1 2
H 6 9 3 2
K 6 14 2 2
Y 9 11 1 3
This matrix contains the information we want, but it might be
preferable to have them in lexicographic order - and it would be
nice to ensure that we haven't by mistake added the same row
twice. The code below takes care of both of these issues.
amino_acids <-
amino_acids[sort(unique(rownames(amino_acids))), ]
amino_acids
C H N O
G 2 5 1 2
H 6 9 3 2
K 6 14 2 2
L 6 13 1 2
Y 9 11 1 3
The next part is to figure out how to deal with the peptides. This
will here be done by first using strsplit to split the string
into separate characters, and then use a table-solution upon the
result to get the vector that we want to multiply with the matrix.
peptide <- "KGHLY"
peptide_2 <- unlist(strsplit(x = peptide, split = ""))
peptide_2
[1] "K" "G" "H" "L" "Y"
Using table upon peptide_2 gives us
table(peptide_2)
peptide_2
G H K L Y
1 1 1 1 1
This can thus be used to define a vector to play the role of test_vec in the first example. However, in general the resulting vector will contain fewer components than the rows of the matrix amino_acids; so a restriction must be performed first, in order to get the correct format we want for our computation.
Several options is available, and the simplest one might be to use the names from the table to subset the required rows from amino_acids, such that the computation can proceed without any further fuzz.
peptide_vec <- table(peptide_2)
colSums(amino_acids[names(peptide_vec), ] * as.vector(peptide_vec))
C H N O
29 52 8 11
This outlines one possible solution for the core of your problem,
and this can be collected into a function that takes care of all
the steps for us.
peptide_function <- function(peptide, amino_acids) {
peptide_vec <- table(
unlist(strsplit(x = peptide, split = "")))
## Compute the result and return it to the work flow.
colSums(
amino_acids[names(peptide_vec), ] *
as.vector(peptide_vec))
}
And finally a test to see that we get the same answer as before.
peptide_function(peptide = "GHKLY",
amino_acids = amino_acids)
C H N O
29 52 8 11
What next? Well that depends on how you have stored your
peptides, and what you would like to do with the result. If for
example you have the peptides stored in a vector, and would like
to have the result stored in a matrix, then it might e.g. be
possible to use vapply as given below.
data_vector <- c("GHKLY", "GGLY", "HKLGL")
result <- t(vapply(
X = data_vector,
FUN = peptide_function,
FUN.VALUE = numeric(4),
amino_acids = amino_acids))
result
C H N O
GHKLY 29 52 8 11
GGLY 19 34 4 9
HKLGL 26 54 8 10

Resources