R function to to search and replace text in a column - r

#I'm seeking to write code that takes a column with character values (e.g. ALA3=VAL20) and converts it to a specific single letter corresponding to that three letter code (e.g. A3=V20). I attempted using the following function, but seem to be having trouble:
substitute_codes <- function(data, col_name) {
# Create a dictionary of the code substitutions
code_dict <- c("ALA" = "A", "ARG" = "R", "ASN" = "N", "ASP" = "D",
"CYS" = "C", "GLU" = "E", "GLN" = "Q", "GLY" = "G",
"HIS" = "H", "ILE" = "I", "LEU" = "L", "LYS" = "K",
"MET" = "M", "PHE" = "F", "PRO" = "P", "SER" = "S",
"THR" = "T", "TRP" = "W", "TYR" = "Y", "VAL" = "V")
# Apply the substitutions using gsub()
data[[col_name]] <- gsub(paste(names(code_dict), collapse = "|"),
paste(code_dict, collapse = ""),
data[[col_name]])
return(data)
}
But I get results like the following from ALA3=VAL20 to ARNDCEQGHILKMFPSTWYV3=ARNDCEQGHILKMFPSTWYV20

As Darren Tsai points out in the comments, we can just use the cdoe_dict in str_replace_all():
library(stringr)
set.seed(123)
x <- sample(names(code_dict), 10)
y <- sample(names(code_dict), 10)
my_string <- paste0(x, sample(10), "=", y, sample(10))
my_string
#> [1] "PRO4=PHE9" "TYR1=CYS10" "PHE3=HIS2" "ASN7=ASN7" "ILE5=GLY3"
#> [6] "ARG10=ILE4" "GLU8=GLN1" "LEU2=PRO6" "CYS9=TRP5" "ASP6=THR8"
str_replace_all(my_string,
code_dict)
#> [1] "P4=F9" "Y1=C10" "F3=H2" "N7=N7" "I5=G3" "R10=I4" "E8=Q1" "L2=P6"
#> [9] "C9=W5" "D6=T8"
For more complex replacements, we could a custom function inside str_replace_all() as replacement argument. The custom function below shorten_str just uses a classical lookup with base R's match(), but we could add any kind of complexity here.
library(stringr)
# our dictionary
code_dict <- c("ALA" = "A", "ARG" = "R", "ASN" = "N", "ASP" = "D",
"CYS" = "C", "GLU" = "E", "GLN" = "Q", "GLY" = "G",
"HIS" = "H", "ILE" = "I", "LEU" = "L", "LYS" = "K",
"MET" = "M", "PHE" = "F", "PRO" = "P", "SER" = "S",
"THR" = "T", "TRP" = "W", "TYR" = "Y", "VAL" = "V")
# let's create a toy string
set.seed(123)
x <- sample(names(code_dict), 10)
y <- sample(names(code_dict), 10)
my_string <- paste0(x, sample(10), "=", y, sample(10))
my_string
#> [1] "PRO4=PHE9" "TYR1=CYS10" "PHE3=HIS2" "ASN7=ASN7" "ILE5=GLY3"
#> [6] "ARG10=ILE4" "GLU8=GLN1" "LEU2=PRO6" "CYS9=TRP5" "ASP6=THR8"
# custom function to replace string
shorten_str <- function(abr) {
code_dict[match(abr, names(code_dict))]
}
# implementation with `str_replace_all()`
str_replace_all(my_string,
paste(names(code_dict), collapse = "|"),
shorten_str)
#> [1] "P4=F9" "Y1=C10" "F3=H2" "N7=N7" "I5=G3" "R10=I4" "E8=Q1" "L2=P6"
#> [9] "C9=W5" "D6=T8"
Created on 2023-02-16 by the reprex package (v2.0.1)

Related

Find repeated elements in a list and remove those objects

I've got a long list, each object of which is itself a list containing headers and data. Some of the objects are repeated. I'd like to find the repeated objects and remove them.
Ideally this would find objects that are identical (name and contents). If both the name and contents are identical then the repeat is removed. If the name is the same, but the contents are different, then the object is renamed.
Alternatively I'd settle for finding names that are repeated and removing the objects without checking their content.
Here's a simplified example
my.list <- list(sample1 = list(header = c("a","b","c","k"),
data = c("a","b","c","k")),
sample2 = list(header = c("d", "k", "x"),
data = c("d", "k", "x")),
sample3 = list(header = c("z", "r", "v"),
data = c("z", "r", "v")),
sample1 = list(header = c("a","b","c","k"),
data = c("a","b","c","k")),
sample2 = list(header = c("h", "j", "l"),
data = c("h", "j", "l")))
table(names(my.list))
sample1 sample2 sample3
2 2 1
In the above example, the second sample1 would be removed, but the second sample2 would be renamed, e.g. sample2_2.
I've read around, but can't find an example which uses objects that are themselves lists. The other solutions don't seem to cover it, e.g. Remove duplicate in a large list while keeping the named number in R
This is relatively simple to do in two steps, but I'm not sure it can be done in one. The first step is removing exact duplicates (with duplicated) and the second one is name repair (with make.names):
my.list <- list(sample1 = list(header = c("a","b","c","k"),
data = c("a","b","c","k")),
sample2 = list(header = c("d", "k", "x"),
data = c("d", "k", "x")),
sample3 = list(header = c("z", "r", "v"),
data = c("z", "r", "v")),
sample1 = list(header = c("a","b","c","k"),
data = c("a","b","c","k")),
sample2 = list(header = c("h", "j", "l"),
data = c("h", "j", "l")))
my.list.dedup <- my.list[!duplicated(my.list)]
names(my.list.dedup) <- make.names(names(my.list.dedup), unique = TRUE)
which returns
list(
sample1 = list(
header = c("a", "b", "c", "k"),
data = c("a", "b", "c", "k")
),
sample2 = list(
header = c("d", "k", "x"),
data = c("d", "k", "x")
),
sample3 = list(
header = c("z", "r", "v"),
data = c("z", "r", "v")
),
sample2.1 = list(
header = c("h", "j", "l"),
data = c("h", "j", "l")
)
)
I would convert it to a data.frame with
do.call(rbind, unname(my.list)) %>% data.frame
then we can find the distinct elements with dplyr::distinct
do.call(rbind, unname(my.list)) %>% data.frame %>% distinct

How to find optimal string content that minimizes the MSE of character count vectors with its reference string

I have the following reference sequence:
ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
and this seed pattern string:
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"
There are 10 wildcards (?) in that pattern.
Given this functions:
aa_count_normalized <- function(x) {
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21
) / nchar(x)
AAC
}
aa_count <- function(x) {
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21
)
AAC
}
I can get :
# we need to normalize refseq_aa content with respect
# to the length of seed_pattern, to accommodate the length
# difference between the two.
> refseq_aa_content <- aa_count_normalized(ref_seq) * nchar(seed_pattern)
> refseq_aa_content
A R N D C E
0.0000000 4.7727273 1.3636364 0.0000000 0.6818182 0.0000000
Q G H I L K
2.7272727 3.4090909 2.0454545 0.6818182 2.0454545 1.3636364
M F P S T W
1.3636364 1.3636364 0.6818182 4.0909091 0.6818182 0.6818182
Y V
1.3636364 0.6818182
What I want to do is to replace the wild cards of the seed pattern - while keeping the non-wildcards as it is - with combinations of residue taken from:
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
such that the mean squared error (MSE) of the final amino acid count of the seed sequence and the normalized reference sequence count is minimized.
With this MSE function:
mse <- function (ref, new_seq) {
return(mean((ref - new_seq)^2))
}
and with this final seed sequences:
seed_final.1 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY")
seed_final.2 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQQGGGSSY")
seed_final.3 <- aa_count("FKDHKHIDVKDRHRTRHLAKSSSGGGRRQQ") # onyambu's
I get
> mse(refseq_aa_content, seed_final.1 )
[1] 1.501446
> mse(refseq_aa_content, seed_final.2 )
[1] 1.63781
> mse(refseq_aa_content, seed_final.3 )
[1] 1.560537
The seed_final.1 is the optimal exact solution, because it has the lowest MSE. Namely the 10 ?s is to be replaced with:
G Q R S Y
3 2 1 3 1 (total 10)
How can I create an efficient R code to return FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY
as the answer.
You can model your problem as integer quadratic problem where you want to minimize:
sum(r^2) - 2 sum(z * r)
with constraints:
sum(r) = k
r[i] nonegative integer
where:
r[i] how many ith letters of AADict you need to add to seed_pattern
z[i] = n(y)/n(x) * x[i] - y[i]
x[i] counts of ith letter of AADict in ref_seq
y[i] counts of ith letter of AADict in seed_pattern
n(x) number of characters in ref_seq
n(y) number of charecters in seed_pattern
k number of wild card characters in seed_pattern
I din't manage to find mixed-integer quadratic solver in R (free one) so here is heuristic using DEoptimR:
ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
aminoSummary <- function(x){
f <- factor(strsplit(x, split = "")[[1]], levels = AADict)
list(
l = nchar(x),
k = sum(is.na(f)),
z = table(f)
)
}
x <- aminoSummary(ref_seq)
y <- aminoSummary(seed_pattern)
M <- length(AADict)
res <- DEoptimR::JDEoptim(
lower = rep(0, M),
upper = rep(y$k, M) + 1,
fn = function(r, z, k){
r <- floor(r)
sum(r * r) - 2 * sum(z * r)
},
constr = function(r, z, k) sum(floor(r)) - k,
meq = 1,
z = as.vector(x$z * y$l / x$l - y$z),
k = y$k
)
rep(AADict, floor(res$par))
[1] "R" "Q" "Q" "G" "G" "G" "S" "S" "S" "Y"

Minimize a function of characters inputs in R

I have the following function that I want to find the minimum:
model <- Create(parameter1 = list(model = "a" , "b"),
parameter2 = list(distribution = "x" , "y"))
The four inputs of this function are characters, and have as possible values:
parameter1: "a", "b", "c", "d", "e"
parameter2: "x", "y", "z", "w", "t", "v"
I've tried the optim function a few times without success.
Any help is appreciated.
Evaluate the function at every possible set of input values and take the least.
# test function
Create <- function(parameter1, parameter2) {
sum(match(unlist(parameter1), p1), match(unlist(parameter2), p2))
}
p1 <- c("a", "b", "c", "d", "e")
p2 <- c("x", "y", "z", "w", "t", "v")
g <- expand.grid(p1, p1, p2, p2, stringsAsFactors = FALSE)
obj <- function(x) Create(x[1:2], x[3:4])
ix <- which.min(apply(g, 1, obj))
g[ix, ]
## Var1 Var2 Var3 Var4
## 1 a a x x
obj(g[ix, ])
## [1] 4

Create a dataframe of combinations of data with proper labels in R

I am trying to create a data frame that pulls combinations of values from a data frame, and applies a function on the data, and finally assigning the label of the combinations as identifiers in the data frame.
To create a simple mockup of the data:
A = (1:20)
B = seq(1, 40, 2)
To create a mockup of each identifier:
L = c("q", "r", "q", "p", "q", "p", "r", "r", "r", "r", "r", "r", "r", "r", "o", "p", "q", "r", "r", "q")
The mockup data frame is then created with:
dat.example = data.frame(A, B, L)
Here, the data in each row is subtracted by the data from the previous row, with all possible combinations.
diffA = combn(dat.example$A, 2, FUN = diff)
diffB = combn(dat.example$B, 2, FUN = diff)
After all possible combination of differences within columns A and B are calculated, a function is applied on the generated output.
calc.diff = sqrt((diffA^2) + (diffB^2))
Now comes the tricky part, where I pull out the identifiers for the combinations.
(L.list = combn(dat.example$L, 2, simplify = F))
L.list.1 = sapply(L.list, "[[", 1)
L.list.2 = sapply(L.list, "[[", 2)
L.dat = data.frame(
X = L.list.1,
Y = L.list.2
)
L.dat$comp = with(
L.dat,
interaction(X, Y, sep = "_")
)
And combine them into a master dataset
master = data.frame(
calc.diff,
L.dat$comp
)
Is there a simpler way to perform this that I am missing?
Update: I just realised the labels I created in this fashion is duplicated, as p_q is equivalent to q_r as only the magnitude of the value matters after the calc.dist function. Therefore, this question still needs a better answer. My present solution is to simply substitute each occurrence with it's twin in the dataset, but this becomes substantially difficult the more labels there are.
Update 2: I have solved the issue of a pure R solution to properly assigning unique combination of labels to the dataset. Below, I shall paste the entire code for generating the above dataset, and assigning the right labels. Improvements are of course welcomed.
A = (1:20)
B = seq(1, 40, 2)
L = c("q", "r", "q", "p", "q", "p", "r", "r", "r", "r", "r", "r", "r", "r", "o", "p", "q", "r", "r", "q")
dat.example = data.frame(A, B, L)
diffA = combn(dat.example$A, 2, FUN = diff)
diffB = combn(dat.example$B, 2, FUN = diff)
calc.diff = sqrt((diffA^2) + (diffB^2))
L.dat.comp <- combn(dat.example$L, 2, paste0, collapse = "_", simplify=F)
master <- data.frame(
calc.diff,
id = unlist(L.dat.comp))
Function to help with combinations
expand.grid.unique <- function(x, y, include.equals=FALSE)
{
x <- unique(x)
y <- unique(y)
g <- function(i)
{
z <- setdiff(y, x[seq_len(i-include.equals)])
if(length(z)) cbind(x[i], z, deparse.level=0)
}
do.call(rbind, lapply(seq_along(x), g))
}
Create dataframe with unique combinations of id labels.
dat.combo =
as.data.frame(
expand.grid.unique(L, L, include.equals = T)
)
dat.combo$c1 = with(dat.combo, interaction(V1, V2, sep = "_"))
dat.combo$c2 = with(dat.combo, interaction(V2, V1, sep = "_"))
master$id2 = master$id
Where ids match dat.combo$c2, replace with dat.com$c1
master$id2[] = setNames(dat.combo$c1,
dat.combo$c2)[as.character(
unlist(
master$id2
)
)
]
Where NAs are present in id2, replace with values in id
master$id2[is.na(master$id2)] = master$id[is.na(master$id2)]
Based on Jay's comement for the "tricky part":
L.dat.comp <- combn(dat.example$L, 2, paste0, collapse = "_", simplify=F)
master <- data.frame(cbind(calc.diff,unlist(L.dat.comp)))
So the whole thing looks like:
A = (1:20)
B = seq(1, 40, 2)
L = c("q", "r", "q", "p", "q", "p", "r", "r", "r", "r", "r", "r", "r", "r", "o", "p", "q", "r", "r", "q")
dat.example = data.frame(A, B, L)
diffA = combn(dat.example$A, 2, FUN = diff)
diffB = combn(dat.example$B, 2, FUN = diff)
calc.diff = sqrt((diffA^2) + (diffB^2))
L.dat.comp <- combn(dat.example$L, 2, paste0, collapse = "_", simplify=F)
master <- data.frame(cbind(calc.diff,unlist(L.dat.comp)))

ggave() Error: Unknown input R

I am having trouble with ggsave() from the ggplot2 library. I wrote a function that I pass arguments to, and that is supposed to produce and then save the results with ggsave().
Here is some example data and code to reproduce the error:
example.df.1 <- data.frame(matrix(1:100, nrow = 20, ncol = 5))
colnames(example.df.1) <- c("var1", "var2", "var3", "var4", "var5")
rownames(example.df.1) <- c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T")
example.df.2 <- data.frame(matrix(ncol = 2, nrow = 24))
example.df.2[,1] <- c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V", "W", "X")
example.df.2[,2] <- rnorm(24, 10, 2)
problematic_func <- function(data1, col, title, var, data2) {
# only include rows without missing values
loc1 <- subset(data1, rowSums(is.na(data1)) == 0)
loc1 <- cbind(loc1, rank(-as.data.frame(loc1[,1]), ties.method = "first"))
# reduce data2 to only those rows that correspond to rows in data1
loc2 <- data2[data2[,1] %in% rownames(loc1),]
# order loc2
loc2.ordered <- loc2[order(loc2[,1]),]
# correlation between loc1 and loc2.ordered
corr <- cor(loc1[,1], loc2.ordered[,2])
# creating the plot
i <- ggplot(loc1, aes_q(x = loc1[,1], y = loc2.ordered))
i <- i + geom_point(colour = col, size = 4)
i <- i + ggtitle(title)
i <- i + xlab(var)
i <- i + ylab("y-axis")
i <- i + coord_cartesian(xlim = c(0, max(loc1[,1])),
ylim = c(0, max(loc2.ordered[,2])*1.2))
i <- i + annotate("text", x = max(loc1[,1])*.5, y = 1,
label = paste("Correlation coef: ", as.character(corr)), size = 3)
# saving the plot - this is where the error occurs according
# to the debugger
ggsave(filename = paste("my_example_plot_", var, ".png", sep = ""),
plot = i, device = png, width = 625, height = 625, limitsize = FALSE)
}
for (i in 1:ncol(example.df.1)) {
sv <- as.data.frame(example.df.1[,i])
rownames(sv) <- rownames(example.df.1)
problematic_func(sv, "orange", colnames(example.df.1[i]),
colnames(example.df.1[i]), data2 = example.df.2)
}
Edit: Sorry, I forgot to add the error I get:
Error in FUN(X[[2L]], ...) : Unknown input:data.frame
I found the mistake myself. The error was not in the ggsave() function, but in the ggplot() function with which I created the plot in the first place. The correct code must be:
i <- ggplot(loc1, aes_q(x = loc1[,1], y = loc2.ordered[,2]))
The difference is that the y aesthetic needs to be passed a column of loc2.ordered, not the entire dataframe.

Resources