I have an excel file of a list of sequences. How would I go about getting the number of times a letter appears before a letter in square brackets? An example of an entry is below.
GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT
I'd also like to do this for the letter after the square brackets.
Edit: Apologies for the confusion. Take the example below. Id like to count how many times A, C, G, and T appears immediately before and after the letter in square brackets (for which there is only one per line). So to count the occurences of A[A]A, A[A]C, C[A]A, and so on. The file is in excel, and I'm happy to use any method in excel, R or in Linux.
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT
You could split the original string into parts. From the start of the string to the first [ and from the first ] to the end of the string.
int count = firstPart.Count(f => f == 'a');
count += secondPart.Count(f => f == 'a');
Option Explicit
Sub test()
Dim seq As String
seq = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
Debug.Print CountLetter("A", seq)
End Sub
Function CountLetter(letter As String, ByVal sequence As String) As Long
'--- assumes the letter in the brackets is the same as that being counted
Dim allLetters() As String
allLetters = Split("A,C,G,T", ",")
Dim letterToDelete As Variant
For Each letterToDelete In allLetters
If letterToDelete <> letter Then
sequence = Replace(sequence, letterToDelete, "")
End If
Next letterToDelete
CountLetter = Len(sequence) - 1
End Function
x = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
#COUNT 'A'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("A", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 3 4
#COUNT 'G'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("G", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 7 6
New R solution (after clarification by OP)
Let's assume the data have been read from Excel into a data.table called los (list of sequences) which has only one column called sequence. Then, the occurences can be counted as follows:
library(data.table)
los[, .N, by = stringr::str_extract(sequence, "[ACGT]\\[[ACGT]\\][ACGT]")]
# stringr N
#1: C[A]C 8
#2: A[A]C 5
#3: C[A]G 1
#4: G[A]G 1
#5: G[A]C 1
#6: T[A]C 1
str_extract() looks for one of the letters A, C, G, T followed by [ followed by one of the letters A, C, G, T followed by ] followed by one of the letters A, C, G, T in column sequence and extracts the matching substrings. Then, los is grouped by the substrings and the number of occurences is counted (.N).
Data
If the Excel file is stored in CSV format then it can be read using data.table's fread() function like this
los <- fread("your_file_name.csv")
(Perhaps, some parameters to fread() might need to be adjusted for the specific file.)
However, some data already are provided in the question. These can be read as character string using fread() as well:
los <- fread("sequence
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT")
Old solution (before clarification by OP) - left here for reference
This is a solution in base R with help of the stringr package which will work with a "list" of sequences (a data.frame), any single letter enclosed in square brackets, and arbitrary lengths of the sequences. It assumes that the data already have been read from file into a data.frame which is named los here.
# create data: data frame with two sequences
los <- data.frame(
sequence = c("GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT",
"GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT"))
# split sequences in three parts
mat <- stringr::str_split_fixed(los$sequence, "[\\[\\]]", n = 3)
los$letter <- mat[, 2]
los$n_before <- stringr::str_count(mat[, 1], mat[, 2])
los$n_after <- stringr::str_count(mat[, 3], mat[, 2])
print(los)
# sequence letter n_before n_after
#1 GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT A 3 4
#2 GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT C 9 9
Note this code works best if there is exactly one pair of square brackets in each sequence. Any additional brackets will be ignored.
It will also work if there is more than just one letter enclosed in brackets, e.g., [GT].
I'm confessing that I'm addicted to Hadley Wickham's stringr package because I have difficulties to remember the inconsistently named base R functions for string maninpulation like strsplit, grepl, sub, match, gregexpr, etc. To understand what I mean please have a look at the Usage and See Also sections of ?grep and compare to stringr.
I would think that R packages for bioinformatics, such as seqinr or Biostrings, would be a good starting point. However, here's a "roll your own" solution.
First step: get your data from Excel into R. I will assume that file mydata.xlsx contains one sheet with a column of sequence and no header. You need to adapt this for your file and sheet format.
library(readxl)
sequences <- read_excel("mydata.xlsx", col_names = FALSE)
colnames(sequences) <- "sequence"
Now you need a function to extract the base in square brackets and the bases at -1 and +1. This function uses the stringr package to extract bases using regular expressions.
get_bases <- function(seq) {
require(stringr)
require(magrittr)
subseqs <- str_match(seq, "^([ACGT]+)\\[([ACGT])\\]([ACGT]+)$")
bases <- list(
before = subseqs[, 2] %>% str_sub(-1, -1),
base = subseqs[, 3],
after = subseqs[, 4] %>% str_sub(1, 1)
)
return(bases)
}
Now you can pass the column of sequences to the function to generate a list of lists, which can be converted to a data frame.
library(purrr)
sequences_df <- lapply(sequences, get_bases) %>%
map_df(as.data.frame, stringsAsFactors = FALSE)
head(sequences_df, 3)
before base after
1 C A C
2 C A C
3 A A C
The last step is to use functions from dplyr and tidyr to count up the bases.
library(tidyr)
sequences_df %>%
gather(position, letter, -base) %>%
group_by(base, position, letter) %>%
tally() %>%
spread(position, n) %>%
select(base, letter, before, after)
Result using your 17 example sequences. I would use better names than I did if I were you: base = the base in square brackets, letter = the base being counted, before = count at -1, after = count at +1.
base letter before after
* <chr> <chr> <int> <int>
1 A A 5 NA
2 A C 9 15
3 A G 2 2
4 A T 1 NA
Related
Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.
can someone help with these regular expressions?
d_total_v_conf.int.low_all
I want three expressions: total_v, conf.int.low, all
I can't just capture elements before the third _, it is more complex than that:
d_share_v_hskill_wc_mean_plus
Should yield share_v_hskill_wc, mean and plus
The first match is for all characters between the second and the penultimate _, the second match takes all between the penultimate and the last _ and the third takes everything after the last _
We can use sub to capture the groups and create a delimiter, to scan
f1 <- function(str_input) {
scan(text = sub("^[^_]+_(.*)_([^_]+)_([^_]+)$",
"\\1,\\2,\\3", str_input), what = "", sep=",")
}
f1(str1)
#[1] "total_v" "conf.int.low" "all"
f1(str2)
#[1] "share_v_hskill_wc" "mean" "plus"
If it is a data.frame column
library(tidyr)
library(dplyr)
df1 %>%
extract(col1, into = c('col1', 'col2', 'col3'),
"^[^_]+_(.*)_([^_]+)_([^_]+)$")
# col1 col2 col3
#1 total_v conf.int.low all
#2 share_v_hskill_wc mean plus
data
str1 <- "d_total_v_conf.int.low_all"
str2 <- "d_share_v_hskill_wc_mean_plus"
df1 <- data.frame(col1 = c(str1, str2))
Here is a single regex that yields the three groups as requested:
(?<=^[^_]_)((?:(?:(?!_).)+)|_)+(_[^_]+$)
Demo
The idea is to use a lookaround, plus an explict match for the first group, an everything-but batch in the middle, and another explicit match for the last part.
You may need to adjust the start and end anchors if those strings show up in free text.
You can use {unglue} for this task :
library(unglue)
x <- c("d_total_v_conf.int.low_all", "d_share_v_hskill_wc_mean_plus")
pattern <- "d_{a}_{b=[^_]+}_{c=[^_]+}"
unglue_data(x, pattern)
#> a b c
#> 1 total_v conf.int.low all
#> 2 share_v_hskill_wc mean plus
what you want basically is to extract a, b and c from a pattern looking like "d_{a}_{b}_{c}", but where b and c are made of one or more non underscore characters, which is what "[^_]+" means in regex.
This question already has answers here:
How to transform a key/value string into distinct rows?
(2 answers)
Closed 4 years ago.
I have a large text file that I want to import in R with multimodal data encoded as such :
A=1,B=1,C=2,...
A=2,B=1,C=1,...
A=1,B=2,C=1,...
What I'd like to have is a dataframe similar to this :
A B C
1 1 2
2 1 1
1 2 1
Because the column name is being repeated over and over for each row, I was wondering if there was a way import that text file with a fscanf functionality that would parse the A, B, C column names such as "A=%d,B=%d,C=%d,...."
Or maybe there's a simpler way using read.table or scan ? But I couldn't figure out how.
Thanks for any tip
1) read.pattern read.pattern in the gsubfn package is very close to what you are asking. Instead of %d use (\\d+) when specifying the pattern. If the column names are not important the col.names argument could be omitted.
library(gsubfn)
L <- c("A=1,B=1,C=2", "A=1,B=1,C=2", "A=1,B=1,C=2") # test input
pat <- "A=(\\d+),B=(\\d+),C=(\\d+)"
read.pattern(text = L, pattern = pat, col.names = unlist(strsplit(pat, "=.*?(,|$)")))
giving:
A B C
1 1 1 2
2 1 1 2
3 1 1 2
1a) percent format Just for fun we could implement it using exactly the format given in the question.
fmt <- "A=%d,B=%d,C=%d"
pat <- gsub("%d", "(\\\\d+)", fmt)
Now run the read.pattern statement above.
2) strapply Using the same input and the gsubfn package, again, an alternative is to pull out all strings of digits eliminating the need for the pat shown in (1) reducing the pattern to just "\\d+".
DF <- strapply(L, "\\d+", as.numeric, simplify = data.frame)
names(DF) <- unlist(strsplit(L[1], "=.*?(,|$)"))
3) read.csv Even simpler is this base only solution which deletes the headings and reads in what is left setting the column names as in the prior solution. Again, omit the col.names argument if column names are not important.
read.csv(text = gsub("\\w*=", "", L), header = FALSE,
col.names = unlist(strsplit(L[1], "=.*?(,|$)")))
I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]
I have the following string:
str1<-"{a{c}{b{{e}{d}}}}"
In addition, I have a list of integers:
str_d <- ( 1, 2, 2, 4, 4)
There is one to one relation between the list to the string.
It means:
a 1
c 2
b 2
e 4
d 4
I would like to sort in alphabetic order only the characters of str1 that have same level.
It means to sort c, b (which have the same value 2) will yield b,c
and to sort e, d (which have the same value 4) will yield d,e.
The required result will be:
str2<-"{a{b}{c{{d}{e}}}}"
In addition a,b,c,d and e can be not only characters, but might be words, such as:
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"
How can I do it with keeping the { in their place?
brkts <- gsub("\\w+", "%s", str1)
strings <- regmatches(str1,gregexpr("[^{}]+",str1))[[1]]
fixed <- ave(strings, str_d, FUN=function(x) sort(x))
do.call(sprintf, as.list(c(brkts, fixed)))
[1] "{a{b}{c{{d}{e}}}}"
and
[1] "{NSP{ARD}{BOS{{COR}{DUD}}}}"
It will work for the first and second case. We first isolate the text with gsub and place %s instead. That will be used later for sprintf. Next we isolate the strings by splitting with strsplit on the comma that we placed after each group of bracket symbols. We then sort based on the sorting vector given and save the characters in the vector fixed. Lastly, we call sprintf on the brkts variable that we created at the beginning and the sorted strings.
Data
str_d <- c(1, 2, 2, 4, 4)
str1<-"{a{c}{b{{e}{d}}}}"
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"
One possible solution (using stringr package):
words <- str_extract_all(str1, '\\w+')[[1]]
ordered <- words[order(paste(str_d, words))]
formatter <- str_replace_all(str1, '\\w+', '%s')
do.call(sprintf, as.list(c(formatter, ordered)))
words is an extract of the words between the braces. I ordered those by sorting the combination of the words with str_d. E.g. the words will become:
1 a
2 c
2 b
4 e
4 d
Then I slap it all back together with sprintf().