Sorting a string by specific values - r

I have the following string:
str1<-"{a{c}{b{{e}{d}}}}"
In addition, I have a list of integers:
str_d <- ( 1, 2, 2, 4, 4)
There is one to one relation between the list to the string.
It means:
a 1
c 2
b 2
e 4
d 4
I would like to sort in alphabetic order only the characters of str1 that have same level.
It means to sort c, b (which have the same value 2) will yield b,c
and to sort e, d (which have the same value 4) will yield d,e.
The required result will be:
str2<-"{a{b}{c{{d}{e}}}}"
In addition a,b,c,d and e can be not only characters, but might be words, such as:
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"
How can I do it with keeping the { in their place?

brkts <- gsub("\\w+", "%s", str1)
strings <- regmatches(str1,gregexpr("[^{}]+",str1))[[1]]
fixed <- ave(strings, str_d, FUN=function(x) sort(x))
do.call(sprintf, as.list(c(brkts, fixed)))
[1] "{a{b}{c{{d}{e}}}}"
and
[1] "{NSP{ARD}{BOS{{COR}{DUD}}}}"
It will work for the first and second case. We first isolate the text with gsub and place %s instead. That will be used later for sprintf. Next we isolate the strings by splitting with strsplit on the comma that we placed after each group of bracket symbols. We then sort based on the sorting vector given and save the characters in the vector fixed. Lastly, we call sprintf on the brkts variable that we created at the beginning and the sorted strings.
Data
str_d <- c(1, 2, 2, 4, 4)
str1<-"{a{c}{b{{e}{d}}}}"
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"

One possible solution (using stringr package):
words <- str_extract_all(str1, '\\w+')[[1]]
ordered <- words[order(paste(str_d, words))]
formatter <- str_replace_all(str1, '\\w+', '%s')
do.call(sprintf, as.list(c(formatter, ordered)))
words is an extract of the words between the braces. I ordered those by sorting the combination of the words with str_d. E.g. the words will become:
1 a
2 c
2 b
4 e
4 d
Then I slap it all back together with sprintf().

Related

Extract three groups: between second and second to last, between second to last and last, and after last underscores

can someone help with these regular expressions?
d_total_v_conf.int.low_all
I want three expressions: total_v, conf.int.low, all
I can't just capture elements before the third _, it is more complex than that:
d_share_v_hskill_wc_mean_plus
Should yield share_v_hskill_wc, mean and plus
The first match is for all characters between the second and the penultimate _, the second match takes all between the penultimate and the last _ and the third takes everything after the last _
We can use sub to capture the groups and create a delimiter, to scan
f1 <- function(str_input) {
scan(text = sub("^[^_]+_(.*)_([^_]+)_([^_]+)$",
"\\1,\\2,\\3", str_input), what = "", sep=",")
}
f1(str1)
#[1] "total_v" "conf.int.low" "all"
f1(str2)
#[1] "share_v_hskill_wc" "mean" "plus"
If it is a data.frame column
library(tidyr)
library(dplyr)
df1 %>%
extract(col1, into = c('col1', 'col2', 'col3'),
"^[^_]+_(.*)_([^_]+)_([^_]+)$")
# col1 col2 col3
#1 total_v conf.int.low all
#2 share_v_hskill_wc mean plus
data
str1 <- "d_total_v_conf.int.low_all"
str2 <- "d_share_v_hskill_wc_mean_plus"
df1 <- data.frame(col1 = c(str1, str2))
Here is a single regex that yields the three groups as requested:
(?<=^[^_]_)((?:(?:(?!_).)+)|_)+(_[^_]+$)
Demo
The idea is to use a lookaround, plus an explict match for the first group, an everything-but batch in the middle, and another explicit match for the last part.
You may need to adjust the start and end anchors if those strings show up in free text.
You can use {unglue} for this task :
library(unglue)
x <- c("d_total_v_conf.int.low_all", "d_share_v_hskill_wc_mean_plus")
pattern <- "d_{a}_{b=[^_]+}_{c=[^_]+}"
unglue_data(x, pattern)
#> a b c
#> 1 total_v conf.int.low all
#> 2 share_v_hskill_wc mean plus
what you want basically is to extract a, b and c from a pattern looking like "d_{a}_{b}_{c}", but where b and c are made of one or more non underscore characters, which is what "[^_]+" means in regex.

How to read a comma-separated numerical string and perform various functions on it

I have a column with numerical comma-separated strings, e.g., '0,1,17,200,6,0,1'.
I want to create new columns for the sums of those numbers (or substrings) in the strings that are not equal to 0.
I can use something like this to count the sum of non-zero numbers for the whole string:
df$F1 <- sapply(strsplit(df1$a, ","), function(x) length(which(x>0)))
[1] 5
This outputs '5' as the number of substrings in for the example string above, which is correct as the number of substrings in '0,1,17,200,6,0,1' is indeed 5.
The challenge, however, is to be able to restrict the number of substrings. For example, how can I get the the count for only the first 3 or 6 substrings in the string?
You can use gsub and backreference to cut the string to the desired length before you count how many substrings are > 0:
DATA:
df1 <- data.frame(a = "0,1,17,200,6,0,1")
df1$a <- as.character(df1$a)
SOLUTION:
First cut the string to whatever number of substrings you want--here, I'm cutting it to three numeric characters (the first two of which are followed by a comma)--and store the result in a new vector:
df1$a_3 <- gsub("^(\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
df1$a_3
[1] "0,1,17"
Now insert the new vector into your sapply statement to count how many substrings are greater than 0:
sapply(strsplit(df1$a_3, ","), function(x) length(which(x>0)))
[1] 2
To vary the number of substrings, vary the number of repetitions of \\d+ in the pattern accordingly. For example, this works for 6 substrings:
df1$a_6 <- gsub("^(\\d+,\\d+,\\d+,\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
sapply(strsplit(df1$a_6, ","), function(x) length(which(x>0)))
[1] 4
EDIT TO ACCOUNT FOR NEW SET OF QUESTIONS:
To compute the maximum value of substrings > 0, exemplified here for df1$a, the string as a whole (for the restricted strings, just use the relevant vector accordingly, e.g., df1$a_3, df1$a_6 etc.):
First split the string using strsplit, then unlist the resulting list using unlist, and finally convert the resulting vector from character to numeric, storing the result in a vector, e.g., string_a:
string_a <- as.numeric(unlist(strsplit(df1$a, ",")))
string_a
[1] 0 1 17 200 6 0 1
On that vector you can perform all sorts of functions, including max for the maximum value, and sum for the sum of the values:
max(string_a)
[1] 200
sum(string_a)
[1] 225
Re the number of values that are equal to 0, adjust your sapply statement by setting x == 0:
sapply(strsplit(df1$a, ","), function(x) length(which(x == 0)))
[1] 2
Hope this helps!

Counting letters before and after a letter

I have an excel file of a list of sequences. How would I go about getting the number of times a letter appears before a letter in square brackets? An example of an entry is below.
GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT
I'd also like to do this for the letter after the square brackets.
Edit: Apologies for the confusion. Take the example below. Id like to count how many times A, C, G, and T appears immediately before and after the letter in square brackets (for which there is only one per line). So to count the occurences of A[A]A, A[A]C, C[A]A, and so on. The file is in excel, and I'm happy to use any method in excel, R or in Linux.
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT
You could split the original string into parts. From the start of the string to the first [ and from the first ] to the end of the string.
int count = firstPart.Count(f => f == 'a');
count += secondPart.Count(f => f == 'a');
Option Explicit
Sub test()
Dim seq As String
seq = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
Debug.Print CountLetter("A", seq)
End Sub
Function CountLetter(letter As String, ByVal sequence As String) As Long
'--- assumes the letter in the brackets is the same as that being counted
Dim allLetters() As String
allLetters = Split("A,C,G,T", ",")
Dim letterToDelete As Variant
For Each letterToDelete In allLetters
If letterToDelete <> letter Then
sequence = Replace(sequence, letterToDelete, "")
End If
Next letterToDelete
CountLetter = Len(sequence) - 1
End Function
x = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
#COUNT 'A'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("A", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 3 4
#COUNT 'G'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("G", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 7 6
New R solution (after clarification by OP)
Let's assume the data have been read from Excel into a data.table called los (list of sequences) which has only one column called sequence. Then, the occurences can be counted as follows:
library(data.table)
los[, .N, by = stringr::str_extract(sequence, "[ACGT]\\[[ACGT]\\][ACGT]")]
# stringr N
#1: C[A]C 8
#2: A[A]C 5
#3: C[A]G 1
#4: G[A]G 1
#5: G[A]C 1
#6: T[A]C 1
str_extract() looks for one of the letters A, C, G, T followed by [ followed by one of the letters A, C, G, T followed by ] followed by one of the letters A, C, G, T in column sequence and extracts the matching substrings. Then, los is grouped by the substrings and the number of occurences is counted (.N).
Data
If the Excel file is stored in CSV format then it can be read using data.table's fread() function like this
los <- fread("your_file_name.csv")
(Perhaps, some parameters to fread() might need to be adjusted for the specific file.)
However, some data already are provided in the question. These can be read as character string using fread() as well:
los <- fread("sequence
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT")
Old solution (before clarification by OP) - left here for reference
This is a solution in base R with help of the stringr package which will work with a "list" of sequences (a data.frame), any single letter enclosed in square brackets, and arbitrary lengths of the sequences. It assumes that the data already have been read from file into a data.frame which is named los here.
# create data: data frame with two sequences
los <- data.frame(
sequence = c("GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT",
"GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT"))
# split sequences in three parts
mat <- stringr::str_split_fixed(los$sequence, "[\\[\\]]", n = 3)
los$letter <- mat[, 2]
los$n_before <- stringr::str_count(mat[, 1], mat[, 2])
los$n_after <- stringr::str_count(mat[, 3], mat[, 2])
print(los)
# sequence letter n_before n_after
#1 GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT A 3 4
#2 GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT C 9 9
Note this code works best if there is exactly one pair of square brackets in each sequence. Any additional brackets will be ignored.
It will also work if there is more than just one letter enclosed in brackets, e.g., [GT].
I'm confessing that I'm addicted to Hadley Wickham's stringr package because I have difficulties to remember the inconsistently named base R functions for string maninpulation like strsplit, grepl, sub, match, gregexpr, etc. To understand what I mean please have a look at the Usage and See Also sections of ?grep and compare to stringr.
I would think that R packages for bioinformatics, such as seqinr or Biostrings, would be a good starting point. However, here's a "roll your own" solution.
First step: get your data from Excel into R. I will assume that file mydata.xlsx contains one sheet with a column of sequence and no header. You need to adapt this for your file and sheet format.
library(readxl)
sequences <- read_excel("mydata.xlsx", col_names = FALSE)
colnames(sequences) <- "sequence"
Now you need a function to extract the base in square brackets and the bases at -1 and +1. This function uses the stringr package to extract bases using regular expressions.
get_bases <- function(seq) {
require(stringr)
require(magrittr)
subseqs <- str_match(seq, "^([ACGT]+)\\[([ACGT])\\]([ACGT]+)$")
bases <- list(
before = subseqs[, 2] %>% str_sub(-1, -1),
base = subseqs[, 3],
after = subseqs[, 4] %>% str_sub(1, 1)
)
return(bases)
}
Now you can pass the column of sequences to the function to generate a list of lists, which can be converted to a data frame.
library(purrr)
sequences_df <- lapply(sequences, get_bases) %>%
map_df(as.data.frame, stringsAsFactors = FALSE)
head(sequences_df, 3)
before base after
1 C A C
2 C A C
3 A A C
The last step is to use functions from dplyr and tidyr to count up the bases.
library(tidyr)
sequences_df %>%
gather(position, letter, -base) %>%
group_by(base, position, letter) %>%
tally() %>%
spread(position, n) %>%
select(base, letter, before, after)
Result using your 17 example sequences. I would use better names than I did if I were you: base = the base in square brackets, letter = the base being counted, before = count at -1, after = count at +1.
base letter before after
* <chr> <chr> <int> <int>
1 A A 5 NA
2 A C 9 15
3 A G 2 2
4 A T 1 NA

Extract only values with a decimal point in between from strings

I have a dataframe with strings such as:
id <- c(1,2)
x <- c("...14.....5.......................395.00.........................14.........1..",
"......114.99....................124.99................")
df <- data.frame(id,x)
df$x <- as.character(df$x)
How can I extract only values with a decimal point in between such as 395.00, 114.99 and 124.99 and not 14, 5, or 1 for each row, and put them in a new column separated by a comma?
The ideal result would be:
id x2
1 395.00
2 114.99,124.99
The amount of periods separating the values are random.
library(stringr)
df$x2 = str_extract_all(df$x, "[0-9]+\\.[0-9]+")
df[c(1, 3)]
# id x2
# 1 1 395.00
# 2 2 114.99, 124.99
Explanation: [0-9]+ matches one or more numbers, \\. matches a single decimal point. str_extract_all extracts all matches.
The new column is a list column, not a string with an inserted comma. This allows you access to the individual elements, if needed:
df$x2[2]
# [[1]]
# [1] "114.99" "124.99"
If you prefer a character vector as the column, do this:
df$x3 = sapply(str_extract_all(df$x, "[0-9]+\\.[0-9]+"), paste, collapse = ",")
df$x3[2]
#[1] "114.99,124.99"

sort vector based on another partial matching vector

I have two character vectors A and B. Most of A has matching strings in B, matched by the first 6 characters.
A strings always end with 'd', and B strings always ends with 'z'. I'd like to sort B based on A, and put any non-matches in C.
Original data:
A <- c("ABCD01d", "DEFG10d", "ZYXW43d")
B <- c("ABCD01z", "ZYXW43z", "DEFG10z", "DFGS88z")
I'd like to end up with:
A <- c("ABCD01d", "DEFG10d", "ZYXW43d")
B <- c("ABCD01z", "DEFG10z", "ZYXW43z")
C <- c("DFGS88z")
What's the best way to do this?
Try this:
m <- match(substr(B,1,6), substr(A,1,6))
B[na.exclude(m)]
#[1] "ABCD01z" "DEFG10z" "ZYXW43z"
B[is.na(m)]
#[1] "DFGS88z"

Resources