What's the best way to count number of words between a predefined delimiter (in my case '/')?
Dataset:
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
Expected results are the following numbers..
2 (which are A DOG and 1)
2 (which are CAT and WHITE)
3 (A HORSE, BROWN & BLACK, 2)
1 (DOG)
Thank you!
strsplit at one or more slash ("/+") and count strings
lengths(strsplit(as.character(df$v1), "/+"))
#[1] 2 2 3 1
Assuming your data doesn't have cases where a string (a) begins with "/" or (b) doesn't end with "/," then you can just count the number of times there's a chunk of slashes in order to get the number of chunks between slashes. So the following works for the data you've provided.
stringr::str_count(df$v1, "/+")
Using stringr::str_split() and counting the number of nonblank strings...
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
sapply(stringr::str_split(df$v1, '/'), function(x) sum(x != ''))
[1] 2 2 3 1
Related
can someone help with these regular expressions?
d_total_v_conf.int.low_all
I want three expressions: total_v, conf.int.low, all
I can't just capture elements before the third _, it is more complex than that:
d_share_v_hskill_wc_mean_plus
Should yield share_v_hskill_wc, mean and plus
The first match is for all characters between the second and the penultimate _, the second match takes all between the penultimate and the last _ and the third takes everything after the last _
We can use sub to capture the groups and create a delimiter, to scan
f1 <- function(str_input) {
scan(text = sub("^[^_]+_(.*)_([^_]+)_([^_]+)$",
"\\1,\\2,\\3", str_input), what = "", sep=",")
}
f1(str1)
#[1] "total_v" "conf.int.low" "all"
f1(str2)
#[1] "share_v_hskill_wc" "mean" "plus"
If it is a data.frame column
library(tidyr)
library(dplyr)
df1 %>%
extract(col1, into = c('col1', 'col2', 'col3'),
"^[^_]+_(.*)_([^_]+)_([^_]+)$")
# col1 col2 col3
#1 total_v conf.int.low all
#2 share_v_hskill_wc mean plus
data
str1 <- "d_total_v_conf.int.low_all"
str2 <- "d_share_v_hskill_wc_mean_plus"
df1 <- data.frame(col1 = c(str1, str2))
Here is a single regex that yields the three groups as requested:
(?<=^[^_]_)((?:(?:(?!_).)+)|_)+(_[^_]+$)
Demo
The idea is to use a lookaround, plus an explict match for the first group, an everything-but batch in the middle, and another explicit match for the last part.
You may need to adjust the start and end anchors if those strings show up in free text.
You can use {unglue} for this task :
library(unglue)
x <- c("d_total_v_conf.int.low_all", "d_share_v_hskill_wc_mean_plus")
pattern <- "d_{a}_{b=[^_]+}_{c=[^_]+}"
unglue_data(x, pattern)
#> a b c
#> 1 total_v conf.int.low all
#> 2 share_v_hskill_wc mean plus
what you want basically is to extract a, b and c from a pattern looking like "d_{a}_{b}_{c}", but where b and c are made of one or more non underscore characters, which is what "[^_]+" means in regex.
I need support with RegEx filtering!
I have a list of keywords and many rows that should be checked.
In this example, the keyword "-book-" can be (1) in the middle of the sentence or (2) at the end, which would mean that the last hyphen is not present.
I need a RegEx expression, which identifies "-book-" and "-book".
I don't want similar keywords like "-booking-" etc to be identified.
library(dplyr)
keywords = c( "-album-", "-book-", "-castle-")
search_terms = paste(keywords, collapse ="|")
number = c(1:5)
sentences = c("the-best-album-in-shop", "this-book-is-fantastic", "that-is-the-best-book", "spacespacespace", "unwanted-sentence-with-booking")
data = data.frame(number, sentences)
output = data %>% filter(., grepl( search_terms, sentences) )
# Current output:
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
# DESIRED output:
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
3 3 that-is-the-best-book
You could also do:
subset(data, grepl(paste0(sprintf("%s?\\b",keywords),collapse = "|"), sentences))
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
3 3 that-is-the-best-book
Note that this will only check for the -book- at the (1) in the middle of the sentence or (2) at the end Not at the beginning
The -book- pattern will match a whole word book with hyphen on the left and right.
To match a whole word book with a hyphen on the left or right, you need an alternation \bbook-|-book\b.
Thus, you can use
keywords = c( "-album-", "\\bbook-", "-book\\b", "-castle-" )
Another solution you can take it into account
library(stringr)
data %>%
filter(str_detect(sentences, regex("-castle-|-album-|-book$|-book-\\w{1,}")))
# number sentences
# 1 1 the-best-album-in-shop
# 2 2 this-book-is-fantastic
# 3 3 that-is-the-best-book
I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2
You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1
b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.
I have an excel file of a list of sequences. How would I go about getting the number of times a letter appears before a letter in square brackets? An example of an entry is below.
GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT
I'd also like to do this for the letter after the square brackets.
Edit: Apologies for the confusion. Take the example below. Id like to count how many times A, C, G, and T appears immediately before and after the letter in square brackets (for which there is only one per line). So to count the occurences of A[A]A, A[A]C, C[A]A, and so on. The file is in excel, and I'm happy to use any method in excel, R or in Linux.
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT
You could split the original string into parts. From the start of the string to the first [ and from the first ] to the end of the string.
int count = firstPart.Count(f => f == 'a');
count += secondPart.Count(f => f == 'a');
Option Explicit
Sub test()
Dim seq As String
seq = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
Debug.Print CountLetter("A", seq)
End Sub
Function CountLetter(letter As String, ByVal sequence As String) As Long
'--- assumes the letter in the brackets is the same as that being counted
Dim allLetters() As String
allLetters = Split("A,C,G,T", ",")
Dim letterToDelete As Variant
For Each letterToDelete In allLetters
If letterToDelete <> letter Then
sequence = Replace(sequence, letterToDelete, "")
End If
Next letterToDelete
CountLetter = Len(sequence) - 1
End Function
x = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
#COUNT 'A'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("A", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 3 4
#COUNT 'G'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("G", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 7 6
New R solution (after clarification by OP)
Let's assume the data have been read from Excel into a data.table called los (list of sequences) which has only one column called sequence. Then, the occurences can be counted as follows:
library(data.table)
los[, .N, by = stringr::str_extract(sequence, "[ACGT]\\[[ACGT]\\][ACGT]")]
# stringr N
#1: C[A]C 8
#2: A[A]C 5
#3: C[A]G 1
#4: G[A]G 1
#5: G[A]C 1
#6: T[A]C 1
str_extract() looks for one of the letters A, C, G, T followed by [ followed by one of the letters A, C, G, T followed by ] followed by one of the letters A, C, G, T in column sequence and extracts the matching substrings. Then, los is grouped by the substrings and the number of occurences is counted (.N).
Data
If the Excel file is stored in CSV format then it can be read using data.table's fread() function like this
los <- fread("your_file_name.csv")
(Perhaps, some parameters to fread() might need to be adjusted for the specific file.)
However, some data already are provided in the question. These can be read as character string using fread() as well:
los <- fread("sequence
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT")
Old solution (before clarification by OP) - left here for reference
This is a solution in base R with help of the stringr package which will work with a "list" of sequences (a data.frame), any single letter enclosed in square brackets, and arbitrary lengths of the sequences. It assumes that the data already have been read from file into a data.frame which is named los here.
# create data: data frame with two sequences
los <- data.frame(
sequence = c("GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT",
"GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT"))
# split sequences in three parts
mat <- stringr::str_split_fixed(los$sequence, "[\\[\\]]", n = 3)
los$letter <- mat[, 2]
los$n_before <- stringr::str_count(mat[, 1], mat[, 2])
los$n_after <- stringr::str_count(mat[, 3], mat[, 2])
print(los)
# sequence letter n_before n_after
#1 GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT A 3 4
#2 GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT C 9 9
Note this code works best if there is exactly one pair of square brackets in each sequence. Any additional brackets will be ignored.
It will also work if there is more than just one letter enclosed in brackets, e.g., [GT].
I'm confessing that I'm addicted to Hadley Wickham's stringr package because I have difficulties to remember the inconsistently named base R functions for string maninpulation like strsplit, grepl, sub, match, gregexpr, etc. To understand what I mean please have a look at the Usage and See Also sections of ?grep and compare to stringr.
I would think that R packages for bioinformatics, such as seqinr or Biostrings, would be a good starting point. However, here's a "roll your own" solution.
First step: get your data from Excel into R. I will assume that file mydata.xlsx contains one sheet with a column of sequence and no header. You need to adapt this for your file and sheet format.
library(readxl)
sequences <- read_excel("mydata.xlsx", col_names = FALSE)
colnames(sequences) <- "sequence"
Now you need a function to extract the base in square brackets and the bases at -1 and +1. This function uses the stringr package to extract bases using regular expressions.
get_bases <- function(seq) {
require(stringr)
require(magrittr)
subseqs <- str_match(seq, "^([ACGT]+)\\[([ACGT])\\]([ACGT]+)$")
bases <- list(
before = subseqs[, 2] %>% str_sub(-1, -1),
base = subseqs[, 3],
after = subseqs[, 4] %>% str_sub(1, 1)
)
return(bases)
}
Now you can pass the column of sequences to the function to generate a list of lists, which can be converted to a data frame.
library(purrr)
sequences_df <- lapply(sequences, get_bases) %>%
map_df(as.data.frame, stringsAsFactors = FALSE)
head(sequences_df, 3)
before base after
1 C A C
2 C A C
3 A A C
The last step is to use functions from dplyr and tidyr to count up the bases.
library(tidyr)
sequences_df %>%
gather(position, letter, -base) %>%
group_by(base, position, letter) %>%
tally() %>%
spread(position, n) %>%
select(base, letter, before, after)
Result using your 17 example sequences. I would use better names than I did if I were you: base = the base in square brackets, letter = the base being counted, before = count at -1, after = count at +1.
base letter before after
* <chr> <chr> <int> <int>
1 A A 5 NA
2 A C 9 15
3 A G 2 2
4 A T 1 NA
I have the following string:
str1<-"{a{c}{b{{e}{d}}}}"
In addition, I have a list of integers:
str_d <- ( 1, 2, 2, 4, 4)
There is one to one relation between the list to the string.
It means:
a 1
c 2
b 2
e 4
d 4
I would like to sort in alphabetic order only the characters of str1 that have same level.
It means to sort c, b (which have the same value 2) will yield b,c
and to sort e, d (which have the same value 4) will yield d,e.
The required result will be:
str2<-"{a{b}{c{{d}{e}}}}"
In addition a,b,c,d and e can be not only characters, but might be words, such as:
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"
How can I do it with keeping the { in their place?
brkts <- gsub("\\w+", "%s", str1)
strings <- regmatches(str1,gregexpr("[^{}]+",str1))[[1]]
fixed <- ave(strings, str_d, FUN=function(x) sort(x))
do.call(sprintf, as.list(c(brkts, fixed)))
[1] "{a{b}{c{{d}{e}}}}"
and
[1] "{NSP{ARD}{BOS{{COR}{DUD}}}}"
It will work for the first and second case. We first isolate the text with gsub and place %s instead. That will be used later for sprintf. Next we isolate the strings by splitting with strsplit on the comma that we placed after each group of bracket symbols. We then sort based on the sorting vector given and save the characters in the vector fixed. Lastly, we call sprintf on the brkts variable that we created at the beginning and the sorted strings.
Data
str_d <- c(1, 2, 2, 4, 4)
str1<-"{a{c}{b{{e}{d}}}}"
str1<-"{NSP{ARD}{BOS{{DUD}{COR}}}}"
One possible solution (using stringr package):
words <- str_extract_all(str1, '\\w+')[[1]]
ordered <- words[order(paste(str_d, words))]
formatter <- str_replace_all(str1, '\\w+', '%s')
do.call(sprintf, as.list(c(formatter, ordered)))
words is an extract of the words between the braces. I ordered those by sorting the combination of the words with str_d. E.g. the words will become:
1 a
2 c
2 b
4 e
4 d
Then I slap it all back together with sprintf().