Related
I want to replace each 'COL' word in the column 'b' of the 'test' data frame, by each element in the column 'a', and put the result in other column, but preserving both order and structure of the character string of the column 'b'.
test <- data.frame(a = c("COL167", "COL2010;COL2012"),
b = c("COL;MO, K", "P;COL, NY, S, COL"))
I have tried the following, but it is not the result that I need:
for(i in 1:length(test$a)){
test$c[i] <- gsub(pattern = "COL", x = test$b[i], replacement = test$a[i])
}
> test
a b c
1 COL167 COL;MO, K COL167;MO, K
2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010;COL2012, NY, S, COL2010;COL2012
I expect the following result:
a b c
1 COL167 COL;MO, K COL167;MO, K
2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010, NY, S, COL2012
Building on what you have already done, I think this would work, but note that you might see some performance issues if your table is large. Also note that, this assumes that size of values to be replaced is equal to values used for replacement.
As gsub doesn't allow for vectorized replacement (replaces all the matched instances with first values of replacement), here I have converted both strings and replacements into vectors, so I can replace each matched substring individually.
test <- data.frame(a = c("COL167", "COL2010;COL2012"),
b = c("COL;MO, K", "P;COL, NY, S, COL"))
re = function(string, replacement){
gsub('COL', replacement, string)
}
for(i in 1:nrow(test)){
#splitting values of column a into vector, this is required for replacement
replacement = unlist(strsplit(test$a[i], ';'))
#split values of column b into vecto, this is required for replacement
b_value = unlist(strsplit(test$b[i], ' '))
#select those which have 'COL' substring
ind_to_replace = which(grepl('COL', b_value))
#replace matched values
result = mapply(re, b_value[ind_to_replace], replacement)
#replace the column b value with new string
b_value[ind_to_replace] = result
#join the string
test$results[i] = paste(b_value, collapse = ' ')
}
test
#> a b results
#> 1 COL167 COL;MO, K COL167;MO, K
#> 2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010, NY, S, COL2012
Created on 2020-09-05 by the reprex package (v0.3.0)
I'll propose a solution using the rowwise function of dplyr.
While it's true that gsub isn't vectorized, the mgsub function in the package of the same name is. My approach is for each row:
turn all of the instances of COL in column b into a vector
make a vector from all the COL+ entries from column a
use vector 2 to replace the old values of COL from b. mutate creates a new column with the result.
library(mgsub)
library(stringr)
library(dplyr)
test %>%
rowwise() %>%
mutate(new_col =
unlist((mgsub(b,
unlist(str_extract_all(b,"COL")),
unlist(str_extract_all(a,"COL.*?\\b")))
)))
# A tibble: 2 x 3
# Rowwise:
a b new_col
<chr> <chr> <chr>
1 COL167 COL;MO, K COL167;MO, K
2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010, NY, S, COL2010
mgsub takes 3 arguments. The string you're working on, the expression you want to replace within that string, and the expression you want to use as the replacement. This package allows you to have multiple patterns to replace and be replaced - both can appear as vectors.
I applied this function to each row - first I designated the b column as the string of interest. Second, all the COL's in column b is what we want to replace and I made this into a vector using stringr::str_extract_all. I extracted all instances of COL and then we have to unlist this output because str_extract_all returns a list. Third, I used the same process to extract the COL+ entries from column a. In summary, we use the entries in column a to replace the characters of interest within column b.
"COL.*?\\b"
selects the letters COL followed by as few characters as possible before reaching a word boundary which allows us to turn the entries in column a into multiple items (COL2010, COL2012 etc).
We have to unlist the mutated row (i.e. the first "unlist") because dplyr outputs a list-column.
The first column in my data.frame consists of strings, and the second column are unique keys.
I want to extract all words after the nth word from each string, and if the string has <= n words, extract the entire string.
I have over 10k rows in my data.frame and was wondering if there is a quick way of doing this other than using for loops?
Thanks.
How about the following:
# Generate some sample data
library(tidyverse)
df <- data.frame(
one = c("Entries from row one", "Entries from row two", "Entries from row three"),
two = runif(3))
# Define function to extract all words after the n=1 word
# (or return the full string if n > # of words in string)
crop_string <- function(ss, n) {
lapply(strsplit(as.character(ss), "\\s"), function(v)
if (length(v) > n) paste(v[(n + 1):length(v)], collapse = " ")
else paste(v, collapse = " "))
}
# Let's crop strings from column one by removing the first 3 words (n = 3)
n <- 3;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.5120053 one
#2 Entries from row two 0.1873522 two
#3 Entries from row three 0.0725107 three
# If n > # of words, return the full string
n <- 10;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.9363278 Entries from row one
#2 Entries from row two 0.3024628 Entries from row two
#3 Entries from row three 0.6666226 Entries from row three
here I use nchar(), so make your data has transformed to the character.
as.character(YOUR_DATA)
as.character(sapply(YOUR_DATA,function(x,y){
if(nchar(x)>=y){
substr(x,y,nchar(x))
}
else{x}
},y= nth_data_you_want))
asumme the data is like:
"gene#seq"
"Cblb#TAGTCCCGAAGGCATCCCGA"
"Fbxo27#CCCACGTGTTCTCCGGCATC"
"Fbxo11#GGAATATACGTCCACGAGAA"
"Pwp1#GCCCGACCCAGGCACCGCCT"
I use 10 as nth data, the result is:
"gene#seq"
"CCCGAAGGCATCCCGA"
"CACGTGTTCTCCGGCATC"
"AATATACGTCCACGAGAA"
"GACCCAGGCACCGCCT"
Following a previous question I asked, I got an awesome answer.
Here is a quick summary:
I want to compute a multidimensional development index based on South Africa Data for several years. My list is composed of individual information for each year, so basically df1 is about year 1 and df2 about year2.
df1<-data.frame(var1=c(1, 1,1), var2=c(0,0,1), var3=c(1,1,0))
df2<-data.frame(var1=c(1, 0,1), var2=c(1,0,1), var3=c(0,1,0))
mylist <-list (df1,df2)
var1 could be the stance on religion of each person, var2 how she voted in last national election, etc. In my very simple case, I have the data for 3 different persons each year.
From there, I compute an index based on a number of variables (not all of them)
You can find here a very simplified working index function, with only 2 of 3 variables, named dimX and dimY:
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
myindex(df1, "var2", "var3")
and
myindex2 = function(x, d) {
myindex(x, d[1], d[2])
}
Then I have my dataframe of variables I want to use for my index. I am trying to compute the index for several sets of variables.
args <- data.frame(set1=c("var1", "var2"), set2=c("var2", "var3"), stringsAsFactors = F)
I'd like to have the result as follows : (a)list(set1 = list(df1, df2), set2 = (df1, df2))instead of (b) list(df1 = list(set1, set2), df2 = list(set1, set2)).
Case (a) represents a time series, meaning I have a list of results of my indexes each year for only one set of variables. Case (b) is the opposite where I have the index results of one year for every set of variables. Each individual result should be a unique numeric value. Hence, I am expecting to get a list of 2 sublists df1 and df2, each sublist containing 3 numeric values.
I've been adviced to do use that great command:
lapply(mylist, function(m) lapply(args, myindex2, x = m))
It's working great, but I get the result in the "wrong" format, namely the second one (b) I showed.
How could I get the results ordered per set (i.e. case (a) as time series) instead of per year?
Thanks a lot for your help!
PJ
EDIT: I've managed to find a solution that doesn't answer the question, but still allows me to get my data in desired order.
Namely, I'm transforming my list of lists to a matrix that I simply transpose.
This answer will be edited!
Currently, your function index() does this
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
Aren't you after this, however?
myindex <- function(x, dimX, dimY){
econ_i<- ( x[,dimX]+ x[,dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
The way you have it right now, length(econ_i) always returns 1 because econ_i is a data.frame() and not a vector. The length of a data.frame() is always 1, while the length of a vector is the number of elements within it.
Kindly note that here is what the output looks like in R.
df1["var1"]
var1
1 1
2 1
3 1
returns a data.frame()
df1[,"var1"]
[1] 1 1 1
returns a vector.
I will adjust this post to answer your question when you respond. I think it's important to solve this part first.
If that may provide any help, from this article, here my actual index function:
RCI_a_3det <-function(x, econ1, econ2, econ3, perso1, perso2, perso3, civic1, civic2, civic3){
econ_i<- (1/3) *( x[econ1]+ x[econ2] + x[econ3])
perso_i<- (1/3)*( x[perso1] + x[perso2] + x[perso3])
civic_i<- (1/3)*(x[civic1] + x[civic2] + x[civic3])
daf <- data.frame(econ_i, perso_i, civic_i)
colnames(daf)<- c("econ_i", "perso_i", "civic_i")
df1 <- subset(daf, daf$econ_i !=1 & daf$perso_i !=1 & daf$civic_i!=1 )
sum_xik <- (df1$econ_i + df1$perso_i + df1$civic_i)
return ( 1/(3*nrow(df1)) * sum(sum_xik, na.rm=T))
}
Edit:
x is a list of all personal information, for every variable and for every year. It is pretty large.
I am using 9 variables to compute this index, but I actually have 30 such variables in my data, so I have set up a dataframe of sets of variables I could use to compute this index. This is the equivalent of my args df in the simple example. I am actually using 200 such combinations.
I have a string (x, see below) that has many different formats. They are all positions on a genome but have different names. These names were given to me and belong to a list of about 6 million so it's not easy for me to change manually. This is a subset, however there are others like X1 or chr 13 that are part of this list too.:
x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
I'd like all the string to look like this:
y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")
I've tried the following, but everything after the first "." is removed... which isn't exactly what I want.
x.test = gsub(pattern = "\\.\\S+$", replacement = "", x = x)
Any help would be greatly appreciated!
If all your data corresponds to the examples you've given:
x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
There are two types of ids, the ones with SNP ids (starting with rs or kgp), and the ones giving a chromosomal position (starting with the chromosome name).
You could start off by identifying your SNP ids, with something like:
x1 = gsub("((rs|kgp)\\d+).*","\\1",x)
This returns:
[1] "rs62224609" "rs376238049" "rs62224614" "X22.17028719.G.A" "rs4535153" "X22.17028719.G.A" "kgp3171179" "rs375850426" "chr22.17030620.G.A"
Then format the chromosome positions with (I've assumed that you had chromosomes from 1 to 22, X,Y and M, but this depends on your data):
## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\\d{1,2}|X|Y|M)[[:punct:]]+(\\d+).*","chr\\2:\\3",x1)
This returns:
[1] "rs62224609" "rs376238049" "rs62224614" "chr22:17028719" "rs4535153" "chr22:17028719" "kgp3171179" "rs375850426" "chr22:17030620"
I have an excel file of a list of sequences. How would I go about getting the number of times a letter appears before a letter in square brackets? An example of an entry is below.
GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT
I'd also like to do this for the letter after the square brackets.
Edit: Apologies for the confusion. Take the example below. Id like to count how many times A, C, G, and T appears immediately before and after the letter in square brackets (for which there is only one per line). So to count the occurences of A[A]A, A[A]C, C[A]A, and so on. The file is in excel, and I'm happy to use any method in excel, R or in Linux.
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT
You could split the original string into parts. From the start of the string to the first [ and from the first ] to the end of the string.
int count = firstPart.Count(f => f == 'a');
count += secondPart.Count(f => f == 'a');
Option Explicit
Sub test()
Dim seq As String
seq = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
Debug.Print CountLetter("A", seq)
End Sub
Function CountLetter(letter As String, ByVal sequence As String) As Long
'--- assumes the letter in the brackets is the same as that being counted
Dim allLetters() As String
allLetters = Split("A,C,G,T", ",")
Dim letterToDelete As Variant
For Each letterToDelete In allLetters
If letterToDelete <> letter Then
sequence = Replace(sequence, letterToDelete, "")
End If
Next letterToDelete
CountLetter = Len(sequence) - 1
End Function
x = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
#COUNT 'A'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("A", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 3 4
#COUNT 'G'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("G", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 7 6
New R solution (after clarification by OP)
Let's assume the data have been read from Excel into a data.table called los (list of sequences) which has only one column called sequence. Then, the occurences can be counted as follows:
library(data.table)
los[, .N, by = stringr::str_extract(sequence, "[ACGT]\\[[ACGT]\\][ACGT]")]
# stringr N
#1: C[A]C 8
#2: A[A]C 5
#3: C[A]G 1
#4: G[A]G 1
#5: G[A]C 1
#6: T[A]C 1
str_extract() looks for one of the letters A, C, G, T followed by [ followed by one of the letters A, C, G, T followed by ] followed by one of the letters A, C, G, T in column sequence and extracts the matching substrings. Then, los is grouped by the substrings and the number of occurences is counted (.N).
Data
If the Excel file is stored in CSV format then it can be read using data.table's fread() function like this
los <- fread("your_file_name.csv")
(Perhaps, some parameters to fread() might need to be adjusted for the specific file.)
However, some data already are provided in the question. These can be read as character string using fread() as well:
los <- fread("sequence
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT")
Old solution (before clarification by OP) - left here for reference
This is a solution in base R with help of the stringr package which will work with a "list" of sequences (a data.frame), any single letter enclosed in square brackets, and arbitrary lengths of the sequences. It assumes that the data already have been read from file into a data.frame which is named los here.
# create data: data frame with two sequences
los <- data.frame(
sequence = c("GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT",
"GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT"))
# split sequences in three parts
mat <- stringr::str_split_fixed(los$sequence, "[\\[\\]]", n = 3)
los$letter <- mat[, 2]
los$n_before <- stringr::str_count(mat[, 1], mat[, 2])
los$n_after <- stringr::str_count(mat[, 3], mat[, 2])
print(los)
# sequence letter n_before n_after
#1 GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT A 3 4
#2 GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT C 9 9
Note this code works best if there is exactly one pair of square brackets in each sequence. Any additional brackets will be ignored.
It will also work if there is more than just one letter enclosed in brackets, e.g., [GT].
I'm confessing that I'm addicted to Hadley Wickham's stringr package because I have difficulties to remember the inconsistently named base R functions for string maninpulation like strsplit, grepl, sub, match, gregexpr, etc. To understand what I mean please have a look at the Usage and See Also sections of ?grep and compare to stringr.
I would think that R packages for bioinformatics, such as seqinr or Biostrings, would be a good starting point. However, here's a "roll your own" solution.
First step: get your data from Excel into R. I will assume that file mydata.xlsx contains one sheet with a column of sequence and no header. You need to adapt this for your file and sheet format.
library(readxl)
sequences <- read_excel("mydata.xlsx", col_names = FALSE)
colnames(sequences) <- "sequence"
Now you need a function to extract the base in square brackets and the bases at -1 and +1. This function uses the stringr package to extract bases using regular expressions.
get_bases <- function(seq) {
require(stringr)
require(magrittr)
subseqs <- str_match(seq, "^([ACGT]+)\\[([ACGT])\\]([ACGT]+)$")
bases <- list(
before = subseqs[, 2] %>% str_sub(-1, -1),
base = subseqs[, 3],
after = subseqs[, 4] %>% str_sub(1, 1)
)
return(bases)
}
Now you can pass the column of sequences to the function to generate a list of lists, which can be converted to a data frame.
library(purrr)
sequences_df <- lapply(sequences, get_bases) %>%
map_df(as.data.frame, stringsAsFactors = FALSE)
head(sequences_df, 3)
before base after
1 C A C
2 C A C
3 A A C
The last step is to use functions from dplyr and tidyr to count up the bases.
library(tidyr)
sequences_df %>%
gather(position, letter, -base) %>%
group_by(base, position, letter) %>%
tally() %>%
spread(position, n) %>%
select(base, letter, before, after)
Result using your 17 example sequences. I would use better names than I did if I were you: base = the base in square brackets, letter = the base being counted, before = count at -1, after = count at +1.
base letter before after
* <chr> <chr> <int> <int>
1 A A 5 NA
2 A C 9 15
3 A G 2 2
4 A T 1 NA