Search Position of a word in String - r

I need to do a small exercise in R and I need to know how many times that one specific word appears in string and wherein the position of each of the words.
I have this:
string = 'a b a b c d a a g'
splitstring = strsplit(string, ' ')
sapply(gregexpr("a", splitstring, fixed= TRUE), function(x) sum(x>-1))
My output is: [1] 4, so I have four 'a' in my string and now I wanted to know their position.

gregexpr gives you the positions:
gregexpr("a", string, f=T)[[1]]
# [1] 1 5 13 15

One solution is using the stringr packages location function as follows:
library(stringr)
string = 'a b a b c d a a g'
l <- str_locate_all(string, 'a')
l
This gives an output in the form of a list of matrix(ces) of all start and end positions as follows:
[[1]]
start end
[1,] 1 1
[2,] 5 5
[3,] 13 13
[4,] 15 15
If you want to extract just the start positions, you can do as follows:
l[[1]][, 'start']
[1] 1 5 13 15

Related

Count substring matches within start/stop positions of pattern in R

From a given string " GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT", I want a matrix that counts triplets in substrings that
start with “AAA” or “GAA”
end with “AGT”
and
have at least 2 and at most N other triplets between the start and the end.
For my problem n=10;
So I have below code and result:
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
str_extract_all( dna, regex )
Result:
[[1]]
[1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
Now how to modify the code so that it returns a matrix containing the starting position of each match along with the actual number of "triplet(every 3 characters will be considered as a triplet)" between the start and the end for each match.
Like for the above results, the result should be:
1 7
31 3
here 1 is the starting position for GAACCCACTAGTATAAAATTTGGGAGT and 7 is the number of triplet between starting pattern GAA and ending pattern AGT.
Same for AAACCCTTTGGGAGT"
31 is the starting position of AAA and 7 is the number of triplets between AAA and AGT
Here's a function that might give you what I think you need, though it doesn't provide the matrix you said you wanted.
func <- function(dna, n = 10, starts, stops) {
if (length(n) == 1L) n <- c(2L, n)
startptn <- paste0("(", paste(starts, collapse = "|"), ")")
stopptn <- paste0("(", paste(stops, collapse = "|"), ")")
starts_ind <- gregexpr(startptn, dna)
stops_ind <- gregexpr(stopptn, dna)
# stops_ind ends on the first char of the triplet, so add 2
stops_ind <- lapply(stops_ind, `+`, 2L)
candidates <- Map(function(bgn, end, txt) {
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
vec <- mtx[nzchar(mtx)]
vec
}, starts_ind, stops_ind, dna)
# "6L" is the first/last triplet defined in starts/stops
cand_triplets <- lapply(candidates, function(z) nchar(z) - 6L)
lens <- lengths(candidates)
df <- data.frame( id = rep(seq_along(dna), lens), dna = rep(dna, lengths(candidates)) )
df$match <- unlist(candidates)
df$inner <- substring(df$match, 4, nchar(df$match) - 3)
df$ntriplets <- nchar(df$inner) / 3
if (nrow(df) > 0) {
df <- df[ abs(df$ntriplets %% 1) < 1e-5 &
n[1] <= df$ntriplets &
df$ntriplets <= n[2], , drop = FALSE ]
}
df
}
Demo:
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
func(c(dna, dna), 10, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
func(c(dna, dna), 20, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 4 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 10 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
In the input:
each input dna can produce 0 or more rows in the output; I use c(dna, dna) merely to demonstrate that this works on multiple dna strings, vectorized;
if n is length 1, then it defaults to ranging between 2 and n; if length 2, then both are used appropriately;
In the output:
id numbers each of the dna input strings;
dna is the actual string; I include both id and dna in case a string is accidentally repeated (minor);
match is the matching substring, including the start and stop triplets;
inner is the same substring with the start/stop triplets removed;
ntriplets is really just the nchar of $inner;
the code filters to ensure we have triplets (/3 reduces it, and %% 1 should be 0), and then how many triplets we have based on the input n
(If you want to see all matches, set n to Inf, though it'll still filter out non-triplet inner strings.)
As to your request of a matrix of lengths, if we insert a browser() in the Map and form mtx, we'll see that we are using matrices:
bgn
# [1] 1 15 31
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
end
# [1] 12 27 45
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
txt
# [1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
mtx
# [,1] [,2] [,3]
# [1,] "GAACCCACTAGT" "GAACCCACTAGTATAAAATTTGGGAGT" "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [2,] "" "AAAATTTGGGAGT" "AAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [3,] "" "" "AAACCCTTTGGGAGT"
nchar(mtx) - 6
# [,1] [,2] [,3]
# [1,] 6 21 39
# [2,] -6 7 25
# [3,] -6 -6 9
(The negatives are just an artifact of the debugging environment and my naïve subtraction of 6 with empty strings present; this does not reflect a bug in the function.)
To me, this matrix suggests we have 2, 7, and 13 triplets in the top row.
i have an ugly solution. You just need to analysis your pattern.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
triplet_string <- str_extract_all( dna, regex )
triplet_matrix <- str_locate_all(dna, regex)
# first solution:
# differenz between end and start position (-1) divide by 3 for triplet and substract the first and last triplet
triplet_length_1 <- (triplet_matrix[[1]][,2] - (triplet_matrix[[1]][,1]-1))/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_1)
# second solution:
# getting the length of each ORF and substract 2 for start and stopp codon
triplet_length_2 <- nchar(triplet_string[[1]])/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_2)
I hope that i could help you.

Extract certain part of name in string

Im trying to extract particular part of names in a column of DF
DF
a b
a.b.c_tot 1
b.c.d_tot 2
d.e.g_tot 3
I need to extract letter between . and _tot, so that
DF
a b c
a.b.c_tot 1 c
b.c.d_tot 2 d
d.e.g_tot 3 g
I suppose it could be done with sub as i have learnt today how to extract the letter before first ., but how to extract "middle" part of the name?
I was reading sub explanation and help but all my trials results in just copying full name of a to c.
Thank you for any tips.
We can call sub() to match the entire string, starting with (1) any number of any characters, then (2) a literal dot, then (3) use a capture group to capture the following character, then (4) a literal _tot. We can then use the \1 backreference atom (with the backslash properly backslash-escaped as per R's string encoding rules) to replace the entire string with the captured character.
DF$c <- sub('^.*\\.(.)_tot$','\\1',DF$a);
DF;
## a b c
## 1 a.b.c_tot 1 c
## 2 b.c.d_tot 2 d
## 3 d.e.g_tot 3 g
Yes, I see the problem; if DF$a were to contain values that do not match the expected pattern, the sub() call would pass them through to the new DF$c column. Here's a hacky solution using the Perl branch reset feature:
DF <- data.frame(a=c('a.b.c_tot','b.c.d_tot','d.e.g_tot','non-matching'),b=c(1L,2L,3L,4L),stringsAsFactors=F);
DF$c <- sub(perl=T,'(?|^.*\\.(.)_tot$|^.*$())','\\1',DF$a);
DF;
## a b c
## 1 a.b.c_tot 1 c
## 2 b.c.d_tot 2 d
## 3 d.e.g_tot 3 g
## 4 non-matching 4
Here's a better solution, involving storing the regex in a variable in advance, and using grepl() and replace() to replace non-matching values with NA prior to calling sub():
re <- '^.*\\.(.)_tot$';
DF$c <- sub(re,'\\1',replace(DF$a,!grepl(re,DF$a),NA));
DF;
## a b c
## 1 a.b.c_tot 1 c
## 2 b.c.d_tot 2 d
## 3 d.e.g_tot 3 g
## 4 non-matching 4 <NA>
Use regexpr and regmatches with a lookbehind and lookahead regex.
x <- c("a.b.c_tot", "b.c.d_tot", "d.e.g_tot")
regmatches(x, regexpr("(?<=\\.).(?=_tot)", x, perl = TRUE))
#[1] "c" "d" "g"
We can use str_extract
library(stringr)
DF$c <- str_extract(DF$a, "\\w(?=_tot)")
DF$c
#[1] "c" "d" "g"

how to avoid change string to number automaticlly in r

I was trying to save some string into a matrix, but it automatically changed to numbers (levels). How can i avoid it??
Here is the table:
trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b
I wanna to save to a matrix like:
J0001 a ab ab b
But, what i get is:
J0001 1 2 2 3
How can i avoid this?
Your M column is defined as a factor. You can save it as-is by wrapping it with as.character
> dat <- read.table(header = TRUE, text = "trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b")
> as.numeric(dat$M)
# [1] 1 2 2 3
> as.character(dat$M)
# [1] "a" "ab" "ab" "b"
You can avoid this in the first place by using stringsAsFactors = FALSE when you read the data into R, or take advantage of the colClasses argument in some of the read-in functions.

match values in dataframes with values in a column

I have two data.frames that looks like these ones:
>df1
V1
a
b
c
d
e
>df2
V1 V2
1 a,k,l
2 c,m,n
3 z,b,s
4 l,m,e
5 t,r,d
I would like to match the values in df1$V1 with those from df2$V2and add a new column to df1 that corresponds to the matching and to the value of df2$V1, the desire output would be:
>df1
V1 V2
a 1
b 3
c 2
d 5
e 4
I've tried this approach but only works if df2$V2 contains just one element:
match(as.character(df1[,1]), strsplit(as.character(df2[,2], ",")) -> idx
df1$V2 <- df2[idx,1]
Many thanks
You can just use grep, which will return the position of the string found:
sapply(df1$V1, grep, x = df2$V2)
# a b c d e
# 1 3 2 5 4
If you expect repeats, you can use paste.
Let's modify your data so that there is a repeat:
df2$V2[3] <- "z,b,s,a"
And modify the solution accordingly:
sapply(df1$V1, function(z) paste(grep(z, x = df2$V2), collapse = ";"))
# a b c d e
# "1;3" "3" "2" "5" "4"
Similar to Tyler's answer, but in base using stack:
df.stack <- stack(setNames(strsplit(as.character(df2$V2), ","), df2$V1))
transform(df1, V2=df.stack$ind[match(V1, df.stack$values)])
produces:
V1 V2
1 a 1
2 b 3
3 c 2
4 d 5
5 e 4
One advantage of splitting over grep is that with grep you run the risk of searching for a and matching things like alabama, etc. (though you can be careful with the patterns to mitigate this (i.e. include word boundaries, etc.).
Note this will only find the first matching value.
Here's an approach:
library(qdap)
key <- setNames(strsplit(as.character(df2$V2), ","), df2$V1)
df1$V2 <- as.numeric(df1$V1 %l% key)
df1
## V1 V2
## 1 a 1
## 2 b 3
## 3 c 2
## 4 d 5
## 5 e 4
First we used strsplit to create a named list. Then we used qdap's lookup operator %l% to match values and create a new column (I converted to numeric though this may not be necessary).

How to disable R from changing the "-" character into '.' character when writing to a file?

I want to write the results of a table into a file and names of my columns contain the '-' character, but when I write it to a file it replaces all the '-' by a '.'.
function(Result,Path)
{
Num<-0
for(i in 1:length(Result[,1]))
{
if(length(which([Result[i,]>0))>0)
{
temp<-which(Result[i,]>0)]
val<-paste(names(temp),sep="\n")
write(val,file=paste(Path,"/Result",Num,".txt",sep=""))
Num<-Num+1
}
}
}
do any one know how to disable this option?
My column names are names of proteins some of them are written in this way YER060W-A
Thank you in advance.
It takes special care to make such an entity. But then you can use the col.names argument and assign colnames(dfrm) to it.
> tst <- data.frame(`a-b`= letters[1:10], `1-2`=1:10, check.names=FALSE)
> colnames(tst)
[1] "a-b" "1-2"
> write.table(tst, file="tst.txt", col.names=colnames(tst) )
Pasted from my editor:
"a-b" "1-2"
"1" "a" 1
"2" "b" 2 snipped the rest...
No success with write.table? I see that write() doesn't have the same repertoire of options as write.table. Let's make an array (rather than a dataframe) with dim names having "-"'s:
DCtest <- array(1:27, c(3,3,3))
dimnames(DCtest) <- list(dim1 =c(a="a-b",b="b-c",c="c%d"),
dim2=letters[4:6],
dim3= letters[7:9])
One quick way to get output is just capture.output():
capture.output(DCtest, file="testm.txt")
testm.txt now looks like this in an editor:
, , dim3 = g
dim2
dim1 d e f
a-b 1 4 7
b-c 2 5 8
c%d 3 6 9
, , dim3 = h
dim2
dim1 d e f
a-b 10 13 16
b-c 11 14 17
c%d 12 15 18
, , dim3 = i
dim2
dim1 d e f
a-b 19 22 25
b-c 20 23 26
c%d 21 24 27
You also should not forget that capture output has an append= parameter if you wanted to append successive slices through an array.

Resources