Why does my while loop gets stuck ? -Programming in R - r

I am trying to make a function that counts de caracteres between "a" "t" "g" and "t" "a" "g" or "t" "g" "a"or "t" "a" "a" inside of a vector. But my code gets stuck in the while loop. An example would be like x = "a" "a" "a" "t" "a" "t" "g" "t" "c" "g" "t " "t " "t" "t" "a" "g". In this example the code should count 6 characters between "a" "t" "g" and "t" "a" "g". Any help will be appreciated :) .
orfs<-function(x,p){
count<-0
cntorfs<-0
n<-length(x)
v<-n-2
for (i in 1:v){
if(x[i]=="a"&& x[i+1]=="t"&& x[i+2]=="g"){
k<-i+3;
w<-x[k]
y<-x[k+1]
z<-x[k+2]
while (((w!="t")&&(y!="a")&& (z!="g"))||((w!="t")&&(y!="a")&&(z!="a"))||((w!="t")&&(y!="g")&& (z!="a"))||(i+2>v)){
count<-count+1
k<-k+1
w<-x[k]
y<-x[k+1]
z<-x[k+2]
}
}
if(count>p){
cntorfs<-cntorfs+1
}
if (count!=0){
count<-0
}
}
cat("orf:",cntorfs)
}

This is a very inefficient and un-R-like way to count the number of characters between two patterns.
Here is an alternative using gsub that should get you started and can be extended to account for the other stop codons:
x <- c("a", "a", "a", "t", "a", "t", "g", "t", "c", "g", "t", "t", "t", "t", "a", "g")
nchar(gsub("[actg]*atg([actg]*)tag[actg]*", "\\1", paste0(x, collapse = "")))
#[1] 6
A more robust and general approach can be found here making use of Biostrings::matchPattern. I would strongly advise against reinventing the wheel here, and instead recommend using some of the standard Bioconductor packages that were developed for exactly these kind of tasks.

Related

Complementary sequence using gsub

I'm trying to make the complementary sequence of a dna chain stored in a vector.
It's supposed to change the "A" for the "T" and the "C" for the "G" and vice versa, the thing is, I need this to happen to the first vector and print the complementary sequence correctly. This is what I tried but got stucked:
pilot_sequence <- c("C","G","A","T","C","C","T","A","T")
complement_sequence_display <- function(pilot_sequence){
complement_chain_Incom <- gsub("A", "T", pilot_sequence)
complement_chain <- paste(complement_chain_Incom, collapse = "")
cat("Complement sequence: ", complement_chain, "\n")
}
complement_chain_Incom <- gsub("A","T", pilot_sequence)
complement_chain <- paste(complement_chain_Incom, collapse= "")
complement_sequence_display(pilot_sequence)
I got as answer: CGTTCCTTT,just the second and penultimate T are correct, how do I solve to the rest of letters ?
the pilot_sequence vector is character type and the functions displays no execution errors.
This is a ideal use case for chartr function:
chartr("ATGC","TACG",pilot_sequence)
output:
[1] "G" "C" "T" "A" "G" "G" "A" "T" "A"
You can do this with purrr::map:
pilot_sequence |> purrr::map_chr(~case_when(
.x == "T" ~ "A",
.x == "G" ~ "C",
.x == "A" ~ "T",
.x == "C" ~ "G"
))
#> [1] "G" "C" "T" "A" "G" "G" "A" "T" "A"
You can use recode from dplyr
library(dplyr)
recode(pilot_sequence, "C" = "G", "G" = "C", "A" = "T", "T" = "A")
Or in base R, create a named vector and use match to match the values location in the named vector and then call name to get the names
pilot_sequence <- c("C","G","A","T","C","C","T","A","T")
values = c("G" = "C", "C" = "G", "A" = "T", "T" = "A")
names(values[match(pilot_sequence, values)])
"G" "C" "T" "A" "G" "G" "A" "T" "A"

R - List manipulation element concatenation

Assume I have a list with 5 elements:
list <- list("A", "B", "C", "D", c("E", "F"))
I am trying to return this to a simple character vector using purrr with the need to combine list elements that have two strings into one, separated by a delimiter such as '-'. The output should look like this:
chr [1:5] "A" "B" "C" "D" "E-F"
I've tried a ton of approaches including paste, paste0, str_c and where I am getting hung up is it seems that map applies the function to each individual string of an element of a list and not the group of strings of an element (when there are more than one). The closes I've gotten is:
list2 <- unlist(map(list, str_flatten))
str(list2)
This returns:
chr [1:5] "A" "B" "C" "D" "EF"
where I need a hyphen between E and F:
chr [1:5] "A" "B" "C" "D" "E-F"
When I try to pass a function as a parenthetiinton to str_flatten(), such as str_flatten(list, collapse = "-"), it doesn't work. The big problem is I can't figure out what string to pass as an argument in str_flatten to group two strings of a given element of a list.
You almost had it. Try
library(purrr)
library(stringr)
unlist(map(lst, str_flatten, collapse = "-"))
#[1] "A" "B" "C" "D" "E-F"
You could also use map_chr
map_chr(lst, str_flatten, collapse = "-")
Without additional packages and with thanks to #G.Grothendieck you could do
sapply(lst, paste, collapse = "-")
data
lst <- list("A", "B", "C", "D", c("E", "F"))
We can also use map_chr and paste.
library(purrr)
lst <- list("A", "B", "C", "D", c("E", "F"))
map_chr(lst, ~paste(.x, collapse = "-"))
# [1] "A" "B" "C" "D" "E-F"

R - How to intersect() and include duplicates?

I have the following character fields I am trying to intersect. These should be equal.
> char.y[[892]]
[1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
> char.x[[892]]
[1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
> intersect(char.x[[892]], char.y[[892]])
[1] "E" "d" "w" "a" "r" "s" " " "L" "i" "f" "e" "c" "n"
>
expected result:
"E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
Using intersect will return the common elements, but will not have them duplicated. For example, s is in there 3 times, but will be in the intersect only once.
If you want to see the same layout, with non intersect values removed, for example, you can use the following:
a <- c("E", "d", "w", "a", "r", "d", "s", " ", "L", "i", "f", "e", "s", "c", "i", "e", "n", "c", "e", "s")
b <- c("E", "d", "w", "a", "r", "d", "s", " ", "L", "i", "f", "e", "s", "c", "i", "e", "n", "c", "e", "s")
a[a %in% intersect(a, b)]
# [1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
This would entirely depend on the vectors you are comparing (and which order) but would this be sufficient?
b <- a <- c('E', 'd', 'w', 'a', 'r', 'd', 's', '', 'L', 'i', 'f', 'e', 's', 'c', 'i', 'e', 'n', 'c', 'e')
c <- letters[sample(1:26,100, rep=T)]
a[is.element(a,b)]
# [1] "E" "d" "w" "a" "r" "d" "s" "" "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
a[is.element(a,c)]
# [1] "d" "w" "a" "r" "d" "s" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
I had the exact same problem and didn't find a solution, so I created my own little function "intersectdup":
intersectdup <- function(vektor1, vektor2) {
result <- c()
for (i in 1:length(vektor2)) {
if (is.element(vektor2[i], vektor1)){
result <- c(result, vektor2[i])
foundAt <- match(vektor2[i], vektor1)
vektor1 <- c(vektor1[1:foundAt-1], vektor1[foundAt+1:length(vektor1)])
}
}
return(result)
}
Picking up on Clemens, here is a simple function in a c-based structure:
intersectMe = function(x, y, duplicates=TRUE)
{
xyi = intersect(x,y);
if(!duplicates) { return (xyi); }
res = c();
for(xy in xyi)
{
y.xy = which(y == xy); ny.xy = length(y.xy);
x.xy = which(x == xy); nx.xy = length(x.xy);
min.xy = min(ny.xy, nx.xy);
res = c(res, rep(xy, min.xy) );
}
res;
}
vecsets library also helps (using on example created by Eric)
vecsets::vintersect(a, b)
[1] "E" "d" "d" "w" "a" "r" "s" "s" "s" " " "L" "i" "i" "f" "e" "e" "e" "c" "c" "n"

Replace text values in a vector [duplicate]

This question already has answers here:
Dictionary style replace multiple items
(11 answers)
Closed 5 years ago.
Here's my data :
dataset <- c("h", "H", "homme", "masculin", "f", "femme", "épouse")
How can I replace text values of the vector like :
"femme" -> "f"
"épouse" ->"f"
"Homme"-> "h"
"masculin" -> "h"
What I tried for "femme" -> "f"
test_out <- sapply(dataset, switch,
"f"="femme")
test_out
Expected result :
"h" "h" "h" "masculin" "f" "f" "f"
Try gsub with regular expressions:
dataset = gsub("^((?!h).*)$", "f", gsub("^((h|H|m).*)$", "h", dataset), perl=TRUE)

Histogram of dates with a specific attribute

I have some data:
> (dput(head(data$Date,10)))
c("18.12.2003", "06.04.2005", "06.04.2005", "07.04.2005", "27.05.2005",
"16.06.2009", "16.06.2009", "21.12.2009", "22.12.2009", "09.06.2011"
)
[1] "18.12.2003" "06.04.2005" "06.04.2005" "07.04.2005" "27.05.2005"
[6] "16.06.2009" "16.06.2009" "21.12.2009" "22.12.2009" "09.06.2011"
> (dput(head(data$Art,10)))
c("V", "K", "K", "K", "Zuteilung", "V", "K", "K", "K", "V")
[1] "V" "K" "K" "K" "Zuteilung" "V"
[7] "K" "K" "K" "V"
As you can see to every date there is a string value.
I can measure all the string values of K with:
> (length(grep("K", data$Art)))
I want to plot all frequencies of K with the same date.
With this I can plot all date, however this does not include the K Strings.
hist(as.Art(data$Date, '%d.%m.%Y'), breaks="days", freq=TRUE)
I really appreciate your answers!

Resources