how to split a sequence in R into multiple sub parts [duplicate] - r

This question already has answers here:
Chopping a string into a vector of fixed width character elements
(13 answers)
Closed 6 years ago.
seq="GAGTAGGAGGAG",how to split this sequence into the following sub sequence "GAG","TAG","GAG","GAG"i.e how to split the sequence in groups of threes

We can create a function called fixed_split that will split a character string into equal parts. The regular expression is a lookbehind that matches on n elements together:
fixed_split <- function(text, n) {
strsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}
fixed_split("GAGTAGGAGGAG", 3)
[[1]]
[1] "GAG" "TAG" "GAG" "GAG"
Edit
In your comment you say sequence ="ATGATGATG" does not work:
strsplit(sequence,"(?<=.{3})", perl=TRUE)
[[1]]
[1] "ATG" "ATG" "ATG"

Related

How to get the most frequent character within a character string? [duplicate]

This question already has answers here:
Finding the most repeated character in a string in R
(2 answers)
Closed 1 year ago.
Suppose the next character string:
test_string <- "A A B B C C C H I"
Is there any way to extract the most frequent value within test_string?
Something like:
extract_most_frequent_character(test_string)
Output:
#C
We can use scan to read the string as a vector of individual elements by splitting at the space, get the frequency count with table, return the named index that have the max count (which.count), get its name
extract_most_frequent_character <- function(x) {
names(which.max(table(scan(text = x, what = '', quiet = TRUE))))
}
-testing
extract_most_frequent_character(test_string)
[1] "C"
Or with strsplit
extract_most_frequent_character <- function(x) {
names(which.max(table(unlist(strsplit(x, "\\s+")))))
}
Here is another base R option (not as elegant as #akrun's answer)
> intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string))))))
[1] "C"
One possibility involving stringr could be:
names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE))))
[1] "C"
Or marginally shorter:
names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))
Here is solution using stringr package, table and which:
library(stringr)
test_string <- str_split(test_string, " ")
test_string <- table(test_string)
names(test_string)[which.max(test_string)]
[1] "C"

Move characters from beginning of column name to end of column name (additonal question) [duplicate]

This question already has an answer here:
regular expression match digit and characters
(1 answer)
Closed 2 years ago.
I need an additional solution to the previous question/answer
Move characters from beginning of column name to end of column name
I have a dataset where column names have two parts divided by _ e.g.
pal036a_lon
pal036a_lat
pal036a_elevation
I would like to convert the prefixes into suffixes so that it becomes:
lon_pal036a
lat_pal036a
elevation_pal036a
The answer to the previous question
names(df) <- sub("([a-z])_([a-z]+)", "\\2_\\1", names(df))
does not work for numbers within the prefixes.
Assuming your names have a single _. You could also you strsplit():
sapply(strsplit(names(df), '_'), function(x) paste(rev(x), collapse = '_'))
If you have more than one you could modify the above as suggested by jay.sf:
sapply(strsplit(x, "_"), function(x) paste(c(x[length(x)], x[-length(x)]), collapse="_"))
You can include alphanumeric characters in the first group:
names(df) <- sub("([a-z0-9]+)_([a-z]+)", "\\2_\\1", names(df))
For example :
x <- c("pal036a_lon","pal036a_lat","pal036a_elevation")
sub("([a-z0-9]+)_([a-z]+)", "\\2_\\1",x)
#[1] "lon_pal036a" "lat_pal036a" "elevation_pal036a"

Match all elements with punctuation mark except asterisk in r [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?
Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"
You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

Delete everything after second comma from string [duplicate]

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Closed 3 years ago.
I would like to remove anything after the second comma in a string -including the second comma-. Here is an example:
x <- 'Day,Bobby,Jean,Gav'
gsub("(.*),.*", "\\1", x)
and it gives:
[1] "Day, Bobby, Jean"
while I want:
[1] "Day, Bobby
regardless of the number of names that may exist in x
Use
> x <- 'Day, Bobby, Jean, Gav'
> sub("^([^,]*,[^,]*),.*", "\\1", x)
[1] "Day, Bobby"
The ^([^,]*,[^,]*),.* pattern matches
^ - start of string
([^,]*,[^,]*) - Group 1: 0+ non-commas, a comma, and 0+ non-commas
,.* - a comma and the rest of the string.
The \1 in the replacement pattern will keep Group 1 value in the result.
We can also use strsplit and then paste
toString(head(strsplit(x, ",")[[1]], 2))
#[1] "Day, Bobby"

"sapply(name, "[", 2)", what's the "[" means? [duplicate]

This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Using '[' square bracket as a function for lapply in R
(2 answers)
Closed 3 years ago.
name
[[1]]
[1] "John" "Davis"
[[2]]
[1] "Angela" "Williams"
[[3]]
[1] "Bullwinkle" "Moose"
The data is as above, I want to take last and first name from the list. The code is:
lastname <- sapply(name, "[", 2)
My question: what does the [ mean?
It is ?Extraction operator. Here, it extracts the 2nd element of the list.
sapply(name, `[`, 2)
In the OP's post, the list elements are vectors. So, it checks the 2nd element and extract that element and output as a vector (sapply)

Resources