Shuffle characters within string R - r

I would like to move one part within a string to the beginning of the string. Please see example below. Can this be done using regex?
in:
c("41_exo","47_exo","48_exo")
out:
c("Exo_41","Exo_47","Exo_48")

Yes, you can do this with regex.
vec <- c("41_exo","47_exo","48_exo")
# using base R
gsub("(.*)_(.*)", "\\2_\\1", vec)
#> [1] "exo_41" "exo_47" "exo_48"
# using stringr
stringr::str_replace_all(vec, "(.*)_(.*)", "\\2_\\1")
#> [1] "exo_41" "exo_47" "exo_48"
Created on 2018-07-08 by the reprex package (v0.2.0).

Or without regex:
sapply(
strsplit(vec, "_"),
function(x) {
paste0(toupper(substring(x[2], 1, 1)), substring(x[2], 2), "_", x[1])
}
)
[1] "Exo_41" "Exo_47" "Exo_48"

Related

Convert character string to number in R

I am trying to convert a character string to a number in R. Example:
a=1; b=2
If my input is "abba", I want my output to be a+b+b+a = 1+2+2+1 = 6.
Here's my attempt so far:
str= "abba"
paste(unlist(strsplit(unlist(str_extract_all(str, "[aA-zZ]+")), split = "")),collapse="+")
[1] "a+b+b+a"
I don't know how to convert this to numeric since as.numeric() returns NA. Any help is appreciated!
Another option setting factor levels for the letters like this:
str= "abba"
sum(as.numeric(factor(unlist(strsplit(str, "")), levels = letters)))
#> [1] 6
Created on 2022-09-28 with reprex v2.0.2
You could use a data.frame to translate your letters to numbers, and match after using strsplit:
translation <- data.frame(letter = letters,
number = 1:26)
str <- "abba"
sum(match(strsplit(str, "")[[1]], translation$letter))
#> [1] 6
An option with chartr
eval(parse(text = chartr('ab', '12', gsub("(?<=\\w)(?=\\w)", "+",
str, perl = TRUE))))
[1] 6

how can i remove certain characters from my rownames?

I want to remove a part of the rownames in my data frame. i want to remove "."and characters after "." Does anyone know?
head(rownames(data))
[1] "ENSG00000000003.15" "ENSG00000000005.6" "ENSG00000000419.13" "ENSG00000000457.14"
[5] "ENSG00000000460.17" "ENSG00000000938.13"
i wanna change it to
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
how
Try this,
rownames(data) <- sub("\\..*", "", rownames(data))
You can try this:
rownames(data) <- unlist(strsplit(rownames(data), split='.',fixed = T))[1]
It looks like they have a fixed length. If it is true, then
rownames(data) = substr(rownames(data),1,15)
A slightly more sophisticated method is this:
sub("([^.]+).*", "\\1", rownames(data))
Here, we define a capture group using a negative character class that includes any character but the period . and, using backreference \\1, 'recollect' just the matching series of digits in sub's replacement argument.
Another option using str_split like this:
library(stringr)
rows <- c("ENSG00000000003.15", "ENSG00000000005.6", "ENSG00000000419.13", "ENSG00000000457.14", "ENSG00000000460.17", "ENSG00000000938.13")
str_split(rows, "\\.", simplify=T)[,1]
#> [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
#> [5] "ENSG00000000460" "ENSG00000000938"
Created on 2022-07-29 by the reprex package (v2.0.1)

Count number of dots in character string with str_count?

I am trying to count the number of dots in a character string.
I have tried to use str_count but it gives me the number of letters of the string instead.
ex_str <- "This.is.a.string"
str_count(ex_str, '.')
nchar(ex_str)
. is a special regex symbol, so you need to escape it:
str_count(ex_str, '\\.')
# [1] 3
Using just base R you could do:
nchar(gsub("[^.]", "", ex_str))
Using stringi:
stri_count_fixed(ex_str, '.')
Another base R solution could be:
length(grepRaw(".", ex_str, fixed = TRUE, all = TRUE))
[1] 3
You may also use the base function gregexpr:
sum(gregexpr(".", ex_str, fixed=TRUE)[[1]] > 0)
[1] 3
You can use stringr::str_count with a fixed(...) argument to avoid treating it as a regular expression:
str_count(ex_str, fixed('.'))
See the online R demo:
library(stringr)
ex_str <- "This.is.a.string"
str_count(ex_str, fixed('.'))
## => [1] 3

gsub / sub to extract between certain characters

How can I extract the numbers / ID from the following string in R?
link <- "D:/temp/sample_data/0000098618-13-000011.htm"
I want to just extract 0000098618-13-000011
That is discard the .htm and the D:/temp/sample_data/.
I have tried grep and gsub without much luck.
1) basename Use basename followed by sub:
sub("\\..*", "", basename(link))
## [1] "0000098618-13-000011"
2) file_path_sans_ext
library(tools)
file_path_sans_ext(link)
## [1] "0000098618-13-000011"
3) sub
sub(".*/(.*)\\..*", "\\1", link)
## [1] "0000098618-13-000011"
4) gsub
gsub(".*/|\\.[^.]*$", "", link)
## [1] "0000098618-13-000011"
5) strsplit
sapply(strsplit(link, "[/.]"), function(x) tail(x, 2)[1])
## [1] "0000098618-13-000011"
6) read.table. If link is a vector this will only work if all elements have the same number of /-separated components. Also this assumes that the only dot is the one separting the extension.
DF <- read.table(text = link, sep = "/", comment = ".", as.is = TRUE)
DF[[ncol(DF)]]
## [1] "0000098618-13-000011"
Using stringr:
library(stringr)
str_extract(link , "[0-9-]+")
# "0000098618-13-000011"

Use substr until condition is met

I have a vector from which I just need the first word. The words have different lengths. Words are separated by a symbol (. and _) How can I use the substr() function to get a new vector with just the first word?
I was thinking of something like this
x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
y <- substr(x,0, ???)
I think sub with some regular expressions would be the easiest solution:
sub(pattern = "[._].*", replacement = "", x = x)
# [1] "wooombel" "mugran" "friendly" "hungry"
Try:
sapply(strsplit(x,'[._]'), function(x) x[1])
[1] "wooombel" "mugran" "friendly" "hungry"
You could also use package stringr. It has some really handy functions for string manipulation.
One that comes to mind for this problem is word. It has a sep argument that allows the use of a regular expression.
> x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
> library(stringr)
> word(x, sep = "[._]")
# [1] "wooombel" "mugran" "friendly" "hungry"
Another option that allows you to continue to use substr is str_locate. So if we just subtract 1 from its result, we can get the desired first words.
> substr(x, 1, str_locate(x, "[._]")-1)
# [1] "wooombel" "mugran" "friendly" "hungry"
An extraction approach with stringi:
library(stringi)
stri_extract_first_regex(x, "[a-z]+(?=[._])")
## [1] "wooombel" "mugran" "friendly" "hungry"
Though "[^a-z]+(?=[._])" may be more explicit.
Regex explanation:
[^a-z]+ any character except: 'a' to 'z' (1 or
more times)
(?= look ahead to see if there is:
[._] any character of: '.', '_'
) end of look-ahead

Resources