I am trying to convert a character string to a number in R. Example:
a=1; b=2
If my input is "abba", I want my output to be a+b+b+a = 1+2+2+1 = 6.
Here's my attempt so far:
str= "abba"
paste(unlist(strsplit(unlist(str_extract_all(str, "[aA-zZ]+")), split = "")),collapse="+")
[1] "a+b+b+a"
I don't know how to convert this to numeric since as.numeric() returns NA. Any help is appreciated!
Another option setting factor levels for the letters like this:
str= "abba"
sum(as.numeric(factor(unlist(strsplit(str, "")), levels = letters)))
#> [1] 6
Created on 2022-09-28 with reprex v2.0.2
You could use a data.frame to translate your letters to numbers, and match after using strsplit:
translation <- data.frame(letter = letters,
number = 1:26)
str <- "abba"
sum(match(strsplit(str, "")[[1]], translation$letter))
#> [1] 6
An option with chartr
eval(parse(text = chartr('ab', '12', gsub("(?<=\\w)(?=\\w)", "+",
str, perl = TRUE))))
[1] 6
Related
I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你
I am trying to count the number of dots in a character string.
I have tried to use str_count but it gives me the number of letters of the string instead.
ex_str <- "This.is.a.string"
str_count(ex_str, '.')
nchar(ex_str)
. is a special regex symbol, so you need to escape it:
str_count(ex_str, '\\.')
# [1] 3
Using just base R you could do:
nchar(gsub("[^.]", "", ex_str))
Using stringi:
stri_count_fixed(ex_str, '.')
Another base R solution could be:
length(grepRaw(".", ex_str, fixed = TRUE, all = TRUE))
[1] 3
You may also use the base function gregexpr:
sum(gregexpr(".", ex_str, fixed=TRUE)[[1]] > 0)
[1] 3
You can use stringr::str_count with a fixed(...) argument to avoid treating it as a regular expression:
str_count(ex_str, fixed('.'))
See the online R demo:
library(stringr)
ex_str <- "This.is.a.string"
str_count(ex_str, fixed('.'))
## => [1] 3
I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"
I have a column in a data.table full of strings in the format string+integer. e.g.
string1, string2, string3, string4, string5,
When I use sort(), I put these strings in the wrong order.
string1, string10, string11, string12, string13, ..., string2, string20,
string21, string22, string23, ....
How would I sort these to be in the order
string01, string02, string03, string04, strin0g5, ... , string10,, string11,
string12, etc.
One method could be to add a 0 to each integer <10, 1-9?
I suspect you would extract the string with str_extract(dt$string_column, "[a-z]+") and then add a 0 to each single-digit integer...somehow with sprintf()
We can remove the characters that are not numbers to do the sorting
dt[order(as.integer(gsub("\\D+", "", col1)))]
You could go for mixedsort in gtools:
vec <- c("string1", "string10", "string11", "string12", "string13","string2",
"string20", "string21", "string22", "string23")
library(gtools)
mixedsort(vec)
#[1] "string1" "string2" "string10" "string11" "string12" "string13"
# "string20" "string21" "string22" "string23"
You could use the str_extract of stringr package to obtain the digits and order according to that
x = c("string1","string3","stringZ","string2","stringX","string10")
library(stringr)
c(x[grepl("\\d+",x)][order(as.integer(str_extract(x[grepl("\\d+",x)],"\\d+")))],
sort(x[!grepl("\\d+",x)]))
#[1] "string1" "string2" "string3" "string10" "stringX" "stringZ"
Assuming the string is something like below:
library(data.table)
library(stringr)
xstring <- data.table(x = c("string1","string11","string2",'string10',"stringx"))
extracts <- str_extract(xstring$x,"(?<=string)(\\d*)")
y_string <- ifelse(nchar(extracts)==2 | extracts=="",extracts,paste0("0",extracts))
fin_string <- str_replace(xstring$x,"(?<=string)(\\d*)",y_string)
sort(fin_string)
Output:
> sort(fin_string)
[1] "string01" "string02" "string10" "string11"
[5] "stringx"
I have a list:
a <- ["12file.txt", "8file.txt", "66file.txt"]
I would like to sort by number:
a would be: ["8file.txt", "12file.txt", "66file.txt"]
Now I could get only this:
a = ["12file.txt", "66file.txt", "8file.txt"]
Thanks
I'm assuming you have a character vector:
a <- c("12file.txt", "8file.txt", "66file.txt")
I would approach this by pulling out the number at the start of each string and sorting on that:
num <- as.numeric(sub("([0-9]+).*", "\\1", a))
a[order(num)]
#[1] "8file.txt" "12file.txt" "66file.txt"
You could also pad your strings with spaces by setting a field length to sprintf to achieve the sorting you want:
a[order(sprintf("%10s",a))]
[1] "8file.txt" "12file.txt" "66file.txt"
You can use str_sort(..., numeric = TRUE) function from stringr package:
library(stringr)
a <- c("12file.txt", "8file.txt", "66file.txt")
str_sort(a, numeric = TRUE)
#> [1] "8file.txt" "12file.txt" "66file.txt"