How to get the most frequent character within a character string? [duplicate]

How to get the most frequent character within a character string? [duplicate] - r

This question already has answers here:
Finding the most repeated character in a string in R
(2 answers)
Closed 1 year ago.
Suppose the next character string:
test_string <- "A A B B C C C H I"
Is there any way to extract the most frequent value within test_string?
Something like:
extract_most_frequent_character(test_string)
Output:
#C

We can use scan to read the string as a vector of individual elements by splitting at the space, get the frequency count with table, return the named index that have the max count (which.count), get its name
extract_most_frequent_character <- function(x) {
names(which.max(table(scan(text = x, what = '', quiet = TRUE))))
}
-testing
extract_most_frequent_character(test_string)
[1] "C"
Or with strsplit
extract_most_frequent_character <- function(x) {
names(which.max(table(unlist(strsplit(x, "\\s+")))))
}

Here is another base R option (not as elegant as #akrun's answer)
> intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string))))))
[1] "C"

One possibility involving stringr could be:
names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE))))
[1] "C"
Or marginally shorter:
names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))

Here is solution using stringr package, table and which:
library(stringr)
test_string <- str_split(test_string, " ")
test_string <- table(test_string)
names(test_string)[which.max(test_string)]
[1] "C"

Related

Extract last digit [duplicate]

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?

I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"

If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"

Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.

str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>

Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)

UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.

A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"

Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"

An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)

I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.

someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.

For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"

I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.

A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"

Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Unexpected regex behavior when executed in base R [duplicate]

This question already has answers here:
Extract a regular expression match
(12 answers)
Closed 1 year ago.
As per this link, I wrote a regex that does not give the expected result when executed for a specific string in R:
string <- "0,9% BB"
regex <- "^ ?\\d+[\\d ,\\.]*[B-DF-HJ-NP-TV-Z\\/]*%?"
grep(regex, string, value = T, perl = T)
The result output is
[1] "0,9% BB"
instead of the desired (and outputed by the link)
[1] "0,9%"
What am I missing to get the desired output? Preferably base R, please.

This returns "0,9%" using only base R
string <- "0,9% BB"
regex <- "^ ?\\d+[\\d ,\\.]*[B-DF-HJ-NP-TV-Z\\/]*%?"
regmatches(x = string, m = regexpr(regex,string,perl = TRUE))

How to obtain character at a specific place? [duplicate]

This question already has answers here:
str_extract: Extracting exactly nth word from a string
(5 answers)
Closed 3 years ago.
example:
"A.B.C.D"
"apple.good.sad.sea"
"X1.AN2.ED3.LK8"
What I need is to obtain the string specifically between the second dot and the third dot.
result:
"C"
"sad"
"ED3"
How can I do this?

You can use base::strsplit, loop thr the elements to get the 3rd one
v <- c("A.B.C.D", "apple.good.sad.sea", "X1.AN2.ED3.LK8")
sapply(strsplit(v, "\\."), `[[`, 3L)
output:
[1] "C" "sad" "ED3"

You can use unlist(strsplit(str,split = "."))[3] to get the third sub-string, where the original string is split by "." when you apply strsplit

I'd use
sub("^([^.]*\\.){2}([^.]*)\\..*", "\\2", x)
# [1] "C" "sad" "ED3"

Using regex in gsub.
v <- c("A.B.C.D", "apple.good.sad.sea", "X1.AN2.ED3.LK8", "A.B.C.D.E")
gsub("(.*?\\.){2}(.*?)(\\..*)", "\\2", v)
# [1] "C" "sad" "ED3" "C"

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote

You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6

To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])

You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5

EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Replace specific characters within strings

I would like to remove specific characters from strings within a vector, similar to the Find and Replace feature in Excel.
Here are the data I start with:
group <- data.frame(c("12357e", "12575e", "197e18", "e18947")
I start with just the first column; I want to produce the second column by removing the e's:
group group.no.e
12357e 12357
12575e 12575
197e18 19718
e18947 18947

With a regular expression and the function gsub():
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
What gsub does here is to replace each occurrence of "e" with an empty string "".
See ?regexp or gsub for more help.

Regular expressions are your friends:
R> ## also adds missing ')' and sets column name
R> group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947")) )
R> group
group
1 12357e
2 12575e
3 197e18
4 e18947
Now use gsub() with the simplest possible replacement pattern: empty string:
R> group$groupNoE <- gsub("e", "", group$group)
R> group
group groupNoE
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
R>

Summarizing 2 ways to replace strings:
group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947"))
1) Use gsub
group$group.no.e <- gsub("e", "", group$group)
2) Use the stringr package
group$group.no.e <- str_replace_all(group$group, "e", "")
Both will produce the desire output:
group group.no.e
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947

You do not need to create data frame from vector of strings, if you want to replace some characters in it. Regular expressions is good choice for it as it has been already mentioned by #Andrie and #Dirk Eddelbuettel.
Pay attention, if you want to replace special characters, like dots, you should employ full regular expression syntax, as shown in example below:
ctr_names <- c("Czech.Republic","New.Zealand","Great.Britain")
gsub("[.]", " ", ctr_names)
this will produce
[1] "Czech Republic" "New Zealand" "Great Britain"

Use the stringi package:
require(stringi)
group<-data.frame(c("12357e", "12575e", "197e18", "e18947"))
stri_replace_all(group[,1], "", fixed="e")
[1] "12357" "12575" "19718" "18947"

> library(stringi)
> group <- c('12357e', '12575e', '12575e', ' 197e18', 'e18947')
> pattern <- "e"
> replacement <- ""
> group <- str_replace(group, pattern, replacement)
> group
[1] "12357" "12575" "12575" " 19718" "18947"

You can use chartr as well:
group$group.no.e <- chartr("e", "", group$group)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to get the most frequent character within a character string? [duplicate] - r

Here is another base R option (not as elegant as #akrun's answer) > intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string)))))) [1] "C"

One possibility involving stringr could be: names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE)))) [1] "C" Or marginally shorter: names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))

Here is solution using stringr package, table and which: library(stringr) test_string <- str_split(test_string, " ") test_string <- table(test_string) names(test_string)[which.max(test_string)] [1] "C"

Related

Extract last digit [duplicate]

Unexpected regex behavior when executed in base R [duplicate]

How to obtain character at a specific place? [duplicate]

Finding number of r's in the vector (Both R and r) before the first u

Replace specific characters within strings

Categories

Resources