Given the string "This has 4 words!" I would like to count only the letters and digits. I would like to exclude whitespace and punctuation. As such, the string above should return 13.
I'm not sure why, but I cannot get this for R.
We can use [[:alnum:]] in str_count to count only the alphabets and digits
library(stringr)
str_count(str1, "[[:alnum:]]")
#[1] 13
Or in base R with gsub to remove the [[:punct:]] and then get the number of characters with nchar
nchar(gsub("[[:punct:]]+", "", str1))
Or negate (^) characters that are not alpha numeric, replace with blank ("") and get the nchar
nchar(gsub("[^[:alnum:]]+", "", str1))
#[1] 13
data
str1 <- "This has 4 words!"
Related
I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"
I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"
How can I change #ManuelaSchwesig#sigmargabriel#nahles into #ManuelaSchwesig, #sigmargabriel, #nahles using R?
We could try with a regex lookaround by splitting at the junction of a lower case letter and the # character to create a vector of strings. Here, the pattern for strsplit is a positive regex lookbehind ((?<=[a-z])) followed by a positive regex lookahead ((?=#)). In the string, there are two instances where it matches i.e. between g and # (Schweig#sigma) and l and # in (gabriel#nahles) and splits between these characters
strsplit(str1, "(?<=[a-z])(?=#)", perl = TRUE)[[1]]
#[1] "#ManuelaSchwesig" "#sigmargabriel" "#nahles"
If we need to keep it as a single string and the objective is to insert a ,
gsub("([a-z])#", "\\1,#", str1)
#[1] "#ManuelaSchwesig,#sigmargabriel,#nahles"
data
str1 <- "#ManuelaSchwesig#sigmargabriel#nahles"
I want to write a regex in R to remove all words of a string containing numbers.
For example:
first_text = "a2c if3 clean 001mn10 string asw21"
second_text = "clean string
Try with gsub
trimws(gsub("\\w*[0-9]+\\w*\\s*", "", first_text))
#[1] "clean string"
It is easier to select words with no numbers than to select and delete words with numbers:
> library(stringr)
> str1 <- "a2c if3 clean 001mn10 string asw21"
> paste(unlist(str_extract_all(str1, "(\\b[^\\s\\d]+\\b)")), collapse = " ")
[1] "clean string"
Note:
Backslashes have to be escaped in R to work properly, hence double backslashes
\b is word boundary
\s is white space
\d is digit character
a caret (^) inside square brackets is a negater: find characters that do not match ...
"+" after the character group inside [] means "1 or more" occurrences of those (non white space and non digit) characters
Just another alternative using gsub
trimws(gsub("[^\\s]*[0-9][^\\s]*", "", first_text, perl=T))
#[1] "clean string"
A bit longer than some of the answers but very tractable is to first convert the string to a vector of words, then check word by word if there are any numbers and use standard R subsetting.
first_text_vec <- strsplit(first_text, " ")[[1]]
first_text_vec
[1] "a2c" "if3" "clean" "001mn10" "string" "asw21"
paste(first_text_vec[!grepl("[0-9]", first_text_vec)], collapse = " ")
[1] "clean string"
I have the following string in R: "xxx, yyy. zzz"
I want to get the yyy part only, which are in between "," and "."
I don't want to use regex.
I searched half a day, found many string functions in R but none which deal with "cut before/after a character" function.
Is there such?
We can use gsub to match zero or more characters that are not a , ([^,]*) from the start (^) of the string followed by a , followed by zero or more spaces (\\s*) or (!) a dot (\\. - it is a metacharacter meaning any character so it is escaped) followed by other characters (.*) until the end of the string ($) and replace it with blank ("")
gsub("^[^,]*,\\s*|\\..*$", "", str1)
#[1] "yyy"
If we don't need regex then strsplit the string by , followed by zero or more spaces or with a . and select the second entry after converting the list output to vector ([[1]])
strsplit(str1, ",\\s*|\\.")[[1]][2]
#[1] "yyy"
data
str1 <- "xxx, yyy. zzz"
It could be that this suffices:
unlist(strsplit("xxx, yyy. zzz","[,.]"))[2] # get yyy with space, or:
gsub(" ","",unlist(strsplit("xxx, yyy. zzz","[,.]")))[2] # remove space