Extracting characters from entries in a vector in R - r

There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.

You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"

Related

R alphanumeric string range

I have managed to extract from a string all values starting with "N_", however I cant extracted precisely those with a certain range of numbers.
Is an R data frame and I have something like this
V1 N_words
(N_R33A, N_R35B, N_T44N, J_T7B) N_R33A, N_R35B, N_T44N
My desired output would be having a specific range of all the N_
V1 N_words (range 30-35)
(N_R33A, N_R35B, N_T44N, J_T7B) N_R33A, N_R35B
The code I am using is but is only extracting N_ and I dont seem to be able to select a range, I am also creating a new column to my x data frame with the extracted words :
x$N_words = str_extract_all(x$V1, "N_([A-Z]|[0-9])+")
One option is modifying the regex:
x = "(N_R33A, N_R35B, N_T44N, J_T7B)"
str_extract_all(x, "N_[A-Z]3[0-5][A-Z]")
# [[1]]
# [1] "N_R33A" "N_R35B"
matches N_
followed by an uppercase letter ([A-Z])
followed by 3
followed by 0, 1, 2, 3, 4 or 5 ([0-5]) .
followed by an uppercase letter ([A-Z])

Search every n characters in a string for a pattern

Let's say I have the string ABCC321BB321A. I want to search for a pattern that consists of ABC...321, where ... can be any character(s). However, I want to only return results in which characters in the substring can be grouped into sets of 3.
E.g., I don't want ABCC321 (ABC - C32 - 1), but I do want ABCC321BB321 (ABC - C32 - 1BB - 321).
How would I do this in R? Is it possible to achieve using regular expressions? I guess I could possibly split the string up into a list containing groups of 3 or use conditionals to only return matches that are divisible by 3 to get the answer I want, but I'm assuming there's a more efficient method.
Try this:
x <- "ABCC321BB321A"
threes <- regmatches(x, gregexpr(".{3}", x))[[1]]
threes
paste(threes, collapse = "-")
which produces:
[[1]]
[1] "ABC" "C32" "1BB" "321"
and
[1] "ABC-C32-1BB-321"

How to develop a function that accepts a vectors of character which corresponds to the column component of a dataframe?

This is my current dataset called details.
> details$names<- c("James Johnson","Michael Jones","Robert Miller","Christopher Smith","Richard Nolan","Constantine Wilson","Mountabatteen Keizman")
I want to extract the part of names considering these 2 aspects:
1) Starting from the left, extract all characters until a space or a hypen (or minus sign) is reached.
2) Extract no more than ten characters.
I tried to do this by using this code:
> abrevStrings<- function(details$names)
{
gsub("([a-z])([A-Z])","([a-z])([A-Z])<= 10",details$names)
}
But I didn't get the output I wanted.
My desired output can be seen below:
James
Michael
Robert
Christophe
Richard
Constantin
Mountabatt
One way would using sub and substr by removing everything after whitespace or hyphen and then select only first 10 characters.
abrevStrings <- function(x) {
substr(sub("\\s+.*|-.*", "", x), 1, 10)
}
abrevStrings(details$names)
#[1] "James" "Michael" "Robert" "Christophe" "Richard"
# "Constantin" "Mountabatt"
Or another option is to split the strings on whitespace or hyphen and take the substring of the first part of the string.
sapply(strsplit(details$names, "\\s+|-"), function(x) substr(x[1], 1, 10))
data
details <- data.frame(names = c("James Johnson","Michael Jones","Robert Miller",
"Christopher Smith","Richard Nolan","Constantine Wilson",
"Mountabatteen Keizman"), stringsAsFactors = FALSE)

Count string length and remove characters if a certain length [duplicate]

There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Resources