There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"
Related
I have managed to extract from a string all values starting with "N_", however I cant extracted precisely those with a certain range of numbers.
Is an R data frame and I have something like this
V1 N_words
(N_R33A, N_R35B, N_T44N, J_T7B) N_R33A, N_R35B, N_T44N
My desired output would be having a specific range of all the N_
V1 N_words (range 30-35)
(N_R33A, N_R35B, N_T44N, J_T7B) N_R33A, N_R35B
The code I am using is but is only extracting N_ and I dont seem to be able to select a range, I am also creating a new column to my x data frame with the extracted words :
x$N_words = str_extract_all(x$V1, "N_([A-Z]|[0-9])+")
One option is modifying the regex:
x = "(N_R33A, N_R35B, N_T44N, J_T7B)"
str_extract_all(x, "N_[A-Z]3[0-5][A-Z]")
# [[1]]
# [1] "N_R33A" "N_R35B"
matches N_
followed by an uppercase letter ([A-Z])
followed by 3
followed by 0, 1, 2, 3, 4 or 5 ([0-5]) .
followed by an uppercase letter ([A-Z])
Let's say I have the string ABCC321BB321A. I want to search for a pattern that consists of ABC...321, where ... can be any character(s). However, I want to only return results in which characters in the substring can be grouped into sets of 3.
E.g., I don't want ABCC321 (ABC - C32 - 1), but I do want ABCC321BB321 (ABC - C32 - 1BB - 321).
How would I do this in R? Is it possible to achieve using regular expressions? I guess I could possibly split the string up into a list containing groups of 3 or use conditionals to only return matches that are divisible by 3 to get the answer I want, but I'm assuming there's a more efficient method.
Try this:
x <- "ABCC321BB321A"
threes <- regmatches(x, gregexpr(".{3}", x))[[1]]
threes
paste(threes, collapse = "-")
which produces:
[[1]]
[1] "ABC" "C32" "1BB" "321"
and
[1] "ABC-C32-1BB-321"
This is my current dataset called details.
> details$names<- c("James Johnson","Michael Jones","Robert Miller","Christopher Smith","Richard Nolan","Constantine Wilson","Mountabatteen Keizman")
I want to extract the part of names considering these 2 aspects:
1) Starting from the left, extract all characters until a space or a hypen (or minus sign) is reached.
2) Extract no more than ten characters.
I tried to do this by using this code:
> abrevStrings<- function(details$names)
{
gsub("([a-z])([A-Z])","([a-z])([A-Z])<= 10",details$names)
}
But I didn't get the output I wanted.
My desired output can be seen below:
James
Michael
Robert
Christophe
Richard
Constantin
Mountabatt
One way would using sub and substr by removing everything after whitespace or hyphen and then select only first 10 characters.
abrevStrings <- function(x) {
substr(sub("\\s+.*|-.*", "", x), 1, 10)
}
abrevStrings(details$names)
#[1] "James" "Michael" "Robert" "Christophe" "Richard"
# "Constantin" "Mountabatt"
Or another option is to split the strings on whitespace or hyphen and take the substring of the first part of the string.
sapply(strsplit(details$names, "\\s+|-"), function(x) substr(x[1], 1, 10))
data
details <- data.frame(names = c("James Johnson","Michael Jones","Robert Miller",
"Christopher Smith","Richard Nolan","Constantine Wilson",
"Mountabatteen Keizman"), stringsAsFactors = FALSE)
Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"