Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.
We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
Related
I have a chr vector:
> head(strings)
[1] "10_88517_0" "10_88521_1" "10_88542_2" "10_280230_3" "10_280258_4" "10_280310_5"
I want to create a new vector of substrings, obtained by cutting each element of this vector at the second _. E.g.:
> head(cut_strings)
[1] "10_88517" "10_88521" "10_88542" "10_280230" "10_280258" "10_280310"
My idea was to first grep for the position of second _ in each string:
cut_pts <- sapply(stringr::str_locate_all(strings, "_"), "[", 2)
All I can come up with though is an awkward for loop that goes through the strings vector and calls substr for each element, e.g.:
cut_strings <- strings
for(i in 1:length(strings)){
string <- strings[i]
cut_pt <- cut_pts[i]
string <- substr(string, 1, cut_pt-1)
cut_strings[i] <- string
}
I'm thinking maybe there's a way to use apply in this context, to cut each element of strings based on the appropriate element of cut_pts?
We could capture those characters in sub and remove the substring afterwards i..e below pattern matches the one or more characters not an underscore ([^_]+) followed by an underscore, then characters not an underscore and remove the character starting from second underscore by not including in the capture group ((...)). Note that we specified the start of the string (^). In the replacement, use the backreference (\\1) of the captured group
sub("^([^_]+_[^_]+)_.*", "\\1", strings)
I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))
I am trying to get the host of an IP address from a list of strings.
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
I want to get the first 2 digits from the ips. output:
ips <- c('140.112', '132.212', '31.2', '7.112')
This is the code that I wrote to convert them:
cat(unlist(strsplit(ips, "\\.", fixed = FALSE))[1:2], sep = ".")
When I check the type of individual ips in the end I get something like this:
140.112 NULL
Not sure what I am doing wrong. If you have some other ideas completely different from this that is completely fine too.
With sub:
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
sub('\\.\\d+\\.\\d+$', '', ips)
# [1] "140.112" "132.212" "31.2" "7.112"
With str_extract from stringr:
library(stringr)
str_extract(ips, '^\\d+\\.\\d+')
# [1] "140.112" "132.212" "31.2" "7.112"
With strsplit + sapply:
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.'))
# [1] "140.112" "132.212" "31.2" "7.112"
With read.table + apply:
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.')
#[1] "140.112" "132.212" "31.2" "7.112"
Notes:
sub('\\.\\d+\\.\\d+$', '', ips):
i. \\.\\d+\\.\\d+$ matches a literal dot, a digit one or more times, a literal dot again, and a digit one or more times at the end of the string
ii. sub removes the above match from the string
str_extract(ips, '^\\d+\\.\\d+'):
i. ^\\d+\\.\\d+ matches a digit one or more times, a literal dot and a digit one or more times in the beginning of the string
ii. str_extract extracts the above match from the string
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.')):
i. strsplit(ips, '\\.') splits each ip using a literal dot as the delimiter. This returns a list of vectors after the split
ii. With sapply, paste(x[1:2], collapse = '.') is applied to every element of the list, thus taking only the first two numbers from each vector, and collapsing them with a dot as the separator. sapply then coerces the list to a vector, thus returning a vector of the desired ips.
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.'):
i. read.table(textConnection(ips), sep='.')[1:2] treats ips as text input and reads it in with dot as a delimiter. Only taking the first two columns.
ii. apply enables paste to be operated on each row, and collapses with a dot.
Could you please try following.
gsub("([0-9]+.[0-9]+)(.*)","\\1",ips)
Explanation: Using gsub function and putting regex there to match digits then DOT then digits in memory's 1st place holder and keeping .* everything after it in 2nd place holder of memory. Then substituting these with \\1 with first regex's value which will be first 2 fields.
One solution is the following:
vapply(strsplit(ips, ".", fixed = TRUE),
function(x) paste(x[1:2], collapse = "."),
character(1L))
vapply applies function(x) to each element of the output of strsplit
strsplit produces a list where each element of the list is the components of the IP addresses separated by "."; setting fixed = TRUE requests to split using the exact value of the splitting string (i.e., "."), not using regex
function(x) takes the first two elements (x[1:2]) of each item coming out of strsplit and pastes them together, seperated by "."
character(1L) tells vapply that each element of the output (i.e., returned from function(x) should be a string of length 1.
Edit: #useR posted this solution right before me (using sapply).
substr is vectorised on the stop argument, so you can use this with a vector of positions before the second dot. regexpr gives the positions of the first match, so if you sub out the first one you can match on the second - which will be conveniently one before it's true position as needed (since you removed the first one).
substr(ips,1,regexpr("\\.",sub("\\.","",ips)))
[1] "140.112" "132.212" "31.2" "7.112"
We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions:
format(numeric_version(ips)[, 1:2])
[1] "140.112" "132.212" "31.2" "7.112"
I would like to move the suffix of a column name to its beginning so that it becomes its prefix. I have many columns with changing names (except the suffix), so manually renaming is not an option.
Example:
set.seed(1)
dat <- data.frame(ID = 1:5,
speed.x.alpha = runif(5),
power.x.alpha = rpois(5, 1),
force.x.alpha = rexp(5),
speed.y.beta = runif(5),
power.y.beta = rpois(5, 1),
force.y.beta = rexp(5))
In the end end the dataframe should have the following column names:
ID, alpha.speed.x, alpha.power.x, alpha.force.x, beta.speed.x, beta.power.x, force.power.x.
I strongly assume I need a gsub/sub expression which allows me to select the characters after the last dot, which I would then paste to the colnames, and eventually remove from the end. So far without success though...
A couple of gsubs and paste0 will do the trick:
gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x" "beta.speed.x"
[6] "beta.power.x" "beta.force.x"
The () in the regular expression capture the characters that match the subexpression. "\." is used to match the literal "." and the "$" anchors the expression to the end of the string. The second argument pastes together the captured sub-expressions. This result is fed to a second gsub which replaces the ending "y" with an "x" if one is found.
to rename the variables, use
names(dat) <- gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
Here is one option with sub. We match one or more characters that are not a . ([^.]+) from the start (^) of the string, capture it as group ((...)- inside the braces), followed by a dot (\\. - note that . is a metacharacter which signifies for any character. So, it needs to be escaped (\\) to read it as the literal character or place it inside square brackets), followed by another set of characters that are not a dot (inside the second capture group) followed by another dot and the rest of the characters until the end of the string. In the replacement, we change the order of backreferences of capture groups to get the expected output.
names(dat) <- sub("^([^.]+)\\.([^.]+)\\.(.*)", "\\3.\\1.\\2", names(dat))
names(dat)
#[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x"
#[5] "beta.speed.y" "beta.power.y" "beta.force.y"
I have two groups of paired samples that could be separated by the first two letters. I would like to make two groups based on the pairing using something like [tn][abc].
Example of paired samples:
nb-008 ta-008
na015 ta-015
data:
> colnames(data)
"nb-008" "nb-014" "na015" "na-018" "ta-008" "tc-014" "ta-015" "ta-018"
patient <- factor(sapply(str_split(colnames(data), '[tn][abc]'), function(x) x[[1]]))
We can create a grouping variable with sub. We match the pattern of 2 characters (..) from the beginning of the string (^) followed by - (if present), followed by one or more characters (.*) that we capture as a group (inside the brackets), and replace by the backreference (\\1). This can be used to split the column names.
split(colnames(data), sub('^..-?(.*)', '\\1', colnames(data))))
#$`008`
#[1] "nb-008" "ta-008"
#$`014`
#[1] "nb-014" "tc-014"
#$`015`
#[1] "na015" "ta-015"
#$`018`
#[1] "na-018" "ta-018"
data
v1 <- c("nb-008", "nb-014", "na015", "na-018",
"ta-008", "tc-014", "ta-015", "ta-018" )
set.seed(24)
data <- setNames(as.data.frame(matrix(sample(0:8, 8*5,
replace=TRUE), ncol=8)), v1)