If there is a first and last name is like "nandan, vivek". I want to display as "vivek nandan".
n<-("nandan,vivek")
result:
[1] vivek nandan
where first name:vivek
last name:nandan
this is the author name.
We can try using sub here:
input <- "nankin,vivek"
sub("([^,]+),\\s*(.*)", "\\2 \\1", input)
[1] "vivek nankin"
The regex pattern used above matches the last name followed by the first name, in separate capture groups. It then replaces with those capture groups, in reverse order, separated by a single space.
An option would be sub to capture the substring that are letters ([a-z]+) followed by a , and again capture the next word ([a-z]+). In the replacement, reverse the order of the backreferences
sub("([a-z]+),([a-z]+)", "\\2 \\1", n)
#[1] "vivek nandan"
A non-regex option would be to split the string and then paste the reversed words
paste(rev(strsplit(n, ",")[[1]]), collapse=" ")
#[1] "vivek nandan"
Or extract the word and paste
library(stringr)
paste(word(n, 2, sep=","), word(n, 1, sep=","))
#[1] "vivek nandan"
data
n<- "nandan,vivek"
Related
I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"
I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.
One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"
Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")
Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"
These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")
I am trying to get the host of an IP address from a list of strings.
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
I want to get the first 2 digits from the ips. output:
ips <- c('140.112', '132.212', '31.2', '7.112')
This is the code that I wrote to convert them:
cat(unlist(strsplit(ips, "\\.", fixed = FALSE))[1:2], sep = ".")
When I check the type of individual ips in the end I get something like this:
140.112 NULL
Not sure what I am doing wrong. If you have some other ideas completely different from this that is completely fine too.
With sub:
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
sub('\\.\\d+\\.\\d+$', '', ips)
# [1] "140.112" "132.212" "31.2" "7.112"
With str_extract from stringr:
library(stringr)
str_extract(ips, '^\\d+\\.\\d+')
# [1] "140.112" "132.212" "31.2" "7.112"
With strsplit + sapply:
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.'))
# [1] "140.112" "132.212" "31.2" "7.112"
With read.table + apply:
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.')
#[1] "140.112" "132.212" "31.2" "7.112"
Notes:
sub('\\.\\d+\\.\\d+$', '', ips):
i. \\.\\d+\\.\\d+$ matches a literal dot, a digit one or more times, a literal dot again, and a digit one or more times at the end of the string
ii. sub removes the above match from the string
str_extract(ips, '^\\d+\\.\\d+'):
i. ^\\d+\\.\\d+ matches a digit one or more times, a literal dot and a digit one or more times in the beginning of the string
ii. str_extract extracts the above match from the string
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.')):
i. strsplit(ips, '\\.') splits each ip using a literal dot as the delimiter. This returns a list of vectors after the split
ii. With sapply, paste(x[1:2], collapse = '.') is applied to every element of the list, thus taking only the first two numbers from each vector, and collapsing them with a dot as the separator. sapply then coerces the list to a vector, thus returning a vector of the desired ips.
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.'):
i. read.table(textConnection(ips), sep='.')[1:2] treats ips as text input and reads it in with dot as a delimiter. Only taking the first two columns.
ii. apply enables paste to be operated on each row, and collapses with a dot.
Could you please try following.
gsub("([0-9]+.[0-9]+)(.*)","\\1",ips)
Explanation: Using gsub function and putting regex there to match digits then DOT then digits in memory's 1st place holder and keeping .* everything after it in 2nd place holder of memory. Then substituting these with \\1 with first regex's value which will be first 2 fields.
One solution is the following:
vapply(strsplit(ips, ".", fixed = TRUE),
function(x) paste(x[1:2], collapse = "."),
character(1L))
vapply applies function(x) to each element of the output of strsplit
strsplit produces a list where each element of the list is the components of the IP addresses separated by "."; setting fixed = TRUE requests to split using the exact value of the splitting string (i.e., "."), not using regex
function(x) takes the first two elements (x[1:2]) of each item coming out of strsplit and pastes them together, seperated by "."
character(1L) tells vapply that each element of the output (i.e., returned from function(x) should be a string of length 1.
Edit: #useR posted this solution right before me (using sapply).
substr is vectorised on the stop argument, so you can use this with a vector of positions before the second dot. regexpr gives the positions of the first match, so if you sub out the first one you can match on the second - which will be conveniently one before it's true position as needed (since you removed the first one).
substr(ips,1,regexpr("\\.",sub("\\.","",ips)))
[1] "140.112" "132.212" "31.2" "7.112"
We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions:
format(numeric_version(ips)[, 1:2])
[1] "140.112" "132.212" "31.2" "7.112"
I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")
There are some strings which show the following pattern
ABC, DEF.JHI
AB,DE.(JH)
Generally, it includes three sections which are separated with , and . The last character can be either normal character or sth like ). I would like to extract the last part. For example, I would like to generate the following two strings based on the above ones
JHI
(JH)
Is there a way to do that in R?
library(stringr)
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
str_extract(str1,perl('(?<=\\.).*'))
#[1] "JHI" "(JH)"
(?<=\\.) search for . followed by .* all characters
You can just split on the . using strsplit and extract the second element.
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
unlist(lapply(strsplit(str1, "\\."), "[", 2))
# [1] "JHI" "(JH)"
Here's another possibility:
sapply(strsplit(str1, "\\.\\(|\\.|\\)"), "[[", 2)
Riffing on #josiber's answer you could remove the part of the string before the .
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)")
gsub(".*\\.", "", str1)
# [1] "JHI" "(JH)"
EDIT
In case your third element is not always preceded by a ., to extract the final part
str1 <- c("ABC, DEF.JHI","AB,DE.(JH)", "ABC.DE, (JH)")
gsub(".*[,.]", "" , str1)
# [1] "JHI" "(JH)" " (JH)"