I have a column containning different names, I would like to get all the strings after the second space of characters.
My example.
df <- data.frame(col = c("Adenia macrophylla", "Adinobotrys atropurpureus (Wall.) Dunn", "Ardisia purpurea Reinw. ex Blume"))
My desired outcome like this
col
1
2 (Wall.) Dunn
3 Reinw. ex Blume
Any sugesstions for me? The way before I did is to separate them and unite, but I consider whether we have any fancy way or better to do it, since I already have many columns.
Update
Just solve it
xx %>%
mutate(col = str_pad(col, 20,"right")) %>%
mutate(col = str_remove(col, '\\w+\\s\\w+\\s'))
Thanks #Ronak and #U12-Forward for providing me regex
You may use sub -
sub('\\w+\\s\\w+\\s', '', df$col)
#[1] "(Wall.) Dunn" "Reinw. ex Blume"
#Also
#sub('.*?\\s.*?\\s', '', df$col)
If you want a tidyverse answer.
library(dplyr)
library(stringr)
df %>% mutate(val = str_remove(col, '\\w+\\s\\w+\\s'))
In case you want to select string after n space's it might be good to use repetition in sub.
sub("([^ ]* ){2}(.*)|.*", "\\2", df$col)
#sub("([^ ]* ){2}|.*", "", df$col, perl=TRUE) #Alternative
#[1] "" "(Wall.) Dunn" "Reinw. ex Blume"
[^ ] get everything but not a space * 0 to n times, match a space, {2} match it two times, .* match everything.
Or use this regex:
df$col <- sub('^\\S+\\s+\\S+', '', df$col)
Output df:
> df
col
1
2 (Wall.) Dunn
3 Reinw. ex Blume
>
Related
I would like to remove second space of several names (after sp.) in R using tidyverse
My example:
df <- data.frame(x = c("Araceae sp. 22", "Arecaceae sp. 02"))
My desired output
x
Araceae sp.22
Arecaceae sp.02
Any suggestions for me, please?
We may use sub to capture the one or more characters that are not a spaces followed by space (\\s+) and another set of characters not a space and replace with the backreference of the captured group
df$x <- sub("^(\\S+\\s+\\S+)\\s+", "\\1", df$x)
df$x
[1] "Araceae sp.22" "Arecaceae sp.02"
Or we can use str_replace
library(dplyr)
library(stringr)
df %>%
mutate(x = str_replace(x, "^(\\S+\\s+\\S+)\\s+", "\\1"))
I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"
I need to replace the 6,7,8th position to "_". In substring, I mentioned the start and stop position. It didn't work.
> a=c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
> substring(a, 6,8) <- "_"
> a
[1] "UHI78_KJRH2V" "TYR32_FHASJKDG" "DHA92_NFSYFN34"
I need UHI78_RH2V TYR32_ASJKDG DHA92_SYFN34
Using sub, we can match on the pattern (?<=^.{5}).{3}, and then replace it by a single underscore:
a <- c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
out <- sub("(?<=^.{5}).{3}", "_", a, perl=TRUE)
out
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
Demo
We could also try doing substring operations here, but we would have to do some splicing:
out <- paste0(substr(a, 1, 5), "_", substr(a, 9, nchar(a)))
1) str_sub<- The str_sub<- replacement function in the stringr package can do that.
library(stringr)
str_sub(a, 6, 8) <- "_"
a
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
2 Base R With only base R you could do this. It replaces the entire string with the match to the first capture group, an underscore and the match to the second capture group.
sub("(.....)...(.*)", "\\1_\\2", a)
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
That regex could also be written as "(.{5}).{3}(.*)" .
3) separate/unite If a is a column in a data frame then we could use dplyr and tidyr to do this:
library(dplyr)
library(tidyr)
DF <- data.frame(a)
DF %>%
separate(a, into = c("pre", "junk", "post"), sep = c(5, 8)) %>%
select(-junk) %>%
unite(a)
giving:
a
1 UHI78_RH2V
2 TYR32_ASJKDG
3 DHA92_SYFN34
From the documentation:
If the portion to be replaced is longer than the replacement string, then only the portion the length of the string is replaced.
So we could do something like this:
substring(a, 6,8) <- "_##"
sub("#+", "", a)
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
I am looking for a regex for gsub to remove all the unwanted commas:
Data:
,,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354
Desired result:
12345
12345,1345,1354
123
12345
12354
This is the progress I have made so far:
(,(?!\d+))
You seem to want to remove all leading and trailing commas.
You may do it with
gsub("^,+|,+$", "", x)
See the regex demo
The regex contans two alternations, ^,+ matches 1 or more commas at the start and ,+$ matches 1+ commas at the end, and gsub replaces these matches with empty strings.
See R demo
x <- c(",,,,,,,12345","12345,1345,1354","123,,,,,,","12345,",",12354")
gsub("^,+|,+$", "", x)
## [1] "12345" "12345,1345,1354" "123" "12345"
## [5] "12354"
You can also use str_extract from stringr. Thanks to greedy matching, you don't have to specify how many times a digit occurs, the longest match is automatically chosen:
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_extract(V1, "\\d.+\\d"))
or if you prefer base R:
df$V1 = regmatches(df$V1, gregexpr("\\d.+\\d", df$V1))
Result:
V1
1 12345
2 12345,1345,1354
3 123
4 12345
5 12354
Data:
df = read.table(text = ",,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354")
How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A