finding second space and remove in R - r

I would like to remove second space of several names (after sp.) in R using tidyverse
My example:
df <- data.frame(x = c("Araceae sp. 22", "Arecaceae sp. 02"))
My desired output
x
Araceae sp.22
Arecaceae sp.02
Any suggestions for me, please?

We may use sub to capture the one or more characters that are not a spaces followed by space (\\s+) and another set of characters not a space and replace with the backreference of the captured group
df$x <- sub("^(\\S+\\s+\\S+)\\s+", "\\1", df$x)
df$x
[1] "Araceae sp.22" "Arecaceae sp.02"
Or we can use str_replace
library(dplyr)
library(stringr)
df %>%
mutate(x = str_replace(x, "^(\\S+\\s+\\S+)\\s+", "\\1"))

Related

select string after second space

I have a column containning different names, I would like to get all the strings after the second space of characters.
My example.
df <- data.frame(col = c("Adenia macrophylla", "Adinobotrys atropurpureus (Wall.) Dunn", "Ardisia purpurea Reinw. ex Blume"))
My desired outcome like this
col
1
2 (Wall.) Dunn
3 Reinw. ex Blume
Any sugesstions for me? The way before I did is to separate them and unite, but I consider whether we have any fancy way or better to do it, since I already have many columns.
Update
Just solve it
xx %>%
mutate(col = str_pad(col, 20,"right")) %>%
mutate(col = str_remove(col, '\\w+\\s\\w+\\s'))
Thanks #Ronak and #U12-Forward for providing me regex
You may use sub -
sub('\\w+\\s\\w+\\s', '', df$col)
#[1] "(Wall.) Dunn" "Reinw. ex Blume"
#Also
#sub('.*?\\s.*?\\s', '', df$col)
If you want a tidyverse answer.
library(dplyr)
library(stringr)
df %>% mutate(val = str_remove(col, '\\w+\\s\\w+\\s'))
In case you want to select string after n space's it might be good to use repetition in sub.
sub("([^ ]* ){2}(.*)|.*", "\\2", df$col)
#sub("([^ ]* ){2}|.*", "", df$col, perl=TRUE) #Alternative
#[1] "" "(Wall.) Dunn" "Reinw. ex Blume"
[^ ] get everything but not a space * 0 to n times, match a space, {2} match it two times, .* match everything.
Or use this regex:
df$col <- sub('^\\S+\\s+\\S+', '', df$col)
Output df:
> df
col
1
2 (Wall.) Dunn
3 Reinw. ex Blume
>

Replacing multiple punctuation marks in a column of data

Column in a df:
chr10:123453:A:C
chr10:2345543:TTTG:CG
chr10:3454757:G:C
chr10:4567875765:C:G
Desired output:
chr10:123453_A/C
chr10:2345543_TTTG/CG
chr10:3454757_G/C
chr10:4567875765_C/G
I think I could use stingsplit but I wanted to try and do it all in a R oneliner. Any ideas would be greatly welcome!
Try this:
gsub(":([A-Z]+):([A-Z]+)$", "_\\1/\\2", x, perl = TRUE)
[1] "chr10:123453_A/C" "chr10:2345543_TTTG/CG"
Here we use backreference twice: \\1 recollects what's between the pre-ultimate and the ultimate :, whereas \\2 recollects what's after the ultimate :.
Data:
x <- c("chr10:123453:A:C","chr10:2345543:TTTG:CG")

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

substring replace nth positions R

I need to replace the 6,7,8th position to "_". In substring, I mentioned the start and stop position. It didn't work.
> a=c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
> substring(a, 6,8) <- "_"
> a
[1] "UHI78_KJRH2V" "TYR32_FHASJKDG" "DHA92_NFSYFN34"
I need UHI78_RH2V TYR32_ASJKDG DHA92_SYFN34
Using sub, we can match on the pattern (?<=^.{5}).{3}, and then replace it by a single underscore:
a <- c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
out <- sub("(?<=^.{5}).{3}", "_", a, perl=TRUE)
out
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
Demo
We could also try doing substring operations here, but we would have to do some splicing:
out <- paste0(substr(a, 1, 5), "_", substr(a, 9, nchar(a)))
1) str_sub<- The str_sub<- replacement function in the stringr package can do that.
library(stringr)
str_sub(a, 6, 8) <- "_"
a
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
2 Base R With only base R you could do this. It replaces the entire string with the match to the first capture group, an underscore and the match to the second capture group.
sub("(.....)...(.*)", "\\1_\\2", a)
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
That regex could also be written as "(.{5}).{3}(.*)" .
3) separate/unite If a is a column in a data frame then we could use dplyr and tidyr to do this:
library(dplyr)
library(tidyr)
DF <- data.frame(a)
DF %>%
separate(a, into = c("pre", "junk", "post"), sep = c(5, 8)) %>%
select(-junk) %>%
unite(a)
giving:
a
1 UHI78_RH2V
2 TYR32_ASJKDG
3 DHA92_SYFN34
From the documentation:
If the portion to be replaced is longer than the replacement string, then only the portion the length of the string is replaced.
So we could do something like this:
substring(a, 6,8) <- "_##"
sub("#+", "", a)
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"

Regular expression on separate function of Tidyr

I need separate two columns with tidyr.
The column have text like: I am Sam. I mean the text always have only two white spaces, and the text can have all other symbols: [a-z][0-9][!\ºª, etc...].
The problem is I need split it in two columns: Column one I am, and column two: Sam.
I can't find a regular expression two separate with the second blank space.
Could you help me please?
We can use extract from tidyr. We match one or more characters and place it in a capture group ((.*)) followed by one or more space (\\s+) and another capture group that contains only non-white space characters (\\S+) to separate the original column into two columns.
library(tidyr)
extract(df1, Col1, into = c("Col1", "Col2"), "(.*)\\s+(\\S+)")
# Col1 Col2
#1 I am Sam
#2 He is Sam
data
df1 <- data.frame(Col1 = c("I am Sam", "He is Sam"), stringsAsFactors=FALSE)
As an alternative, given:
library(tidyr)
df <- data.frame(txt = "I am Sam")
you can use
separate(, txt, c("a", "b"), sep="(?<=\\s\\S{1,100})\\s")
# a b
# 1 I am Sam
with separate using stringi::stri_split_regex (ICU engine), or
separate(df, txt, c("a", "b"), sep="^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
with the older (?) separate using base:strsplit (Perl engine). See also
strsplit("I am Sam", "^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
# [[1]]
# [1] "I am" "Sam"
But it might be a bit "esoterique"...

Resources