Regular expression on separate function of Tidyr - r

I need separate two columns with tidyr.
The column have text like: I am Sam. I mean the text always have only two white spaces, and the text can have all other symbols: [a-z][0-9][!\ºª, etc...].
The problem is I need split it in two columns: Column one I am, and column two: Sam.
I can't find a regular expression two separate with the second blank space.
Could you help me please?

We can use extract from tidyr. We match one or more characters and place it in a capture group ((.*)) followed by one or more space (\\s+) and another capture group that contains only non-white space characters (\\S+) to separate the original column into two columns.
library(tidyr)
extract(df1, Col1, into = c("Col1", "Col2"), "(.*)\\s+(\\S+)")
# Col1 Col2
#1 I am Sam
#2 He is Sam
data
df1 <- data.frame(Col1 = c("I am Sam", "He is Sam"), stringsAsFactors=FALSE)

As an alternative, given:
library(tidyr)
df <- data.frame(txt = "I am Sam")
you can use
separate(, txt, c("a", "b"), sep="(?<=\\s\\S{1,100})\\s")
# a b
# 1 I am Sam
with separate using stringi::stri_split_regex (ICU engine), or
separate(df, txt, c("a", "b"), sep="^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
with the older (?) separate using base:strsplit (Perl engine). See also
strsplit("I am Sam", "^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
# [[1]]
# [1] "I am" "Sam"
But it might be a bit "esoterique"...

Related

Extract words enclosed within asterisks in a column in R

I have a dataframe, col1 contains text and within the text there are words enclosed by double asterisks. I want to extract all of these words and put them in another column called col2. If there is more than 1 word enclosed with double asterisks, I would like them to be separated by a comma. col2 in the example shows the desired result.
col1<-c("**sometimes** i code in python",
"I like **walks** in the park",
"I **often** do **exercise**")
col2<-c("**sometimes**","**walks**","**often**,**exercise**")
df<-data.frame(col1, col2, stringsAsFactors = FALSE)
Can anyone suggest a solution?
You may use stringr::str_match_all -
df$col3 <- sapply(stringr::str_match_all(df$col1, '(\\*+.*?\\*+)'),
function(x) toString(x[, 2]))
df
# col1 col2 col3
#1 **sometimes** i code in python **sometimes** **sometimes**
#2 I like **walks** in the park **walks** **walks**
#3 I **often** do **exercise** **often**,**exercise** **often**, **exercise**
* has a special meaning in regex. Here we want to match an actual * so we escape it with \\. We extract all the values which come between 1 or more than 1 *.
str_match_all returns a list of matrix, we are interested in the capture group that is between (...) which is the 2nd column hence x[, 2] and finally for more than one value we collapse them in one comma separated string using toString.
You can use str_extract_all:
library(stringr)
library(dplyr)
df %>%
mutate(col2 = str_extract_all(col1, "\\*\\*[^* ]+\\*\\*"))
col1 col2
1 **sometimes** i code in python **sometimes**
2 I like **walks** in the park **walks**
3 I **often** do **exercise** **often**, **exercise**
How the regex works:
\\*\\* matches two asterisks
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
\\*\\* matches two asterisks
If you don't need the asterisks in col2, then this is how you can extract the strings without them:
df %>%
mutate(col2 = str_extract_all(col1, "(?<=\\*\\*)[^* ]+(?=\\*\\*)"))
col1 col2
1 **sometimes** i code in python sometimes
2 I like **walks** in the park walks
3 I **often** do **exercise** often, exercise
How this regex works:
(?<=\\*\\*): positive lookbehind asserting that there must be two asterisks to the left
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
(?=\\*\\*) positive lookahead asserting that there must be two two asterisks to the right

finding second space and remove in R

I would like to remove second space of several names (after sp.) in R using tidyverse
My example:
df <- data.frame(x = c("Araceae sp. 22", "Arecaceae sp. 02"))
My desired output
x
Araceae sp.22
Arecaceae sp.02
Any suggestions for me, please?
We may use sub to capture the one or more characters that are not a spaces followed by space (\\s+) and another set of characters not a space and replace with the backreference of the captured group
df$x <- sub("^(\\S+\\s+\\S+)\\s+", "\\1", df$x)
df$x
[1] "Araceae sp.22" "Arecaceae sp.02"
Or we can use str_replace
library(dplyr)
library(stringr)
df %>%
mutate(x = str_replace(x, "^(\\S+\\s+\\S+)\\s+", "\\1"))

Split String with second (single) Backslash / R Emojis (Unicode) without Modifier

I have a tribble with a chr column that contains the unicode to emojis. I want to split these strings into two columns in case of need, if there are more than two backslash in the whole string. So I need a split with the 2nd backslash. It would also be enough to just delete everything from the 2nd backslash on.
Here is what I tried:
df <- tibble::tribble(
~RUser, ~REmoji,
"User1", "\U0001f64f\U0001f3fb",
"User2", "\U0001f64f",
"User2", "\U0001f64f\U0001f3fc"
)
df %>% mutate(newcol = gsub("\\\\*", "", REmoji))
I found the solution Replace single backslash in R. But in my case I have only one backslash, and I don't understand how to separate the column here.
The result should look like this output:
df2 <- tibble::tribble(
~RUser, ~REmoji1, ~newcol,
"User1", "\U0001f64f", "\U0001f3fb",
"User2", "\U0001f64f", "", #This Field is empty, since there was no Emoji-Modification
"User2", "\U0001f64f", "\U0001f3fc"
)
Thanks a lot!
We could also use substring from base R
df$newcol <- substring(df$REmoji, 2)
Note these \U... are single Unicode code points, not just a backslash + digits/letters.
Using the ^. PCRE regex with sub provides the expected results:
> df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))
# A tibble: 3 x 3
RUser REmoji newcol
<chr> <chr> <chr>
1 User1 "\U0001f64f\U0001f3fb" "\U0001f3fb"
2 User2 "\U0001f64f" ""
3 User2 "\U0001f64f\U0001f3fc" "\U0001f3fc"
Make sure you pass the perl=TRUE argument.
And in order to do the reverse, i.e. keep the first code point only, you can use:
df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))

extracting names and numbers using regex

I think I might have some issues with understanding the regular expressions in R.
I need to extract phone numbers and names from a sample vector and create a data-frame with corresponding columns for names and numbers using stringr package functionality.
The following is my sample vector.
phones <- c("Ann 077-789663", "Johnathan 99656565",
"Maria2 099-65-6569 office")
The code that I came up with to extract those is as follows
numbers <- str_remove_all(phones, pattern = "[^0-9]")
numbers <- str_remove_all(numbers, pattern = "[a-zA-Z]")
numbers <- trimws(numbers)
names <- str_remove_all(phones, pattern = "[A-Za-z]+", simplify = T)
phones_data <- data.frame("Name" = names, "Phone" = numbers)
It doesn't work, as it takes the digit in the name and joins with the phone number. (not optimal code as well)
I would appreciate some help in explaining the simplest way for accomplishing this task.
Not a regex expert, however with stringr package we can extract a number pattern with optional "-" in it and replace the "-" with empty string to extract numbers without any "-". For names, we extract the first word at the beginning of the string.
library(stringr)
data.frame(Name = str_extract(phones, "^[A-Za-z]+"),
Number = gsub("-","",str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+")))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
If you want to stick completely with stringr we can use str_replace_all instead of gsub
data.frame(Name = str_extract(phones, "[A-Za-z]+"),
Number=str_replace_all(str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+"), "-",""))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
I think Ronak's answer is good for the name part, I don't really have a good alternative to offer there.
For numbers, I would go with "numbers and hyphens, with a word boundary at either end", i.e.
numbers = str_extract(phones, "\\b[-0-9]+\\b") %>%
str_remove_all("-")
# Can also specify that you need at least 5 numbers/hyphens
# in a row to match
numbers2 = str_extract(phones, "\\b[-0-9]{5,}\\b") %>%
str_remove_all("-")
That way, you're not locked into a fixed format for the number of hyphens that appear in the number (my suggested regex allows for any number).
If you (like me) prefer to use base-R and want to keep the regex as simple as possible you could do something like this:
phone_split <- lapply(
strsplit(phones, " "),
function(x) {
name_part <- grepl("[^-0-9]", x)
c(
name = paste(x[name_part], collapse = " "),
phone = x[!name_part]
)
}
)
phone_split
[[1]]
name phone
"Ann" "077-789663"
[[2]]
name phone
"Johnathan" "99656565"
[[3]]
name phone
"Maria2 office" "099-65-6569"
do.call(rbind, phone_split)
name phone
[1,] "Ann" "077-789663"
[2,] "Johnathan" "99656565"
[3,] "Maria2 office" "099-65-6569"

Remove part of a string

How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A

Resources