Separate columns with different structure - r

I need to extract surname and city from the column.
Surname is always the first part of the column and then is ",".
City is the last part of the column and before it is "."
Raws don't have similar structure, so I want to extract the first part before the first comma and the last part before dot (.) and after the last space.
I tried:
df<-separate(df, Name, into=c("v1","v2", "v3", "v4"), sep=",")
v1 seems OK and it's a surname but I can't separate the city (the last part of the column)
Please help to separate surname as one column, city as another column.enter image description here

We can use tidyr::extract, and specify two capture groups in the regex - one for surname (beginning of string, followed by a word) (^\\w+\\b), the other for city (one or two words that follow a comma and a space (, ) followed by a literal dot (.) and the end of the string) ((?<=, )\\w+ *\\w+(?=\\.$)):
library(tidyr)
df %>% extract(Name, into = c("Surname", "City"), regex = "(^\\w+\\b).*((?<=, )\\w+ *\\w+(?=\\.$))", remove = FALSE)
We can also extract the words (\\w+) into a list, then subset the first and last elements of the list (this will only work if cities have only one word each:
library(dplyr)
library(tidyr)
library(stringr)
df %>% mutate(output=str_extract_all(Name, "\\w+") %>%
map(~list(surname=first(.x), city=last(.x))))%>%
unnest_wider(output)
output
# A tibble: 2 x 3
Name Surname City
<chr> <chr> <chr>
1 Ivanov, Petr, Ivanovich, Novosibirsk. Ivanov Novosibirsk
2 Lipenko, Daria, Nizhniy Novgorod. Lipenko Nizhniy Novgorod
data
df<-tibble(Name=c("Ivanov, Petr, Ivanovich, Novosibirsk.", "Lipenko, Daria, Nizhniy Novgorod."))

Related

Extract words enclosed within asterisks in a column in R

I have a dataframe, col1 contains text and within the text there are words enclosed by double asterisks. I want to extract all of these words and put them in another column called col2. If there is more than 1 word enclosed with double asterisks, I would like them to be separated by a comma. col2 in the example shows the desired result.
col1<-c("**sometimes** i code in python",
"I like **walks** in the park",
"I **often** do **exercise**")
col2<-c("**sometimes**","**walks**","**often**,**exercise**")
df<-data.frame(col1, col2, stringsAsFactors = FALSE)
Can anyone suggest a solution?
You may use stringr::str_match_all -
df$col3 <- sapply(stringr::str_match_all(df$col1, '(\\*+.*?\\*+)'),
function(x) toString(x[, 2]))
df
# col1 col2 col3
#1 **sometimes** i code in python **sometimes** **sometimes**
#2 I like **walks** in the park **walks** **walks**
#3 I **often** do **exercise** **often**,**exercise** **often**, **exercise**
* has a special meaning in regex. Here we want to match an actual * so we escape it with \\. We extract all the values which come between 1 or more than 1 *.
str_match_all returns a list of matrix, we are interested in the capture group that is between (...) which is the 2nd column hence x[, 2] and finally for more than one value we collapse them in one comma separated string using toString.
You can use str_extract_all:
library(stringr)
library(dplyr)
df %>%
mutate(col2 = str_extract_all(col1, "\\*\\*[^* ]+\\*\\*"))
col1 col2
1 **sometimes** i code in python **sometimes**
2 I like **walks** in the park **walks**
3 I **often** do **exercise** **often**, **exercise**
How the regex works:
\\*\\* matches two asterisks
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
\\*\\* matches two asterisks
If you don't need the asterisks in col2, then this is how you can extract the strings without them:
df %>%
mutate(col2 = str_extract_all(col1, "(?<=\\*\\*)[^* ]+(?=\\*\\*)"))
col1 col2
1 **sometimes** i code in python sometimes
2 I like **walks** in the park walks
3 I **often** do **exercise** often, exercise
How this regex works:
(?<=\\*\\*): positive lookbehind asserting that there must be two asterisks to the left
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
(?=\\*\\*) positive lookahead asserting that there must be two two asterisks to the right

Selecting longest string from each value of table column

Hello everyone I hope you guys are having a good one,
I have the following date frame:
ID TX
GROUP
HUDJDUDOOD--BANNK2--OLDODOLD985555545UIJF
1
UJDID YUH23498 IDX09
2
854 UIJSAZXC
3
I would like to be able to extract the longest string for each value under the column ID TX knowing that each cell may have different strings or maybe just one but in some instances they may be separated by punctuation such as "," "--", "," "--" ect or even a space " ".
I have thought of the following I need to first replace punctuation by a white space " " then.. separate or split each cell by " " after that I will calculate the length of each string perhaps with nchart() or str_length() and select the index of the string the the longest value, but I have not been able yet to do so as I cant mannage to select the index (word) that I need after splitting the values since I dont know in what index the longest string may be.. my desired output would be:
OUTPUT
OLDODOLD985555545UIJF
YUH23498
UIJSAZXC
sidenote: no worries there will not be ties.
Thank you so much guys for your help I will be very alert to award you for your response!
# Your data
dat <- structure(list(ID_TX = c("HUDJDUDOOD--BANNK2--OLDODOLD985555545UIJF",
"UJDID YUH23498 IDX09", "854 UIJSAZXC"), GROUP = 1:3), class = "data.frame", row.names = c(NA,
-3L))
# Splitting strings in the data
spl <- strsplit(dat$ID_TX, "--|\\s")
# Identify the position of the longest string in each row
idx <- spl|> lapply(nchar) |> lapply(which.max) |> unlist()
# Select the longest string and bind them to a data.frame
mapply(function(x,y) spl[[x]][y], seq_along(idx),idx) |>
as.data.frame() |>
setNames("OUTPUT")
# The result
# OUTPUT
#1 OLDODOLD985555545UIJF
#2 YUH23498
#3 UIJSAZXC

split cell at special character if comma found after first word

hi i've got some budget data with names and titles that read "Last, First - Title" and other rows in same column position that read "anything really - ,asd;flkajsd". I'd like to split the column IF first word ends in a "," at the "-" position that follows it.
ive tried this:
C22$ITEM2 <- ifelse(grepl(",", C22$ITEM), C22$ITEM, NA)
test <- str_split_fixed(C22$ITEM2, "-", 2)
C22 <- cbind(C22, test)
but i'm getting other cells with commas elsewhere, need to limit to just "if first word ends in comma"
library(tidyverse)
data <- tibble(data = c("Doe, John - Mr", "Anna, Anna - Ms", " ,asd;flkajsd"))
data
data %>%
# first word must ed with a
filter(data %>% str_detect("^[A-z]+a")) %>%
separate(data, into = c("Last", "First", "Title"), sep = "[,-]") %>%
mutate_all(str_trim)
# A tibble: 1 × 3
# Last First Title
# <chr> <chr> <chr>
#1 Anna Anna Ms
We may use extract to do this - capture the regex pattern as two groups ((...)) where the first group would return word (\\w+) from the start (^) of the string followed by a ,, zero or more space (\\s*), another word (\\w+), then the - (preceding or succeeding zero or more space and the second capture group with the word (\\w+) before the end ($) of the string
library(tidyr)
library(dplyr)
extract(C22, ITEM, into = c("Name", "Title"),
"^(\\w+,\\s*\\w+)\\s*-\\s*(\\w+)$") %>%
mutate(Name = coalesce(Name, ITEM), .keep = 'unused')
NOTE: The mutate is added in case the regex didn't match and return NA elements, we coalesce with the original column to return the value that corresponds to NA

Split String with second (single) Backslash / R Emojis (Unicode) without Modifier

I have a tribble with a chr column that contains the unicode to emojis. I want to split these strings into two columns in case of need, if there are more than two backslash in the whole string. So I need a split with the 2nd backslash. It would also be enough to just delete everything from the 2nd backslash on.
Here is what I tried:
df <- tibble::tribble(
~RUser, ~REmoji,
"User1", "\U0001f64f\U0001f3fb",
"User2", "\U0001f64f",
"User2", "\U0001f64f\U0001f3fc"
)
df %>% mutate(newcol = gsub("\\\\*", "", REmoji))
I found the solution Replace single backslash in R. But in my case I have only one backslash, and I don't understand how to separate the column here.
The result should look like this output:
df2 <- tibble::tribble(
~RUser, ~REmoji1, ~newcol,
"User1", "\U0001f64f", "\U0001f3fb",
"User2", "\U0001f64f", "", #This Field is empty, since there was no Emoji-Modification
"User2", "\U0001f64f", "\U0001f3fc"
)
Thanks a lot!
We could also use substring from base R
df$newcol <- substring(df$REmoji, 2)
Note these \U... are single Unicode code points, not just a backslash + digits/letters.
Using the ^. PCRE regex with sub provides the expected results:
> df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))
# A tibble: 3 x 3
RUser REmoji newcol
<chr> <chr> <chr>
1 User1 "\U0001f64f\U0001f3fb" "\U0001f3fb"
2 User2 "\U0001f64f" ""
3 User2 "\U0001f64f\U0001f3fc" "\U0001f3fc"
Make sure you pass the perl=TRUE argument.
And in order to do the reverse, i.e. keep the first code point only, you can use:
df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))

Regular expression on separate function of Tidyr

I need separate two columns with tidyr.
The column have text like: I am Sam. I mean the text always have only two white spaces, and the text can have all other symbols: [a-z][0-9][!\ºª, etc...].
The problem is I need split it in two columns: Column one I am, and column two: Sam.
I can't find a regular expression two separate with the second blank space.
Could you help me please?
We can use extract from tidyr. We match one or more characters and place it in a capture group ((.*)) followed by one or more space (\\s+) and another capture group that contains only non-white space characters (\\S+) to separate the original column into two columns.
library(tidyr)
extract(df1, Col1, into = c("Col1", "Col2"), "(.*)\\s+(\\S+)")
# Col1 Col2
#1 I am Sam
#2 He is Sam
data
df1 <- data.frame(Col1 = c("I am Sam", "He is Sam"), stringsAsFactors=FALSE)
As an alternative, given:
library(tidyr)
df <- data.frame(txt = "I am Sam")
you can use
separate(, txt, c("a", "b"), sep="(?<=\\s\\S{1,100})\\s")
# a b
# 1 I am Sam
with separate using stringi::stri_split_regex (ICU engine), or
separate(df, txt, c("a", "b"), sep="^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
with the older (?) separate using base:strsplit (Perl engine). See also
strsplit("I am Sam", "^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
# [[1]]
# [1] "I am" "Sam"
But it might be a bit "esoterique"...

Resources