hi i've got some budget data with names and titles that read "Last, First - Title" and other rows in same column position that read "anything really - ,asd;flkajsd". I'd like to split the column IF first word ends in a "," at the "-" position that follows it.
ive tried this:
C22$ITEM2 <- ifelse(grepl(",", C22$ITEM), C22$ITEM, NA)
test <- str_split_fixed(C22$ITEM2, "-", 2)
C22 <- cbind(C22, test)
but i'm getting other cells with commas elsewhere, need to limit to just "if first word ends in comma"
library(tidyverse)
data <- tibble(data = c("Doe, John - Mr", "Anna, Anna - Ms", " ,asd;flkajsd"))
data
data %>%
# first word must ed with a
filter(data %>% str_detect("^[A-z]+a")) %>%
separate(data, into = c("Last", "First", "Title"), sep = "[,-]") %>%
mutate_all(str_trim)
# A tibble: 1 × 3
# Last First Title
# <chr> <chr> <chr>
#1 Anna Anna Ms
We may use extract to do this - capture the regex pattern as two groups ((...)) where the first group would return word (\\w+) from the start (^) of the string followed by a ,, zero or more space (\\s*), another word (\\w+), then the - (preceding or succeeding zero or more space and the second capture group with the word (\\w+) before the end ($) of the string
library(tidyr)
library(dplyr)
extract(C22, ITEM, into = c("Name", "Title"),
"^(\\w+,\\s*\\w+)\\s*-\\s*(\\w+)$") %>%
mutate(Name = coalesce(Name, ITEM), .keep = 'unused')
NOTE: The mutate is added in case the regex didn't match and return NA elements, we coalesce with the original column to return the value that corresponds to NA
Related
I would like to prepare a table from raw text using readr::read_fwf. There is an argument col_position responsible for determining columns width which in my case could differ.
Table always includes 4 columns and is based on 4 first words from the string like besides one:
category variable description value sth
> text_for_column_width = "category variable description value sth"
> nchar("category ")
[1] 12
> nchar("variable ")
[1] 11
> nchar("description ")
[1] 17
> nchar("value ")
[1] 11
I want obtain 4 first words but keeping spaces to have category with 8[a-b]+4[spaces] characters and finally create a vector including number of characters for each of four names c(12,11,17,11). I tried using strsplit with space split argument and then calculate existing zeros however I believe there is faster way just using proper regular expression.
A possible solution, using stringr:
library(tidyverse)
text_for_column_width = "category variable description value sth"
strings <- text_for_column_width %>%
str_remove("sth$") %>%
str_split("(?<=\\s)(?=\\S)") %>%
unlist
strings
#> [1] "category " "variable " "description "
#> [4] "value "
strings %>% str_count
#> [1] 12 11 17 11
You can use utils::strcapture:
text_for_column_width = "category variable description value sth"
pattern <- "^(\\S+\\s+)(\\S+\\s+)(\\S+\\s+)(\\S+\\s*)"
result <- utils::strcapture(pattern, text_for_column_width, list(f1 = character(), f2 = character(), f3 = character(), f4 = character()))
nchar(as.character(as.vector(result[1,])))
## => [1] 12 11 17 11
See the regex demo. The ^(\S+\s+)(\S+\s+)(\S+\s+)(\S+\s*) matches
^ - start of string
(\S+\s+) - Group 1: one or more non-whitespace chars and then one or more whitespaces
(\S+\s+) - Group 2: one or more non-whitespace chars and then one or more whitespaces
(\S+\s+) - Group 3: one or more non-whitespace chars and then one or more whitespaces
(\S+\s*) - Group 4: one or more non-whitespace chars and then zero or more whitespaces
You can also use this pattern:
stringr::str_split("category variable description value sth", "\\s+") %>%
unlist() %>%
purrr::map_int(nchar)
I need to extract surname and city from the column.
Surname is always the first part of the column and then is ",".
City is the last part of the column and before it is "."
Raws don't have similar structure, so I want to extract the first part before the first comma and the last part before dot (.) and after the last space.
I tried:
df<-separate(df, Name, into=c("v1","v2", "v3", "v4"), sep=",")
v1 seems OK and it's a surname but I can't separate the city (the last part of the column)
Please help to separate surname as one column, city as another column.enter image description here
We can use tidyr::extract, and specify two capture groups in the regex - one for surname (beginning of string, followed by a word) (^\\w+\\b), the other for city (one or two words that follow a comma and a space (, ) followed by a literal dot (.) and the end of the string) ((?<=, )\\w+ *\\w+(?=\\.$)):
library(tidyr)
df %>% extract(Name, into = c("Surname", "City"), regex = "(^\\w+\\b).*((?<=, )\\w+ *\\w+(?=\\.$))", remove = FALSE)
We can also extract the words (\\w+) into a list, then subset the first and last elements of the list (this will only work if cities have only one word each:
library(dplyr)
library(tidyr)
library(stringr)
df %>% mutate(output=str_extract_all(Name, "\\w+") %>%
map(~list(surname=first(.x), city=last(.x))))%>%
unnest_wider(output)
output
# A tibble: 2 x 3
Name Surname City
<chr> <chr> <chr>
1 Ivanov, Petr, Ivanovich, Novosibirsk. Ivanov Novosibirsk
2 Lipenko, Daria, Nizhniy Novgorod. Lipenko Nizhniy Novgorod
data
df<-tibble(Name=c("Ivanov, Petr, Ivanovich, Novosibirsk.", "Lipenko, Daria, Nizhniy Novgorod."))
I have a tribble with a chr column that contains the unicode to emojis. I want to split these strings into two columns in case of need, if there are more than two backslash in the whole string. So I need a split with the 2nd backslash. It would also be enough to just delete everything from the 2nd backslash on.
Here is what I tried:
df <- tibble::tribble(
~RUser, ~REmoji,
"User1", "\U0001f64f\U0001f3fb",
"User2", "\U0001f64f",
"User2", "\U0001f64f\U0001f3fc"
)
df %>% mutate(newcol = gsub("\\\\*", "", REmoji))
I found the solution Replace single backslash in R. But in my case I have only one backslash, and I don't understand how to separate the column here.
The result should look like this output:
df2 <- tibble::tribble(
~RUser, ~REmoji1, ~newcol,
"User1", "\U0001f64f", "\U0001f3fb",
"User2", "\U0001f64f", "", #This Field is empty, since there was no Emoji-Modification
"User2", "\U0001f64f", "\U0001f3fc"
)
Thanks a lot!
We could also use substring from base R
df$newcol <- substring(df$REmoji, 2)
Note these \U... are single Unicode code points, not just a backslash + digits/letters.
Using the ^. PCRE regex with sub provides the expected results:
> df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))
# A tibble: 3 x 3
RUser REmoji newcol
<chr> <chr> <chr>
1 User1 "\U0001f64f\U0001f3fb" "\U0001f3fb"
2 User2 "\U0001f64f" ""
3 User2 "\U0001f64f\U0001f3fc" "\U0001f3fc"
Make sure you pass the perl=TRUE argument.
And in order to do the reverse, i.e. keep the first code point only, you can use:
df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))
I am trying to write a function to detect capitalized words that are all capitalised
currently, code:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
Where output is:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
DONT
WAS
The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS".
ie:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
THIS
DONT
WANT
does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.
If you run the code with your regex you'll realise 'THIS' is not included in the output at all.
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b) instead.
str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"
Your code would work if you use the above pattern in it.
Or you could also use :
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS
Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!
A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.