Split string keeping spaces in R - r

I would like to prepare a table from raw text using readr::read_fwf. There is an argument col_position responsible for determining columns width which in my case could differ.
Table always includes 4 columns and is based on 4 first words from the string like besides one:
category variable description value sth
> text_for_column_width = "category variable description value sth"
> nchar("category ")
[1] 12
> nchar("variable ")
[1] 11
> nchar("description ")
[1] 17
> nchar("value ")
[1] 11
I want obtain 4 first words but keeping spaces to have category with 8[a-b]+4[spaces] characters and finally create a vector including number of characters for each of four names c(12,11,17,11). I tried using strsplit with space split argument and then calculate existing zeros however I believe there is faster way just using proper regular expression.

A possible solution, using stringr:
library(tidyverse)
text_for_column_width = "category variable description value sth"
strings <- text_for_column_width %>%
str_remove("sth$") %>%
str_split("(?<=\\s)(?=\\S)") %>%
unlist
strings
#> [1] "category " "variable " "description "
#> [4] "value "
strings %>% str_count
#> [1] 12 11 17 11

You can use utils::strcapture:
text_for_column_width = "category variable description value sth"
pattern <- "^(\\S+\\s+)(\\S+\\s+)(\\S+\\s+)(\\S+\\s*)"
result <- utils::strcapture(pattern, text_for_column_width, list(f1 = character(), f2 = character(), f3 = character(), f4 = character()))
nchar(as.character(as.vector(result[1,])))
## => [1] 12 11 17 11
See the regex demo. The ^(\S+\s+)(\S+\s+)(\S+\s+)(\S+\s*) matches
^ - start of string
(\S+\s+) - Group 1: one or more non-whitespace chars and then one or more whitespaces
(\S+\s+) - Group 2: one or more non-whitespace chars and then one or more whitespaces
(\S+\s+) - Group 3: one or more non-whitespace chars and then one or more whitespaces
(\S+\s*) - Group 4: one or more non-whitespace chars and then zero or more whitespaces

You can also use this pattern:
stringr::str_split("category variable description value sth", "\\s+") %>%
unlist() %>%
purrr::map_int(nchar)

Related

How to drop specific characters in strings in a column?

How can I drop "-" or double "--" only at the beginning of the value in the text column?
df <- data.frame (x = c(12,14,15,178),
text = c("--Car","-Transport","Big-Truck","--Plane"))
x text
1 12 --Car
2 14 -Transport
3 15 Big-Truck
4 178 --Plane
Expected output:
x text
1 12 Car
2 14 Transport
3 15 Big-Truck
4 178 Plane
You can use gsub and the following regex "^\\-+". ^ states that the match should be at the beginning of the string, and that it should be 1 or more (+) hyphen (\\-).
gsub("^\\-+", "", df$text)
# [1] "Car" "Transport" "Big-Truck" "Plane"
If there are whitespaces in the beginning of the string and you want to remove them, you can use [ -]+ in your regex. It tells to match if there are repeated whitespaces or hyphens in the beginning of your string.
gsub("^[ -]+", "", df$text)
To apply this to the dataframe, just do this. In tidyverse, you can also use str_remove:
df$text <- gsub("^\\-+", "", df$text)
# or, in dplyr
library(tidyverse)
df %>%
mutate(text1 = gsub("^\\-+", "", text),
text2 = str_remove(text, "^\\-+"))
You could use trimws to remove certain leading/trailing characters.
trimws(df$text, whitespace = '[ -]')
# [1] "Car" "Transport" "Big-Truck" "Plane"
# a more complex situation
x <- " -- Car - -"
trimws(x, whitespace = '[ -]')
# [1] "Car"

R: parse_number fails if the string contains a dot

parse_number from readr fails if the character string contains a .
It works well with special characters.
library(readr)
#works
parse_number("%ç*%&23")
#does not work
parse_number("art. 23")
Warning: 1 parsing failure.
row col expected actual
1 -- a number .
[1] NA
attr(,"problems")
# A tibble: 1 x 4
row col expected actual
<int> <int> <chr> <chr>
1 1 NA a number .
Why is this happening?
Update:
The excpected result would be 23
There is a space in after the dot which is causing an error. What is the expected number from this sequence (0.23 or 23)?
parse_number seems to look for decimal and grouping separators as defined by your locale, see the documentation here https://www.rdocumentation.org/packages/readr/versions/1.3.1/topics/parse_number
You can opt to change the locale using the following (grouping_mark is a dot with a space):
parse_number("art. 23", locale=locale(grouping_mark=". ", decimal_mark=","))
Output: 23
or remove the space in front:
parse_number(gsub(" ", "" , "art. 23"))
Output: 0.23
Edit: To handle dots as abbreviations and numbers use the following:
library(stringr)
> as.numeric(str_extract("art. 23", "\\d+\\.*\\d*"))
[1] 23
> as.numeric(str_extract("%ç*%&23", "\\d+\\.*\\d*"))
[1] 23
The above uses regular expressions to identify number patterns within strings.
\\d+ finds a digits
\\.* finds a dot
\\d* finds the remaining digits
Note: I am no expert on regex but there are plenty of other resources that will make you one

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

stringr: extract words containing a specific word

Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!
A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

Resources