How to drop specific characters in strings in a column? - r

How can I drop "-" or double "--" only at the beginning of the value in the text column?
df <- data.frame (x = c(12,14,15,178),
text = c("--Car","-Transport","Big-Truck","--Plane"))
x text
1 12 --Car
2 14 -Transport
3 15 Big-Truck
4 178 --Plane
Expected output:
x text
1 12 Car
2 14 Transport
3 15 Big-Truck
4 178 Plane

You can use gsub and the following regex "^\\-+". ^ states that the match should be at the beginning of the string, and that it should be 1 or more (+) hyphen (\\-).
gsub("^\\-+", "", df$text)
# [1] "Car" "Transport" "Big-Truck" "Plane"
If there are whitespaces in the beginning of the string and you want to remove them, you can use [ -]+ in your regex. It tells to match if there are repeated whitespaces or hyphens in the beginning of your string.
gsub("^[ -]+", "", df$text)
To apply this to the dataframe, just do this. In tidyverse, you can also use str_remove:
df$text <- gsub("^\\-+", "", df$text)
# or, in dplyr
library(tidyverse)
df %>%
mutate(text1 = gsub("^\\-+", "", text),
text2 = str_remove(text, "^\\-+"))

You could use trimws to remove certain leading/trailing characters.
trimws(df$text, whitespace = '[ -]')
# [1] "Car" "Transport" "Big-Truck" "Plane"
# a more complex situation
x <- " -- Car - -"
trimws(x, whitespace = '[ -]')
# [1] "Car"

Related

Parsing text in r without separator

I need help with ideas for parsing this text.
I want do it the most automatic way possible.
This is the text
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
I need this result:
a
b
JOHN DEERE
PMWF2126
NEW HOLLAND
441702A1
HIFI
WE 2126
CUMMINS
4907485
This is an example, there is a different marks an item id
I try:
str_split(text, " ")
[[1]]
[1] "JOHN" "DEERE:" "PMWF2126" "NEW" "HOLLAND:" "441702A1" "HIFI:" "WE" "2126"
[10] "CUMMINS:" "4907485" "CUMMINS:" "3680433" "CUMMINS:" "3680315" "CUMMINS:" "3100310"
Thanks!
Edit:
Thanks for your answers, very helpfull
But there is anoter case where can end with a letter to
text <- "LANSS: EF903R DARMET: VP-2726/S CASE: 133721A1 JOHN DEERE: RE68049 JCB: 32917302 WIX: 46490 TURBO: TR25902 HIFI: SA 16080 CATERPILLAR: 4431570 KOMATSU: Z7602BXK06 KOMATSU: Z7602BX106 KOMATSU: YM12991012501 KOMATSU: YM12991012500 KOMATSU: YM11900512571 KOMATSU: 6001851320 KOMATSU: 6001851300 KOMATSU: 3EB0234790 KOMATSU: 11900512571"
We can use separate_rows and separate from tidyr for this task:
library(tidyverse)
data.frame(text) %>%
# separate into rows:
separate_rows(text, sep = "(?<=\\d)\\s") %>%
# separate into columns:
separate(text,
into = c("a", "b"),
sep = ":\\s")
# A tibble: 4 × 2
a b
<chr> <chr>
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
The split point for separate_rows uses look-behind (?<=\\d) to assert that the whitespace \\s on which the string is broken must be preceded by a \\digit.
Data:
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
Thje sulution assumes (as in your sample data), that the second value always ends with a number, and the first column does not.
If this s not the case, you'll have to adapt the regex-part (?<=[0-9] )(?=[A-Z]), so that the splitting point lies between the two round-bracketed parts.
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
lapply(
strsplit(
unlist(strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE)),
":"), trimws)
[[1]]
[1] "JOHN DEERE" "PMWF2126"
[[2]]
[1] "NEW HOLLAND" "441702A1"
[[3]]
[1] "HIFI" "WE 2126"
[[4]]
[1] "CUMMINS" "4907485"
the key part is the strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE) part.
This looks for occurences where, after a numeric value followed by a space ?<=[0-9] , there is a new part, starting with a capital ?=[A-Z].
These positions are the used as splitting points
Since the second field always ends in a digit and the first field does not, replace a digit followed by space with that digit and a newline and then use read.table with a colon separator.
text |>
gsub("(\\d) ", "\\1\n", x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)
giving
V1 V2
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
If in your data the second field can have a digit but the first cannot and the digit is not necessarily at the end of the last word in field two but could be anywhere in the last word in field 2 then we can use this variation which gives the same result here. gsubfn is like gsub except the 2nd argument can be a function instead of a replacement string and it takes the capture group as input and replaces the entire match with the output of the function. The function can be expressed in formula notation as is done here.
library(gsubfn)
text |>
gsubfn("\\w+", ~ if (grepl("[0-9]", x)) paste(x, "\n") else x, x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)

Split string keeping spaces in R

I would like to prepare a table from raw text using readr::read_fwf. There is an argument col_position responsible for determining columns width which in my case could differ.
Table always includes 4 columns and is based on 4 first words from the string like besides one:
category variable description value sth
> text_for_column_width = "category variable description value sth"
> nchar("category ")
[1] 12
> nchar("variable ")
[1] 11
> nchar("description ")
[1] 17
> nchar("value ")
[1] 11
I want obtain 4 first words but keeping spaces to have category with 8[a-b]+4[spaces] characters and finally create a vector including number of characters for each of four names c(12,11,17,11). I tried using strsplit with space split argument and then calculate existing zeros however I believe there is faster way just using proper regular expression.
A possible solution, using stringr:
library(tidyverse)
text_for_column_width = "category variable description value sth"
strings <- text_for_column_width %>%
str_remove("sth$") %>%
str_split("(?<=\\s)(?=\\S)") %>%
unlist
strings
#> [1] "category " "variable " "description "
#> [4] "value "
strings %>% str_count
#> [1] 12 11 17 11
You can use utils::strcapture:
text_for_column_width = "category variable description value sth"
pattern <- "^(\\S+\\s+)(\\S+\\s+)(\\S+\\s+)(\\S+\\s*)"
result <- utils::strcapture(pattern, text_for_column_width, list(f1 = character(), f2 = character(), f3 = character(), f4 = character()))
nchar(as.character(as.vector(result[1,])))
## => [1] 12 11 17 11
See the regex demo. The ^(\S+\s+)(\S+\s+)(\S+\s+)(\S+\s*) matches
^ - start of string
(\S+\s+) - Group 1: one or more non-whitespace chars and then one or more whitespaces
(\S+\s+) - Group 2: one or more non-whitespace chars and then one or more whitespaces
(\S+\s+) - Group 3: one or more non-whitespace chars and then one or more whitespaces
(\S+\s*) - Group 4: one or more non-whitespace chars and then zero or more whitespaces
You can also use this pattern:
stringr::str_split("category variable description value sth", "\\s+") %>%
unlist() %>%
purrr::map_int(nchar)

regex commas not between two numbers

I am looking for a regex for gsub to remove all the unwanted commas:
Data:
,,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354
Desired result:
12345
12345,1345,1354
123
12345
12354
This is the progress I have made so far:
(,(?!\d+))
You seem to want to remove all leading and trailing commas.
You may do it with
gsub("^,+|,+$", "", x)
See the regex demo
The regex contans two alternations, ^,+ matches 1 or more commas at the start and ,+$ matches 1+ commas at the end, and gsub replaces these matches with empty strings.
See R demo
x <- c(",,,,,,,12345","12345,1345,1354","123,,,,,,","12345,",",12354")
gsub("^,+|,+$", "", x)
## [1] "12345" "12345,1345,1354" "123" "12345"
## [5] "12354"
You can also use str_extract from stringr. Thanks to greedy matching, you don't have to specify how many times a digit occurs, the longest match is automatically chosen:
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_extract(V1, "\\d.+\\d"))
or if you prefer base R:
df$V1 = regmatches(df$V1, gregexpr("\\d.+\\d", df$V1))
Result:
V1
1 12345
2 12345,1345,1354
3 123
4 12345
5 12354
Data:
df = read.table(text = ",,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354")

stringr: extract words containing a specific word

Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!
A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

How to remove all whitespace from a string?

So " xx yy 11 22 33 " will become "xxyy112233". How can I achieve this?
In general, we want a solution that is vectorised, so here's a better test example:
whitespace <- " \t\n\r\v\f" # space, tab, newline,
# carriage return, vertical tab, form feed
x <- c(
" x y ", # spaces before, after and in between
" \u2190 \u2192 ", # contains unicode chars
paste0( # varied whitespace
whitespace,
"x",
whitespace,
"y",
whitespace,
collapse = ""
),
NA # missing
)
## [1] " x y "
## [2] " ← → "
## [3] " \t\n\r\v\fx \t\n\r\v\fy \t\n\r\v\f"
## [4] NA
The base R approach: gsub
gsub replaces all instances of a string (fixed = TRUE) or regular expression (fixed = FALSE, the default) with another string. To remove all spaces, use:
gsub(" ", "", x, fixed = TRUE)
## [1] "xy" "←→"
## [3] "\t\n\r\v\fx\t\n\r\v\fy\t\n\r\v\f" NA
As DWin noted, in this case fixed = TRUE isn't necessary but provides slightly better performance since matching a fixed string is faster than matching a regular expression.
If you want to remove all types of whitespace, use:
gsub("[[:space:]]", "", x) # note the double square brackets
## [1] "xy" "←→" "xy" NA
gsub("\\s", "", x) # same; note the double backslash
library(regex)
gsub(space(), "", x) # same
"[:space:]" is an R-specific regular expression group matching all space characters. \s is a language-independent regular-expression that does the same thing.
The stringr approach: str_replace_all and str_trim
stringr provides more human-readable wrappers around the base R functions (though as of Dec 2014, the development version has a branch built on top of stringi, mentioned below). The equivalents of the above commands, using [str_replace_all][3], are:
library(stringr)
str_replace_all(x, fixed(" "), "")
str_replace_all(x, space(), "")
stringr also has a str_trim function which removes only leading and trailing whitespace.
str_trim(x)
## [1] "x y" "← →" "x \t\n\r\v\fy" NA
str_trim(x, "left")
## [1] "x y " "← → "
## [3] "x \t\n\r\v\fy \t\n\r\v\f" NA
str_trim(x, "right")
## [1] " x y" " ← →"
## [3] " \t\n\r\v\fx \t\n\r\v\fy" NA
The stringi approach: stri_replace_all_charclass and stri_trim
stringi is built upon the platform-independent ICU library, and has an extensive set of string manipulation functions. The equivalents of the above are:
library(stringi)
stri_replace_all_fixed(x, " ", "")
stri_replace_all_charclass(x, "\\p{WHITE_SPACE}", "")
Here "\\p{WHITE_SPACE}" is an alternate syntax for the set of Unicode code points considered to be whitespace, equivalent to "[[:space:]]", "\\s" and space(). For more complex regular expression replacements, there is also stri_replace_all_regex.
stringi also has trim functions.
stri_trim(x)
stri_trim_both(x) # same
stri_trim(x, "left")
stri_trim_left(x) # same
stri_trim(x, "right")
stri_trim_right(x) # same
I just learned about the "stringr" package to remove white space from the beginning and end of a string with str_trim( , side="both") but it also has a replacement function so that:
a <- " xx yy 11 22 33 "
str_replace_all(string=a, pattern=" ", repl="")
[1] "xxyy112233"
x = "xx yy 11 22 33"
gsub(" ", "", x)
> [1] "xxyy112233"
Use [[:blank:]] to match any kind of horizontal white_space characters.
gsub("[[:blank:]]", "", " xx yy 11 22 33 ")
# [1] "xxyy112233"
Please note that soultions written above removes only space. If you want also to remove tab or new line use stri_replace_all_charclass from stringi package.
library(stringi)
stri_replace_all_charclass(" ala \t ma \n kota ", "\\p{WHITE_SPACE}", "")
## [1] "alamakota"
The function str_squish() from package stringr of tidyverse does the magic!
library(dplyr)
library(stringr)
df <- data.frame(a = c(" aZe aze s", "wxc s aze "),
b = c(" 12 12 ", "34e e4 "),
stringsAsFactors = FALSE)
df <- df %>%
rowwise() %>%
mutate_all(funs(str_squish(.))) %>%
ungroup()
df
# A tibble: 2 x 2
a b
<chr> <chr>
1 aZe aze s 12 12
2 wxc s aze 34e e4
Another approach can be taken into account
library(stringr)
str_replace_all(" xx yy 11 22 33 ", regex("\\s*"), "")
#[1] "xxyy112233"
\\s: Matches Space, tab, vertical tab, newline, form feed, carriage return
*: Matches at least 0 times
income<-c("$98,000.00 ", "$90,000.00 ", "$18,000.00 ", "")
To remove space after .00 use the trimws() function.
income<-trimws(income)
From stringr library you could try this:
Remove consecutive fill blanks
Remove fill blank
library(stringr)
2. 1.
| |
V V
str_replace_all(str_trim(" xx yy 11 22 33 "), " ", "")

Resources