Parsing text in r without separator - r

I need help with ideas for parsing this text.
I want do it the most automatic way possible.
This is the text
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
I need this result:
a
b
JOHN DEERE
PMWF2126
NEW HOLLAND
441702A1
HIFI
WE 2126
CUMMINS
4907485
This is an example, there is a different marks an item id
I try:
str_split(text, " ")
[[1]]
[1] "JOHN" "DEERE:" "PMWF2126" "NEW" "HOLLAND:" "441702A1" "HIFI:" "WE" "2126"
[10] "CUMMINS:" "4907485" "CUMMINS:" "3680433" "CUMMINS:" "3680315" "CUMMINS:" "3100310"
Thanks!
Edit:
Thanks for your answers, very helpfull
But there is anoter case where can end with a letter to
text <- "LANSS: EF903R DARMET: VP-2726/S CASE: 133721A1 JOHN DEERE: RE68049 JCB: 32917302 WIX: 46490 TURBO: TR25902 HIFI: SA 16080 CATERPILLAR: 4431570 KOMATSU: Z7602BXK06 KOMATSU: Z7602BX106 KOMATSU: YM12991012501 KOMATSU: YM12991012500 KOMATSU: YM11900512571 KOMATSU: 6001851320 KOMATSU: 6001851300 KOMATSU: 3EB0234790 KOMATSU: 11900512571"

We can use separate_rows and separate from tidyr for this task:
library(tidyverse)
data.frame(text) %>%
# separate into rows:
separate_rows(text, sep = "(?<=\\d)\\s") %>%
# separate into columns:
separate(text,
into = c("a", "b"),
sep = ":\\s")
# A tibble: 4 × 2
a b
<chr> <chr>
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
The split point for separate_rows uses look-behind (?<=\\d) to assert that the whitespace \\s on which the string is broken must be preceded by a \\digit.
Data:
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"

Thje sulution assumes (as in your sample data), that the second value always ends with a number, and the first column does not.
If this s not the case, you'll have to adapt the regex-part (?<=[0-9] )(?=[A-Z]), so that the splitting point lies between the two round-bracketed parts.
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
lapply(
strsplit(
unlist(strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE)),
":"), trimws)
[[1]]
[1] "JOHN DEERE" "PMWF2126"
[[2]]
[1] "NEW HOLLAND" "441702A1"
[[3]]
[1] "HIFI" "WE 2126"
[[4]]
[1] "CUMMINS" "4907485"
the key part is the strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE) part.
This looks for occurences where, after a numeric value followed by a space ?<=[0-9] , there is a new part, starting with a capital ?=[A-Z].
These positions are the used as splitting points

Since the second field always ends in a digit and the first field does not, replace a digit followed by space with that digit and a newline and then use read.table with a colon separator.
text |>
gsub("(\\d) ", "\\1\n", x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)
giving
V1 V2
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
If in your data the second field can have a digit but the first cannot and the digit is not necessarily at the end of the last word in field two but could be anywhere in the last word in field 2 then we can use this variation which gives the same result here. gsubfn is like gsub except the 2nd argument can be a function instead of a replacement string and it takes the capture group as input and replaces the entire match with the output of the function. The function can be expressed in formula notation as is done here.
library(gsubfn)
text |>
gsubfn("\\w+", ~ if (grepl("[0-9]", x)) paste(x, "\n") else x, x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)

Related

How to drop specific characters in strings in a column?

How can I drop "-" or double "--" only at the beginning of the value in the text column?
df <- data.frame (x = c(12,14,15,178),
text = c("--Car","-Transport","Big-Truck","--Plane"))
x text
1 12 --Car
2 14 -Transport
3 15 Big-Truck
4 178 --Plane
Expected output:
x text
1 12 Car
2 14 Transport
3 15 Big-Truck
4 178 Plane
You can use gsub and the following regex "^\\-+". ^ states that the match should be at the beginning of the string, and that it should be 1 or more (+) hyphen (\\-).
gsub("^\\-+", "", df$text)
# [1] "Car" "Transport" "Big-Truck" "Plane"
If there are whitespaces in the beginning of the string and you want to remove them, you can use [ -]+ in your regex. It tells to match if there are repeated whitespaces or hyphens in the beginning of your string.
gsub("^[ -]+", "", df$text)
To apply this to the dataframe, just do this. In tidyverse, you can also use str_remove:
df$text <- gsub("^\\-+", "", df$text)
# or, in dplyr
library(tidyverse)
df %>%
mutate(text1 = gsub("^\\-+", "", text),
text2 = str_remove(text, "^\\-+"))
You could use trimws to remove certain leading/trailing characters.
trimws(df$text, whitespace = '[ -]')
# [1] "Car" "Transport" "Big-Truck" "Plane"
# a more complex situation
x <- " -- Car - -"
trimws(x, whitespace = '[ -]')
# [1] "Car"

replacing "." with a character in R

I have a string that is
NAME = "Brad.Pitt"
I want to replace "." with a space (" "). I tried using sub, gsub, str_replace, str_replace all. They aren't working fine. Is there anything that I am missing out ?
Try:
NAME = "Brad.Pitt"
gsub("\\.", " ", NAME)
The main thing missing is that all the commands you mention regard the pattern as a regular expression, not a fixed string and, in particular, in a regular expression dot matches any character -- not just dot. You can find out more about these by reading ?"regular expression"
Here are some alternatives. (1) The first one uses fixed=TRUE so that the special characters in the first argument are not interpreted as regular expression characters. We used sub which substitutes one occurrence but if there were multiple occurrences and you wanted to replace them all with space then use gsub in place of sub. (2) The second escapes the dot with square brackets so it is not interpreted as a regular expression character. Another way of escaping dot is to preface it with a backslash so using R 4.0 r"{...}" notation we can use r"{\.}" as the pattern or without that notation see the other answer. (3) chartr does not use regular expressions at all. It will replace every dot with space. (4) substr<- replaces the 5th character with space. (5) scan read in the words separately creating a character vector with two components and then paste pastes them back together with a space between them. If there were multiple dots then it would replace each with a space.
# 1
sub(".", " ", NAME, fixed = TRUE)
## [1] "Brad Pitt"
# 2
sub("[.]", " ", NAME)
## [1] "Brad Pitt"
# 3
chartr(".", " ", NAME)
## [1] "Brad Pitt"
# 4
`substr<-`(NAME, 5, 5, ' ')
## [1] "Brad Pitt"
# 5
paste(scan(text = NAME, what = "", sep = ".", quiet = TRUE), collapse = " ")
## [1] "Brad Pitt"
Headings
Also note that if the reason you have the dot is that you read it in as a heading then you can use check.names = FALSE to avoid that. Compare the headings in the output of the two read.csv commands below.
Lines <- "BRAD PITT
1
2"
read.csv(text = Lines)
## BRAD.PITT
## 1 1
## 2 2
read.csv(text = Lines, check.names = FALSE)
## BRAD PITT
## 1 1
## 2 2

Match two character strings by location in R

string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string
[1] " A B C"
text
[1] " I love chocolate pudding"
I am trying to match letters in "string" with text in "text" such that to the letter A corresponds the text "I love" to B corresponds "chocolate" and to C "pudding". Ideally, I would like to put A, B, C in column 1 and three different rows of a dataframe (or tibble) and the text in column 2 and the corresponding rows. Any suggestion?
It is hard to know whether the strings in which you are trying to manipulate and then collate into columns in a data.frame follow a pattern. But for the example you posted, I suggest creating a list with the strings (strings):
strings <- list(string, text)
Then use lapply() which will in turn create a list for each element in strings.
res <-lapply(strings, function(x){
grep(x=trimws(unlist(strsplit(x, "\\s\\s"))), pattern="[[:alpha:]]", value=TRUE)
})
In the code above, strsplit() splits the string whenever two spaces are found (\\s\\s). But the resulting split is a list with the strings as inner elements. Therefore you need to use unlist() so you can use it with grep(). grep() will select only those strings with an alphanumeric character --which is what you want.
You can then use do.call(cbind, list) to bind the elements in the resulting lapply() list into columns. The dimension must match for this work.
do.call(cbind, res)
Result:
> do.call(cbind, res)
[,1] [,2]
[1,] "A" "I love"
[2,] "B" "chocolate"
[3,] "C" "pudding"
You can wrap it up into a as.data.frame() for instance to get the desired result:
> as.data.frame(do.call(cbind, res), stringsAsFactors = FALSE)
V1 V2
1 A I love
2 B chocolate
3 C pudding
You can use read.fwf and get the positions using nchar.
read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1]
# V2 V3 V4
#1 I love chocolate pudding
In case the white spaces should be removed use also trimws:
trimws(read.fwf(file=textConnection(text),
widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1])
#[1] "I love" "chocolate" "pudding"
Based on your data, I came up with this workaround by using the package stringr. This only works with that kind of pattern, so in case you have erratic ones you need to adjust it.
The output is a data.frame with two columns given by your two input data and rows according to the matches.
library(stringr)
string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")
string_nospace <- str_replace_all( string, "\\s{1,20}", " " )
string_nospace <- str_trim( string_nospace )
string_nospace <- data.frame( string = t(str_split(string_nospace, "\\s", simplify = TRUE)))
text_nospace <- str_replace_all( text, "\\s{2,20}", "_" )
text_nospace <- str_sub(text_nospace, start = 2)
text_nospace <- data.frame(text = t(str_split(text_nospace, "_", simplify = TRUE)))
df = data.frame(string = string_nospace,
text = text_nospace )
df
#> string text
#> 1 A I love
#> 2 B chocolate
#> 3 C pudding
Created on 2020-06-08 by the reprex package (v0.3.0)

How to extract text using delimiters when some delimiters missing

I am trying to extract text according to the headers in a semi-structured text document.
Input
Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"
The output here is
Order Subject Name Grade Report Conclusion
1223442 History Bilbo Johnson Bad Need to complete Dud
I can achieve this with the following (messy but it works) function:
dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")
Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)
However, sometimes the delimiters are missing eg
Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud
In which case the desired output is
Order Subject Name Grade Conclusion
1223442 History Bilbo Johnson Bad Dud
but the actual output becomes:
Order Subject Name Grade Report Conclusion
:1223442 :History Bilbo Johnson : Bad : Dud <NA>
How can I account for missing delimiters although they are in the same order (including delimiters that are missing in the middle of the text as well as the end as in the example above) ?
We may do the following (it's only text extraction, I leave constructing the output for you):
library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" NA "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA "History" "Bilbo Johnson" NA NA NA
Since we have NA's it's clear what delimiters were missing and what weren't.
The way it works in your case is that we have a series of patterns
pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
Then str_match nice extracts the (.*?) part to the second output columns and we get rid of any spaces with trimws. Ah and we use lazy matching in (.*?) as not to match too much.

How to remove all whitespace from a string?

So " xx yy 11 22 33 " will become "xxyy112233". How can I achieve this?
In general, we want a solution that is vectorised, so here's a better test example:
whitespace <- " \t\n\r\v\f" # space, tab, newline,
# carriage return, vertical tab, form feed
x <- c(
" x y ", # spaces before, after and in between
" \u2190 \u2192 ", # contains unicode chars
paste0( # varied whitespace
whitespace,
"x",
whitespace,
"y",
whitespace,
collapse = ""
),
NA # missing
)
## [1] " x y "
## [2] " ← → "
## [3] " \t\n\r\v\fx \t\n\r\v\fy \t\n\r\v\f"
## [4] NA
The base R approach: gsub
gsub replaces all instances of a string (fixed = TRUE) or regular expression (fixed = FALSE, the default) with another string. To remove all spaces, use:
gsub(" ", "", x, fixed = TRUE)
## [1] "xy" "←→"
## [3] "\t\n\r\v\fx\t\n\r\v\fy\t\n\r\v\f" NA
As DWin noted, in this case fixed = TRUE isn't necessary but provides slightly better performance since matching a fixed string is faster than matching a regular expression.
If you want to remove all types of whitespace, use:
gsub("[[:space:]]", "", x) # note the double square brackets
## [1] "xy" "←→" "xy" NA
gsub("\\s", "", x) # same; note the double backslash
library(regex)
gsub(space(), "", x) # same
"[:space:]" is an R-specific regular expression group matching all space characters. \s is a language-independent regular-expression that does the same thing.
The stringr approach: str_replace_all and str_trim
stringr provides more human-readable wrappers around the base R functions (though as of Dec 2014, the development version has a branch built on top of stringi, mentioned below). The equivalents of the above commands, using [str_replace_all][3], are:
library(stringr)
str_replace_all(x, fixed(" "), "")
str_replace_all(x, space(), "")
stringr also has a str_trim function which removes only leading and trailing whitespace.
str_trim(x)
## [1] "x y" "← →" "x \t\n\r\v\fy" NA
str_trim(x, "left")
## [1] "x y " "← → "
## [3] "x \t\n\r\v\fy \t\n\r\v\f" NA
str_trim(x, "right")
## [1] " x y" " ← →"
## [3] " \t\n\r\v\fx \t\n\r\v\fy" NA
The stringi approach: stri_replace_all_charclass and stri_trim
stringi is built upon the platform-independent ICU library, and has an extensive set of string manipulation functions. The equivalents of the above are:
library(stringi)
stri_replace_all_fixed(x, " ", "")
stri_replace_all_charclass(x, "\\p{WHITE_SPACE}", "")
Here "\\p{WHITE_SPACE}" is an alternate syntax for the set of Unicode code points considered to be whitespace, equivalent to "[[:space:]]", "\\s" and space(). For more complex regular expression replacements, there is also stri_replace_all_regex.
stringi also has trim functions.
stri_trim(x)
stri_trim_both(x) # same
stri_trim(x, "left")
stri_trim_left(x) # same
stri_trim(x, "right")
stri_trim_right(x) # same
I just learned about the "stringr" package to remove white space from the beginning and end of a string with str_trim( , side="both") but it also has a replacement function so that:
a <- " xx yy 11 22 33 "
str_replace_all(string=a, pattern=" ", repl="")
[1] "xxyy112233"
x = "xx yy 11 22 33"
gsub(" ", "", x)
> [1] "xxyy112233"
Use [[:blank:]] to match any kind of horizontal white_space characters.
gsub("[[:blank:]]", "", " xx yy 11 22 33 ")
# [1] "xxyy112233"
Please note that soultions written above removes only space. If you want also to remove tab or new line use stri_replace_all_charclass from stringi package.
library(stringi)
stri_replace_all_charclass(" ala \t ma \n kota ", "\\p{WHITE_SPACE}", "")
## [1] "alamakota"
The function str_squish() from package stringr of tidyverse does the magic!
library(dplyr)
library(stringr)
df <- data.frame(a = c(" aZe aze s", "wxc s aze "),
b = c(" 12 12 ", "34e e4 "),
stringsAsFactors = FALSE)
df <- df %>%
rowwise() %>%
mutate_all(funs(str_squish(.))) %>%
ungroup()
df
# A tibble: 2 x 2
a b
<chr> <chr>
1 aZe aze s 12 12
2 wxc s aze 34e e4
Another approach can be taken into account
library(stringr)
str_replace_all(" xx yy 11 22 33 ", regex("\\s*"), "")
#[1] "xxyy112233"
\\s: Matches Space, tab, vertical tab, newline, form feed, carriage return
*: Matches at least 0 times
income<-c("$98,000.00 ", "$90,000.00 ", "$18,000.00 ", "")
To remove space after .00 use the trimws() function.
income<-trimws(income)
From stringr library you could try this:
Remove consecutive fill blanks
Remove fill blank
library(stringr)
2. 1.
| |
V V
str_replace_all(str_trim(" xx yy 11 22 33 "), " ", "")

Resources