How to truncate specific part of string if present - r

Let's consider vector following:
x <- c("GDP_UK", "GDP_US", "GDP_UK_diff2_L2",
"INC","GDP_UK_L2", "GDP_US_level", "INC_UK", "INC_L1", "INC_diff1")
As you can see there is a vector containing some strings.
What I want to do is to find those who have "_diff(number)", "_L(number), _level within it and truncate this part of the string.
What I want to end up with is a vector following:
c("GDP_UK", "GDP_US", "GDP_UK", "INC", "GDUP_UK", "GDP_US", "INC_UK", "INC", "INC")
As you can see all _diff, _L, _level were truncated to obtain raw strings.
And I'm not sure how to do it. I tried code
x[grepl(paste(c("diff", "level", "_L"), collapse = "|"), x)]
to obtain only elements which include grepl or level or _L, but I haven't any idea how to cut it. Tried something with substring but wasn't sure exactly how to specify up to which letter it should be deleted. Do you have any idea how it can be done ?
** EDIT **
WE can use code following:
x <- gsub(pattern = "_L", replacement = "", x)
x <- gsub(pattern = "_diff", replacement = "", x)
x <- gsub(pattern = "_level", replacement = "", x)
However we will end up with remaining numbers at the end of the strings:
"GDP_UK" "GDP_US" "GDP_UK22" "INC" "GDP_UK2" "GDP_US" "INC_UK" "INC2" "INC1"

What you are looking for is the regex "_L\\d*", etc. This matches an underscore, L and zero or more digits.
In full
x <- c("GDP_UK", "GDP_US", "GDP_UK_diff2_L2",
"INC","GDP_UK_L2", "GDP_US_level", "INC_UK", "INC_L1", "INC_diff1")
gsub("_L\\d*", "", x)
gsub("_diff\\d*", "", x)
gsub("_level\\d*", "", x)
# or in one go:
library(stringr)
x %>%
str_replace_all("_L\\d*", "") %>%
str_replace_all("_diff\\d*", "") %>%
str_replace_all("_level\\d*", "")
#> [1] "GDP_UK" "GDP_US" "GDP_UK" "INC" "GDP_UK" "GDP_US" "INC_UK" "INC"
#> [9] "INC"
## or even in one go:
gsub("_(L|diff|level)\\d*", "", x)
#> [1] "GDP_UK" "GDP_US" "GDP_UK" "INC" "GDP_UK" "GDP_US" "INC_UK" "INC"
#> [9] "INC"

Related

Removing hyphens in http but preserving hyphenated words in corpus

I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive, etc.).
I actually asked similar questions a few months ago on a different question thread, the code looks like this:
# load stringr to use str_replace_all
require(stringr)
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("#\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
but could not get satisfactory answer, I then shifted my attention to other projects until resuming to work on this. It appears that the "[^[:alnum:][:space:]'-]" in the last line of the code block is the culprit that also removed - from the non-http part of corpus.
I could not figure out how to achieve our desired outputs, it will be very appreciated if someone could offer their insights on this.
The actual culprit is the [[:punct:]] removing pattern as it matches - anywhere in the string.
You may use
clean.text <- function(x)
{
# remove rt
x <- gsub("rt\\s", "", x)
# remove at
x <- gsub("#\\w+", "", x)
x <- gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
x <- gsub("[[:digit:]]+", "", x)
# remove http
x <- gsub("http\\w+", "", x)
x <- gsub("\\h{2,}", "", x, perl=TRUE)
x <- trimws(x)
x <- gsub("[^[:alnum:][:space:]'-]", " ", x)
return(x)
}
Then,
my_text <- " accident-prone http://www.some.com rt "
new_text <- clean.text(my_text)
new_text
## => [1] "accident-prone"
See the R demo.
Note:
x = gsub("^ ", "", x) and x = gsub(" $", "", x) can be replaced with trimws(x)
gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE) removes any punctuation BUT hyphens in between word chars (you may adjust this further in the part before (*SKIP)(*F))
gsub("[^[:alnum:][:space:]'-]", " ", x) is a base R equivalent for str_replace_all(x, "[^[:alnum:][:space:]'-]", " ").
gsub("\\h{2,}", "", x, perl=TRUE) remove any 2 or more horizontal whitespaces. If by "[ |\t]{2,}" you meant to match any 2 or more whitespaces, use \\s instead of \\h here.

gsub: Keep a given character only if between two digits/letters

I have a list of addresses that I would like to split into two arrays:
Address line (keeping special characters such as "-" whenever between two letters - c.f. text.2)
House number (keeping special characters such as "-" whenever between two digits)
Here is an example:
text.1 <- "CALLE COMPOSITOR LEHMBERG RUIZ 19-21"
text.2 <- "CALLE COMPOSITOR LEHMBERG-RUIZ 19-21"
To extract the house numbers, I tried using gsub("[^0-9\\-]", "", x) which works fine for text.1 but not as well as desired for text.2:
> gsub("[^0-9\\-]", "", text.1)
[1] "19-21"
> gsub("[^0-9\\-]", "", text.2)
[1] "-19-21"
To extract the address line I used gsub("[0-9]", "", x) yielding a similar problem.
I could circumvent this issue with the following code:
ifelse( substr( gsub("[^0-9\\-]", "", x ), 1, 1 ) == "-" ,
substr( gsub("[^0-9\\-]", "", x), 2, nchar( gsub("[^0-9\\-]", "", x) )
)
, gsub("[^0-9\\-]", "", x)
)
yielding "19-21" for both x = text.1 and x = text.2. However, as one can tell it is not very elegant.
My question would be if there is an "elegant" way to solve this issue (e.g. using gsub in a cleverer fashion)?
We can use a regular expression to SKIP when the pattern is true and remove all other characters
gsub("(\\d+)-(\\d+)(*SKIP)(*F)|.", "", text.1, perl = TRUE)
#[1] "19-21"
gsub("(\\d+)-(\\d+)(*SKIP)(*F)|.", "", text.2, perl = TRUE)
#[1] "19-21"
I would advise to use str_extract instead of gsub in your case. You could d as follow:
library(stringr)
str_extract(text.1,"[0-9]{1,3}\\-[0-9]{1,3}")
[1] "19-21"
str_extract(text.2,"[0-9]{1,3}\\-[0-9]{1,3}")
[1] "19-21"
str_extract(text.1,"[^0-9][A-Z\\-\\s]+")
[1] "CALLE COMPOSITOR LEHMBERG RUIZ "
str_extract(text.2,"[^0-9][A-Z\\-\\s]+")
[1] "CALLE COMPOSITOR LEHMBERG-RUIZ "

Handling string search and substitution in R

I am a beginner in R, used Matlab before and I have been searching around for a solution to my problem but I do not appear to find one.
I have a very large vector with text entries. Something like
CAT06
6CAT
CAT 6
DOG3
3DOG
I would like to be able to find a function such that: If an entry is found and it contains "CAT" & "6" (no matter position), substitute cat6. If an entry is found and it contains "DOG" & "3" (no matter position) substitute dog3. So the outcome should be:
cat6 cat6 cat6 dog3 dog3
Can anybody help on this? Thank you very much, find myself a bit lost!
First remove blank spaces i.e. elements like "CAT 6" to "CAT6":
sp = gsub(" ", "", c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG"))
Then use some regex magic to find any combination of "CAT", "0", "6" and replace these matches with "cat6" as follows:
sp = gsub("^(?:CAT|0|6)*$", "cat6", sp)
Same here with DOG case:
sp = gsub("^(?:DOG|0|3)*$", "dog3", sp)
The input shown in the question is ambiguous as per my comment under the question. We show how to calculate it depending on which of three assumptions was intended.
1) vector input with embedded spaces Remove the digits and spaces ("[0-9 ]") in the first gsub and remove the non-digits ("\\D") in the second gsub converting to numeric to avoid leading zeros and then paste together:
x1 <- c("CAT06", "6CAT", "CAT 6", "DOG3", "3DOG") # test input
paste0(gsub("[0-9 ]", "", x1), as.numeric(gsub("\\D", "", x1)))
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
2) single string Form chars by removing all digits and scanning the result in. Then form nums by removing everything except digits and spaces and scanning the result. Finally paste these together.
x2 <- "CAT06 6CAT CAT 6 DOG3 3DOG" # test input
chars <- scan(textConnection(gsub("\\d", "", x2)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", x2)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
or if a single output stirng is wanted add this:
paste(y, collapse = " ")
3) vector input without embedded spaces Reduce this to case (2) and then apply (2).
x3 <- c("CAT06", "6CAT", "CAT", "6", "DOG3", "3DOG") # test input
xx <- paste(x3, collapse = " ")
chars <- scan(textConnection(gsub("\\d", "", xx)), what = "", quiet = TRUE)
nums <- scan(textConnection(gsub("[^ 0-9]", "", xx)), , quiet = TRUE)
y <- paste0(chars, nums)
y
## [1] "CAT6" "CAT6" "CAT6" "DOG3" "DOG3"
Note that this actually works for all three inputs. That is if we replace x3 with x1 or x2 it still works and as with (2) then if a single output string is wanted then add paste(y, collapse = " ")

R Match And Sub On Space Between Specific Characters

I need a little help with a regular expression using gsub. Take this object:
x <- "4929A 939 8229"
I want to remove the space in between "A" and "9", but I am not sure how to match on only the space between them and not on the second space. I essentially need something like this:
x <- gsub("A 9", "", x)
But I am not sure how to write the regular expression to not match on the "A" and "9" and only the space between them.
Thanks in advance!
You may use the following regex in sub:
> x <- "4929A 939 8229"
> sub("\\s+", "", x)
[1] "4929A939 8229"
The \\s+ will match 1 or more whitespace symbols.
The replacement part is an empty string.
See the online R demo
gsub matches/uses all regex found whereas sub only matches/uses the first one. So
sub(" ", "", "4929A 939 8229") # returns "4929A939 8229"
Will do the job
Removing second/nth occurence
You can do that e.g. by using strsplit as follows:
x <- c("4929A 939 8229", "4929A 9398229")
collapse_nth <- function(x_split, split, nth, replacement){
left <- paste(x_split[seq_len(nth)], collapse = split)
right <- paste(x_split[-seq_len(nth)], collapse = split)
paste(left, right, sep = replacement)
}
remove_nth <- function(x, nth, split, replacement = ""){
x_split <- strsplit(x, split, fixed = TRUE)
x_len <- vapply(x_split, length, integer(1))
out <- x
out[x_len>nth] <- vapply(x_split[x_len>nth], collapse_nth, character(1), split, nth, replacement)
out
}
Which gives you:
# > remove_nth(x, 2, " ")
# [1] "4929A 9398229" "4929A 9398229"
and
# > remove_nth(x, 2, " ", "---")
# [1] "4929A 939---8229" "4929A 9398229"

R simplify gsub() to make sample names from longer string

I have a list of sample names
name <- c("GOM_13M_TB-01_S.HM (Q)30",
"GOM_13M_PS-06_S.HM (Q)30",
"GOM_13O_PS-06_3C_HM (Q)30",
"GOM_14O_GI-02_B3 (Q)30",
"GOM_14O_PS-03_A3 (Q)30",
"GOM_12J_GI-01_MS (Q)30")'
that need to be simplified into
13M_TB-01_MS (MS for consistency)
13M_PS-06_MS
13O_PS-06_3C (I am not too concerned about the last 2 digits order)
14O_GI-02_B3
14O_PS-03_A3
12J_GI-01_MS
I have tried the following uses of gsub(), but I'm trying to simplify the solution.
x <- gsub("GOM_", "", name)
x <- gsub("\\(Q\\)30", "", x)
x <- gsub("_S", "_MS", x)
x <- gsub(".HM", "", x)
Any suggestions?
Maybe you can try something like the following:
gsub("GOM_(.*) .*", "\\1", gsub("S.HM", "MS", name))
# [1] "13M_TB-01_MS" "13M_PS-06_MS" "13O_PS-06_3C_HM" "14O_GI-02_B3"
# [5] "14O_PS-03_A3" "12J_GI-01_MS"
Or, perhaps:
## I think this matches what you're expecting...
substr(gsub("S.HM", "MS", name), 5, 16)
# [1] "13M_TB-01_MS" "13M_PS-06_MS" "13O_PS-06_3C" "14O_GI-02_B3"
# [5] "14O_PS-03_A3" "12J_GI-01_MS"

Resources