Split string in R based on punctuation - r

Does anyone know how to split a string in R based on punctuation or how can I remove everything before punctuation, but not the punctuation?
x <- c("a>1", "b2<0", "yy01>10")
The following is the desired result:
"a", "b2", "yy01"
">1", "<0", ">10"
To get the first part I can do:
gsub("\\b\\d+\\b|[[:punct:]]", "", x)
"a" "b2" "yy01"
But I am not sure how to get the second one. Does anyone have an idea?
Thanks

using strsplit from base R with regex specified to split at the word boundary preceding the operators <>
do.call(cbind, strsplit(x, "\\b(?=[<>])", perl = TRUE))
# [,1] [,2] [,3]
#[1,] "a" "b2" "yy01"
#[2,] ">1" "<0" ">10"

Related

how to remove non alphabetic characters and columns from an csv file

I have a csv file that looks like this:
And in some portions the data in the columns is like this:
so as you can see, and because the "=" sign is present it wants to convert it into a formula, but what I need is the word in this case "rama...
I have extracted this term from a spam file and with R converted into a sparse matrix. So the question that I have is how can I get rid of the non-alphanumeric characters from this header in R, and then convert it again into a csv file?
Thanks
If you want a literal answer, you could try using gsub to replace any entry having one or more non alphanumeric characters:
df <- data.frame(v1=c(1,2,3), v2=c("#NAME?", "two", "#NAME?"),
stringsAsFactors=FALSE)
df <- data.frame(sapply(df, function(x) gsub(".*[^A-Za-z0-9].*", "", x)))
df
v1 v2
1 1
2 2 two
3 3
Demo
But the best/easiest thing to do here is probably to just fix your Excel formulas such that you catch these errors, and just display empty string, or some other sensible message. From what I can see, this is basically an Excel, not R, problem.
You can use gsub for that:
## A dummy matrix
example <- matrix(paste0("=", letters[1:9]),3,3)
# [,1] [,2] [,3]
#[1,] "= a" "= d" "= g"
#[2,] "= b" "= e" "= h"
#[3,] "= c" "= f" "= i"
You can remove the "=" by replacing it by "" in gsub
## Replacing the "=" by "" (nothing)
gsub("=", "", example)
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "b" "e" "h"
#[3,] "c" "f" "i"
Or only in the first row (or in the column name, etc.)
## Removing the "=" in the first row
example <- gsub("=", "", example[,1])
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "=b" "=e" "=h"
#[3,] "=c" "=f" "=i"

R: trim consecutive trailing and leading special characters from set of strings

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

in R, use gsub to remove all punctuation except period

I am new to R so I hope you can help me.
I want to use gsub to remove all punctuation except for periods and minus signs so I can keep decimal points and negative symbols in my data.
Example
My data frame z has the following data:
[,1] [,2]
[1,] "1" "6"
[2,] "2#" "7.235"
[3,] "3" "8"
[4,] "4" "$9"
[5,] "£5" "-10"
I want to use gsub("[[:punct:]]", "", z) to remove the punctuation.
Current output
> gsub("[[:punct:]]", "", z)
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "10"
I would like, however, to keep the "-" sign and the "." sign.
Desired output
PSEUDO CODE:
> gsub("[[:punct:]]", "", z, except(".", "-") )
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Any ideas how I can make some characters exempt from the gsub() function?
You can put back some matches like this:
sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Here I am keeping the . and -.
And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:
matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
You may try this code. I found it quite handy.
x <- c('6,345', '7.235', '8', '$9', '-10')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "6345" "7.235" "8" "9" "-10"
x <- c('1', '2#', '3', '4', '£5')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "1" "2" "3" "4" "5"
This code{gsub("[^[:alnum:]]", "", x))} removes everything that does not include alphanumeric terms. Then we add to the exception list. Here we add hyphen(\-), full-stop(\.) and space(\s) to get gsub("[^[:alnum:]\-\.\s]", "", x). Now it removes everything that is not alphanumeric, hyphen, full stop and space.
Here are some options to restrict a generic character class in R using both base R (g)sub and the stringr remove/replace functions:
(g)sub with perl=TRUE
You may use the [[:punct:]] bracket expression with the [:punct:] POSIX character class and restrict it with the (?!\.) negative lookahead that will require that the immediately following char on the right is not equal to .:
(?!\.)[[:punct:]] # Excluding a dot only
(?![.-])[[:punct:]] # Excluding a dot and hyphen
To match one or more occurrences, wrap it with a non-capturing group and then set the + quantifier to the group:
(?:(?!\.)[[:punct:]])+ # Excluding a dot only
(?:(?![.-])[[:punct:]])+ # Excluding a dot and hyphen
Note that when you remove found matches, both expressions will yield the same results, however, when you need to replace with some other string/char, the quantification will allow changing whole consecutive character chunks with a single occurrence of the replacement pattern.
With stringr replace/remove functions
Before going into details, mind that the PCRE [[:punct:]] used with (g)sub will not match the same chars in the stringr regex functions that are powered by the ICU regex library. You need to use [\p{P}\p{S}] instead, see R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
The ICU regex library has a nice feature that can be used with character classes, called character class subtraction.
So, you write your character class, say, all punctuation matching class like [\p{P}\p{S}], and then you want to "exclude" (=subtract) a char or two or three, or a whole subclass of chars. You may use two notations:
[\p{P}\p{S}&&[^.]] # Excluding a dot
[\p{P}\p{S}--[.]] # Excluding a dot
[\p{P}\p{S}&&[^.-]] # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]] # Excluding a dot and hyphen
To match 1+ consecutive occurrences with this approach, you do not need any wrapping groups, simply use +:
[\p{P}\p{S}&&[^.]]+ # Excluding a dot
[\p{P}\p{S}--[.]]+ # Excluding a dot
[\p{P}\p{S}&&[^.-]]+ # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]+ # Excluding a dot and hyphen
See R demo tests with outputs:
x <- "Abc.123#&*xxx(x-y-z)???? some#other!chars."
gsub("(?!\\.)[[:punct:]]", "", x, perl=TRUE)
## => [1] "Abc.123xxxxyz someotherchars."
gsub("(?!\\.)[[:punct:]]", "~", x, perl=TRUE)
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
gsub("(?:(?!\\.)[[:punct:]])+", "~", x, perl=TRUE)
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
library(stringr)
stringr::str_remove_all(x, "[\\p{P}\\p{S}&&[^.]]") # Same as "[\\p{P}\\p{S}--[.]]"
## => [1] "Abc.123xxxxyz someotherchars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]", "~")
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]+", "~") # Same as "[\\p{P}\\p{S}--[.]]+"
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
Another way to think about it is what do you want to keep? You can use regular expressions to both keep information as well as omit it. I have a lot of data frames that I need to clean units out of and convert from multiple rows in one pass and I find it easiest to use something from the apply family in these instances.
Recreating the example:
a <- c('1', '2#', '3', '4', '£5')
b <- c('6', '7.235', '8', '$9', '-10')
z <- matrix(data = c(a, b), nrow = length(a), ncol=2)
Then use apply in conjunction with gsub.
apply(z, 2, function(x) as.numeric(gsub('[^0-9\\.\\-]', '', x)))
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
This instructs R to match everything except digits, periods, and hyphens/dashes. Personally, I find it much cleaner and easier to use in these situations and gives the same output.
Also, the documentation has a good explanation of these powerful but confusing regular expressions.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Or ?regex

a sequence of str_match in a data.table

I have a string variable to parse into two parts. I figured I'd approach this using str_match from the stringr package, which returns a matrix with the original string in the first column and each extracted part in the other columns.
I found about a dozen regular expressions to extract these two parts. (The parts are a ladder and rung on a pay schedule, and it's very messy. I've verified that my regexes work by defining a function with a bunch of nested ifelse statements.)
library(stringr)
library(data.table)
my_strs <- c("A 01","G 00","A 2")
mydt <- data.table(strs = my_strs)
rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2 <- '(A) ([[:digit:]])'
I want to check the regexes in sequence and extract the parts using the first one that checks out. If I only had one regex, I could do this:
myfun <- function(x){
y <- str_match(x,rx1)
return(y)
}
mydt[,myfun(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA NA NA
(It took me a long time to even get that to work, trying all combinations of Vectorize and as.list on the function and *applying in the call.)
My best attempt at checking the regexes in sequence is this rather ugly kludge:
myfun2 <- function(x){
y <- str_match(x,rx1)
ifelse(!is.na(y[1]),"",(y <- str_match(x,rx2))[1])
return(y)
}
mydt[1:2,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
mydt[3,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 2" "A" "2"
mydt[1:3,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA NA NA
As you can see, it doesn't quite work yet.
Do you have any idea about a better way to approach this? I have about 3.5 m rows in my data set, but only about 2000 unique values for this string, so I'm not really worried about efficiency.
Try this using strapply from the gsubfn package. We define a function that accepts the matches and returns the first two non-empty ones. Then use it with the regular expression paste(rx1, rx2, sep = "|") for each component of my_str :
library(gsubfn)
# test data
# there was an addition to the question in the comments. It asked to be able to handle
# one regular expression which has only a single capture. Make sure its at the end.
rx3 <- "^([[:digit:]]{2})$"
my_strs2 <- c(my_strs, "99")
# code
first2 <- function(...) { x <- c(..., NA); head(x[x != ""], 2) }
strapply(my_strs2, paste(rx1, rx2, rx3, sep = "|"), first2, simplify = TRUE)
The last line returns:
[,1] [,2] [,3] [,4]
[1,] "A " "G " "A" "99"
[2,] "01" "00" "2" NA
(If there are components of my_strs that do not match at all then a list will be returned in which those components are NULL. In that case you may prefer to drop the simplify = TRUE and always have it return a list.)
Note: strapplyc in the same package is much faster than strapply since the guts of it are written in tcl (a string processing language) whereas strapply is written in R. Thus you might want to break it up this way to leverage off of the faster routine:
L <- strapplyc(my_strs2, paste(rx1, rx2, rx3, sep = "|"))
sapply(L, first2)
For posterity, here is another solution I found today:
mydt[,{
i_rx <- min(which(unlist(sapply(rx_list,function(x)grepl(x,strs)))))
as.list(str_match(strs,rx_list[[i_rx]]))
},by=1:nrow(mydt)]
I made some minor alterations to the regexes and put them in a list.
rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2a <- "^(A) ([[:digit:]])$"
rx3a <- "^()([[:digit:]]{2})$"
rx_list <- list(rx1,rx2a,rx3a)

In R, how can a string be split without using a seperator

i am try split method and i want to have the second element of a string containing only 2 elemnts. The size of the string is 2.
examples :
string= "AC"
result shouldbe a split after the first letter ("A"), that I get :
res= [,1] [,2]
[1,] "A" "C"
I tryed it with split, but I have no idea how to split after the first element??
strsplit() will do what you want (if I understand your Question). You need to split on "" to split the string on it's elements. Here is an example showing how to do what you want on a vector of strings:
strs <- rep("AC", 3) ## your string repeated 3 times
next, split each of the three strings
sstrs <- strsplit(strs, "")
which produces
> sstrs
[[1]]
[1] "A" "C"
[[2]]
[1] "A" "C"
[[3]]
[1] "A" "C"
This is a list so we can process it with lapply() or sapply(). We need to subset each element of sstrs to select out the second element. Fo this we apply the [ function:
sapply(sstrs, `[`, 2)
which produces:
> sapply(sstrs, `[`, 2)
[1] "C" "C" "C"
If all you have is one string, then
strsplit("AC", "")[[1]][2]
which gives:
> strsplit("AC", "")[[1]][2]
[1] "C"
split isn't used for this kind of string manipulation. What you're looking for is strsplit, which in your case would be used something like this:
strsplit(string,"",fixed = TRUE)
You may not need fixed = TRUE, but it's a habit of mine as I tend to avoid regular expressions. You seem to indicate that you want the result to be something like a matrix. strsplit will return a list, so you'll want something like this:
strsplit(string,"",fixed = TRUE)[[1]]
and then pass the result to matrix.
If you sure that it's always two char string (check it by all(nchar(x)==2)) and you want only second then you could use sub or substr:
x <- c("ab", "12")
sub(".", "", x)
# [1] "b" "2"
substr(x, 2, 2)
# [1] "b" "2"

Resources