in R, use gsub to remove all punctuation except period - r

I am new to R so I hope you can help me.
I want to use gsub to remove all punctuation except for periods and minus signs so I can keep decimal points and negative symbols in my data.
Example
My data frame z has the following data:
[,1] [,2]
[1,] "1" "6"
[2,] "2#" "7.235"
[3,] "3" "8"
[4,] "4" "$9"
[5,] "£5" "-10"
I want to use gsub("[[:punct:]]", "", z) to remove the punctuation.
Current output
> gsub("[[:punct:]]", "", z)
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "10"
I would like, however, to keep the "-" sign and the "." sign.
Desired output
PSEUDO CODE:
> gsub("[[:punct:]]", "", z, except(".", "-") )
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Any ideas how I can make some characters exempt from the gsub() function?

You can put back some matches like this:
sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Here I am keeping the . and -.
And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:
matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000

You may try this code. I found it quite handy.
x <- c('6,345', '7.235', '8', '$9', '-10')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "6345" "7.235" "8" "9" "-10"
x <- c('1', '2#', '3', '4', '£5')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "1" "2" "3" "4" "5"
This code{gsub("[^[:alnum:]]", "", x))} removes everything that does not include alphanumeric terms. Then we add to the exception list. Here we add hyphen(\-), full-stop(\.) and space(\s) to get gsub("[^[:alnum:]\-\.\s]", "", x). Now it removes everything that is not alphanumeric, hyphen, full stop and space.

Here are some options to restrict a generic character class in R using both base R (g)sub and the stringr remove/replace functions:
(g)sub with perl=TRUE
You may use the [[:punct:]] bracket expression with the [:punct:] POSIX character class and restrict it with the (?!\.) negative lookahead that will require that the immediately following char on the right is not equal to .:
(?!\.)[[:punct:]] # Excluding a dot only
(?![.-])[[:punct:]] # Excluding a dot and hyphen
To match one or more occurrences, wrap it with a non-capturing group and then set the + quantifier to the group:
(?:(?!\.)[[:punct:]])+ # Excluding a dot only
(?:(?![.-])[[:punct:]])+ # Excluding a dot and hyphen
Note that when you remove found matches, both expressions will yield the same results, however, when you need to replace with some other string/char, the quantification will allow changing whole consecutive character chunks with a single occurrence of the replacement pattern.
With stringr replace/remove functions
Before going into details, mind that the PCRE [[:punct:]] used with (g)sub will not match the same chars in the stringr regex functions that are powered by the ICU regex library. You need to use [\p{P}\p{S}] instead, see R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
The ICU regex library has a nice feature that can be used with character classes, called character class subtraction.
So, you write your character class, say, all punctuation matching class like [\p{P}\p{S}], and then you want to "exclude" (=subtract) a char or two or three, or a whole subclass of chars. You may use two notations:
[\p{P}\p{S}&&[^.]] # Excluding a dot
[\p{P}\p{S}--[.]] # Excluding a dot
[\p{P}\p{S}&&[^.-]] # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]] # Excluding a dot and hyphen
To match 1+ consecutive occurrences with this approach, you do not need any wrapping groups, simply use +:
[\p{P}\p{S}&&[^.]]+ # Excluding a dot
[\p{P}\p{S}--[.]]+ # Excluding a dot
[\p{P}\p{S}&&[^.-]]+ # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]+ # Excluding a dot and hyphen
See R demo tests with outputs:
x <- "Abc.123#&*xxx(x-y-z)???? some#other!chars."
gsub("(?!\\.)[[:punct:]]", "", x, perl=TRUE)
## => [1] "Abc.123xxxxyz someotherchars."
gsub("(?!\\.)[[:punct:]]", "~", x, perl=TRUE)
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
gsub("(?:(?!\\.)[[:punct:]])+", "~", x, perl=TRUE)
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
library(stringr)
stringr::str_remove_all(x, "[\\p{P}\\p{S}&&[^.]]") # Same as "[\\p{P}\\p{S}--[.]]"
## => [1] "Abc.123xxxxyz someotherchars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]", "~")
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]+", "~") # Same as "[\\p{P}\\p{S}--[.]]+"
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."

Another way to think about it is what do you want to keep? You can use regular expressions to both keep information as well as omit it. I have a lot of data frames that I need to clean units out of and convert from multiple rows in one pass and I find it easiest to use something from the apply family in these instances.
Recreating the example:
a <- c('1', '2#', '3', '4', '£5')
b <- c('6', '7.235', '8', '$9', '-10')
z <- matrix(data = c(a, b), nrow = length(a), ncol=2)
Then use apply in conjunction with gsub.
apply(z, 2, function(x) as.numeric(gsub('[^0-9\\.\\-]', '', x)))
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
This instructs R to match everything except digits, periods, and hyphens/dashes. Personally, I find it much cleaner and easier to use in these situations and gives the same output.
Also, the documentation has a good explanation of these powerful but confusing regular expressions.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Or ?regex

Related

how to remove non alphabetic characters and columns from an csv file

I have a csv file that looks like this:
And in some portions the data in the columns is like this:
so as you can see, and because the "=" sign is present it wants to convert it into a formula, but what I need is the word in this case "rama...
I have extracted this term from a spam file and with R converted into a sparse matrix. So the question that I have is how can I get rid of the non-alphanumeric characters from this header in R, and then convert it again into a csv file?
Thanks
If you want a literal answer, you could try using gsub to replace any entry having one or more non alphanumeric characters:
df <- data.frame(v1=c(1,2,3), v2=c("#NAME?", "two", "#NAME?"),
stringsAsFactors=FALSE)
df <- data.frame(sapply(df, function(x) gsub(".*[^A-Za-z0-9].*", "", x)))
df
v1 v2
1 1
2 2 two
3 3
Demo
But the best/easiest thing to do here is probably to just fix your Excel formulas such that you catch these errors, and just display empty string, or some other sensible message. From what I can see, this is basically an Excel, not R, problem.
You can use gsub for that:
## A dummy matrix
example <- matrix(paste0("=", letters[1:9]),3,3)
# [,1] [,2] [,3]
#[1,] "= a" "= d" "= g"
#[2,] "= b" "= e" "= h"
#[3,] "= c" "= f" "= i"
You can remove the "=" by replacing it by "" in gsub
## Replacing the "=" by "" (nothing)
gsub("=", "", example)
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "b" "e" "h"
#[3,] "c" "f" "i"
Or only in the first row (or in the column name, etc.)
## Removing the "=" in the first row
example <- gsub("=", "", example[,1])
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "=b" "=e" "=h"
#[3,] "=c" "=f" "=i"

How do I extract the first number that occurs after a matching pattern

Consider these examples:
examples <- c(
"abc foo",
"abc foo 17",
"0 abc defg foo 5 121",
"abc 12 foo defg 11"
)
Here I would like to return the first number that occurs after "foo". In this case: NA, 17, 5, 11. How can I do this? I tried using a look-behind, but with no luck.
library(stringr)
str_extract(examples, "(?<=foo.*)[0-9]+")
Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT)
This seems to work:
str_match(examples, "foo.*?(\\d+)")
[,1] [,2]
[1,] NA NA
[2,] "foo 17" "17"
[3,] "foo 5" "5"
[4,] "foo defg 11" "11"
From ?regex:
By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier.
From ?str_extract:
See Also
?str_match to extract matched groups; ?stri_extract for the underlying implementation.
You may use a base R solution like this:
> res <- gsub(".*?foo\\D*(\\d+).*|.*", "\\1", examples)
> res[nchar(res)==0] <- NA
> res
[1] NA "17" "5" "11"
As the regex will always match any string, you do not need to run a regex replacement twice, just fill out empty values with NA as the second step.
The pattern matches:
.*?foo - any 0+ chars as few as possible (since *? is lazy) up to the first occurrence of foo and then foo itself
\\D* - zero or more non-digit chars
(\\d+) - Group 1 that captures 1 or more digits (later, the value stored in the group can be referred with \1 backreference)
.* - the rest of the string
| - OR
.* - the whole string even if empty.
Base R gsub can do it:
# pulls fist instance of a digit
gsub('^\\D*(\\d*).*', '\\1', examples)
[1] "" "17" "0" "12"
Edit: actual solution using base R
ifelse(
grepl('foo\\D*\\d', examples),
gsub('^\\D*(\\d+).*', '\\1', gsub('.*foo\\s*', '', examples)),
NA)
[1] NA "17" "5" "11"

extract numerical suffixes from strings in R

I have this character vector:
variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" "vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
Desired output:
> suffixes(variables)
[1] 1 1 4 4 5 6 11 12 2
In other words, I need a function that will return a numeric vector showing the suffixes (each of which be 1 or 2 digits long). Note, I need something that can work with a much larger number of strings which may or may not have numbers somewhere the middle. The numerical suffixes range from 1 to 99.
Many thanks
Just use gsub:
> gsub(".*?([0-9]+)$", "\\1", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
Wrap it in as.numeric if you want the result as a number.
You could use sub function.
> variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" ,"vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
> sub(".*\\D", "", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
.*\\D matches all the characters from the start upto the last non-digit character. Replacing those matched characters with an empty string will give you the desired output.

Why does as.matrix add extra spaces when converting numeric to character?

If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.
An example:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
df
## id1 id2
## 1 a 100
## 2 a 90
## 3 a 8
as.matrix(df)
## id1 id2
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" " 8"
I would have expected the result to be:
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Why the extra spaces?
They can create unexpected results when using apply on a data.frame:
myfunc <- function(row){
paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a 8"
>
While looping gives the expected result.
> for (i in 1:nrow(df)){
print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"
and
> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90" "a8"
Are there any situations where the extra spaces that are added with as.matrix is useful?
This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.
Details
?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
?format also notes that
Character strings are padded with blanks to the display width of the widest.
Consider this example which illustrates the behaviour
> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3
format doesn't have to work this way as it has trim:
trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.
e.g.
> format(df[,2], trim = TRUE)
[1] "100" "90" "8"
but there is no way to pass this argument along to the as.matrix.data.frame method.
Workaround
A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE
> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):
> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
It does seem a little strange. In the manual (?as.matrix) it explains that format is called for the conversion to a character matrix:
The method for data frames will return a character matrix if there is
only atomic columns and any non-(numeric/logical/complex) column,
applying as.vector to factors and format to other non-character
columns.
And you can see that if you call format directly, it does what as.matrix does:
format(df$id2)
[1] "100" " 90" " 8"
What you need to do is pass the trim arugment:
format(df$id2,trim=TRUE)
[1] "100" "90" "8"
But, unfortunately, the as.matrix.data.frame function doesn't allow you to do that.
else if (non.numeric) {
for (j in pseq) {
if (is.character(X[[j]]))
next
xj <- X[[j]]
miss <- is.na(xj)
xj <- if (length(levels(xj)))
as.vector(xj)
else format(xj) # This could have ... as an argument
# else format(xj,...)
is.na(xj) <- miss
X[[j]] <- xj
}
}
So, you could modify as.data.frame.matrix. I think it would be a nice feature addition, however, to include this in base.
But, a quick solution would be to simply:
as.matrix(data.frame(lapply(df,as.character)))
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
# As mentioned in the comments, this also works:
sapply(df,as.character)
as.matrix calls format internally:
> format(df$id2)
[1] "100" " 90" " 8"
That's where the extra spaces come from. format has an extra argument trim to remove those:
> format(df$id2, trim = TRUE)
[1] "100" "90" "8"
However you cannot supply this argument to as.matrix.
The reason for this behaviour is already explained in previous answers, but I'd like to offer another way of circumventing this:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
do.call(cbind,df)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Note that if using stringsAsFactors = TRUE, this doesn't work as factor levels are converted to numbers.
Just another solution: trimWhiteSpace(x) (from limma R pckg) also does the job if you don't mind downloading the package.
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
as.matrix(df)
id1 id2
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" " 8"
trimWhiteSpace(as.matrix(df))
id1 id2 enter code here
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

How can I use grep with parameters in R?

Obviously I dont get the way grep works in R. If I use grep on my OS X terminal, I am able to use the parameter -o which makes grep only return the matching part. In R, I can't find how to do a corresponding thing. Reading the manual I thought values was the right approach, which is better inasmuch that it returns characters not indexes, but still returns the whole string.
# some string fasdjlk465öfsdj123
# R
test <- fasdjlk465öfsdj123
grep("[0-9]",test,value=TRUE) # returns "fasdjlk465öfsdj123"
# shell
grep -o '[0-9]' fasdjlk465öfsdj123
# returns 4 6 5 1 2 3
What's the parameter I am missing in R ?
EDIT: Joris Meys' suggestions comes really close to what I am trying to do. I get a vector as a result of readLines. And I'd like to check every element of the vector for numbers and return these numbers. I am really surprised there's no standard solution for that. I thought of using some regexp function that works on a string and returns the match like grep -o and then use lapply on that vector. grep.custom comes closest – i'll try to make that work for me.
Spacedman said it already. If you really want to simulate grep in the shell, you have to work on the characters itself, using strsplit() :
> chartest <- unlist(strsplit(test,""))
> chartest
[1] "f" "a" "s" "d" "j" "l" "k" "4" "6" "5" "ö" "f" "s" "d" "j" "1" "2" "3"
> grep("[0-9]",chartest,value=T)
[1] "4" "6" "5" "1" "2" "3"
EDIT :
As Nico said, if you want to do this for complete regular expressions, you need to use the gregexpr() and substr(). I'd make a custom function like this one :
grep.custom <- function(x,pattern){
strt <- gregexpr(pattern,x)[[1]]
lngth <- attributes(strt)$match.length
stp <- strt + lngth - 1
apply(cbind(strt,stp),1,function(i){substr(x,i[1],i[2])})
}
Then :
> grep.custom(test,"sd")
[1] "sd" "sd"
> grep.custom(test,"[0-9]")
[1] "4" "6" "5" "1" "2" "3"
> grep.custom(test,"[a-z]s[a-z]")
[1] "asd" "fsd"
EDIT2 :
for vectors, use the function Vectorize(), eg:
> X <- c("sq25dfgj","sqd265jfm","qs55d26fjm" )
> v.grep.custom <- Vectorize(grep.custom)
> v.grep.custom(X,"[0-9]+")
$sq25dfgj
[1] "25"
$sqd265jfm
[1] "265"
$qs55d26fjm
[1] "55" "26"
and if you want to call grep from the shell, see ?system
That's because 'grep' for R works on vectors - it will do the search on every element and return the element indices that match. It says 'which elements in this vector match this pattern?' For example, here we make a vector of 3 and then ask 'which elements in this vector have a single number in them?'
> test = c("fasdjlk465öfsdj123","nonumbers","123")
> grep("[0-9]",test)
[1] 1 3
Elements 1 and 3 - not 2, which is only characters.
You probably want gsub - substitute anything that doesn't match digits with nothing:
> gsub("[^0-9]","",test)
[1] "465123" "" "123"
All this dancing around with strings is the problem the stringr package was designed to solve.
library(stringr)
str_extract_all('fasdjlk465fsdj123', '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
# It is vectorized too
str_extract_all(rep('fasdjlk465fsdj123',3), '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
[[2]]
[1] "4" "6" "5" "1" "2" "3"
[[3]]
[1] "4" "6" "5" "1" "2" "3"
The motivation behind stringr is to unify string operations in R under two principles:
Use a sane and consistent naming scheme for functions (str_do_something).
Make it so that all the string operations that take one step in other programing languages, yet fifty steps in R, take only one step in R.
grep will only tell you whether the string matches or not.
For instance if you have:
values <- c("abcde", "12345", "abc123", "123abc")
Then
grep <- ("[0-9]", values)
[1] 2 3 4
This tells you that elements 2,3 and 4 of the array match the regexp. You can pass value=TRUE to return the strings rather then the indices.
If you want to check where the match is happening you can use regexpr instead
> regexpr("[0-9]", values)
[1] -1 1 4 1
attr(,"match.length")
[1] -1 1 1 1
which tells you where the first match is happening.
Even better, you can use gregexpr for multiple matches
> gregexpr("[0-9]", values)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
[[3]]
[1] 4 5 6
attr(,"match.length")
[1] 1 1 1
[[4]]
[1] 1 2 3
attr(,"match.length")
[1] 1 1 1
No idea where you get the impression that
> test <- "fasdjlk465öfsdj123"
> grep("[0-9]",test)
[1] 1
returns "fasdjlk465öfsdj123"
If you want to return the matches, you need to break test into it's component parts, grep on those and then use the thing returned from grep to index test.
> test <- strsplit("fasdjlk465öfsdj123", "")[[1]]
> matched <- grep("[0-9]", test)
> test[matched]
[1] "4" "6" "5" "1" "2" "3"
Or just return the matched strings directly, depends what you want:
> grep("[0-9]", test, value = TRUE)
[1] "4" "6" "5" "1" "2" "3"
strapply in the gsubfn package can do such extraction:
> library(gsubfn)
> strapply(c("ab34de123", "55x65"), "\\d+", as.numeric, simplify = TRUE)
[,1] [,2]
[1,] 34 55
[2,] 123 65
Its based on the apply paradigm where the first argument is the object, the second is the modifier (margin for apply, regular expression for strapply) and the third argument is the function to apply on the matches.
str_extract_all(obj, re) in the stringr package is similar to strapply specialized to use c for the function, i.e. its the similar to strapply(obj, re, c) .
strapply supports the sets of regular expressions supported by R and also supports tcl regular expressions.
See the gsubfn home page at http://gsubfn.googlecode.com

Resources