how to remove non alphabetic characters and columns from an csv file - r

I have a csv file that looks like this:
And in some portions the data in the columns is like this:
so as you can see, and because the "=" sign is present it wants to convert it into a formula, but what I need is the word in this case "rama...
I have extracted this term from a spam file and with R converted into a sparse matrix. So the question that I have is how can I get rid of the non-alphanumeric characters from this header in R, and then convert it again into a csv file?
Thanks

If you want a literal answer, you could try using gsub to replace any entry having one or more non alphanumeric characters:
df <- data.frame(v1=c(1,2,3), v2=c("#NAME?", "two", "#NAME?"),
stringsAsFactors=FALSE)
df <- data.frame(sapply(df, function(x) gsub(".*[^A-Za-z0-9].*", "", x)))
df
v1 v2
1 1
2 2 two
3 3
Demo
But the best/easiest thing to do here is probably to just fix your Excel formulas such that you catch these errors, and just display empty string, or some other sensible message. From what I can see, this is basically an Excel, not R, problem.

You can use gsub for that:
## A dummy matrix
example <- matrix(paste0("=", letters[1:9]),3,3)
# [,1] [,2] [,3]
#[1,] "= a" "= d" "= g"
#[2,] "= b" "= e" "= h"
#[3,] "= c" "= f" "= i"
You can remove the "=" by replacing it by "" in gsub
## Replacing the "=" by "" (nothing)
gsub("=", "", example)
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "b" "e" "h"
#[3,] "c" "f" "i"
Or only in the first row (or in the column name, etc.)
## Removing the "=" in the first row
example <- gsub("=", "", example[,1])
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "=b" "=e" "=h"
#[3,] "=c" "=f" "=i"

Related

How to grab the number and hashtag out of the quotes using R ? <hashtag count="5" value="#starbucks"/>

[1] <hashtag count="5" value="#starbucks"/>
Count Hashtags
[1] 5 #starbucks
The column I have now is in the one character in the brackets, how can I gent the number and text of the hashtag out and split into two columns?
This seems to be a simple regex question:
library(stringr)
strings <- c('<hashtag count="5" value="#starbucks"/>',
'<hashtag count="99" value="#peets coffee"/>')
str_match(strings, 'count=\\"(\\d+).*value=\\"#([^"]+)')[,2:3]
[,1] [,2]
[1,] "5" "starbucks"
[2,] "99" "peets coffee"
If strings is a data.frame, you'll need to apply the function by row and choose the correct column to extract the values from:
strings <- data.frame(str = c('<hashtag count="5" value="#starbucks"/>',
'<hashtag count="99" value="#peets coffee"/>'),
col2 = c(2,4))
apply(strings, 1, function(x) str_match(x['str'],
'count=\\"(\\d+).*value=\\"#([^"]+)')[,2:3])
[,1] [,2]
[1,] "5" "99"
[2,] "starbucks" "peets coffee"

Split character vector with vector of patterns in R

I'm trying to write a function that builds a matrix by splitting a character vector repeatedly using successive elements in a vector of patterns.
Let's call the function I'm trying to write str_split_vector(). Here's an example of the output I'm looking for:
char <- c("A & P | B & C # D",
"E & Q | F & G # H",
"I & R | J & K # L")
splits <- c(" \\| ", " & ", " # ")
str_split_vector(char, splits)
# [,1] [,2] [,3] [,4]
# [1,] "A & P" "B" "C" "D"
# [2,] "E & Q" "F" "G" "H"
# [3,] "I & R" "J" "K" "L"
The char vector is split by each pattern in turn, leaving "A & P" intact. (Although it might be easiest to manage that last bit with particular regex patterns.)
I've been able to accomplish this task only iteratively, with a pretty ad hoc loop:
for(ii in 1:length(splits)) {
if(ii == 1) {
char_mat <- matrix(char)
char_mat <- do.call(rbind, strsplit(char_mat[ , ii], splits[ii]))
} else {
char_mat <- cbind(char_mat[ , 1:ii - 1],
do.call(rbind,
strsplit(char_mat[ , ii], splits[ii])
)
)
}
}
That process looks inefficient to me, since I'm "growing" char_mat with the repeated cbind() calls. Even worse, I find it almost impossible to understand what's going on without actually running the code.
Is there a simpler way to write this, potentially ignoring the requirement that "A & P" not be split?
Maybe the following is what you want. No loops.
str_split_vector <- function(x, y){
s <- strsplit(x, paste(y, collapse = "|"))
do.call(rbind, s)
}
str_split_vector(char, splits)
# [,1] [,2] [,3] [,4] [,5]
#[1,] "A" "P" "B" "C" "D"
#[2,] "E" "Q" "F" "G" "H"
#[3,] "I" "R" "J" "K" "L"
An approach that uses grouping and won't perform any splitting on the first & is the following:
do.call(rbind, strsplit(gsub("(.*) \\| (.*) & (.*) # (.*)", "\\1_\\2_\\3_\\4", char), "_"))
It basically replaces the characters you wish to split on with an underscore and then splits on those underscores.

in R, use gsub to remove all punctuation except period

I am new to R so I hope you can help me.
I want to use gsub to remove all punctuation except for periods and minus signs so I can keep decimal points and negative symbols in my data.
Example
My data frame z has the following data:
[,1] [,2]
[1,] "1" "6"
[2,] "2#" "7.235"
[3,] "3" "8"
[4,] "4" "$9"
[5,] "£5" "-10"
I want to use gsub("[[:punct:]]", "", z) to remove the punctuation.
Current output
> gsub("[[:punct:]]", "", z)
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "10"
I would like, however, to keep the "-" sign and the "." sign.
Desired output
PSEUDO CODE:
> gsub("[[:punct:]]", "", z, except(".", "-") )
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Any ideas how I can make some characters exempt from the gsub() function?
You can put back some matches like this:
sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Here I am keeping the . and -.
And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:
matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
You may try this code. I found it quite handy.
x <- c('6,345', '7.235', '8', '$9', '-10')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "6345" "7.235" "8" "9" "-10"
x <- c('1', '2#', '3', '4', '£5')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "1" "2" "3" "4" "5"
This code{gsub("[^[:alnum:]]", "", x))} removes everything that does not include alphanumeric terms. Then we add to the exception list. Here we add hyphen(\-), full-stop(\.) and space(\s) to get gsub("[^[:alnum:]\-\.\s]", "", x). Now it removes everything that is not alphanumeric, hyphen, full stop and space.
Here are some options to restrict a generic character class in R using both base R (g)sub and the stringr remove/replace functions:
(g)sub with perl=TRUE
You may use the [[:punct:]] bracket expression with the [:punct:] POSIX character class and restrict it with the (?!\.) negative lookahead that will require that the immediately following char on the right is not equal to .:
(?!\.)[[:punct:]] # Excluding a dot only
(?![.-])[[:punct:]] # Excluding a dot and hyphen
To match one or more occurrences, wrap it with a non-capturing group and then set the + quantifier to the group:
(?:(?!\.)[[:punct:]])+ # Excluding a dot only
(?:(?![.-])[[:punct:]])+ # Excluding a dot and hyphen
Note that when you remove found matches, both expressions will yield the same results, however, when you need to replace with some other string/char, the quantification will allow changing whole consecutive character chunks with a single occurrence of the replacement pattern.
With stringr replace/remove functions
Before going into details, mind that the PCRE [[:punct:]] used with (g)sub will not match the same chars in the stringr regex functions that are powered by the ICU regex library. You need to use [\p{P}\p{S}] instead, see R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
The ICU regex library has a nice feature that can be used with character classes, called character class subtraction.
So, you write your character class, say, all punctuation matching class like [\p{P}\p{S}], and then you want to "exclude" (=subtract) a char or two or three, or a whole subclass of chars. You may use two notations:
[\p{P}\p{S}&&[^.]] # Excluding a dot
[\p{P}\p{S}--[.]] # Excluding a dot
[\p{P}\p{S}&&[^.-]] # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]] # Excluding a dot and hyphen
To match 1+ consecutive occurrences with this approach, you do not need any wrapping groups, simply use +:
[\p{P}\p{S}&&[^.]]+ # Excluding a dot
[\p{P}\p{S}--[.]]+ # Excluding a dot
[\p{P}\p{S}&&[^.-]]+ # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]+ # Excluding a dot and hyphen
See R demo tests with outputs:
x <- "Abc.123#&*xxx(x-y-z)???? some#other!chars."
gsub("(?!\\.)[[:punct:]]", "", x, perl=TRUE)
## => [1] "Abc.123xxxxyz someotherchars."
gsub("(?!\\.)[[:punct:]]", "~", x, perl=TRUE)
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
gsub("(?:(?!\\.)[[:punct:]])+", "~", x, perl=TRUE)
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
library(stringr)
stringr::str_remove_all(x, "[\\p{P}\\p{S}&&[^.]]") # Same as "[\\p{P}\\p{S}--[.]]"
## => [1] "Abc.123xxxxyz someotherchars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]", "~")
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]+", "~") # Same as "[\\p{P}\\p{S}--[.]]+"
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
Another way to think about it is what do you want to keep? You can use regular expressions to both keep information as well as omit it. I have a lot of data frames that I need to clean units out of and convert from multiple rows in one pass and I find it easiest to use something from the apply family in these instances.
Recreating the example:
a <- c('1', '2#', '3', '4', '£5')
b <- c('6', '7.235', '8', '$9', '-10')
z <- matrix(data = c(a, b), nrow = length(a), ncol=2)
Then use apply in conjunction with gsub.
apply(z, 2, function(x) as.numeric(gsub('[^0-9\\.\\-]', '', x)))
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
This instructs R to match everything except digits, periods, and hyphens/dashes. Personally, I find it much cleaner and easier to use in these situations and gives the same output.
Also, the documentation has a good explanation of these powerful but confusing regular expressions.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Or ?regex

a sequence of str_match in a data.table

I have a string variable to parse into two parts. I figured I'd approach this using str_match from the stringr package, which returns a matrix with the original string in the first column and each extracted part in the other columns.
I found about a dozen regular expressions to extract these two parts. (The parts are a ladder and rung on a pay schedule, and it's very messy. I've verified that my regexes work by defining a function with a bunch of nested ifelse statements.)
library(stringr)
library(data.table)
my_strs <- c("A 01","G 00","A 2")
mydt <- data.table(strs = my_strs)
rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2 <- '(A) ([[:digit:]])'
I want to check the regexes in sequence and extract the parts using the first one that checks out. If I only had one regex, I could do this:
myfun <- function(x){
y <- str_match(x,rx1)
return(y)
}
mydt[,myfun(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA NA NA
(It took me a long time to even get that to work, trying all combinations of Vectorize and as.list on the function and *applying in the call.)
My best attempt at checking the regexes in sequence is this rather ugly kludge:
myfun2 <- function(x){
y <- str_match(x,rx1)
ifelse(!is.na(y[1]),"",(y <- str_match(x,rx2))[1])
return(y)
}
mydt[1:2,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
mydt[3,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 2" "A" "2"
mydt[1:3,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA NA NA
As you can see, it doesn't quite work yet.
Do you have any idea about a better way to approach this? I have about 3.5 m rows in my data set, but only about 2000 unique values for this string, so I'm not really worried about efficiency.
Try this using strapply from the gsubfn package. We define a function that accepts the matches and returns the first two non-empty ones. Then use it with the regular expression paste(rx1, rx2, sep = "|") for each component of my_str :
library(gsubfn)
# test data
# there was an addition to the question in the comments. It asked to be able to handle
# one regular expression which has only a single capture. Make sure its at the end.
rx3 <- "^([[:digit:]]{2})$"
my_strs2 <- c(my_strs, "99")
# code
first2 <- function(...) { x <- c(..., NA); head(x[x != ""], 2) }
strapply(my_strs2, paste(rx1, rx2, rx3, sep = "|"), first2, simplify = TRUE)
The last line returns:
[,1] [,2] [,3] [,4]
[1,] "A " "G " "A" "99"
[2,] "01" "00" "2" NA
(If there are components of my_strs that do not match at all then a list will be returned in which those components are NULL. In that case you may prefer to drop the simplify = TRUE and always have it return a list.)
Note: strapplyc in the same package is much faster than strapply since the guts of it are written in tcl (a string processing language) whereas strapply is written in R. Thus you might want to break it up this way to leverage off of the faster routine:
L <- strapplyc(my_strs2, paste(rx1, rx2, rx3, sep = "|"))
sapply(L, first2)
For posterity, here is another solution I found today:
mydt[,{
i_rx <- min(which(unlist(sapply(rx_list,function(x)grepl(x,strs)))))
as.list(str_match(strs,rx_list[[i_rx]]))
},by=1:nrow(mydt)]
I made some minor alterations to the regexes and put them in a list.
rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2a <- "^(A) ([[:digit:]])$"
rx3a <- "^()([[:digit:]]{2})$"
rx_list <- list(rx1,rx2a,rx3a)

Why does as.matrix add extra spaces when converting numeric to character?

If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.
An example:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
df
## id1 id2
## 1 a 100
## 2 a 90
## 3 a 8
as.matrix(df)
## id1 id2
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" " 8"
I would have expected the result to be:
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Why the extra spaces?
They can create unexpected results when using apply on a data.frame:
myfunc <- function(row){
paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a 8"
>
While looping gives the expected result.
> for (i in 1:nrow(df)){
print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"
and
> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90" "a8"
Are there any situations where the extra spaces that are added with as.matrix is useful?
This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.
Details
?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
?format also notes that
Character strings are padded with blanks to the display width of the widest.
Consider this example which illustrates the behaviour
> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3
format doesn't have to work this way as it has trim:
trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.
e.g.
> format(df[,2], trim = TRUE)
[1] "100" "90" "8"
but there is no way to pass this argument along to the as.matrix.data.frame method.
Workaround
A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE
> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):
> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
It does seem a little strange. In the manual (?as.matrix) it explains that format is called for the conversion to a character matrix:
The method for data frames will return a character matrix if there is
only atomic columns and any non-(numeric/logical/complex) column,
applying as.vector to factors and format to other non-character
columns.
And you can see that if you call format directly, it does what as.matrix does:
format(df$id2)
[1] "100" " 90" " 8"
What you need to do is pass the trim arugment:
format(df$id2,trim=TRUE)
[1] "100" "90" "8"
But, unfortunately, the as.matrix.data.frame function doesn't allow you to do that.
else if (non.numeric) {
for (j in pseq) {
if (is.character(X[[j]]))
next
xj <- X[[j]]
miss <- is.na(xj)
xj <- if (length(levels(xj)))
as.vector(xj)
else format(xj) # This could have ... as an argument
# else format(xj,...)
is.na(xj) <- miss
X[[j]] <- xj
}
}
So, you could modify as.data.frame.matrix. I think it would be a nice feature addition, however, to include this in base.
But, a quick solution would be to simply:
as.matrix(data.frame(lapply(df,as.character)))
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
# As mentioned in the comments, this also works:
sapply(df,as.character)
as.matrix calls format internally:
> format(df$id2)
[1] "100" " 90" " 8"
That's where the extra spaces come from. format has an extra argument trim to remove those:
> format(df$id2, trim = TRUE)
[1] "100" "90" "8"
However you cannot supply this argument to as.matrix.
The reason for this behaviour is already explained in previous answers, but I'd like to offer another way of circumventing this:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
do.call(cbind,df)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Note that if using stringsAsFactors = TRUE, this doesn't work as factor levels are converted to numbers.
Just another solution: trimWhiteSpace(x) (from limma R pckg) also does the job if you don't mind downloading the package.
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
as.matrix(df)
id1 id2
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" " 8"
trimWhiteSpace(as.matrix(df))
id1 id2 enter code here
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

Resources