If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.
An example:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
df
## id1 id2
## 1 a 100
## 2 a 90
## 3 a 8
as.matrix(df)
## id1 id2
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" " 8"
I would have expected the result to be:
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Why the extra spaces?
They can create unexpected results when using apply on a data.frame:
myfunc <- function(row){
paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a 8"
>
While looping gives the expected result.
> for (i in 1:nrow(df)){
print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"
and
> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90" "a8"
Are there any situations where the extra spaces that are added with as.matrix is useful?
This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.
Details
?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
?format also notes that
Character strings are padded with blanks to the display width of the widest.
Consider this example which illustrates the behaviour
> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3
format doesn't have to work this way as it has trim:
trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.
e.g.
> format(df[,2], trim = TRUE)
[1] "100" "90" "8"
but there is no way to pass this argument along to the as.matrix.data.frame method.
Workaround
A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE
> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):
> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
It does seem a little strange. In the manual (?as.matrix) it explains that format is called for the conversion to a character matrix:
The method for data frames will return a character matrix if there is
only atomic columns and any non-(numeric/logical/complex) column,
applying as.vector to factors and format to other non-character
columns.
And you can see that if you call format directly, it does what as.matrix does:
format(df$id2)
[1] "100" " 90" " 8"
What you need to do is pass the trim arugment:
format(df$id2,trim=TRUE)
[1] "100" "90" "8"
But, unfortunately, the as.matrix.data.frame function doesn't allow you to do that.
else if (non.numeric) {
for (j in pseq) {
if (is.character(X[[j]]))
next
xj <- X[[j]]
miss <- is.na(xj)
xj <- if (length(levels(xj)))
as.vector(xj)
else format(xj) # This could have ... as an argument
# else format(xj,...)
is.na(xj) <- miss
X[[j]] <- xj
}
}
So, you could modify as.data.frame.matrix. I think it would be a nice feature addition, however, to include this in base.
But, a quick solution would be to simply:
as.matrix(data.frame(lapply(df,as.character)))
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
# As mentioned in the comments, this also works:
sapply(df,as.character)
as.matrix calls format internally:
> format(df$id2)
[1] "100" " 90" " 8"
That's where the extra spaces come from. format has an extra argument trim to remove those:
> format(df$id2, trim = TRUE)
[1] "100" "90" "8"
However you cannot supply this argument to as.matrix.
The reason for this behaviour is already explained in previous answers, but I'd like to offer another way of circumventing this:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
do.call(cbind,df)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Note that if using stringsAsFactors = TRUE, this doesn't work as factor levels are converted to numbers.
Just another solution: trimWhiteSpace(x) (from limma R pckg) also does the job if you don't mind downloading the package.
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
as.matrix(df)
id1 id2
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" " 8"
trimWhiteSpace(as.matrix(df))
id1 id2 enter code here
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Related
I have a unique character, each letter follows a number. For instance: A1B10C5
I would like to split it into letter <- c(A, B, C) and number <- c(1, 10, 5) using R.
We can use regex lookarounds to split between the letters and numbers
v1 <- strsplit(str1, "(?<=[A-Za-z])(?=[0-9])|(?<=[0-9])(?=[A-Za-z])", perl = TRUE)[[1]]
v1[c(TRUE, FALSE)]
#[1] "A" "B" "C"
as.numeric(v1[c(FALSE, TRUE)])
#[1] 1 10 5
data
str1 <- "A1B10C5"
str_extract_all is another way to do this:
library(stringr)
> str <- "A1B10C5"
> str
[1] "A1B10C5"
> str_extract_all(str, "[0-9]+")
[[1]]
[1] "1" "10" "5"
> str_extract_all(str, "[aA-zZ]+")
[[1]]
[1] "A" "B" "C"
To extract letters and numbers at same time, you can use str_match_all to get letters and numbers in two separate columns:
library(stringr)
str_match_all("A1B10C5", "([a-zA-Z]+)([0-9]+)")[[1]][,-1]
# [,1] [,2]
#[1,] "A" "1"
#[2,] "B" "10"
#[3,] "C" "5"
You can also use the base R regmatches with gregexpr:
regmatches(this, gregexpr("[0-9]+", "A1B10C5"))
[[1]]
[1] "1" "10" "5"
regmatches(this, gregexpr("[A-Z]+", "A1B10C5"))
[[1]]
[1] "A" "B" "C"
These return lists with a single element, a character vector. As akrun does, you can extract the list item using [[1]] and can also convert the vector of digits to numeric like this:
as.numeric(regmatches(this, gregexpr("[0-9]+", this))[[1]])
When converting a data frame to a matrix, R pads spaces into numeric columns:
> d=data.frame(x=c(10000,1),a=c("a","bbbbb"))
> as.matrix(d)
x a
[1,] "10000" "a"
[2,] " 1" "bbbbb"
the source code for as.matrix.data.frame shows this is because it uses format to convert to character (rather than as.character), so you get:
> format(d$x)
[1] "10000" " 1"
instead of
> as.character(d$x)
[1] "10000" "1"
Character columns aren't formatted with format so they don't get padded.
Is there an easy way to convert the DF to a matrix without padding? Better than running str_trim all over it?
This seems to work:
as.matrix(format(d, trim=T))
# x a
# 1 "10000" "a"
# 2 "1" "bbbbb"
I am new to R so I hope you can help me.
I want to use gsub to remove all punctuation except for periods and minus signs so I can keep decimal points and negative symbols in my data.
Example
My data frame z has the following data:
[,1] [,2]
[1,] "1" "6"
[2,] "2#" "7.235"
[3,] "3" "8"
[4,] "4" "$9"
[5,] "£5" "-10"
I want to use gsub("[[:punct:]]", "", z) to remove the punctuation.
Current output
> gsub("[[:punct:]]", "", z)
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "10"
I would like, however, to keep the "-" sign and the "." sign.
Desired output
PSEUDO CODE:
> gsub("[[:punct:]]", "", z, except(".", "-") )
[,1] [,2]
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Any ideas how I can make some characters exempt from the gsub() function?
You can put back some matches like this:
sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))
X..1. X..2.
[1,] "1" "6"
[2,] "2" "7.235"
[3,] "3" "8"
[4,] "4" "9"
[5,] "5" "-10"
Here I am keeping the . and -.
And I guess , the next step is to coerce you result to a numeric matrix, SO here I combine the 2 steps like this:
matrix(as.numeric(sub("([.-])|[[:punct:]]", "\\1", as.matrix(z))),ncol=2)
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
You may try this code. I found it quite handy.
x <- c('6,345', '7.235', '8', '$9', '-10')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "6345" "7.235" "8" "9" "-10"
x <- c('1', '2#', '3', '4', '£5')
gsub("[^[:alnum:]\\-\\.\\s]", "", x)
[1] "1" "2" "3" "4" "5"
This code{gsub("[^[:alnum:]]", "", x))} removes everything that does not include alphanumeric terms. Then we add to the exception list. Here we add hyphen(\-), full-stop(\.) and space(\s) to get gsub("[^[:alnum:]\-\.\s]", "", x). Now it removes everything that is not alphanumeric, hyphen, full stop and space.
Here are some options to restrict a generic character class in R using both base R (g)sub and the stringr remove/replace functions:
(g)sub with perl=TRUE
You may use the [[:punct:]] bracket expression with the [:punct:] POSIX character class and restrict it with the (?!\.) negative lookahead that will require that the immediately following char on the right is not equal to .:
(?!\.)[[:punct:]] # Excluding a dot only
(?![.-])[[:punct:]] # Excluding a dot and hyphen
To match one or more occurrences, wrap it with a non-capturing group and then set the + quantifier to the group:
(?:(?!\.)[[:punct:]])+ # Excluding a dot only
(?:(?![.-])[[:punct:]])+ # Excluding a dot and hyphen
Note that when you remove found matches, both expressions will yield the same results, however, when you need to replace with some other string/char, the quantification will allow changing whole consecutive character chunks with a single occurrence of the replacement pattern.
With stringr replace/remove functions
Before going into details, mind that the PCRE [[:punct:]] used with (g)sub will not match the same chars in the stringr regex functions that are powered by the ICU regex library. You need to use [\p{P}\p{S}] instead, see R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?
The ICU regex library has a nice feature that can be used with character classes, called character class subtraction.
So, you write your character class, say, all punctuation matching class like [\p{P}\p{S}], and then you want to "exclude" (=subtract) a char or two or three, or a whole subclass of chars. You may use two notations:
[\p{P}\p{S}&&[^.]] # Excluding a dot
[\p{P}\p{S}--[.]] # Excluding a dot
[\p{P}\p{S}&&[^.-]] # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]] # Excluding a dot and hyphen
To match 1+ consecutive occurrences with this approach, you do not need any wrapping groups, simply use +:
[\p{P}\p{S}&&[^.]]+ # Excluding a dot
[\p{P}\p{S}--[.]]+ # Excluding a dot
[\p{P}\p{S}&&[^.-]]+ # Excluding a dot and hyphen
[\p{P}\p{S}--[.-]]+ # Excluding a dot and hyphen
See R demo tests with outputs:
x <- "Abc.123#&*xxx(x-y-z)???? some#other!chars."
gsub("(?!\\.)[[:punct:]]", "", x, perl=TRUE)
## => [1] "Abc.123xxxxyz someotherchars."
gsub("(?!\\.)[[:punct:]]", "~", x, perl=TRUE)
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
gsub("(?:(?!\\.)[[:punct:]])+", "~", x, perl=TRUE)
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
library(stringr)
stringr::str_remove_all(x, "[\\p{P}\\p{S}&&[^.]]") # Same as "[\\p{P}\\p{S}--[.]]"
## => [1] "Abc.123xxxxyz someotherchars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]", "~")
## => [1] "Abc.123~~~xxx~x~y~z~~~~~ some~other~chars."
stringr::str_replace_all(x, "[\\p{P}\\p{S}&&[^.]]+", "~") # Same as "[\\p{P}\\p{S}--[.]]+"
## => [1] "Abc.123~xxx~x~y~z~ some~other~chars."
Another way to think about it is what do you want to keep? You can use regular expressions to both keep information as well as omit it. I have a lot of data frames that I need to clean units out of and convert from multiple rows in one pass and I find it easiest to use something from the apply family in these instances.
Recreating the example:
a <- c('1', '2#', '3', '4', '£5')
b <- c('6', '7.235', '8', '$9', '-10')
z <- matrix(data = c(a, b), nrow = length(a), ncol=2)
Then use apply in conjunction with gsub.
apply(z, 2, function(x) as.numeric(gsub('[^0-9\\.\\-]', '', x)))
[,1] [,2]
[1,] 1 6.000
[2,] 2 7.235
[3,] 3 8.000
[4,] 4 9.000
[5,] 5 -10.000
This instructs R to match everything except digits, periods, and hyphens/dashes. Personally, I find it much cleaner and easier to use in these situations and gives the same output.
Also, the documentation has a good explanation of these powerful but confusing regular expressions.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Or ?regex
I would like to generate all combinations of two vectors, given two constraints: there can never be more than 3 characters from the first vector, and there must always be at least one characters from the second vector. I would also like to vary the final number of characters in the combination.
For instance, here are two vectors:
vec1=c("A","B","C","D")
vec2=c("W","X","Y","Z")
Say I wanted 3 characters in the combination. Possible acceptable permutations would be: "A" "B" "X"or "A" "Y" "Z". An unacceptable permutation would be: "A" "B" "C" since there is not at least one character from vec2.
Now say I wanted 5 characters in the combination. Possible acceptable permutations would be: "A" "C" "Z" "Y" or "A" "Y" "Z" "X". An unacceptable permutation would be: "A" "C" "D" "B" "X" since there are >3 characters from vec2.
I suppose I could use expand.grid to generate all combinations and then somehow subset, but there must be an easier way. Thanks in advance!
I'm not sure wheter this is easier, but you can leave away permutations that do not satisfy your conditions whith this strategy:
generate all combinations from vec1 that are acceptable.
generate all combinations from vec2 that are acceptable.
generate all combinations taking one solution from 1. + one solution from 2. Here I'd do the filtering with condition 3 afterwards.
(if you're looking for combinations, you're done, otherwise:) produce all permutations of letters within each result.
Now, let's have
vec1 <- LETTERS [1:4]
vec2 <- LETTERS [23:26]
## lists can eat up lots of memory, so use character vectors instead.
combine <- function (x, y)
combn (y, x, paste, collapse = "")
res1 <- unlist (lapply (0:3, combine, vec1))
res2 <- unlist (lapply (1:length (vec2), combine, vec2))
now we have:
> res1
[1] "" "A" "B" "C" "D" "AB" "AC" "AD" "BC" "BD" "CD" "ABC"
[13] "ABD" "ACD" "BCD"
> res2
[1] "W" "X" "Y" "Z" "WX" "WY" "WZ" "XY" "XZ" "YZ"
[11] "WXY" "WXZ" "WYZ" "XYZ" "WXYZ"
res3 <- outer (res1, res2, paste0)
res3 <- res3 [nchar (res3) == 5]
So here you are:
> res3
[1] "ABCWX" "ABDWX" "ACDWX" "BCDWX" "ABCWY" "ABDWY" "ACDWY" "BCDWY" "ABCWZ"
[10] "ABDWZ" "ACDWZ" "BCDWZ" "ABCXY" "ABDXY" "ACDXY" "BCDXY" "ABCXZ" "ABDXZ"
[19] "ACDXZ" "BCDXZ" "ABCYZ" "ABDYZ" "ACDYZ" "BCDYZ" "ABWXY" "ACWXY" "ADWXY"
[28] "BCWXY" "BDWXY" "CDWXY" "ABWXZ" "ACWXZ" "ADWXZ" "BCWXZ" "BDWXZ" "CDWXZ"
[37] "ABWYZ" "ACWYZ" "ADWYZ" "BCWYZ" "BDWYZ" "CDWYZ" "ABXYZ" "ACXYZ" "ADXYZ"
[46] "BCXYZ" "BDXYZ" "CDXYZ" "AWXYZ" "BWXYZ" "CWXYZ" "DWXYZ"
If you prefer the results split into single letters:
res <- matrix (unlist (strsplit (res3, "")), nrow = length (res3), byrow = TRUE)
> res
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "B" "C" "W" "X"
[2,] "A" "B" "D" "W" "X"
[3,] "A" "C" "D" "W" "X"
[4,] "B" "C" "D" "W" "X"
(snip)
[51,] "C" "W" "X" "Y" "Z"
[52,] "D" "W" "X" "Y" "Z"
Which are your combinations.
Obviously I dont get the way grep works in R. If I use grep on my OS X terminal, I am able to use the parameter -o which makes grep only return the matching part. In R, I can't find how to do a corresponding thing. Reading the manual I thought values was the right approach, which is better inasmuch that it returns characters not indexes, but still returns the whole string.
# some string fasdjlk465öfsdj123
# R
test <- fasdjlk465öfsdj123
grep("[0-9]",test,value=TRUE) # returns "fasdjlk465öfsdj123"
# shell
grep -o '[0-9]' fasdjlk465öfsdj123
# returns 4 6 5 1 2 3
What's the parameter I am missing in R ?
EDIT: Joris Meys' suggestions comes really close to what I am trying to do. I get a vector as a result of readLines. And I'd like to check every element of the vector for numbers and return these numbers. I am really surprised there's no standard solution for that. I thought of using some regexp function that works on a string and returns the match like grep -o and then use lapply on that vector. grep.custom comes closest – i'll try to make that work for me.
Spacedman said it already. If you really want to simulate grep in the shell, you have to work on the characters itself, using strsplit() :
> chartest <- unlist(strsplit(test,""))
> chartest
[1] "f" "a" "s" "d" "j" "l" "k" "4" "6" "5" "ö" "f" "s" "d" "j" "1" "2" "3"
> grep("[0-9]",chartest,value=T)
[1] "4" "6" "5" "1" "2" "3"
EDIT :
As Nico said, if you want to do this for complete regular expressions, you need to use the gregexpr() and substr(). I'd make a custom function like this one :
grep.custom <- function(x,pattern){
strt <- gregexpr(pattern,x)[[1]]
lngth <- attributes(strt)$match.length
stp <- strt + lngth - 1
apply(cbind(strt,stp),1,function(i){substr(x,i[1],i[2])})
}
Then :
> grep.custom(test,"sd")
[1] "sd" "sd"
> grep.custom(test,"[0-9]")
[1] "4" "6" "5" "1" "2" "3"
> grep.custom(test,"[a-z]s[a-z]")
[1] "asd" "fsd"
EDIT2 :
for vectors, use the function Vectorize(), eg:
> X <- c("sq25dfgj","sqd265jfm","qs55d26fjm" )
> v.grep.custom <- Vectorize(grep.custom)
> v.grep.custom(X,"[0-9]+")
$sq25dfgj
[1] "25"
$sqd265jfm
[1] "265"
$qs55d26fjm
[1] "55" "26"
and if you want to call grep from the shell, see ?system
That's because 'grep' for R works on vectors - it will do the search on every element and return the element indices that match. It says 'which elements in this vector match this pattern?' For example, here we make a vector of 3 and then ask 'which elements in this vector have a single number in them?'
> test = c("fasdjlk465öfsdj123","nonumbers","123")
> grep("[0-9]",test)
[1] 1 3
Elements 1 and 3 - not 2, which is only characters.
You probably want gsub - substitute anything that doesn't match digits with nothing:
> gsub("[^0-9]","",test)
[1] "465123" "" "123"
All this dancing around with strings is the problem the stringr package was designed to solve.
library(stringr)
str_extract_all('fasdjlk465fsdj123', '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
# It is vectorized too
str_extract_all(rep('fasdjlk465fsdj123',3), '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
[[2]]
[1] "4" "6" "5" "1" "2" "3"
[[3]]
[1] "4" "6" "5" "1" "2" "3"
The motivation behind stringr is to unify string operations in R under two principles:
Use a sane and consistent naming scheme for functions (str_do_something).
Make it so that all the string operations that take one step in other programing languages, yet fifty steps in R, take only one step in R.
grep will only tell you whether the string matches or not.
For instance if you have:
values <- c("abcde", "12345", "abc123", "123abc")
Then
grep <- ("[0-9]", values)
[1] 2 3 4
This tells you that elements 2,3 and 4 of the array match the regexp. You can pass value=TRUE to return the strings rather then the indices.
If you want to check where the match is happening you can use regexpr instead
> regexpr("[0-9]", values)
[1] -1 1 4 1
attr(,"match.length")
[1] -1 1 1 1
which tells you where the first match is happening.
Even better, you can use gregexpr for multiple matches
> gregexpr("[0-9]", values)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
[[3]]
[1] 4 5 6
attr(,"match.length")
[1] 1 1 1
[[4]]
[1] 1 2 3
attr(,"match.length")
[1] 1 1 1
No idea where you get the impression that
> test <- "fasdjlk465öfsdj123"
> grep("[0-9]",test)
[1] 1
returns "fasdjlk465öfsdj123"
If you want to return the matches, you need to break test into it's component parts, grep on those and then use the thing returned from grep to index test.
> test <- strsplit("fasdjlk465öfsdj123", "")[[1]]
> matched <- grep("[0-9]", test)
> test[matched]
[1] "4" "6" "5" "1" "2" "3"
Or just return the matched strings directly, depends what you want:
> grep("[0-9]", test, value = TRUE)
[1] "4" "6" "5" "1" "2" "3"
strapply in the gsubfn package can do such extraction:
> library(gsubfn)
> strapply(c("ab34de123", "55x65"), "\\d+", as.numeric, simplify = TRUE)
[,1] [,2]
[1,] 34 55
[2,] 123 65
Its based on the apply paradigm where the first argument is the object, the second is the modifier (margin for apply, regular expression for strapply) and the third argument is the function to apply on the matches.
str_extract_all(obj, re) in the stringr package is similar to strapply specialized to use c for the function, i.e. its the similar to strapply(obj, re, c) .
strapply supports the sets of regular expressions supported by R and also supports tcl regular expressions.
See the gsubfn home page at http://gsubfn.googlecode.com