When converting a data frame to a matrix, R pads spaces into numeric columns:
> d=data.frame(x=c(10000,1),a=c("a","bbbbb"))
> as.matrix(d)
x a
[1,] "10000" "a"
[2,] " 1" "bbbbb"
the source code for as.matrix.data.frame shows this is because it uses format to convert to character (rather than as.character), so you get:
> format(d$x)
[1] "10000" " 1"
instead of
> as.character(d$x)
[1] "10000" "1"
Character columns aren't formatted with format so they don't get padded.
Is there an easy way to convert the DF to a matrix without padding? Better than running str_trim all over it?
This seems to work:
as.matrix(format(d, trim=T))
# x a
# 1 "10000" "a"
# 2 "1" "bbbbb"
Related
I am trying to create permutations of the alphabet {0,1,2,3} using combinat::permn.
The thing is that I want each one of the permutations to be converted to the form of '%s-%s-%s'..etc and to be stored in a list. For example,
> library(combinat)
> permn(numbers[1:4])
[[1]]
[1] "0" "1" "2" "3"
[[2]]
[1] "0" "1" "3" "2"
.
.
. and so on
But I want to convert the output for all permutations into a list of string sequences of my specific format, i.e. '0-1-2-3', '0-1-3-2 etc.
Use lapply to apply paste on each of the vectors and collapse them with the delimiter you want (in this case "-").
lapply(permn(0:3), paste, collapse = "-")
If you just want the output as a vector instead of a list you could use sapply in place of lapply
I am using R and want to covert following:
"A,B"
to
"A","B" OR 'A','B'
I tried str_replace(), but that's not working out.
Please suggest, thanks.
Update
I tried the suggested answer by d.b. Though it works, but I didn't realize that I should have shared that, I am going to use above solution for vector. I need the values in data with "A,B" to split in order to use it as a vector.
Using strsplit
> data
[1] "A,B"
> test <- strsplit(x = data, split = ",")
> test
[[1]]
[1] "A" "B"
above test won't be useful because I can't use it for following:
> output_1 <- c(test)
> outputFinalData <- outputFinal[outputFinal$Column %in% output_1,]
outputFinalData is empty with above process. But is not empty when I do:
> output_2 <- c("A", "B")
> outputFinalData <- outputFinal[outputFinal$Column %in% output_2,]
Also, output_1 and output_2 are not same:
> output_1
[[1]]
[1] "Bin_14" "Bin_15"
> output_2
[1] "Bin_14" "Bin_15"
> output_1 == output_2
[1] FALSE FALSE
Use strsplit:
> data = "A,B"
> strsplit(x=data,split=",")
[[1]]
[1] "A" "B"
Note that it returns a list with a vector. The list is length one because you asked it to split one string. If you ask it to split two strings you get a list of length 2:
> data = c("A,B","Foo,bar")
> strsplit(x=data,split=",")
[[1]]
[1] "A" "B"
[[2]]
[1] "Foo" "bar"
So if you know you are only going to have one thing to split you can get a vector of the parts by taking the first element:
> data = "A,B"
> strsplit(x=data,split=",")[[1]]
[1] "A" "B"
However it might be more efficient to do a load of splits in one go and put the bits in a matrix. As long as you can be sure everything splits into the same number of parts, then something like:
> data = c("A,B","Foo,bar","p1,p2")
> do.call(rbind,(strsplit(x=data,split=",")))
[,1] [,2]
[1,] "A" "B"
[2,] "Foo" "bar"
[3,] "p1" "p2"
>
Gets you the two parts in columns of a matrix that you can then add to a data frame if that's what you need.
I want to isolate a value in the summary of a data frame, so I wrote:
> summary(pf$mobile_likes > 0)[2]
FALSE
"35056"
The response to my command is a character vector, and I can convert it to an integer
> typeof(summary(pf$mobile_likes > 0)[2])
[1] "character"
> strtoi(summary(pf$mobile_likes > 0)[2])
[1] 35056
Still, I don't understand why that FALSE header shows up on top. What is it, and how can I isolate my character vector from it?
Your summary is a vector, and what you're seeing there is an element name.
You can wrap the call in unname to get rid of the names.
> x <- 1:5
> (summ <- summary(x > 2)[2:3])
# FALSE TRUE
# "2" "3"
> names(summ)
# [1] "FALSE" "TRUE"
> unname(summ)
# [1] "2" "3"
If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.
An example:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
df
## id1 id2
## 1 a 100
## 2 a 90
## 3 a 8
as.matrix(df)
## id1 id2
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" " 8"
I would have expected the result to be:
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Why the extra spaces?
They can create unexpected results when using apply on a data.frame:
myfunc <- function(row){
paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a 8"
>
While looping gives the expected result.
> for (i in 1:nrow(df)){
print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"
and
> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90" "a8"
Are there any situations where the extra spaces that are added with as.matrix is useful?
This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.
Details
?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
?format also notes that
Character strings are padded with blanks to the display width of the widest.
Consider this example which illustrates the behaviour
> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3
format doesn't have to work this way as it has trim:
trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.
e.g.
> format(df[,2], trim = TRUE)
[1] "100" "90" "8"
but there is no way to pass this argument along to the as.matrix.data.frame method.
Workaround
A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE
> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):
> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
It does seem a little strange. In the manual (?as.matrix) it explains that format is called for the conversion to a character matrix:
The method for data frames will return a character matrix if there is
only atomic columns and any non-(numeric/logical/complex) column,
applying as.vector to factors and format to other non-character
columns.
And you can see that if you call format directly, it does what as.matrix does:
format(df$id2)
[1] "100" " 90" " 8"
What you need to do is pass the trim arugment:
format(df$id2,trim=TRUE)
[1] "100" "90" "8"
But, unfortunately, the as.matrix.data.frame function doesn't allow you to do that.
else if (non.numeric) {
for (j in pseq) {
if (is.character(X[[j]]))
next
xj <- X[[j]]
miss <- is.na(xj)
xj <- if (length(levels(xj)))
as.vector(xj)
else format(xj) # This could have ... as an argument
# else format(xj,...)
is.na(xj) <- miss
X[[j]] <- xj
}
}
So, you could modify as.data.frame.matrix. I think it would be a nice feature addition, however, to include this in base.
But, a quick solution would be to simply:
as.matrix(data.frame(lapply(df,as.character)))
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
# As mentioned in the comments, this also works:
sapply(df,as.character)
as.matrix calls format internally:
> format(df$id2)
[1] "100" " 90" " 8"
That's where the extra spaces come from. format has an extra argument trim to remove those:
> format(df$id2, trim = TRUE)
[1] "100" "90" "8"
However you cannot supply this argument to as.matrix.
The reason for this behaviour is already explained in previous answers, but I'd like to offer another way of circumventing this:
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
do.call(cbind,df)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
Note that if using stringsAsFactors = TRUE, this doesn't work as factor levels are converted to numbers.
Just another solution: trimWhiteSpace(x) (from limma R pckg) also does the job if you don't mind downloading the package.
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE)
as.matrix(df)
id1 id2
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" " 8"
trimWhiteSpace(as.matrix(df))
id1 id2 enter code here
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
I'm reading a text file like this in R 2.10.0.
248585_at 250887_at 245638_s_at AFFX-BioC-5_at
248585_at 250887_at 264488_s_at 245638_s_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at
248585_at 250887_at
Using the command
clusters<-read.delim("test",sep="\t",fill=TRUE,header=FALSE)
Now, I must pass every row in this file to a BioConductor function that takes only character vectors as input.
My problem is that using as.character on this "clusters" object turns everything into numeric strings.
> clusters[1,]
V1 V2 V3 V4 V5 V6 V7
1 248585_at 250887_at 245638_s_at AFFX-BioC-5_at
But
> as.character(clusters[1,])
[1] "1" "1" "2" "3" "1" "1" "1"
Is there any way to keep the original names and put them into a character vector?
Maybe it helps: my "clusters" object given by the "read.delim" file belongs to the "list" type.
Thanks a lot :-)
Federico
By default character columns are converted to factors. You can avoid this by setting as.is=TRUE argument:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE, as.is=TRUE)
If you only pass arguments from text file to character vector you could do something like:
x <- readLines("test")
xx <- strsplit(x,split="\t")
xx[[1]] # xx is a list
# [1] "248585_at" "250887_at" "245638_s_at" "AFFX-BioC-5_at"
I never would have expected that to happen, but trying a small test case produces the same results you're giving.
Since the result of df[1,] is itself a data.frame, one fix I thought to try was to use unlist -- seems to work:
> df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20], c=LETTERS[5:14])
> df[1,]
a b c
1 A K E
> as.character(df[1,])
[1] "1" "1" "1"
> as.character(unlist(df[2,]))
[1] "B" "L" "F"
I think turning the data.frame into a matrix first would also get around this:
m <- as.matrix(df)
> as.character(m[2,])
[1] "B" "L" "F"
To avoid issues with factors in your data.frame you might want to set stringsAsFactors=TRUE when reading in your data from the text file, eg:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE,
stringsAsFactors=FALSE)
And, after all that, the unexpected behavior seems to come from the fact that the original affy probes in your data.frame are treated as factors. So, doing the stringsAsFactors=FALSE thing will side-step the fanfare:
df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20],
c=LETTERS[5:14], stringsAsFactors=FALSE)
> as.character(df[1,])
[1] "A" "K" "E"