cbind coerces a data frame to matrix - r

I'm having trouble When using cbind. Prior to using cbind the object is a data.frame of two character vectors.
After I add a column using cbind, the data.frame object changes class to matrix. I've tried as.vector, declaring h as an empty character vector, etc. but couldn't fix it. Thank you for any suggestions and help.
output <- data.frame(h = character(), st = character()) ## empty dataframe
st <- state.abb
h <- (rep("a", 50))
output <- cbind(output$h, h) ## output changes to matrix class here
output <- cbind(output, st) ## adding a second column

I guess you may not need cbind().
output <- data.frame(state = state.abb, h = rep("a", 50))
head(output)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
# Ken I'm not sure what you actually want to obtain but it may be easier if variables are kept in a list. Below is an example.
state <- state.abb
h <- rep("a", 50)
lst <- list(state = state, h = h)
mat <- as.matrix(do.call(cbind, lst))
head(mat)
state h
[1,] "AL" "a"
[2,] "AK" "a"
[3,] "AZ" "a"
[4,] "AR" "a"
[5,] "CA" "a"
[6,] "CO" "a"
df <- as.data.frame(do.call(cbind, lst))
head(df)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a

As a complement of info, notice that you could use single bracket notation to make it work with something close to your original code:
data
output <- data.frame(h = letters[1:5],st = letters[6:10])
h2 <- (rep("a", 5))
This won't work
cbind(output$h, h2)
# h2
# [1,] "1" "a"
# [2,] "2" "a"
# [3,] "3" "a"
# [4,] "4" "a"
# [5,] "5" "a"
class(cbind(output$h, h2)) # matrix
It's a matrix and factors have been coerced in numbers
this will work
cbind(output["h"], h2)
# h h2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
class(cbind(output["h"], h2)) # data.frame
Note that with double brackets (output[["h"]]) you'll have the same inadequate result as when using the dollar notation.

Related

R data.fame manipulation: convert to NA after specific column

I have a large data.frame and I need some conversion based by row. My purpose is convert all values in rows to NA after if there is specific character in column.
For example I provide little sample from my real data set:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
As an example in sample_df
First I want to turn all values to NA after first "I"
Sample data.frames
I tried base, dpylr, purrr but can not create an algorithm.
Thanks for your help.
Try this:
Find "I" values
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
Find positions from the first "I" seen
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
Replace needed values
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
Your output
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
Example 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
Here is a brute force approach, which should be the easiest to come up with but the least preferred. Anyway, here it is:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
Here's a possible answer using ddply from the plyr package
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
The plyr approach :
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
You have to use an if because x[min(which(x == "I"))] would returns numeric(0) for rows without at least one I
My Solution:
After #Julien Navarre recommendation, first I created toNA() function:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
First, I tried plyr::adply() and purrrlyr::by_row() functions to apply my toNA() function my data.frame which has over 3 million rows.
Both are very slow. (For 1000 rows they take 9 and 6 seconds respectively). These approaches are also slow with a simple function(x) x. I am not sure what is overhead.
So I tried base::apply() function: (result is my data set)
as.tibble(t(apply(result, 1, toNA ) ))
It only takes 0.2 seconds for 1000 rows.
I am not sure about programming style but for now this solution works for me.
Thanks for all your recommendations.
A pure base solution, we're building a boolean matrix of "=="I" or not", then with a double cumsum by row we can find where our NAs must be placed:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V

How do I make two data columns into one longer data column

Let me clear, I do not want to add, multiply, subtract or divide the data. I want the new column to include all the information from both the first column and the second column. Here is an example of what I mean.
data 1 data 2 new data
1 Q 1 Q
2 T 5 T
3 R 3 R
4 1
5 5
6 3
The values I am looking at are not categorical but I used them as an example to show the difference.
cbind.fill <- function(...){
# From a SO answer by Tyler Rinker
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
df1 <- data.frame("data 1"=c("Q","T","R"),"data 2"=c(1,5,3), stringsAsFactors = F)
df2 <- cbind.fill(df1, c(df1$data.1, df1$data.2))
colnames(df2) <- c(colnames(df2)[1:2], "new data")
df2
data.1 data.2 new data
[1,] "Q" "1" "Q"
[2,] "T" "5" "T"
[3,] "R" "3" "R"
[4,] NA NA "1"
[5,] NA NA "5"
[6,] NA NA "3"
Source of cbind.fill function: cbind a df with an empty df (cbind.fill?)

count of records within levels of a factor

I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)
You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.
You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"
There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2
My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))
factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2

Count the occurence of specific combinations of characters in a list

My question is very simple..but I cant manage to work it out...
I have run a variable selection method in R on 2000 genes using 1000 iterations and in each iteration I got a combination of genes. I would like to count the number of times each combination of genes occurs in R.
For example I have
# for iteration 1
genes[1] "a" "b" "c"
# for iteration 2
genes[2] "a" "b"
# for iteration 3
genes[3] "a" "c"
# for iteration 4
genes [4] "a" "b"
and this would give me
"a" "b" "c" 1
"a" "b" 2
"a" "c" 1
I have unlisted the list and got the number each gene comes but I am interested in is the combination. I tried to create a table but I have unequal length for each gene vector. Thanks in advance.
The way I could immediately think of is to paste them and then use table as follows:
genes_p <- sapply(my_genes, paste, collapse=";")
freq <- as.data.frame(table(genes_p))
# Var1 Freq
# 1 a;b 2
# 2 a;b;c 1
# 3 c 1
The above solution assumes that the genes are sorted by names and the same gene id doesn't occur more than once within an element of the list. If you want to account for both, then:
# sort genes before pasting
genes_p <- sapply(my_genes, function(x) paste(sort(x), collapse=";"))
# sort + unique
genes_p <- sapply(my_genes, function(x) paste(sort(unique(x)), collapse=";"))
Edit: Following OP's question in comment, the idea is to get all combinations of 2'ers (so to say), wherever possible and then take the table. First I'll break down the code and write them separate for understanding. Then I'll group them together to get a one-liner.
# you first want all possible combinations of length 2 here
# that is, if vector is:
v <- c("a", "b", "c")
combn(v, 2)
# [,1] [,2] [,3]
# [1,] "a" "a" "b"
# [2,] "b" "c" "c"
This gives all the combinations taken 2 at a time. Now, you can just paste it similarly. combn also allows function argument.
combn(v, 2, function(y) paste(y, collapse=";"))
# [1] "a;b" "a;c" "b;c"
So, for each set of genes in your list, you can do the same by wrapping this around a sapply as follows:
sapply(my_genes, function(x) combn(x, min(length(x), 2), function(y)
paste(y, collapse=";")))
The min(length(x), 2) is required because some of your gene list can be just 1 gene.
# [[1]]
# [1] "a;b" "a;c" "b;c"
# [[2]]
# [1] "a;b"
# [[3]]
# [1] "c"
# [[4]]
# [1] "a;b"
Now, you can unlist this to get a vector and then use table to get frequency:
table(unlist(sapply(l, function(x) combn(x, min(length(x), 2), function(y)
paste(y, collapse=";")))))
# a;b a;c b;c c
# 3 1 1 1
You can wrap this in turn with as.data.frame(.) to get a data.frame:
as.data.frame(table(unlist(sapply(l, function(x) combn(x, min(length(x), 2),
function(y) paste(y, collapse=";"))))))
# Var1 Freq
# 1 a;b 3
# 2 a;c 1
# 3 b;c 1
# 4 c 1

Row names & column names in R

Do the following function pairs generate exactly the same results?
Pair 1) names() & colnames()
Pair 2) rownames() & row.names()
As Oscar Wilde said
Consistency is the last refuge of the
unimaginative.
R is more of an evolved rather than designed language, so these things happen. names() and colnames() work on a data.frame but names() does not work on a matrix:
R> DF <- data.frame(foo=1:3, bar=LETTERS[1:3])
R> names(DF)
[1] "foo" "bar"
R> colnames(DF)
[1] "foo" "bar"
R> M <- matrix(1:9, ncol=3, dimnames=list(1:3, c("alpha","beta","gamma")))
R> names(M)
NULL
R> colnames(M)
[1] "alpha" "beta" "gamma"
R>
Just to expand a little on Dirk's example:
It helps to think of a data frame as a list with equal length vectors. That's probably why names works with a data frame but not a matrix.
The other useful function is dimnames which returns the names for every dimension. You will notice that the rownames function actually just returns the first element from dimnames.
Regarding rownames and row.names: I can't tell the difference, although rownames uses dimnames while row.names was written outside of R. They both also seem to work with higher dimensional arrays:
>a <- array(1:5, 1:4)
> a[1,,,]
> rownames(a) <- "a"
> row.names(a)
[1] "a"
> a
, , 1, 1
[,1] [,2]
a 1 2
> dimnames(a)
[[1]]
[1] "a"
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
I think that using colnames and rownames makes the most sense; here's why.
Using names has several disadvantages. You have to remember that it means "column names", and it only works with data frame, so you'll need to call colnames whenever you use matrices. By calling colnames, you only have to remember one function. Finally, if you look at the code for colnames, you will see that it calls names in the case of a data frame anyway, so the output is identical.
rownames and row.names return the same values for data frame and matrices; the only difference that I have spotted is that where there aren't any names, rownames will print "NULL" (as does colnames), but row.names returns it invisibly. Since there isn't much to choose between the two functions, rownames wins on the grounds of aesthetics, since it pairs more prettily withcolnames. (Also, for the lazy programmer, you save a character of typing.)
And another expansion:
# create dummy matrix
set.seed(10)
m <- matrix(round(runif(25, 1, 5)), 5)
d <- as.data.frame(m)
If you want to assign new column names you can do following on data.frame:
# an identical effect can be achieved with colnames()
names(d) <- LETTERS[1:5]
> d
A B C D E
1 3 2 4 3 4
2 2 2 3 1 3
3 3 2 1 2 4
4 4 3 3 3 2
5 1 3 2 4 3
If you, however run previous command on matrix, you'll mess things up:
names(m) <- LETTERS[1:5]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 3 2 4 3 4
[2,] 2 2 3 1 3
[3,] 3 2 1 2 4
[4,] 4 3 3 3 2
[5,] 1 3 2 4 3
attr(,"names")
[1] "A" "B" "C" "D" "E" NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[20] NA NA NA NA NA NA
Since matrix can be regarded as two-dimensional vector, you'll assign names only to first five values (you don't want to do that, do you?). In this case, you should stick with colnames().
So there...

Resources