removing gene duplicates from heatmap in r - r

I have drawn heat maps from microarray expression data set and in the heatmaps I see duplicates and triplicates for many of the genes I am interested
I am very new to R and is there a way to remove these duplicates or triplicates of genes
For example I see name of one gene say (BMP1) 2 or 3 times in the heatmap
Kindly suggest me with some solutions
Regards
Ram

I try to guess your answer, but it will be better if you give an example of your problem:
> tmp <- data.frame("numbers" = 1:3, "letters" = letters[1:3])
> tmp
numbers letters
1 1 a
2 2 b
3 3 c
> tmp <- rbind(tmp,tmp)
> tmp
numbers letters
1 1 a
2 2 b
3 3 c
4 1 a
5 2 b
6 3 c
> unique(tmp)
numbers letters
1 1 a
2 2 b
3 3 c
From the base help
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

Related

I want to eliminate duplicates in a variable but only within a certain group of values in R

Not an extremely proficient programmer here so bear with me.
I want to eliminate duplicities in variable 'B' but only within the same values of variable 'A'. That is so that I get only one 'a' value for the group of 1's and I don't eliminate it for the group of 2's.
A <- c(1,1,1,2,2,2)
B <- c('a','b','a','c','a','d')
ab <- cbind(A,B)
AB <- as.data.frame(ab)
Thank you beforehand! Hope it was clear enough.
You may also want to take a look at the duplicated() function. Your example
a <- c(1,1,1,2,2,2)
b <- c('a','b','a','c','a','d')
ab <- cbind(a,b)
ab_df <- as.data.frame(ab)
gives you the following data frame:
> ab_df
a b
1 1 a
2 1 b
3 1 a
4 2 c
5 2 a
6 2 d
Obviously row 3 duplicates row 1. duplicated(ab_df) returns a logical vector indicating duplicated rows:
> duplicated(ab_df)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
This in turn could be used to eliminate the duplicated rows from your original data frame:
> d <- duplicated(ab_df)
> ab_df[!d, ]
a b
1 1 a
2 1 b
4 2 c
5 2 a
6 2 d
You may use unique which removes the duplicated rows of your data frame.
ab <- unique(ab)
ab
# A B
# 1 1 a
# 2 1 b
# 4 2 c
# 5 2 a
# 6 2 d

data frame column names no longer unique when subsetting

I have a data frame that contains duplicate column names. I'm aware that it's non-standard to use duplicated column names, but these names are actually being reassigned downstream using user inputs. For now, I'm attempting to positionally subset a data frame, but the column names become deduplicated. Here's an example.
> df <- data.frame(x = 1:4, y = 2:5, y = LETTERS[2:5], y = (2+(2:5)), check.names = F)
> df
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
4 4 5 E 7
However, when I attempt to subset, the names change...
> df[, 1:3]
x y y.1
1 1 2 B
2 2 3 C
3 3 4 D
4 4 5 E
Is there any way to prevent this from happening? It only occurs when I subset on columns, not rows.
> df[1:3,]
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
Edit for others noticing this behavior:
I've done some digging into the behavior and this relevant section from the help page for extract.data.frame (type ?'[')
The relevant section states:
If [ returns a data frame it will have unique (and non-missing) row
names, if necessary transforming the row names using make.unique.
Similarly, if columns are selected column names will be transformed to
be unique if necessary (e.g., if columns are selected more than once,
or if more than one column of a given name is selected if the data
frame has duplicate column names).
This explains the why, appreciate the comments so far on addressing how to best navigate this.
Here is an option, although I think it is not a good idea to have duplicated column names.
as.data.frame(as.list(df)[1:3], check.names = F)
# x y y
# 1 1 2 B
# 2 2 3 C
# 3 3 4 D
# 4 4 5 E

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources