I got a dataframe where there is gene expression data
I'm trying to extract all rows where ANY of the columns has a value (data is already in log2 values) >= 2 but can't seem to get there. My data is:
A B C D
Gene1 1 2 3 1
Gene2 2 1 1 4
Gene3 1 1 0 1
Gene4 1 2 0 1
I would only like to retain gene1, gene2 and gene4 without stating all columns (as this is just a toy example).
You could use rowSums on a logical matrix derived from df >=2 and double negate (!) to get the index of rows to subset.
df[!!rowSums(df >=2),]
# A B C D
#Gene1 1 2 3 1
#Gene2 2 1 1 4
#Gene4 1 2 0 1
Or using the reverse condition df <2 to get the logical matrix, userowSums, then check whether this is less than ncol(df)
df[rowSums(df <2) < ncol(df),]
# A B C D
#Gene1 1 2 3 1
#Gene2 2 1 1 4
#Gene4 1 2 0 1
Or
df[apply(t(df>=2),2, any), ]
data
df <- structure(list(A = c(1L, 2L, 1L, 1L), B = c(2L, 1L, 1L, 2L),
C = c(3L, 1L, 0L, 0L), D = c(1L, 4L, 1L, 1L)), .Names = c("A",
"B", "C", "D"), class = "data.frame", row.names = c("Gene1",
"Gene2", "Gene3", "Gene4"))
Related
Dataframe A:
Tree Apple Orange Pear
1 0 0 1
0 0 1 1
1 1 0 1
1 0 0 0
Dataframe B:
WK1 WK2 WK3 WK4
1 2 3 8
3 4 2 1
1 3 2 5
6 2 5 8
Both dataframe A and B have the same dimensions. What I am trying to do is to sum the cells across the rows in dataframe B only if the corresponding cell in dataframe A is equal to one.
The expected output is:
WK1 WK2 WK3 WK4 SUM
1 2 3 8 9
3 4 2 1 3
1 3 2 5 4
6 2 5 8 6
Since (row 1 column 1) and (row 1 column 4) of dataframe A are equal to one, then (row 1 column 1) and (row 1 column 4) of dataframe B are summed. The non-abbreviated form of dataframe A and B have over 883 columns and 12000 rows, so I cant write the name of each column.
Since the A dataframe has 1/0 value and you can multiply A dataframe with B and calculate row-wise sum.
B$SUM <- rowSums(A * B)
B
# WK1 WK2 WK3 WK4 SUM
#1 1 2 3 8 9
#2 3 4 2 1 3
#3 1 3 2 5 9
#4 6 2 5 8 6
If you can have values other than 0 and 1 in A you can compare A with 1 and then multiply.
B$SUM <- rowSums(+(A == 1) * B)
An option is to multiply by the datasets so that 0's will remain 0 and 1 will be replaced by the value of second dataset and as there are NA, we can use na.rm in rowSums
df2$SUM <- rowSums((df1 == 1) * df2, na.rm = TRUE)
df2
# WK1 WK2 WK3 WK4 SUM
#1 1 2 3 8 9
#2 3 4 2 1 3
#3 1 3 2 5 9
#4 6 2 5 8 6
Or another option is Map/Reduce
df2$SUM <- Reduce(`+`, Map(`*`, df1, df2))
Or we can replace the elements in 'df2' where 'df1' is 0 to NA and use rowSums to create the 'SUM' column in base R
df2$SUM <- rowSums(replace(df2, df1 ==0, NA), na.rm = TRUE)
Or slightly more compact option is
df2$SUM <- rowSums(df2 *NA^(df1== 0), na.rm = TRUE)
NOTE: This would also work when there are non-binary elements
data
df1 <- structure(list(Tree = c(1L, 0L, 1L, 1L), Apple = c(0L, 0L, 1L,
0L), Orange = c(0L, 1L, 0L, 0L), Pear = c(1L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(WK1 = c(1L, 3L, 1L, 6L), WK2 = c(2L, 4L, 3L, 2L
), WK3 = c(3L, 2L, 2L, 5L), WK4 = c(8L, 1L, 5L, 8L)), class = "data.frame",
row.names = c(NA,
-4L))
DF
ID B C D
1 A 1 1 3
2 B 2 3 1
3 C 1 1 1
4 D 3 1 1
5 E 1 0 0
Given a dataframe such the one mentioned above, how can I quickly calculate the means for each row in one column and store them in another column of the dataframe? For example the average of column B would be: 0.5, 1, 0.5, 1,5, 0.5.
And is it possible to have a function that does it automatically for several columns at once?
Option is to get the matching row element from 'ID' to divide the column with the value
f1 <- function(dat, colNm) transform(dat,
newCol = dat[[colNm]]/dat[match(colNm, ID), colNm])
f1(DF, 'B')
# ID B C D newCol
#1 A 1 1 3 0.5
#2 B 2 3 1 1.0
#3 C 1 1 1 0.5
#4 D 3 1 1 1.5
#5 E 1 0 0 0.5
If it is to divide by a constant value, then just do
DF[-1] <- DF[-1]/2
data
DF <- structure(list(ID = c("A", "B", "C", "D", "E"), B = c(1L, 2L,
1L, 3L, 1L), C = c(1L, 3L, 1L, 1L, 0L), D = c(3L, 1L, 1L, 1L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5"))
I have a dataframe in R which looks like the one below.
a b c d e f
0 1 1 0 0 0
1 1 1 1 0 1
0 0 0 1 0 1
1 0 0 1 0 1
1 1 1 0 0 0
The database is big, spanning over 100 columns and 5000 rows and contain all binaries (0's and 1's). I want to construct an overlap between each and every columns in R. Something like the one given below. This overlap dataframe will be a square matrix with equal number of rows and columns and that will be same as the number of columns in the 1st dataframe.
a b c d e f
a 3 2 2 2 0 2
b 2 3 3 3 0 1
c 2 3 3 1 0 1
d 2 3 1 3 0 3
e 0 0 0 0 0 0
f 2 1 1 3 0 3
Each cell of the second dataframe is populated by the number of cases where both row and column have 1 in the first dataframe.
I'm thinking of constructing a empty matrix like this:
df <- matrix(ncol = ncol(data), nrow = ncol(data))
colnames(df) <- names(data)
rownames(df) <- names(data)
.. and iterating over each cell of this matrix using an apply command reading the corresponding row name (say, x) and column name (say, y) and running a function like the one below.
summation <- function (x,y) (return (sum(data$x * data$y)))
The problem with is I can't find out the row name and column name while within an apply function. Any help will be appreciated.
Any more efficient way than what I'm thinking is more than welcome.
You are looking for crossprod
crossprod(as.matrix(df1))
# a b c d e f
#a 3 2 2 2 0 2
#b 2 3 3 1 0 1
#c 2 3 3 1 0 1
#d 2 1 1 3 0 3
#e 0 0 0 0 0 0
#f 2 1 1 3 0 3
data
df1 <- structure(list(a = c(0L, 1L, 0L, 1L, 1L), b = c(1L, 1L, 0L, 0L,
1L), c = c(1L, 1L, 0L, 0L, 1L), d = c(0L, 1L, 1L, 1L, 0L), e = c(0L,
0L, 0L, 0L, 0L), f = c(0L, 1L, 1L, 1L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-5L))
I would like summarize my data by counting the entities and create counting_column for each entity.
let say:
df:
id class
1 A
1 B
1 A
1 A
1 B
1 c
2 A
2 B
2 B
2 D
I want to create a table like
id A B C D
1 3 2 1 0
2 1 2 0 1
How can I do this in R using apply function?
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
class = structure(c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 2L, 4L
), .Label = c("A", "B", "C", "D"), class = "factor")), .Names = c("id",
"class"), class = "data.frame", row.names = c(NA, -10L))
with(df, table(id, class))
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
xtabs(~ id + class, df)
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
tapply(rep(1, nrow(df)), df, length, default = 0)
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
This seems like a very strange requirement but if you insist on using apply then the function count counts the number of rows for which id equals x and class equals y. It is applied to every combination of id and class to get a using nested apply calls. Finally we add the row and column names.
uid <- unique(DF$id)
uclass <- unique(DF$class)
count <- function(x, y, DF) sum(x == DF$id & y == DF$class)
a <- apply(matrix(uclass), 1, function(u) apply(matrix(uid), 1, count, u, DF))
dimnames(a) <- list(uid, uclass)
giving:
> a
A B c D
1 3 2 1 0
2 1 2 0 1
Note
We used this for DF
Lines <- "id class
1 A
1 B
1 A
1 A
1 B
1 c
2 A
2 B
2 B
2 D"
DF <- read.table(text = Lines, header = TRUE)
Hello I have the data frame and I need to remove all the rows with max values from each columns.
Example
A B C
1 2 3 5
2 4 1 1
3 1 4 3
4 2 1 1
So the output is:
A B C
4 2 1 1
Is there any quick way to do this?
We can do this with %in%
df1[!seq_len(nrow(df1)) %in% sapply(df1, which.max),]
# A B C
#4 2 1 1
If there are ties for maximum values in each row, then do
df1[!Reduce(`|`, lapply(df1, function(x) x== max(x))),]
df[-sapply(df, which.max),]
# A B C
#4 2 1 1
DATA
df = structure(list(A = c(2L, 4L, 1L, 2L), B = c(3L, 1L, 4L, 1L),
C = c(5L, 1L, 3L, 1L)), .Names = c("A", "B", "C"),
class = "data.frame", row.names = c(NA,-4L))