I have a data set like this:
A B C D E F G
12 1 0 0 0 0 0
Hey 0 1 0 0 0 0
No 0 0 0 0 0 1
Yes 0 0 0 0 1 0
I want to build an scenario, what will happen if a COLUMN has 10% more YES (yes = 1). But, in my scenario, this should be done with 3 columns at the same time.
So: let's say that the rows of interest are where B=1 or C=1 or D=1. If one of the columns is equal to 1, that is fine. But I want to (randomly) make 10% of the remaining rows (where B=0 & C=0 & D=0) into a 1 (and of course, if we give them (the randomly 10% of the remaining rows) a 1, then the other columns should be all 0 (except column A)).
Sorry, really had a hard time to explain this problem. Hopefully it is clear.
The result should be something like this (it is not representing the 10% since the example is too small).
A B C D E F G
12 1 0 0 0 0 0
Hey 0 1 0 0 0 0
No 0 0 0 0 0 1
Yes 0 1 0 0 0 0
where you can see that "Yes" is randomly assigned as C=1, and its original value is set back to 0.
I believe this is what you want:
data:
df1<-
structure(list(A = c("12", "Hey", "No", "Yes"), B = c(1L, 0L,
0L, 0L), C = c(0L, 1L, 0L, 0L), D = c(0L, 0L, 0L, 0L), E = c(0L,
0L, 0L, 0L), F = c(0L, 0L, 0L, 1L), G = c(0L, 0L, 1L, 0L)), row.names = c(NA,
-4L), class = "data.frame")
code:
m <- `rownames<-`(df1[,-1],df1[,1]) # make your life simple, add character col as rownames
percentage = .5 # choose any percentage you like from 0 to 1, .1 for 10%
amountOf1 = floor(percentage * ncol(m)) # get the amount of ones based on percentage
IND <- which(rowSums(m[,1:3]) == 0) # get those rows having B, C, D with 0
for(i in IND) {
m[i,] = sample(rep(1:0,c(amountOf1,ncol(m)-amountOf1)) )
}
result: (now 50% are 1 in rows where B,C,D is 0)
# B C D E F G
#12 1 0 0 0 0 0
#Hey 0 1 0 0 0 0
#No 1 0 0 0 1 1
#Yes 1 0 1 0 0 1
Related
I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.
To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":
SecondaryCauses = c("DiseaseB", "DiseaseD")
"NotDiseasedWithA" means that they do not have disease A.
"Primary" means that they have disease A but not any of the known diseases that can cause it.
"Secondary" means that they have disease A and a diseases that probably caused it.
Sample data
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE
1 0 1 0 0 0
2 1 0 0 0 1
3 1 0 1 1 0
4 1 0 1 1 1
5 0 0 0 0 0
My question is:
How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?
I tried something like the following, but this did not work:
DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))
So in the end I want to have this results:
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
1 0 1 0 0 0 NotDiseasedWithA
2 1 0 0 0 1 Primary
3 1 0 1 1 0 Secondary
4 1 0 1 1 1 Secondary
5 0 0 0 0 0 NotDiseasedWithA
using data.table
df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L,
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L,
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA,
-5L), class = c("data.frame"))
library(data.table)
setDT(df) # make it a data.table
SecondaryCauses = c("DiseaseB", "DiseaseD")
df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]
df
# ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
# 1: 1 0 1 0 0 0 NotDiseasedWithA
# 2: 2 1 0 0 0 1 Primary
# 3: 3 1 0 1 1 0 Secondary
# 4: 4 1 0 1 1 1 Secondary
# 5: 5 0 0 0 0 0 NotDiseasedWithA
I have a dataframe in R which looks like the one below.
a b c d e f
0 1 1 0 0 0
1 1 1 1 0 1
0 0 0 1 0 1
1 0 0 1 0 1
1 1 1 0 0 0
The database is big, spanning over 100 columns and 5000 rows and contain all binaries (0's and 1's). I want to construct an overlap between each and every columns in R. Something like the one given below. This overlap dataframe will be a square matrix with equal number of rows and columns and that will be same as the number of columns in the 1st dataframe.
a b c d e f
a 3 2 2 2 0 2
b 2 3 3 3 0 1
c 2 3 3 1 0 1
d 2 3 1 3 0 3
e 0 0 0 0 0 0
f 2 1 1 3 0 3
Each cell of the second dataframe is populated by the number of cases where both row and column have 1 in the first dataframe.
I'm thinking of constructing a empty matrix like this:
df <- matrix(ncol = ncol(data), nrow = ncol(data))
colnames(df) <- names(data)
rownames(df) <- names(data)
.. and iterating over each cell of this matrix using an apply command reading the corresponding row name (say, x) and column name (say, y) and running a function like the one below.
summation <- function (x,y) (return (sum(data$x * data$y)))
The problem with is I can't find out the row name and column name while within an apply function. Any help will be appreciated.
Any more efficient way than what I'm thinking is more than welcome.
You are looking for crossprod
crossprod(as.matrix(df1))
# a b c d e f
#a 3 2 2 2 0 2
#b 2 3 3 1 0 1
#c 2 3 3 1 0 1
#d 2 1 1 3 0 3
#e 0 0 0 0 0 0
#f 2 1 1 3 0 3
data
df1 <- structure(list(a = c(0L, 1L, 0L, 1L, 1L), b = c(1L, 1L, 0L, 0L,
1L), c = c(1L, 1L, 0L, 0L, 1L), d = c(0L, 1L, 1L, 1L, 0L), e = c(0L,
0L, 0L, 0L, 0L), f = c(0L, 1L, 1L, 1L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-5L))
I am trying to create a summary table and having a mental hang up. Essentially, what I think I want is a summaryBy statement getting colSums for the subsets for ALL columns except the factor to summarize on.
My data frame looks like this:
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524
comp103680_c0 10 0 0 0 0 0 1
comp103947_c0 3 0 0 0 0 0 0
comp104660_c0 1 1 1 0 0 0 0
comp105255_c0 10 0 0 0 0 0 0
What I would like to do is get colSums for all columns after Cluster using Cluster as the grouping factor.
I have tried a bunch of things. The last was the ply ddply
> groupColumns = "Cluster"
> dataColumns = colnames(GO_matrix_MF[,2:ncol(GO_matrix_MF)])
> res = ddply(GO_matrix_MF, groupColumns, function(x) colSums(GO_matrix_MF[dataColumns]))
> head(res)
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524 GO:0004674 GO:0045735
1 1 121 138 196 94 43 213 97 20
2 2 121 138 196 94 43 213 97 20
I am not sure what the return values represent, but they do not represent the colSums
Try:
> aggregate(.~Cluster, data=ddf, sum)
Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
1 1 1 1 0 0 0 0
2 3 0 0 0 0 0 0
3 10 0 0 0 0 0 1
I think you are looking for something like this. I modified your data a bit. There are other options too.
# Modified data
foo <- structure(list(Cluster = c(10L, 3L, 1L, 10L), GO.0003677 = c(11L,
0L, 1L, 5L), GO.0003700 = c(0L, 0L, 1L, 0L), GO.0046872 = c(0L,
9L, 0L, 0L), GO.0008270 = c(0L, 0L, 0L, 0L), GO.0043565 = c(0L,
0L, 0L, 0L), GO.0005524 = c(1L, 0L, 0L, 0L)), .Names = c("Cluster",
"GO.0003677", "GO.0003700", "GO.0046872", "GO.0008270", "GO.0043565",
"GO.0005524"), class = "data.frame", row.names = c("comp103680_c0",
"comp103947_c0", "comp104660_c0", "comp105255_c0"))
library(dplyr)
foo %>%
group_by(Cluster) %>%
summarise_each(funs(sum))
# Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
#1 1 1 1 0 0 0 0
#2 3 0 0 9 0 0 0
#3 10 16 0 0 0 0 1
I have two matrices I want to sum based on their row and column names. The matrices will not necessarily have all rows and columns in common - some may be missing from either matrix.
For example, consider two matrices A and B:
A= B=
a b c d a c d e
v 1 1 1 0 v 0 0 0 1
w 1 1 0 1 w 0 0 1 0
x 1 0 1 1 y 0 1 0 0
y 0 1 1 1 z 1 0 0 0
Column e is missing from matrix A and column b is missing from matrix B.
Row z is missing from matrix A and row x is missing from matrix B.
The summed table I'm looking for is:
Sum=
a b c d e
v 1 1 1 0 1
w 1 1 0 2 0
x 1 0 1 1 na
y 0 1 2 1 0
z 1 na 0 0 0
The row and column ordering in the final matrix don't matter, as long as the matrix is complete, i.e. has all the data. Missing values don't have to be "Na", but could be "0" instead.
I'm not sure if there is a way to do this that doesn't involve for loops. Any help would be much appreciated.
My solution
I managed to do this easily by converting the matrices to dataframes, binding the dataframes by row and then casting the resulting dataframe back into a matrix. This looks like it works, but maybe someone could double check or let me know if there is a better way.
library(reshape2)
A_df=as.data.frame(as.table(A))
B_df=as.data.frame(as.table(B))
merged_df=rbind(A_df,B_df)
Summed_matrix=acast(merged_df, Var1 ~ Var2, sum)
merged_df looks like this:
Var1 Var2 Freq
1 v a 1
2 w a 1
3 x a 1
4 y a 0
5 v b 1
6 w b 1
etc...
May be you can try:
cAB <- union(colnames(A), colnames(B))
rAB <- union(rownames(A), rownames(B))
A1 <- matrix(0, ncol=length(cAB), nrow=length(rAB), dimnames=list(rAB, cAB))
B1 <- A1
indxA <- outer(rAB, cAB, FUN=paste) %in% outer(rownames(A), colnames(A), FUN=paste)
indxB <- outer(rAB, cAB, FUN=paste) %in% outer(rownames(B), colnames(B), FUN=paste)
A1[indxA] <- A
B1[indxB] <- B
A1+B1 #because it was mentioned to have `0` as missing values
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 0
#y 0 1 2 1 0
#z 1 0 0 0 0
If you want to get the NA as missing values
A1 <- matrix(NA, ncol=length(cAB), nrow=length(rAB), dimnames=list(rAB, cAB))
B1 <- A1
A1[indxA] <- A
B1[indxB] <- B
indxNA <- is.na(A1) & is.na(B1)
A1[is.na(A1)!= indxNA] <- 0
B1[is.na(B1)!= indxNA] <- 0
A1+B1
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 NA
#y 0 1 2 1 0
#z 1 NA 0 0 0
Or using reshape2
library(reshape2)
acast(rbind(melt(A), melt(B)), Var1~Var2, sum) #Inspired from the OP's idea
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 0
#y 0 1 2 1 0
#z 1 0 0 0 0
data
A <- structure(c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 1L, 1L), .Dim = c(4L, 4L), .Dimnames = list(c("v", "w", "x",
"y"), c("a", "b", "c", "d")))
B <- structure(c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L), .Dim = c(4L, 4L), .Dimnames = list(c("v", "w", "y",
"z"), c("a", "c", "d", "e")))
I'm really a beginner in R so, sorry if my code shocks you guys.
My data resembles something like this:
a b c d e f g h i j
t1 0 0 0 0 3 0 0 0 0 0
t2 0 0 0 0 0 6 0 0 0 0
t3 0 0 0 0 0 0 0 0 0 8
t4 0 0 0 0 0 0 0 0 9 0
I'd like to, for each row find the column with the maximum value and then get columns minus 3 to plus 3 of that one.
I wrote the following script to perform exactly that:
M<-c(1)
for (row in 1: length(D[,1])) {
max<-which.max(D[row,])
D<-D[,c(max-3,max-2,max-1,max,max+1,max+2,max+3)]
M<- cbind(M,D)
}
M<-M[,-1]
It would work, except for the case in which the maximum value is in a column near the beginning or end of a row (like rows t3 and t4 in the example above). In this case I'd like to have the 7 columns more close to the column with the maximum value, like this:
t1 0 0 0 3 0 0 0
t2 0 0 0 6 0 0 0
t3 0 0 0 0 0 0 8
t4 0 0 0 0 0 9 0
Help would be really appreciated!
dput() version of example data:
structure(list(a = c(0L, 0L, 0L, 0L), b = c(0L, 0L, 0L, 0L),
c = c(0L, 0L, 0L, 0L), d = c(0L, 0L, 0L, 0L), e = c(3L, 0L,
0L, 0L), f = c(0L, 6L, 0L, 0L), g = c(0L, 0L, 0L, 0L), h = c(0L,
0L, 0L, 0L), i = c(0L, 0L, 0L, 9L), j = c(0L, 0L, 8L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), class = "data.frame",
row.names = c("t1", "t2", "t3", "t4"))
This should work nicely:
t(apply(D,
MARGIN = 1,
FUN = function(X) {
n <- which.max(X)
i <- seq(min(max(1, n-3), ncol(D)-6), len=7)
X[i]
}))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# t1 0 0 0 3 0 0 0
# t2 0 0 0 6 0 0 0
# t3 0 0 0 0 0 0 8
# t4 0 0 0 0 0 9 0
To test that the key column-selecting bit works as you'd like it to, you can try the following:
n <- 2
seq(min(max(1, n-3), ncol(D)-6), len=7)
n <- 10
seq(min(max(1, n-3), ncol(D)-6), len=7)
n <- 6
seq(min(max(1, n-3), ncol(D)-6), len=7)