Match combinations of row values between 2 different data frames - r

I have a data.frame with 16 different combinations of 4 different cell markers
combinations_df
FITC Cy3 TX_RED Cy5
a 0 0 0 0
b 1 0 0 0
c 0 1 0 0
d 1 1 0 0
e 0 0 1 0
f 1 0 1 0
g 0 1 1 0
h 1 1 1 0
i 0 0 0 1
j 1 0 0 1
k 0 1 0 1
l 1 1 0 1
m 0 0 1 1
n 1 0 1 1
o 0 1 1 1
p 1 1 1 1
I have my "main" data.frame with 10 columns and thousands of rows.
> main_df
a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1 1 1 1 0 1 1 1 1
2 0 1 0 1 1 0 1 0 1 1
3 1 1 0 0 0 1 1 0 0 0
4 0 1 1 1 1 0 1 1 1 1
5 0 0 0 0 0 0 0 0 0 0
....
I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.
sample output
> phenotype
[1] "g" "i" "a" "p" "g"
I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.
Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.
EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.
EDIT: Changin the sample data output, since no "t" should be present

The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:
# phenotype FITC Cy3 TX_RED Cy5
#1 a 0 0 0 0
#2 b 1 0 0 0
#3 c 0 1 0 0
#4 d 1 1 0 0
# etc
dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.
library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))
# a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1 1 1 1 0 1 1 1 1 p
#2 0 1 0 1 1 0 1 0 1 1 o
#3 1 1 0 0 0 1 1 0 0 0 e
#4 0 1 1 1 1 0 1 1 1 1 p
#5 0 0 0 0 0 0 0 0 0 0 a
I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.

Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.
main_text=NULL
for(i in 1:length(main_df[,1])){
main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
}
comb_text=NULL
for(i in 1:length(combinations_df[,1])){
comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
}
rownames(combinations_df)[match(main_text,comb_text)]

How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.
combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)
main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)
rownames(combination_df)[match(main_df$key, combination_df$key)]

Related

Create a new column based on several conditions

I want to create a new column based on some conditions imposed on several columns. For example, here is an example dataset:
a <- data.frame(x=c(1,0,1,0,0), y=c(0,0,0,0,0), z=c(1,1,0,0,0))
a
x y z
1 1 0 1
2 0 0 1
3 1 0 0
4 0 0 0
5 0 0 0
Specifically, if for any particular row 1 is present, then the new column returns 1. If all are 0, then the new column returns 0. So the dataset with the new column will be
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
My initial thought was to use %in% but couldn't get the result I want. Thank you for your help!
If your data frame consists of binary values, e.g., only 0 and 1, you can try the code below with rowSums
a$w <- +(rowSums(a)>0)
such that
> a
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
We can use rowMaxs from matrixStats
library(matrixStats)
a$w <- rowMaxs(as.matrix(a))
a$w
#[1] 1 1 1 0 0
You can find max of each row :
a$w <- do.call(pmax, a)
a
# x y z w
#1 1 0 1 1
#2 0 0 1 1
#3 1 0 0 1
#4 0 0 0 0
#5 0 0 0 0
which can also be done with apply :
a$w <- apply(a, 1, max)

Selecting specific columns from dataset

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.
I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

calculating sum of all against all columns with matching row count

I have a df with several columns having values 0 or 1. Something like:
a b c d e
1 0 0 0 0
0 1 0 1 0
0 1 0 1 0
1 0 1 0 1
I would like to create a 5 by 5 matrix showing total count if columns have 1 in same row. I only want to consider 1's and in case of diagonal it would automatically reflect total row in that column with 1. Output something like:
a b c d e
a 2 0 1 0 1
b 0 2 0 2 0
c 1 0 1 0 1
d 0 2 0 2 0
e 1 0 1 0 1
Thanks.
Sudhir
Convert to matrix and take cross product:
m <- as.matrix(d)
crossprod(m,m)

reverse lexicographic order after using expand.grid

I'm trying to generate the following matrix, based on a multinomial framework. For example, if I had three columns, I'd get:
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1
But, I want many more columns. I know I can use expand.grid, like:
u <- list(0:1)
expand.grid(rep(u,3))
But, it returns what I want in the wrong order:
0 0 0
1 0 0
0 1 0
1 1 0
0 0 1
1 0 1
0 1 1
1 1 1
Any ideas? Thanks.
You can reorder your rows to match your expected output:
u <- list(0:1)
g <- expand.grid(rep(u,3))
g <- g[order(rowSums(g)), ]

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources