reverse lexicographic order after using expand.grid - r

I'm trying to generate the following matrix, based on a multinomial framework. For example, if I had three columns, I'd get:
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1
But, I want many more columns. I know I can use expand.grid, like:
u <- list(0:1)
expand.grid(rep(u,3))
But, it returns what I want in the wrong order:
0 0 0
1 0 0
0 1 0
1 1 0
0 0 1
1 0 1
0 1 1
1 1 1
Any ideas? Thanks.

You can reorder your rows to match your expected output:
u <- list(0:1)
g <- expand.grid(rep(u,3))
g <- g[order(rowSums(g)), ]

Related

Create a new column based on several conditions

I want to create a new column based on some conditions imposed on several columns. For example, here is an example dataset:
a <- data.frame(x=c(1,0,1,0,0), y=c(0,0,0,0,0), z=c(1,1,0,0,0))
a
x y z
1 1 0 1
2 0 0 1
3 1 0 0
4 0 0 0
5 0 0 0
Specifically, if for any particular row 1 is present, then the new column returns 1. If all are 0, then the new column returns 0. So the dataset with the new column will be
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
My initial thought was to use %in% but couldn't get the result I want. Thank you for your help!
If your data frame consists of binary values, e.g., only 0 and 1, you can try the code below with rowSums
a$w <- +(rowSums(a)>0)
such that
> a
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
We can use rowMaxs from matrixStats
library(matrixStats)
a$w <- rowMaxs(as.matrix(a))
a$w
#[1] 1 1 1 0 0
You can find max of each row :
a$w <- do.call(pmax, a)
a
# x y z w
#1 1 0 1 1
#2 0 0 1 1
#3 1 0 0 1
#4 0 0 0 0
#5 0 0 0 0
which can also be done with apply :
a$w <- apply(a, 1, max)

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Match combinations of row values between 2 different data frames

I have a data.frame with 16 different combinations of 4 different cell markers
combinations_df
FITC Cy3 TX_RED Cy5
a 0 0 0 0
b 1 0 0 0
c 0 1 0 0
d 1 1 0 0
e 0 0 1 0
f 1 0 1 0
g 0 1 1 0
h 1 1 1 0
i 0 0 0 1
j 1 0 0 1
k 0 1 0 1
l 1 1 0 1
m 0 0 1 1
n 1 0 1 1
o 0 1 1 1
p 1 1 1 1
I have my "main" data.frame with 10 columns and thousands of rows.
> main_df
a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1 1 1 1 0 1 1 1 1
2 0 1 0 1 1 0 1 0 1 1
3 1 1 0 0 0 1 1 0 0 0
4 0 1 1 1 1 0 1 1 1 1
5 0 0 0 0 0 0 0 0 0 0
....
I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.
sample output
> phenotype
[1] "g" "i" "a" "p" "g"
I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.
Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.
EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.
EDIT: Changin the sample data output, since no "t" should be present
The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:
# phenotype FITC Cy3 TX_RED Cy5
#1 a 0 0 0 0
#2 b 1 0 0 0
#3 c 0 1 0 0
#4 d 1 1 0 0
# etc
dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.
library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))
# a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1 1 1 1 0 1 1 1 1 p
#2 0 1 0 1 1 0 1 0 1 1 o
#3 1 1 0 0 0 1 1 0 0 0 e
#4 0 1 1 1 1 0 1 1 1 1 p
#5 0 0 0 0 0 0 0 0 0 0 a
I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.
Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.
main_text=NULL
for(i in 1:length(main_df[,1])){
main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
}
comb_text=NULL
for(i in 1:length(combinations_df[,1])){
comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
}
rownames(combinations_df)[match(main_text,comb_text)]
How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.
combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)
main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)
rownames(combination_df)[match(main_df$key, combination_df$key)]

Selecting specific columns from dataset

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.
I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

Permutation position of numbers in R

I'm looking for a function in R which can do the permutation. For example, I have a vector with five 1 and ten 0 like this:
> status=c(rep(1,5),rep(0,10))
> status
[1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Now I'd like to randomly permute the position of these numbers but keep the same number of 0 and 1 in vector and to get new series of number, for example to get something like this:
1 1 0 1 0 1 0 0 0 0 0 1 0 0 0
or
1 0 0 0 0 0 0 1 1 0 0 1 0 1 0
I found the function sample() can help us to sample, but the number of 1 and 0 is not the same each time. Do you know how can I do this with R? Thanks in advance.
We can use sample
sample(status)
#[1] 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0
sample(status)
#[1] 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0
If we use sample to return the entire vector, it will do the permutation and give the frequency count same for each of the unique elements
colSums(replicate(5, sample(status)))
#[1] 5 5 5 5 5
i.e. we get 5 one's in each of the sampling. So, the remaining 0's would be 10.

Resources