R spread columns with a specific pattern - r

Got a data.frame with a column like this:
Column_1
AAA
B
BBB
AAA_FACE
CCC
BBB_AAA
I want to spread the column into new columns (but not for all my unique values, because then I would get very, very much columns), but only for the values containing a specific pattern: "AAA".
After spreading the values, I want to make them binary, So ideally my new data.frame looks like this:
AAA AAA_FACE BBB_AAA
1 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 1
I tried dplyr's spread() function. But there I got the issue that I spread the data in many, many columns (instead of only the columns containing 'AAA' pattern).

One option with tidyverse would be
library(tidyverse)
df1 %>%
mutate(i1 = as.integer(str_detect(Column_1, "AAA")),
rn = row_number()) %>%
spread(Column_1, i1, fill = 0) %>%
select(matches("AAA"))
# AAA AAA_FACE BBB_AAA
#1 1 0 0
#2 0 0 0
#3 0 0 0
#4 0 1 0
#5 0 0 0
#6 0 0 1
It can be made a bit more efficient by replaceing the other values to NA and then do the spread
df1 %>%
mutate(i1 = as.integer(str_detect(Column_1, "AAA")),
Column_1 = replace(Column_1, !i1, NA),
rn = row_number()) %>%
spread(Column_1, i1, fill = 0) %>%
select(matches("AAA"))

Using basic R code:
Your data
db<-data.frame(Column_1=c("AAA","B","BBB","AAA_FACE","CCC","BBB_AAA"))
Identify "AAA" pattern
AAA_names<-as.character(db[grep("AAA",db$Column_1),"Column_1"])
Output dataframe creation:
out<-data.frame(lapply(AAA_names, f<-function(x,y){ return(x == y) }, y=as.character(db$Column_1)))
colnames(out)<-AAA_names
out[,AAA_names] <- lapply(out[,AAA_names], as.numeric)
Your output
out
AAA AAA_FACE BBB_AAA
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 0
6 0 0 1

Related

How to select columns whose elements are all 0 in R?

I am struggling to select all columns whose elements are all 0. Here is a simple example:
df <- data.frame(A=c(1,0,1,0),B=c(0,0,0,0),C=c(1,1,0,0),D=c(0,0,0,0))
df
A B C D
1 1 0 1 0
2 0 0 1 0
3 1 0 0 0
4 0 0 0 0
I want to select Columns B and D. I know the dplyr::selectallows to select by colnames. But how to select columns by their values? Something like select(across(everything(),~.x==0)). But the across() cannot be used in select().
We can use where inside the select statement to find the columns that have a sum of 0.
library(dplyr)
df %>%
select(where(~sum(.) == 0))
Or you could use colSums in base R:
df[,colSums(df) == 0]
Output
B D
1 0 0
2 0 0
3 0 0
4 0 0

A problem with ifelse according to two variables from different dataframes in R

I have two data frames. The first one (A) contain information about GOALS, and the second one (B) contains the specific information about the IDs which had that GOAL:
> A
GOAL
1 A116642173
2 A116642174
3 A116642175
4 A116642176
5 A116642178
6 A116642181
> B
ID GOAL
1 1873 A116433509
2 478 A116642178
3 2165 A116192937
4 165 A116192937
5 313 A116433701
6 475 A116367456
I would like to create new columns in one of this according the other data frame. So, first I create aditional columns:
> idkids=c(313,475,165,478,1873,2165)
> ids<-c(idkids)
> A[ ,paste0(ids)]<-0
> A
GOAL 313 475 165 478 1873 2165
1 A116642173 0 0 0 0 0 0
2 A116642174 0 0 0 0 0 0
3 A116642175 0 0 0 0 0 0
4 A116642176 0 0 0 0 0 0
5 A116642178 0 0 0 0 0 0
6 A116642181 0 0 0 0 0 0
I tried to use ifelse to find the GOAL for a specifid ID, but I didn't. I have tried to do this by two ways:
for (i in 1:kids){
A[ ,i+1]<-ifelse(A[ ,i+1]%in%B$ID,"",
ifelse(A$GOAL%in%B$GOAL, 1, 0))
}
for (i in 1:kids){
A[ ,i+1]<-ifelse(A[,i+1]%in%B$ID & A$GOAL%in%B$GOAL,1,0)
}
But my code didn't recognize the specific ID and it didn't give me 1 (TRUE) or 0 (FALSE). It give me 0 for all the columns... Can any one help me, please?
Here is one method to reshape the 'B' data into 'wide' and then do a join
library(dplyr)
library(tidyr)
pivot_wider(B, names_from = ID, values_from = ID, values_fn = length,
values_fill = 0) %>%
right_join(A)

Extract common values pairs for multiple dataframes to create a new binary dataframe based on them

I have 3 dataframes
Drug<-c("ab","bc","cd","ef","gh")
Target<-c("qwewr","saff","cxzcc","sadda","sadd")
fileA<-data.frame(Drug,Target)
Drug<-c("ab","bc","cdD","efc","ghg","hj")
Target<-c("qwewr","saff","cxzccf","saddav","sadd","bn")
fileB<-data.frame(Drug,Target)
Drug<-c("abB","bcv","cdD","efc")
Target<-c("qwewrm","saff","cxzccfh","saddav")
fileC<-data.frame(Drug,Target)
As you can see each one contains a pair "Drug"-"Target". Every dataframe contains only unique pairs. But you can find exactly the same pair in the other dataframes. What I want to achieve is to create a new dataframe which will extract all the unique pairs in the first column and then in the other 3 columns will have the fileA, fileB and fileC which will be filled with 1 if the pair exists and 0 if the pair does not exist. Something like:
Pairs fileA fileB fileC
1: abqwewr 1 1 1
2: bcsaff 1 1 1
3: cdcxzcc 1 1 1
4: efsadda 1 1 1
5: ghsadd 1 1 0
6: cdDcxzccf 0 0 0
7: efcsaddav 0 0 0
8: ghgsadd 0 0 0
9: hjbn 0 0 0
10: abBqwewrm 0 0 0
11: bcvsaff 0 0 0
12: cdDcxzccfh 0 0 0
But here the dataframe is not correct since in the first column there is only the drug name and also each row should have had at least one 1.
My method:
# Create composite dataset by combining all files
compositeDataD <- rbind(fileA,fileB,fileC)
# Get unique (drug, target) pairs
# Connect Drug Names and Target Gene Symbols into one vector of pairs
compositeDataD <- na.omit(compositeDataD)
DrugTargetPairsD <- paste(compositeDataD$Drug,compositeDataD$Target,sep="")
uniquePairsD<-unique(DrugTargetPairsD)
PairsA <- DrugTargetPairsD[1:nrow(na.omit(fileA))]
PairsB <- DrugTargetPairsD[1:nrow(na.omit(fileB))]
PairsC <- DrugTargetPairsD[1:nrow(na.omit(fileC))]
# Create binary matrix for unique (drug, target) pairs
binaryA <- as.numeric(uniquePairsD %in% PairsA) # This function returns a binary value for each unique (Drug, Target) Pair compared with the content of file1
binaryB <- as.numeric(uniquePairsD %in% PairsB)
binaryC <- as.numeric(uniquePairsD %in% PairsC)
table33 <- data.table(Pairs=uniquePairsD,
fileA=binaryA,fileB=binaryB,
fileC=binaryC)
Form list L from the three objects and use lapply to paste their columns together and then stack to create a 2 column data frame with the pasted values and an indicator of which object it came from. Finally use table to provide the counts.
L <- mget(ls(pattern = "file"))
s <- stack(lapply(L, function(x) paste0(x[[1]], x[[2]])))
table(s)
giving:
ind
values fileA fileB fileC
abBqwewrm 0 0 1
abqwewr 1 1 0
bcsaff 1 1 0
bcvsaff 0 0 1
cdcxzcc 1 0 0
cdDcxzccf 0 1 0
cdDcxzccfh 0 0 1
efcsaddav 0 1 1
efsadda 1 0 0
ghgsadd 0 1 0
ghsadd 1 0 0
hjbn 0 1 0
A variation of this is to express it as this pipeline:
library(magrittr)
mget(ls(pattern = "file")) %>%
lapply(function(x) paste0(x[[1]], x[[2]])) %>%
stack %>%
table
You can first create the Pairs and then merge on them, while carrying a column where the data came from:
Create the indicator column in each file:
fileA$fileA <- 1
fileB$fileB <- 1
fileC$fileC <- 1
Create the pairs in each file:
fileA$DrugTargetPair <- paste0(fileA$Drug, fileA$Target)
fileB$DrugTargetPair <- paste0(fileB$Drug, fileB$Target)
fileC$DrugTargetPair <- paste0(fileC$Drug, fileC$Target)
Select only the indicator column and the Pairs colum :
fileA <- fileA[, c("DrugTargetPair", "fileA")]
fileB <- fileB[, c("DrugTargetPair", "fileB")]
fileC <- fileC[, c("DrugTargetPair", "fileC")]
Merge on the Pairs column, kepp all Pairs with all = T:
file_new <- merge(fileA, fileB, by = "DrugTargetPair", all = T)
file_new <- merge(file_new, fileC, by = "DrugTargetPair", all = T)
file_new[is.na(file_new)] <- 0
file_new
DrugTargetPair fileA fileB fileC
1 abBqwewrm 0 0 1
2 abqwewr 1 1 0
3 bcsaff 1 1 0
4 bcvsaff 0 0 1
5 cdcxzcc 1 0 0
6 cdDcxzccf 0 1 0
7 cdDcxzccfh 0 0 1
8 efcsaddav 0 1 1
9 efsadda 1 0 0
10 ghgsadd 0 1 0
11 ghsadd 1 0 0
12 hjbn 0 1 0
data:
Drug<-c("ab","bc","cd","ef","gh")
Target<-c("qwewr","saff","cxzcc","sadda","sadd")
fileA<-data.frame(I(Drug),I(Target))
Drug<-c("ab","bc","cdD","efc","ghg","hj")
Target<-c("qwewr","saff","cxzccf","saddav","sadd","bn")
fileB<-data.frame(I(Drug),I(Target))
Drug<-c("abB","bcv","cdD","efc")
Target<-c("qwewrm","saff","cxzccfh","saddav")
fileC<-data.frame(I(Drug),I(Target))
code:
all_list <- list(fileA, fileB, fileC)
all1 <- rbind(fileA,fileB,fileC)
all1 <- as.data.frame(unique(all1))
ans <- t(apply(all1, 1, function(drgT){ sapply(all_list, function(x) {(list(drgT) %in% unlist(apply(x,1,list), recursive = F))*1} ) }))
ans[rowSums(ans) == 1,] <- 0
cbind(all1, ans)
result:
# Drug Target 1 2 3
#1 ab qwewr 1 1 0
#2 bc saff 1 1 0
#3 cd cxzcc 0 0 0
#4 ef sadda 0 0 0
#5 gh sadd 0 0 0
#8 cdD cxzccf 0 0 0
#9 efc saddav 0 1 1
#10 ghg sadd 0 0 0
#11 hj bn 0 0 0
#12 abB qwewrm 0 0 0
#13 bcv saff 0 0 0
#14 cdD cxzccfh 0 0 0
please note:
please revise your example data/ desired outcome.
please E D U C A T E yourself on stringsAsFactors.

Create a dataframe from a dataframe

I'd like to create a dataframe from a dataframe that created before. my first dataframe is:
Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
Then I want to create a dataframe like below, for all (96*24-motifs*chromosomes-):
Sample CT-G.A,chr1 TA-C.C,chr1 TC-G.C,chr1 CG-A.T,ch1 CA-G.T,ch1 CT-G.A,chr2 TA-C.C,chr2 TC-G.C,chr2 CG-A.T,ch2 CA-G.T,ch2
1 1 1 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 1 1
Here is a possble solution using dplyr and tidyr.
We add a column value that indicates if a chromosome is present, then complete the data.frame, making sure we have rows for each motif-chromosome-Sample combination, where missing combinations get a 0 in the value column. We create a key out of the motif and chromosome columns, and then discard those columns. Lastly, we reshape the data.frame from long to wide (see here) to get your desired format. Hope this helps!
df = read.table(text="Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
2 CA-G.T 2",header=T)
library(tidyr)
library(dplyr)
df %>% mutate(value=1) %>% complete(motif,chromosome,Sample,fill=list(value=0)) %>%
mutate(key=paste0(motif,',chr',chromosome)) %>%
group_by(Sample,key) %>%
summarize(value = sum(value)) %>%
spread(key,value) %>%
as.data.frame
Output:
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 2 0 1 0 0 0 0 0 0
This seems to be a classic case of when you would want to use factors and ensure that the empty factor levels aren't dropped (which dcast and other functions might do unless explicitly told not to).
Using #Florian's sample data, you can try:
library(data.table)
cols <- c("motif", "chromosome")
setDT(df)[, (cols) := lapply(.SD, factor), .SDcols = cols][
, dcast(unique(.SD)[, value := 1L],
Sample ~ motif + chromosome, value.var = "value",
fill = 0L, drop = FALSE)]
# Sample CA-G.T_1 CA-G.T_2 CG-A.T_1 CG-A.T_2 CT-G.A_1 CT-G.A_2 TA-C.C_1 TA-C.C_2 TC-G.C_1 TC-G.C_2
# 1 1 0 0 0 0 1 0 1 0 0 1
# 2 2 0 1 0 1 0 0 0 0 0 0
I've moved "cols" and myfun() outside of the transformation to save some typing and make things look a little more tidy.
Using the "tidyverse", I'd take a slightly different approach from #Florian, perhaps something like:
library(tidyverse)
df %>%
mutate_at(c("motif", "chromosome"), factor) %>%
mutate(value = 1) %>%
distinct() %>%
mutate(key = interaction(motif, chromosome)) %>%
select(-motif, -chromosome) %>%
spread(key, value, fill = 0, drop = FALSE)
Benchmarks
Benchmarks for these approaches and #Florian's can be found at this Gist.
On 10,000 rows, and 20 resulting columns, the results look like:
This will work for you. I have used package tidyr and dplyr. Actually, I had preferred to use unite and expand.grid from base r to achieve by finally using spread
df <- read.table(text = "Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2", header = TRUE)
#add a column to represent presence of chromosome
df$val <- 1
library(tidyr)
library(dplyr)
#Complete missing rows
df_complete <- left_join(
expand.grid(Sample=unique(df$Sample), motif=unique(df$motif),
chromosome=unique(df$chromosome)),
df, by = c("Sample", "motif", "chromosome"), copy = TRUE)
#Additional rows should have val = 0
df_complete$val[is.na(df_complete$val)] <- 0
df_complete %>%
unite(motif, c("motif", "chromosome"), sep = ",chr" ) %>%
spread(motif, val)
#Result
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 1 0 1 0 0 0 0 0 0

Convert a factor column to multiple boolean columns

Given data that looks like:
library(data.table)
DT <- data.table(x=rep(1:5, 2))
I would like to split this data into 5 boolean columns that indicate the presence of each number.
I can do this like this:
new.names <- sort(unique(DT$x))
DT[, paste0('col', new.names) := lapply(new.names, function(i) DT$x==i), with=FALSE]
But this uses a pesky lapply which is probably slower than the data.table alternative and this solutions strikes me as not very "data.table-ish".
Is there a better and/or faster way to create these new columns?
How about model.matrix?
model.matrix(~factor(x)-1,data=DT)
factor(x)1 factor(x)2 factor(x)3 factor(x)4 factor(x)5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(x)`
[1] "contr.treatment"
Apparently, you can put model.matrix into [.data.table to give the same results. Not sure if it would be faster:
DT[,model.matrix(~factor(x)-1)]
There is also nnet::class.ind
library(nnet)
cbind(DT, setnames(as.data.table(DT[, class.ind(x)]),paste0('col', unique(DT$x))))
library(data.table)
DT <- data.table(x=rep(1:5, 2))
# add column with id
DT[, id := seq.int(nrow(DT))]
# cast long table into wide
DT.wide <- dcast(DT, id ~ x, value.var = "x", fill = 0, fun = function(x) 1)

Resources