I have a table (mydf) as shown below. I would like to use this for loop (my code) in R which works for only one column (for ALT1 column in this instance) to loop over all the columns containing ALT1 through ALTn and store the output in separate variables from final1 through finaln.
The purpose here is to loop over ALT1 through ALTn to match the nucleotide columns (A,C,G,T,N) and get the corresponding values as shown in the result below.Thank you for your help!
mycode
final1 <- {}
i <- 1
result =merge(coverage.bam, rows.concat.alt, by="start")
for(i in 1:nrow(result)){
final1[i] = paste(paste(result$chr[i], result$start[i], result$end[i],sep=":"),"-",
result$REF[i],"(",result[,(as.character(result$REF[i]))][i],")",",", result$ALT1[i],
"(",result[,(as.character(result$ALT1[i]))][i][!is.na(result[,(as.character(result$ALT1[i]))][i])],")",sep="")
}
final1
I have tried to expand this code for ALT through ALTn, but it does not work, could you help me solve this please?
final <- list()
setValue<-function(element){
print(element)
for(i in 1:nrow(result)){
final[[i]] = paste(paste(result$chr[i], result$start[i], result$end[i],sep=":"),"-",
result$REF[i],"(",result[,(as.character(result$REF[i]))][i],")",",", result[,element][i],
"(",result[,(as.character(result[,element][i])))][i][!is.na(result[,(as.character(result[,element][i])][i])],")",sep="")
}
}
for(i in colnames(result)){
if(grepl('ALT', i)){
setValue(i)
}
}
mydf
chr start end A C G T N = - REF ALT ALT1 ALT2 ALT3 ALTn
1 chr10 102022031 102022031 NA 34 NA NA NA NA NA C G G NA NA NA
2 chr10 102220574 102220574 2 22 2 3 NA NA NA C AGT A G T NA
3 chr10 115322228 115322228 NA 25 NA NA NA NA NA C A A NA NA NA
4 chr10 122222925 122222925 30 NA NA NA NA NA NA A C C NA NA NA
5 chr10 121111042 121111042 NA 48 NA NA NA NA NA C T T NA NA NA
6 chr10 124444484 124444484 NA 60 NA NA NA NA NA C T T NA NA NA
Result
"chr10:102022031:102022031-C(34),G()" "chr10:102220574:102220574-C(22),A(2),G(2),T(3)" "chr10:115322228:115322228-C(25),A()"
[4] "chr10:122222925:122222925-A(30),C()" "chr10:121111042:121111042-C(48),T()" "chr10:124444484:124444484-C(60),T()"
Try
p1 <- do.call(paste,c(mydf[1:3], sep=":"))
p2 <- apply(mydf[c(4:8, 11:16)], 1, function(x) {
Un1 <- unique(match( x[7:11], names(x)[1:4], nomatch=0))
i1 <- match(x[6], names(x))
v1 <- paste0(names(x[i1]),'(', x[i1], ')')
v2 <- as.numeric(x[Un1])
v2[is.na(v2)] <- ''
v3 <-paste(names(x[Un1]), '(', v2, ')', sep='', collapse=",")
paste(v1, v3, sep=",") })
paste(p1, p2, sep="-")
#[1] "chr10:102022031:102022031-C(34),G()"
#[2] "chr10:102220574:102220574-C(22),A(2),G(2),T(3)"
#[3] "chr10:115322228:115322228-C(25),A()"
#[4] "chr10:122222925:122222925-A(30),C()"
#[5] "chr10:121111042:121111042-C(48),T()"
#[6] "chr10:124444484:124444484-C(60),T()"
Related
I have many samples, each one of which has a corresponding abundance matrix. From these abundance matrices, I would like to create a large matrix that contains abundance information for each sample in rows.
For example, a single abundance matrix would look like:
A B C D
sample1 1 3 4 2
where A, B, C, and D represent colnames, and the abundances are the row values.
I would like to populate my larger matrix, which has as colnames all possible letters (A:Z) and all possible samples (sample1:sampleN) as rows, by matching the colname values.
For ex. :
A B C D E F G .... Z
sample1 1 3 4 2 NA NA NA ....
sample2 NA NA 2 5 7 NA NA ....
sample3 4 NA 6 9 2 NA 2 .....
....
sampleN
Different samples have a varying mix of abundances, in no guaranteed order.
When iteratively adding to this larger matrix, how could I ensure that the correct columns are populated by the right abundance values (ex. column "A" is only filled by values corresponding to abundances of "A" in different samples)? Thanks!
Starting data, changing just a little to highlight differences:
m1 <- as.matrix(read.table(header=TRUE, text="
A B C Z
sample1 1 3 4 2"))
m2 <- as.matrix(read.table(header=TRUE, text="
A B C D E F G
sample2 NA NA 2 5 7 NA NA
sample3 4 NA 6 9 2 NA 2"))
First, we need to make sure both matrices have the same column names:
newcols <- setdiff(colnames(m2), colnames(m1))
m1 <- cbind(m1, matrix(NA, nr=nrow(m1), nc=length(newcols), dimnames=list(NULL, newcols)))
newcols <- setdiff(colnames(m1), colnames(m2))
m2 <- cbind(m2, matrix(NA, nr=nrow(m2), nc=length(newcols), dimnames=list(NULL, newcols)))
m1
# A B C Z D E F G
# sample1 1 3 4 2 NA NA NA NA
m2
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
And now we combine them; regular cbind needs the column names to be aligned as well:
rbind(m2, m1[,colnames(m2),drop=FALSE])
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
# sample1 1 3 4 NA NA NA NA 2
You should be able to take advantage of matrix indexing, like so:
big[cbind(rownames(abun),colnames(abun))] <- abun
Using this example abundance matrix, and a big matrix to fill:
abun <- matrix(c(1,3,4,2),nrow=1,dimnames=list("sample1",LETTERS[1:4]))
big <- matrix(NA,nrow=5,ncol=26,dimnames=list(paste0("sample",1:5),LETTERS))
Another solution using reduce from purrr package and union_all from dplyr package:
library(purrr)
library(dplyr)
sample_names <- c("sample1","sample2","sample3")
Generating 3 random abundance dataframes:
num1 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df1 <- data.frame(t(num1))
colnames(df1) <- sample(LETTERS,length(num1))
num2 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df2 <- data.frame(t(num2))
colnames(df2) <- sample(LETTERS,length(num2))
num3 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df3 <- data.frame(t(num3))
colnames(df3) <- sample(LETTERS,length(num3))
This is actually the code that does all the magic:
A <- reduce(list(df1,df2,df3),union_all)
col_order <- sort(colnames(A),decreasing = FALSE)
A <- A[,col_order]
rownames(A) <- sample_names
Output:
> A
A C E F O P Q U W Y
sample1 9 NA NA NA 9 NA 5 6 NA NA
sample2 NA NA NA NA 5 4 NA NA 5 NA
sample3 NA 6 5 9 NA NA 3 NA 5 7
I have 15 datasets. The 1st column is "subject" and is identical in all sets. The number of the rest of the columns is not the same in all datasets. I need to combine all of this data in a single dataframe. I found the command "Reduce" but I am just starting with R and I couldn't understand if this is what I need and if so, what is the syntax? Thanks!
I suggest including a reproducible example in the future so that others can see the format of data you're working with and what you're trying to do.
Here is some randomly generated example data, each with the "Subject" column:
list_of_dfs <- list(
df1 = data.frame(Subject = 1:4, a = rnorm(4), b = rnorm(4)),
df2 = data.frame(Subject = 5:8, c = rnorm(4), d = rnorm(4), e = rnorm(4)),
df3 = data.frame(Subject = 7:10, f = rnorm(4)),
df4 = data.frame(Subject = 2:5, g = rnorm(4), h = rnorm(4))
)
Reduce with merge is a good choice:
combined_df <- Reduce(
function(x, y) { merge(x, y, by = "Subject", all = TRUE) },
list_of_dfs
)
And the output:
> combined_dfs
Subject a b c d e f g h
1 1 1.1106594 1.2530046 NA NA NA NA NA NA
2 2 -1.0275630 0.6437101 NA NA NA NA -1.9393347 -0.4361952
3 3 0.1558639 1.2792212 NA NA NA NA -0.8861966 1.0137530
4 4 0.4283585 -0.1045530 NA NA NA NA 1.8924896 -0.3788198
5 5 NA NA 0.08261190 0.77058804 -1.165042 NA 0.7950784 -1.3467386
6 6 NA NA 2.51214598 0.62024328 1.496520 NA NA NA
7 7 NA NA 0.01581309 -0.04777196 -1.327884 1.5111734 NA NA
8 8 NA NA 0.80448136 -0.33347573 -2.290428 -0.3863564 NA NA
9 9 NA NA NA NA NA -1.2371795 NA NA
10 10 NA NA NA NA NA 1.6819063 NA NA
I created a empty data frame something like this
id Alyr Crub Lala Brap Bole Spar Esal Aara Thas
1 XLOC_003940_TBH_1 NA NA NA NA NA NA NA NA NA
I wanted to see if id and column name match then it should replace "NA" with certain value. Here is an example:
ex1 <- "Alyr_XLOC_003940_TBH_1_Ortholog_Known_Gene_Sense"
sp <- sub("([A-Za-z]+)_(XLOC_\\d+_TBH_1)_([A-Za-z_]+)","\\1", ex1)
gene <- sub("([A-Za-z]+)_(XLOC_\\d+_TBH_1)_([A-Za-z_]+)","\\2", ex1)
fun <- sub("([A-Za-z]+)_(XLOC_\\d+_TBH_1)_([A-Za-z_]+)","\\3", ex1)
Based on the above example, i wanted to get something like this
id Alyr Crub Lala Brap Bole Spar Esal Aara Thas
1 XLOC_003940_TBH_1 Ortholog_Known_Gene_Sense NF NF NF NF NF NF NF NF
I am stuck here and can't figure how can i do this?
Use matrix subsetting:
df1$id <- gene
df1[cbind(1:nrow(df1), match(sp, names(df1)))] <- fun
Check this answer for more on subsetting a data frame by a two-column matrix.
##Example
nms <- scan(what="character", text="id Alyr Crub Lala Brap Bole Spar Esal Aara Thas")
df1 <- as.data.frame(matrix(NA, 3, 10))
names(df1) <- nms
df1
# id Alyr Crub Lala Brap Bole Spar Esal Aara Thas
#1 NA NA NA NA NA NA NA NA NA NA
#2 NA NA NA NA NA NA NA NA NA NA
#3 NA NA NA NA NA NA NA NA NA NA
ex1 <- c("Alyr_XLOC_003940_TBH_1_Ortholog_Gene",
"Lala_XLOC_1234_TBH_1_Lalala_Gene",
"Thas_XLOC_5678_TBH_1_Thasthas_Gene")
sp <- sub("([A-Za-z]+)_(XLOC_\\d+_TBH_1)_([A-Za-z_]+)","\\1", ex1)
gene <- sub("([A-Za-z]+)_(XLOC_\\d+_TBH_1)_([A-Za-z_]+)","\\2", ex1)
fun <- sub("([A-Za-z]+)_(XLOC_\\d+_TBH_1)_([A-Za-z_]+)","\\3", ex1)
df1$id <- gene
df1[cbind(1:nrow(df1), match(sp, names(df1)))] <- fun
df1
# id Alyr Crub Lala Brap Bole Spar Esal Aara Thas
# 1 XLOC_003940_TBH_1 Ortholog_Gene NA <NA> NA NA NA NA NA <NA>
# 2 XLOC_1234_TBH_1 <NA> NA Lalala_Gene NA NA NA NA NA <NA>
# 3 XLOC_5678_TBH_1 <NA> NA <NA> NA NA NA NA NA Thasthas_Gene
I have a problem with conditional replacement. Let's assume I have the following code for a dataframe
a=c("0","1","0","B","NA","NA","NA","NA","NA")
b=c(0,1,0,0,1,0,1,0,1)
c=c(0,0,0,0,1,0,0,1,1)
d=c("0","1","0","0","1","0","B","NA","NA")
dat=data.frame(rbind(a,b,c,d))
names(dat)=c("P1","P2","P3","P4","C1","C2","C3","C4","C5")
Now I want to replace the row values of P1:P4 with NA if one of these values is B and I also want to replace the row values of C1:C5 with NA if one of these values is B. So I want the Dataframe to look like this:
a=c(**"NA","NA","NA","NA"**,"NA","NA","NA","NA","NA")
b=c(0,1,0,0,1,0,1,0,1)
c=c(0,0,0,0,1,0,0,1,1)
d=c("0","1","0","0",**"NA","NA","NA"**,"NA","NA")
dat=data.frame(rbind(a,b,c,d))
names(dat)=c("P1","P2","P3","P4","C1","C2","C3","C4","C5")
I hope the problem is understandable and I would appreciate any help.
Considering dat to be the original provided dataframe, I'm providing a comparatively lengthy code for better understanding. Hope it helps.
dat2 <- data.frame()
for(i in 1:nrow(dat)){
datSubset <- with(dat, dat[i,])
col.num.of.B <- which(datSubset == "B", arr.ind = T)[2]
if(is.na(col.num.of.B)){
datSubset <- datSubset
} else if(col.num.of.B < 5) {
datSubset[,c(1:4)] <- NA
} else {
datSubset[,c(5:9)] <- NA
}
dat2 <- rbind(dat2, datSubset)
}
dat2
# P1 P2 P3 P4 C1 C2 C3 C4 C5
# a <NA> <NA> <NA> <NA> NA NA NA NA NA
# b 0 1 0 0 1 0 1 0 1
# c 0 0 0 0 1 0 0 1 1
# d 0 1 0 0 <NA> <NA> <NA> <NA> <NA>
As I understood it... If the value B is found in columns P1 to P4, then set all the values within P1 to P4 to NA.
You can try:
nm <- c("P1", "P2", "P3", "P4")
cols <- which(names(dat) %in% nm)
dat[,cols][any(dat[,cols] == "B")] <- NA
dat
# P1 P2 P3 P4 C1 C2 C3 C4 C5
# a NA NA NA NA NA NA NA NA NA
# b NA NA NA NA 1 0 1 0 1
# c NA NA NA NA 1 0 0 1 1
# d NA NA NA NA 1 0 B NA NA
If you want to apply this to only the first row, then use dat[1,cols][any(dat[,cols] == "B")] <- NA.
I'm trying to tabulate/map the counts of 2 factor-class vectors (b1 & b2) into a bigger dataframe. Summary of the vectors are as below:
> summary(b1)
(4,6] (6,8] NA's
16 3 1
> summary(b2)
(4,6] (6,8] NA's
9 0 11
I would like to map the above counts into a bigger dataframe:
Intervals b1 b2
1 (-Inf,0] NA NA
2 (0,2] NA NA
3 (2,4] NA NA
4 (4,6] NA NA
5 (6,8] NA NA
6 (8,10] NA NA
7 (10,12] NA NA
8 (12, Inf] NA NA
My question: is there a vectorized or more direct way to do the above without resorting to a 'for' loop + if-else condition checking? It seems like something easily done, but I'm have been having this mental block and I haven't been successful in finding relevant help online. Any help/hint is appreciated. Thanks in advance!
My sample code is attached:
NoOfElement <- 20
MyBreaks <- c(seq(4, 8, by=2))
MyBigBreaks <- c(-Inf, seq(0,12, by=2), Inf)
set.seed(1)
a1 <- rnorm(NoOfElement, 5); a2 <- rnorm(NoOfElement, 4)
b1 <- cut(a1, MyBreaks); b2 <- cut(a2, MyBreaks)
c <- seq(-10, 10)
d <- cut(c, MyBigBreaks)
e <- data.frame( Intervals=levels(d), b1=NA, b2=NA )
The table function does the tabulation that you need. It returns a named vector, and you can compare the names against the column e$Intervals to assign the correct values.
This relies on the fact that the order of the factor levels is the same in e$Intervals and b1 and b2. This is so because these all come from cut.
e$b1[e$Intervals %in% names(table(b1))] <- table(b1)
e$b2[e$Intervals %in% names(table(b2))] <- table(b2)
e
## Intervals b1 b2
## 1 (-Inf,0] NA NA
## 2 (0,2] NA NA
## 3 (2,4] NA NA
## 4 (4,6] 16 9
## 5 (6,8] 3 0
## 6 (8,10] NA NA
## 7 (10,12] NA NA
## 8 (12, Inf] NA NA