Adding multiple columns to data.table in R - r

I have a data.table with sequences and number of reads, like so:
sequence num_reads
1: AACCTGCCG 1
2: CGCGCTCAA 12
3: AGTGTGAGC 3
4: TGGGTACAC 11
5: GGCCGCGTG 15
6: CCTTAAGAG 2
7: GCGGAACTG 9
8: GCGTTGTAG 17
9: GTTGTAGCG 20
10: ACACGTGAC 16
I'd like to use data.table to add two new columns to this table, based on the results of applying dpois() with two weights and two lambdas. The correct output should be this (based on using data.frame):
sequence num_reads clus1 clus2
1 AACCTGCCG 1 2.553269503552647000377e-03 1.610220613932057849571e-03
2 CGCGCTCAA 12 1.053993989051599418361e-02 2.887608256917401083896e-02
3 AGTGTGAGC 3 2.085170094567994833468e-02 1.717568654860860896672e-02
4 TGGGTACAC 11 1.806846838374168498498e-02 4.331412385376097462508e-02
5 GGCCGCGTG 15 1.324248858039188620275e-03 5.415587646672919558410e-03
6 CCTTAAGAG 2 8.936443262434262332916e-03 6.440882455728230530922e-03
7 GCGGAACTG 9 4.056186780023639942838e-02 7.444615037365168164207e-02
8 GCGTTGTAG 17 2.385595369261770803265e-04 1.274255916864215588610e-03
9 GTTGTAGCG 20 1.196285397159046524451e-05 9.538289904012846548518e-05
10 ACACGTGAC 16 5.793588753921446012421e-04 2.707793823336458478163e-03
But when I try to use data.table I can't seem to get the right result. Here is what I tried (based on similar questions asked around this topic):
pois = function(n, p, l){return(dpois(as.numeric(as.character(n)), l)*p) }
x = x[, c(paste("clus", seq(1,2), sep = '')) := pois(num_reads, c(0.4,0.6), c(7,8)), by = seq_len(nrow(x))]
And here is the result:
sequence num_reads clus1 clus2
1: AACCTGCCG 1 2.553269503552647000377e-03 2.553269503552647000377e-03
2: CGCGCTCAA 12 1.053993989051599418361e-02 1.053993989051599418361e-02
3: AGTGTGAGC 3 2.085170094567994833468e-02 2.085170094567994833468e-02
4: TGGGTACAC 11 1.806846838374168498498e-02 1.806846838374168498498e-02
5: GGCCGCGTG 15 1.324248858039188620275e-03 1.324248858039188620275e-03
6: CCTTAAGAG 2 8.936443262434262332916e-03 8.936443262434262332916e-03
7: GCGGAACTG 9 4.056186780023639942838e-02 4.056186780023639942838e-02
8: GCGTTGTAG 17 2.385595369261770803265e-04 2.385595369261770803265e-04
9: GCGTTGTAG 20 1.196285397159046524451e-05 1.196285397159046524451e-05
10: ACACGTGAC 16 5.793588753921446012421e-04 5.793588753921446012421e-04
The reason I'm using data.table and not data.frame is that my real data has 100,000s of rows. I studied the answers to this and this but I haven't been able to come up with a solution.
Any tips you have would be much appreciated. Thanks!

We can try
x[, paste("clus", seq(1,2), sep = ''):=
as.list(pois(num_reads, c(0.4,0.6), c(7,8))), by = seq_len(nrow(x))]
x
# sequence num_reads clus1 clus2
#1: AACCTGCCG 1 0.00255326950 0.0016102206
#2: CGCGCTCAA 12 0.01053993989 0.0288760826
#3: AGTGTGAGC 3 0.02085170095 0.0171756865
#4: TGGGTACAC 11 0.01806846838 0.0433141239
#5: GGCCGCGTG 15 0.00132424886 0.0054155876
#6: CCTTAAGAG 2 0.00893644326 0.0064408825
#7: GCGGAACTG 9 0.04056186780 0.0744461504
#8: GCGTTGTAG 17 0.00023855954 0.0012742559
#9: GTTGTAGCG 20 0.00001196285 0.0000953829
#10:ACACGTGAC 16 0.00057935888 0.0027077938

Related

How to swap values between variables in a data frame by column name in R

I would like to swap values for the following data set by column name. For example, swap the values of beta0_C1 and beta0_C2 for the row 10 to 15, remaining values remain unchanged. Similarly for row 10 to 15, swap the values of beta1_C1 and beta1_C2. Similarly for beta2_C1 and beta2_C2,
beta3_C1 and beta3_C2
beta0_C1 beta1_C1 beta2_C1 beta3_C1 beta0_C2 beta1_C2 beta2_C2
1 6.010537 0.2826006 0.001931834 -0.0014162495 6.862525 -0.7267671 0.12065368
2 6.182425 0.1633226 0.025748699 -0.0028515529 6.780775 -0.6686269 0.10548767
3 6.222667 0.1109463 0.036438064 -0.0034054813 6.891512 -0.7372192 0.11895311
4 5.980246 0.3095103 -0.002670511 -0.0011975572 6.677035 -0.5774936 0.08990028
5 6.146192 0.1661733 0.024968028 -0.0027346213 6.881571 -0.7439543 0.11835484
6 6.056259 0.2374753 0.010833872 -0.0019526540 7.094971 -0.8504940 0.13648015
7 6.051281 0.2265750 0.017030676 -0.0024138722 6.829044 -0.7180662 0.12121844
8 5.911484 0.3628966 -0.014161483 -0.0005192893 6.784079 -0.6090060 0.09075940
9 5.956709 0.3486160 -0.011525364 -0.0006776760 6.934137 -0.7821656 0.12996924
10 6.010721 0.2821788 0.002475369 -0.0014508507 6.810553 -0.7140603 0.12471406
11 6.021261 0.3180654 -0.004986709 -0.0010968281 6.708342 -0.6259794 0.10697798
12 6.171459 0.2020801 0.015380862 -0.0021379484 6.592252 -0.5040888 0.07813420
13 6.103334 0.2432321 0.010022319 -0.0019386513 6.831204 -0.6854066 0.11129609
14 5.989656 0.3026038 -0.003007319 -0.0011073984 6.782081 -0.6822204 0.10769549
15 6.024628 0.2786942 0.001861784 -0.0014022176 6.864881 -0.7299905 0.12030466
16 6.023082 0.2707312 0.008308583 -0.0019947781 6.850565 -0.7136916 0.11551886
17 5.988829 0.3267394 -0.007576506 -0.0008493887 6.882956 -0.7739330 0.13467615
18 6.072949 0.2744519 0.002846329 -0.0014917373 6.886863 -0.7853582 0.13512568
19 6.030894 0.2693881 0.006378019 -0.0017875603 6.842824 -0.7238131 0.11835479
20 6.197286 0.1311579 0.036005746 -0.0035338268 6.807729 -0.6549960 0.10400631
beta3_C2
1 -0.005112708
2 -0.003982831
3 -0.004824895
4 -0.003356916
5 -0.004724677
6 -0.005657009
7 -0.005200557
8 -0.003065364
9 -0.005408715
10 -0.005551546
11 -0.004516814
12 -0.002726879
13 -0.004493288
14 -0.004053661
15 -0.004913402
16 -0.004609239
17 -0.006101912
18 -0.005945182
19 -0.004801623
20 -0.004151904
Any help is appreciated.
Given this input :
(df1 <- as.data.frame(matrix(1:12, ncol = 3)))
# V1 V2 V3
#1 1 5 9
#2 2 6 10
#3 3 7 11
#4 4 8 12
You can use rev
df1[1:2, c("V2", "V3")] <- rev(df1[1:2, c("V2", "V3")])
Result
df1
# V1 V2 V3
#1 1 9 5
#2 2 10 6
#3 3 7 11
#4 4 8 12
Written as a function of rows and cols
f <- function(data, rows, cols) {
data[rows, cols] <- rev(data[rows, cols])
data
}
f(df1, 1:2, c("V2", "V3"))

How can this R code be sped up with the apply (lapply, mapply ect.) functions?

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?
sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
# Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
filt_fam = lapply(family_summary, function(sample){
sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
})
row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
if (nrow(row_cat) > 0){
fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
sum_store = rbind(sum_store, fam_sum) # storing the results for that family
} else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
proteins = 0
sum_store = rbind(sum_store, data.frame(Family, proteins))
} else {
next
}
}
}
family_summary is just a list of 18 two column dataframes that look like this:
Family proteins
CE0 2
CE1 9
CE4 15
CE7 1
CE9 1
CE14 10
GH0 5
GH1 1
GH3 4
GH4 1
GH8 1
GH9 2
GH13 2
GH15 5
GH17 1
with different cazy families.
cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:
GH GT PL CE AA CBM
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.
The output sum_store is this:
Family proteins
GH1 54
GH2 51
GH3 125
GH4 29
GH5 40
GH6 25
GH7 0
GH8 16
GH9 25
GH10 19
GH11 5
GH12 5
GH13 164
GH14 3
GH15 61
A dataframe with all listed cazy families and the total number of apperances across the family_summary list.
Please let me know if you need anything else to help answer my question.

Column-specific arguments to lapply in data.table .SD when applying rbinom

I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?
I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.

data.table foverlaps how to return all ranges from both x and y?

I would like to use foverlaps() to perform an overlapping range join. A problem I've run into is that the maxgap argument has not been implemented, and I just noticed that the merge does not return all values from the two tables to be joined even when selecting type = "any", mult="all". See here:
range_one =
data.table(
seqnames = rep(11, 8),
start = c(5617339, 5624476, 5625764, 5626567, 5629602, 5631375, 5631765, 5632007),
end = c(5617339, 5624476, 5625764, 5626567, 5629602, 5631375, 5631765, 5632007),
exon_rank = c(1:8)
)
range_one =
data.table(
seqnames = rep(11, 8),
start = c(5617886, 5624476, 5625764, 5626567, 5629602, 5631375, 5631765, 5632007),
end = c(5618144, 5624965, 5625859, 5626797, 5629624, 5631475, 5631791, 5634182),
exon_rank = c(1:8)
)
> range_one
seqnames start end exon_rank
1: 11 5617339 5617411 1
2: 11 5624476 5624965 2
3: 11 5625764 5625859 3
4: 11 5626567 5626797 4
5: 11 5629602 5629624 5
6: 11 5631375 5631475 6
7: 11 5631765 5631791 7
8: 11 5632007 5634182 8
> range_two
seqnames start end exon_rank
1: 11 5617886 5618144 1
2: 11 5624476 5624965 2
3: 11 5625764 5625859 3
4: 11 5626567 5626797 4
5: 11 5629602 5629624 5
6: 11 5631375 5631475 6
7: 11 5631765 5631791 7
8: 11 5632007 5634182 8
setkey(range_one, seqnames,start,end)
setkey(range_one, seqnames,start,end)
> foverlaps(range_one,range_two, type = "any", mult="all")
seqnames start end exon_rank i.start i.end i.exon_rank
1: 11 NA NA NA 5617339 5617411 1
2: 11 5624476 5624965 2 5624476 5624965 2
3: 11 5625764 5625859 3 5625764 5625859 3
4: 11 5626567 5626797 4 5626567 5626797 4
5: 11 5629602 5629624 5 5629602 5629624 5
6: 11 5631375 5631475 6 5631375 5631475 6
7: 11 5631765 5631791 7 5631765 5631791 7
8: 11 5632007 5634182 8 5632007 5634182 8
Notice how the first row from range_two is lost in the foverlaps result.
How can I make sure that ALL ranges from both sides are returned whether or not they overlap fully or partially, or not at all?

R - Skip columns in pmax command if they do not exist

I'd like to use the pmax command to create a new column. My code Looks like this:
Master <- Master %>%
mutate(RAM = pmax(RAM1, RAM2, RAM3, RAM4, RAM5, RAM6, RAM7, RAM8, RAM9, RAM10,
RAM11, RAM12, RAM13, RAM14, RAM15, RAM16, RAM17, RAM18,
RAM19, RAM20, RAM21, RAM22, RAM23, RAM24, RAM25, RAM26,
RAM27, RAM28, RAM29, RAM30, RAM31, RAM32, RAM33, RAM34,
RAM35, RAM36, RAM37, RAM38, RAM39, RAM40, RAM41, RAM42,
RAM43, RAM44, RAM45, RAM46, RAM47, RAM48, RAM49, RAM50,
RAM51, RAM52, RAM53, RAM54, RAM55, RAM56, RAM57, RAM58,
RAM59, RAM60, RAM61, RAM62, RAM63, RAM64, RAM65, RAM66,
RAM67, RAM68, RAM69, RAM70, RAM71, RAM72, RAM73, RAM74,
RAM75, RAM76, RAM77, RAM78, RAM79, RAM80, RAM81, RAM82,
RAM83, RAM84, RAM85, RAM86, RAM87, RAM88, RAM89, RAM90,
RAM91, RAM92, na.rm =T))
In my current data base, however, only the columns RAM1 to RAM8 exist. In this case, I want R to skip all the other columns mentioned in the Statement and to only use column RAM1 to RAM8 (it is okay if R displays an error message, but I don't want the program to interrupt running the code).
Any ideas how to do so?
Thanks!
One way to do this would be as follows:
Set up some data to make a reproducible example
set.seed(0)
Master <- data.frame(Other=100,RAM1=1:10, RAM2=1:10, RAM3=1:10, RAM4=1:10,
RAM5=1:10, RAM6=1:10, RAM7=1:10, RAM8=rnorm(10)+5)
Master[5,5] <- NA
Select required columns of the dataframe:
Master[colnames(Master) %in% paste0("RAM",1:92)]
Use do.call to run pmax using the selected columns as arguments, and adding the argument na.rm=TRUE
Master$RAM <- do.call(pmax, c(Master[colnames(Master) %in% paste0("RAM",1:92)], na.rm=TRUE))
Sample output:
Master
# Other RAM1 RAM2 RAM3 RAM4 RAM5 RAM6 RAM7 RAM8 RAM
#1 100 1 1 1 1 1 1 1 6.262954 6.262954
#2 100 2 2 2 2 2 2 2 4.673767 4.673767
#3 100 3 3 3 3 3 3 3 6.329799 6.329799
#4 100 4 4 4 4 4 4 4 6.272429 6.272429
#5 100 5 5 5 NA 5 5 5 5.414641 5.414641
#6 100 6 6 6 6 6 6 6 3.460050 6.000000
#7 100 7 7 7 7 7 7 7 4.071433 7.000000
#8 100 8 8 8 8 8 8 8 4.705280 8.000000
#9 100 9 9 9 9 9 9 9 4.994233 9.000000
#10 100 10 10 10 10 10 10 10 7.404653 10.000000

Resources