Strange behaviour gtools:mixedorder combined with dplyr::arrange? - r

I noticed an unexpected outcome while using dplyr::arrange together with gtools::mixedorder.
Consider:
library(tidyverse)
test <- data.frame(V1 = c("all13_LG1", "all13_LG10", "all13_LG11",
"all13_LG12", "all13_LG13", "all13_LG14", "all13_LG15", "all13_LG16",
"all13_LG2", "all13_LG3", "all13_LG4", "all13_LG5", "all13_LG6",
"all13_LG7", "all13_LG8", "all13_LG9"),
V2 = c(rep(1:16)))
test2 <- test %>% arrange(gtools::mixedorder(V1))
test3 <- test %>% slice(gtools::mixedorder(V1))
In test2 the 1st column is sorted: "all13_LG1", "all13_LG3", "all13_LG4", "all13_LG5", "all13_LG6", "all13_LG7", "all13_LG8", "all13_LG9", "all13_LG10", "all13_LG11", "all13_LG12", "all13_LG13", "all13_LG14", "all13_LG15", "all13_LG16", "all13_LG2"
Whereas in test3, the columns are sorted as one would expect when using gtools:mixedorder
Why is this happening when I combine arrange and mixedtools? Is this a bug?
Many thanks,
Anneke

To use results of mixedorder in arrange you need to order the results.
library(dplyr)
test %>% arrange(order(gtools::mixedorder(V1)))
# V1 V2
#1 all13_LG1 1
#2 all13_LG2 9
#3 all13_LG3 10
#4 all13_LG4 11
#5 all13_LG5 12
#6 all13_LG6 13
#7 all13_LG7 14
#8 all13_LG8 15
#9 all13_LG9 16
#10 all13_LG10 2
#11 all13_LG11 3
#12 all13_LG12 4
#13 all13_LG13 5
#14 all13_LG14 6
#15 all13_LG15 7
#16 all13_LG16 8

We can slice with mixedsort and match
library(dplyr)
test %>%
slice(match(gtools::mixedsort(V1), V1))
-output
V1 V2
1 all13_LG1 1
2 all13_LG2 9
3 all13_LG3 10
4 all13_LG4 11
5 all13_LG5 12
6 all13_LG6 13
7 all13_LG7 14
8 all13_LG8 15
9 all13_LG9 16
10 all13_LG10 2
11 all13_LG11 3
12 all13_LG12 4
13 all13_LG13 5
14 all13_LG14 6
15 all13_LG15 7
16 all13_LG16 8

Related

How to swap values between variables in a data frame by column name in R

I would like to swap values for the following data set by column name. For example, swap the values of beta0_C1 and beta0_C2 for the row 10 to 15, remaining values remain unchanged. Similarly for row 10 to 15, swap the values of beta1_C1 and beta1_C2. Similarly for beta2_C1 and beta2_C2,
beta3_C1 and beta3_C2
beta0_C1 beta1_C1 beta2_C1 beta3_C1 beta0_C2 beta1_C2 beta2_C2
1 6.010537 0.2826006 0.001931834 -0.0014162495 6.862525 -0.7267671 0.12065368
2 6.182425 0.1633226 0.025748699 -0.0028515529 6.780775 -0.6686269 0.10548767
3 6.222667 0.1109463 0.036438064 -0.0034054813 6.891512 -0.7372192 0.11895311
4 5.980246 0.3095103 -0.002670511 -0.0011975572 6.677035 -0.5774936 0.08990028
5 6.146192 0.1661733 0.024968028 -0.0027346213 6.881571 -0.7439543 0.11835484
6 6.056259 0.2374753 0.010833872 -0.0019526540 7.094971 -0.8504940 0.13648015
7 6.051281 0.2265750 0.017030676 -0.0024138722 6.829044 -0.7180662 0.12121844
8 5.911484 0.3628966 -0.014161483 -0.0005192893 6.784079 -0.6090060 0.09075940
9 5.956709 0.3486160 -0.011525364 -0.0006776760 6.934137 -0.7821656 0.12996924
10 6.010721 0.2821788 0.002475369 -0.0014508507 6.810553 -0.7140603 0.12471406
11 6.021261 0.3180654 -0.004986709 -0.0010968281 6.708342 -0.6259794 0.10697798
12 6.171459 0.2020801 0.015380862 -0.0021379484 6.592252 -0.5040888 0.07813420
13 6.103334 0.2432321 0.010022319 -0.0019386513 6.831204 -0.6854066 0.11129609
14 5.989656 0.3026038 -0.003007319 -0.0011073984 6.782081 -0.6822204 0.10769549
15 6.024628 0.2786942 0.001861784 -0.0014022176 6.864881 -0.7299905 0.12030466
16 6.023082 0.2707312 0.008308583 -0.0019947781 6.850565 -0.7136916 0.11551886
17 5.988829 0.3267394 -0.007576506 -0.0008493887 6.882956 -0.7739330 0.13467615
18 6.072949 0.2744519 0.002846329 -0.0014917373 6.886863 -0.7853582 0.13512568
19 6.030894 0.2693881 0.006378019 -0.0017875603 6.842824 -0.7238131 0.11835479
20 6.197286 0.1311579 0.036005746 -0.0035338268 6.807729 -0.6549960 0.10400631
beta3_C2
1 -0.005112708
2 -0.003982831
3 -0.004824895
4 -0.003356916
5 -0.004724677
6 -0.005657009
7 -0.005200557
8 -0.003065364
9 -0.005408715
10 -0.005551546
11 -0.004516814
12 -0.002726879
13 -0.004493288
14 -0.004053661
15 -0.004913402
16 -0.004609239
17 -0.006101912
18 -0.005945182
19 -0.004801623
20 -0.004151904
Any help is appreciated.
Given this input :
(df1 <- as.data.frame(matrix(1:12, ncol = 3)))
# V1 V2 V3
#1 1 5 9
#2 2 6 10
#3 3 7 11
#4 4 8 12
You can use rev
df1[1:2, c("V2", "V3")] <- rev(df1[1:2, c("V2", "V3")])
Result
df1
# V1 V2 V3
#1 1 9 5
#2 2 10 6
#3 3 7 11
#4 4 8 12
Written as a function of rows and cols
f <- function(data, rows, cols) {
data[rows, cols] <- rev(data[rows, cols])
data
}
f(df1, 1:2, c("V2", "V3"))

How can this R code be sped up with the apply (lapply, mapply ect.) functions?

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?
sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
# Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
filt_fam = lapply(family_summary, function(sample){
sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
})
row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
if (nrow(row_cat) > 0){
fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
sum_store = rbind(sum_store, fam_sum) # storing the results for that family
} else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
proteins = 0
sum_store = rbind(sum_store, data.frame(Family, proteins))
} else {
next
}
}
}
family_summary is just a list of 18 two column dataframes that look like this:
Family proteins
CE0 2
CE1 9
CE4 15
CE7 1
CE9 1
CE14 10
GH0 5
GH1 1
GH3 4
GH4 1
GH8 1
GH9 2
GH13 2
GH15 5
GH17 1
with different cazy families.
cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:
GH GT PL CE AA CBM
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.
The output sum_store is this:
Family proteins
GH1 54
GH2 51
GH3 125
GH4 29
GH5 40
GH6 25
GH7 0
GH8 16
GH9 25
GH10 19
GH11 5
GH12 5
GH13 164
GH14 3
GH15 61
A dataframe with all listed cazy families and the total number of apperances across the family_summary list.
Please let me know if you need anything else to help answer my question.

R - Skip columns in pmax command if they do not exist

I'd like to use the pmax command to create a new column. My code Looks like this:
Master <- Master %>%
mutate(RAM = pmax(RAM1, RAM2, RAM3, RAM4, RAM5, RAM6, RAM7, RAM8, RAM9, RAM10,
RAM11, RAM12, RAM13, RAM14, RAM15, RAM16, RAM17, RAM18,
RAM19, RAM20, RAM21, RAM22, RAM23, RAM24, RAM25, RAM26,
RAM27, RAM28, RAM29, RAM30, RAM31, RAM32, RAM33, RAM34,
RAM35, RAM36, RAM37, RAM38, RAM39, RAM40, RAM41, RAM42,
RAM43, RAM44, RAM45, RAM46, RAM47, RAM48, RAM49, RAM50,
RAM51, RAM52, RAM53, RAM54, RAM55, RAM56, RAM57, RAM58,
RAM59, RAM60, RAM61, RAM62, RAM63, RAM64, RAM65, RAM66,
RAM67, RAM68, RAM69, RAM70, RAM71, RAM72, RAM73, RAM74,
RAM75, RAM76, RAM77, RAM78, RAM79, RAM80, RAM81, RAM82,
RAM83, RAM84, RAM85, RAM86, RAM87, RAM88, RAM89, RAM90,
RAM91, RAM92, na.rm =T))
In my current data base, however, only the columns RAM1 to RAM8 exist. In this case, I want R to skip all the other columns mentioned in the Statement and to only use column RAM1 to RAM8 (it is okay if R displays an error message, but I don't want the program to interrupt running the code).
Any ideas how to do so?
Thanks!
One way to do this would be as follows:
Set up some data to make a reproducible example
set.seed(0)
Master <- data.frame(Other=100,RAM1=1:10, RAM2=1:10, RAM3=1:10, RAM4=1:10,
RAM5=1:10, RAM6=1:10, RAM7=1:10, RAM8=rnorm(10)+5)
Master[5,5] <- NA
Select required columns of the dataframe:
Master[colnames(Master) %in% paste0("RAM",1:92)]
Use do.call to run pmax using the selected columns as arguments, and adding the argument na.rm=TRUE
Master$RAM <- do.call(pmax, c(Master[colnames(Master) %in% paste0("RAM",1:92)], na.rm=TRUE))
Sample output:
Master
# Other RAM1 RAM2 RAM3 RAM4 RAM5 RAM6 RAM7 RAM8 RAM
#1 100 1 1 1 1 1 1 1 6.262954 6.262954
#2 100 2 2 2 2 2 2 2 4.673767 4.673767
#3 100 3 3 3 3 3 3 3 6.329799 6.329799
#4 100 4 4 4 4 4 4 4 6.272429 6.272429
#5 100 5 5 5 NA 5 5 5 5.414641 5.414641
#6 100 6 6 6 6 6 6 6 3.460050 6.000000
#7 100 7 7 7 7 7 7 7 4.071433 7.000000
#8 100 8 8 8 8 8 8 8 4.705280 8.000000
#9 100 9 9 9 9 9 9 9 4.994233 9.000000
#10 100 10 10 10 10 10 10 10 7.404653 10.000000

Adding multiple columns to data.table in R

I have a data.table with sequences and number of reads, like so:
sequence num_reads
1: AACCTGCCG 1
2: CGCGCTCAA 12
3: AGTGTGAGC 3
4: TGGGTACAC 11
5: GGCCGCGTG 15
6: CCTTAAGAG 2
7: GCGGAACTG 9
8: GCGTTGTAG 17
9: GTTGTAGCG 20
10: ACACGTGAC 16
I'd like to use data.table to add two new columns to this table, based on the results of applying dpois() with two weights and two lambdas. The correct output should be this (based on using data.frame):
sequence num_reads clus1 clus2
1 AACCTGCCG 1 2.553269503552647000377e-03 1.610220613932057849571e-03
2 CGCGCTCAA 12 1.053993989051599418361e-02 2.887608256917401083896e-02
3 AGTGTGAGC 3 2.085170094567994833468e-02 1.717568654860860896672e-02
4 TGGGTACAC 11 1.806846838374168498498e-02 4.331412385376097462508e-02
5 GGCCGCGTG 15 1.324248858039188620275e-03 5.415587646672919558410e-03
6 CCTTAAGAG 2 8.936443262434262332916e-03 6.440882455728230530922e-03
7 GCGGAACTG 9 4.056186780023639942838e-02 7.444615037365168164207e-02
8 GCGTTGTAG 17 2.385595369261770803265e-04 1.274255916864215588610e-03
9 GTTGTAGCG 20 1.196285397159046524451e-05 9.538289904012846548518e-05
10 ACACGTGAC 16 5.793588753921446012421e-04 2.707793823336458478163e-03
But when I try to use data.table I can't seem to get the right result. Here is what I tried (based on similar questions asked around this topic):
pois = function(n, p, l){return(dpois(as.numeric(as.character(n)), l)*p) }
x = x[, c(paste("clus", seq(1,2), sep = '')) := pois(num_reads, c(0.4,0.6), c(7,8)), by = seq_len(nrow(x))]
And here is the result:
sequence num_reads clus1 clus2
1: AACCTGCCG 1 2.553269503552647000377e-03 2.553269503552647000377e-03
2: CGCGCTCAA 12 1.053993989051599418361e-02 1.053993989051599418361e-02
3: AGTGTGAGC 3 2.085170094567994833468e-02 2.085170094567994833468e-02
4: TGGGTACAC 11 1.806846838374168498498e-02 1.806846838374168498498e-02
5: GGCCGCGTG 15 1.324248858039188620275e-03 1.324248858039188620275e-03
6: CCTTAAGAG 2 8.936443262434262332916e-03 8.936443262434262332916e-03
7: GCGGAACTG 9 4.056186780023639942838e-02 4.056186780023639942838e-02
8: GCGTTGTAG 17 2.385595369261770803265e-04 2.385595369261770803265e-04
9: GCGTTGTAG 20 1.196285397159046524451e-05 1.196285397159046524451e-05
10: ACACGTGAC 16 5.793588753921446012421e-04 5.793588753921446012421e-04
The reason I'm using data.table and not data.frame is that my real data has 100,000s of rows. I studied the answers to this and this but I haven't been able to come up with a solution.
Any tips you have would be much appreciated. Thanks!
We can try
x[, paste("clus", seq(1,2), sep = ''):=
as.list(pois(num_reads, c(0.4,0.6), c(7,8))), by = seq_len(nrow(x))]
x
# sequence num_reads clus1 clus2
#1: AACCTGCCG 1 0.00255326950 0.0016102206
#2: CGCGCTCAA 12 0.01053993989 0.0288760826
#3: AGTGTGAGC 3 0.02085170095 0.0171756865
#4: TGGGTACAC 11 0.01806846838 0.0433141239
#5: GGCCGCGTG 15 0.00132424886 0.0054155876
#6: CCTTAAGAG 2 0.00893644326 0.0064408825
#7: GCGGAACTG 9 0.04056186780 0.0744461504
#8: GCGTTGTAG 17 0.00023855954 0.0012742559
#9: GTTGTAGCG 20 0.00001196285 0.0000953829
#10:ACACGTGAC 16 0.00057935888 0.0027077938

R dplyr column sort with alphanumeric characters

Why does dplyr not sort the first column properly when it's composed of numeric and alphabetic characters?
> library(dplyr)
> y <- read.table("file.csv", sep = ",")
> arrange(y, V1)
V1 V2 V3 V4 V5 V6
1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
3 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
4 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
5 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
6 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
7 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
8 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
9 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
10 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
11 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
12 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
13 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
14 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
15 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
I'd like to see:
V1 V2 V3 V4 V5 V6
1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
3 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
4 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
5 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
6 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
7 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
8 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
9 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
10 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
11 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
12 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
13 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
14 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
15 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
Your disregard of alpha is a bit problematic, but how about:
library(dplyr)
arrange(y, as.numeric(V1))
# Warning in order(as.numeric(y$V1)) : NAs introduced by coercion
# V1 V2 V3 V4 V5 V6
# 1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
# 8 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
# 9 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
# 10 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
# 11 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
# 12 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
# 13 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
# 14 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
# 15 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
# 2 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
# 3 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
# 4 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
# 5 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
# 6 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
# 7 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
# 16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
# 17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
# 18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
This also works with base:
y[ order(as.numeric(y$V1)), ]
Edit: OP then asked (deSpite! having said "I don't really care" ;-) how to then sort the non-numeric fields.
The reason the first command works is that the non-numeric fields are all converted to NA, which conveniently puts them after numbers in a sort. Well, both dplyr::arrange and base::order take arbitrary arguments, where ties in the first column are handled by the second argument, etc. So, in order to sort among the NAs (non-numeric V1 elements), just add something that makes sense amongst them, such as ... "them":
arrange(y, as.numeric(V1), V1)
y[ order(as.numeric(y$V1), y$V1), ]
I recommend making V1 a factor and sorting the levels with a stringr package function before arranging:
> library(dplyr)
> library(stringr)
> y <- tibble(V1 = c("B", "A", 2, 1), V2 =c(1,2,3,4), V3=c(1,2,3,4))
> y %>%
dplyr::mutate(V1_fac = factor(V1, levels= str_sort(V1, numeric=TRUE))) %>%
dplyr::arrange(V1_fac)
The numeric=TRUE option allows to sort V1 digits numerically, instead of as strings.
If some entries in V1 are not unique, you might want to use:
y %>%
dplyr::mutate(V1_fac = factor(V1, levels= str_sort(unique(V1), numeric=TRUE))) %>%
dplyr::arrange(V1_fac)

Resources