R dplyr column sort with alphanumeric characters - r

Why does dplyr not sort the first column properly when it's composed of numeric and alphabetic characters?
> library(dplyr)
> y <- read.table("file.csv", sep = ",")
> arrange(y, V1)
V1 V2 V3 V4 V5 V6
1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
3 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
4 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
5 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
6 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
7 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
8 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
9 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
10 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
11 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
12 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
13 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
14 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
15 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
I'd like to see:
V1 V2 V3 V4 V5 V6
1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
3 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
4 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
5 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
6 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
7 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
8 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
9 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
10 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
11 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
12 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
13 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
14 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
15 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310

Your disregard of alpha is a bit problematic, but how about:
library(dplyr)
arrange(y, as.numeric(V1))
# Warning in order(as.numeric(y$V1)) : NAs introduced by coercion
# V1 V2 V3 V4 V5 V6
# 1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
# 8 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
# 9 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
# 10 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
# 11 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
# 12 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
# 13 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
# 14 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
# 15 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
# 2 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
# 3 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
# 4 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
# 5 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
# 6 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
# 7 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
# 16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
# 17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
# 18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
This also works with base:
y[ order(as.numeric(y$V1)), ]
Edit: OP then asked (deSpite! having said "I don't really care" ;-) how to then sort the non-numeric fields.
The reason the first command works is that the non-numeric fields are all converted to NA, which conveniently puts them after numbers in a sort. Well, both dplyr::arrange and base::order take arbitrary arguments, where ties in the first column are handled by the second argument, etc. So, in order to sort among the NAs (non-numeric V1 elements), just add something that makes sense amongst them, such as ... "them":
arrange(y, as.numeric(V1), V1)
y[ order(as.numeric(y$V1), y$V1), ]

I recommend making V1 a factor and sorting the levels with a stringr package function before arranging:
> library(dplyr)
> library(stringr)
> y <- tibble(V1 = c("B", "A", 2, 1), V2 =c(1,2,3,4), V3=c(1,2,3,4))
> y %>%
dplyr::mutate(V1_fac = factor(V1, levels= str_sort(V1, numeric=TRUE))) %>%
dplyr::arrange(V1_fac)
The numeric=TRUE option allows to sort V1 digits numerically, instead of as strings.
If some entries in V1 are not unique, you might want to use:
y %>%
dplyr::mutate(V1_fac = factor(V1, levels= str_sort(unique(V1), numeric=TRUE))) %>%
dplyr::arrange(V1_fac)

Related

Find the the mean of several indexed lists at index values in R

I have 4 predicted y values presented as an indexed list in R:
> y_a
2 12 15 19 20 22 3 4
26.05434 24.33894 38.57935 37.94003 23.87608 46.20327 18.43043 24.96521
5 8 13 21 1 7 10 11
17.34129 30.41087 28.49836 39.02917 21.96358 30.41087 23.61032 30.41087
16 18
35.31196 35.85652
> y_b
6 9 14 17 23 24 3 4
36.87726 35.30301 40.48044 38.24398 42.67726 41.31053 32.32106 33.81204
5 8 13 21 1 7 10 11
32.07257 35.05451 40.31655 44.74850 38.82558 35.05451 27.80451 35.05451
16 18
36.17274 36.29699
> y_c
6 9 14 17 23 24 2 12
30.24043 35.33617 39.18723 33.63404 42.76170 39.36809 32.25106 24.04894
15 19 20 22 1 7 10 11
39.34681 38.28298 31.01702 43.66596 33.19787 34.71915 27.60213 34.71915
16 18
37.49574 37.80426
> y_d
6 9 14 17 23 24 2 12
26.48159 35.12368 38.41591 31.00840 40.54660 36.01979 31.00840 22.70478
15 19 20 22 3 4 5 8
40.47355 32.72757 29.36229 46.23494 25.24701 30.18534 24.42395 34.30063
13 21
32.72757 33.55063
I would like to create a list that returns an average of the points on each list at the same index. In other words the average of point at index 2, index 12, index 15, and etc...
> y_mean
2 6 9 12....
26.05434 31.8664 ...... ......
Any ideas on how to do that?
We may get the elements in a list, then stack it to two column data.frame, rbind and do a group by mean
dat <- do.call(rbind,
lapply(mget(ls(pattern = "^y_[a-z]$")), stack))
aggregate(values ~ ind, dat, FUN = mean)
Or use tapply
with(dat, tapply(values, ind, FUN = mean))
Or if there are only four vectors, just do
v1 <- c(y_a, y_b, y_c, y_d)
tapply(v1, names(v1), FUN = mean)

`rbind` elements with same index in a list

I have a list which has three lists each of which have 2 data.frames like following:-
d <- list(list(data.frame(height = rnorm(10), weight = runif(10)), data.frame(nr = rnorm(10), qr = rchisq(10, 10))),
list(data.frame(height = rnorm(6), weight = runif(6)), data.frame(nr = rnorm(6), qr = rchisq(6, 10))),
list(data.frame(height = rnorm(8), weight = runif(8)), data.frame(nr = rnorm(8), qr = rchisq(8, 10))))
[[1]]
[[1]][[1]]
height weight
1 -0.49424331 0.023996582
2 -0.80320654 0.029460558
3 -0.89797434 0.932508002
4 -0.25267069 0.790625104
5 0.27474082 0.859495769
6 0.14285128 0.009731295
7 -0.86224008 0.343969165
8 0.07358127 0.106006154
9 -1.61474408 0.302890840
10 2.23920173 0.133115944
[[1]][[2]]
nr qr
1 -1.24342871 10.278033
2 1.37520549 13.246929
3 -0.06046197 6.267480
4 0.73643661 14.084240
5 -0.01897590 1.323470
6 1.10877385 11.739945
7 -1.09511298 10.616714
8 -1.03525533 9.992008
9 -0.04301281 12.943073
10 -0.79446848 8.670066
[[2]]
[[2]][[1]]
height weight
1 -2.7323741 0.8825884
2 0.4745896 0.1813869
3 0.9158570 0.9660507
4 0.8927806 0.1156805
5 -0.8443665 0.3079322
6 -0.4703602 0.2345349
[[2]][[2]]
nr qr
1 -1.6915651 9.532319
2 -1.9810859 15.145930
3 -0.4890531 10.013549
4 0.2163449 13.407265
5 1.0770555 6.676846
6 -0.5431102 7.688177
[[3]]
[[3]][[1]]
height weight
1 -1.2671410 0.48468152
2 -0.7792946 0.04499799
3 -0.6976782 0.10917336
4 0.8274744 0.69698260
5 -0.9456592 0.64183451
6 -1.2882436 0.29868696
7 0.6424889 0.86165232
8 -0.8255187 0.16430852
[[3]][[2]]
nr qr
1 -0.4160331 5.341376
2 -1.0321303 8.947948
3 -0.3380597 7.937599
4 -2.1520878 11.740298
5 -0.8979710 2.393419
6 -1.1172138 11.780884
7 -1.0309391 2.673642
8 0.8822399 12.351724
I want to transform it so that all the data.frames with (height, weight) columns are rbinded together and all the (nr, qr) data.frames are rbinded together. So basically first element of each list in the list should be binded together and the second element of each list in the list should be binded together.
Expected Output would be another list which will have two data.frames like following:-
[[1]]
height weight
1 -0.49424331 0.023996582
2 -0.80320654 0.029460558
3 -0.89797434 0.932508002
4 -0.25267069 0.790625104
5 0.27474082 0.859495769
6 0.14285128 0.009731295
7 -0.86224008 0.343969165
8 0.07358127 0.106006154
9 -1.61474408 0.302890840
10 2.23920173 0.133115944
11 -2.7323741 0.8825884
12 0.4745896 0.1813869
13 0.9158570 0.9660507
14 0.8927806 0.1156805
15 -0.8443665 0.3079322
16 -0.4703602 0.2345349
17 -1.2671410 0.48468152
18 -0.7792946 0.04499799
19 -0.6976782 0.10917336
20 0.8274744 0.69698260
21 -0.9456592 0.64183451
22 -1.2882436 0.29868696
23 0.6424889 0.86165232
24 -0.8255187 0.16430852
[[2]]
nr qr
1 -1.24342871 10.278033
2 1.37520549 13.246929
3 -0.06046197 6.267480
4 0.73643661 14.084240
5 -0.01897590 1.323470
6 1.10877385 11.739945
7 -1.09511298 10.616714
8 -1.03525533 9.992008
9 -0.04301281 12.943073
10 -0.79446848 8.670066
11 -1.6915651 9.532319
12 -1.9810859 15.145930
13 -0.4890531 10.013549
14 0.2163449 13.407265
15 1.0770555 6.676846
16 -0.5431102 7.688177
17 -0.4160331 5.341376
18 -1.0321303 8.947948
19 -0.3380597 7.937599
20 -2.1520878 11.740298
21 -0.8979710 2.393419
22 -1.1172138 11.780884
23 -1.0309391 2.673642
24 0.8822399 12.351724
This should do it.
dd <- list(do.call(rbind, lapply(d, "[[", 1)), do.call(rbind, lapply(d, "[[", 2)))

Strange behaviour gtools:mixedorder combined with dplyr::arrange?

I noticed an unexpected outcome while using dplyr::arrange together with gtools::mixedorder.
Consider:
library(tidyverse)
test <- data.frame(V1 = c("all13_LG1", "all13_LG10", "all13_LG11",
"all13_LG12", "all13_LG13", "all13_LG14", "all13_LG15", "all13_LG16",
"all13_LG2", "all13_LG3", "all13_LG4", "all13_LG5", "all13_LG6",
"all13_LG7", "all13_LG8", "all13_LG9"),
V2 = c(rep(1:16)))
test2 <- test %>% arrange(gtools::mixedorder(V1))
test3 <- test %>% slice(gtools::mixedorder(V1))
In test2 the 1st column is sorted: "all13_LG1", "all13_LG3", "all13_LG4", "all13_LG5", "all13_LG6", "all13_LG7", "all13_LG8", "all13_LG9", "all13_LG10", "all13_LG11", "all13_LG12", "all13_LG13", "all13_LG14", "all13_LG15", "all13_LG16", "all13_LG2"
Whereas in test3, the columns are sorted as one would expect when using gtools:mixedorder
Why is this happening when I combine arrange and mixedtools? Is this a bug?
Many thanks,
Anneke
To use results of mixedorder in arrange you need to order the results.
library(dplyr)
test %>% arrange(order(gtools::mixedorder(V1)))
# V1 V2
#1 all13_LG1 1
#2 all13_LG2 9
#3 all13_LG3 10
#4 all13_LG4 11
#5 all13_LG5 12
#6 all13_LG6 13
#7 all13_LG7 14
#8 all13_LG8 15
#9 all13_LG9 16
#10 all13_LG10 2
#11 all13_LG11 3
#12 all13_LG12 4
#13 all13_LG13 5
#14 all13_LG14 6
#15 all13_LG15 7
#16 all13_LG16 8
We can slice with mixedsort and match
library(dplyr)
test %>%
slice(match(gtools::mixedsort(V1), V1))
-output
V1 V2
1 all13_LG1 1
2 all13_LG2 9
3 all13_LG3 10
4 all13_LG4 11
5 all13_LG5 12
6 all13_LG6 13
7 all13_LG7 14
8 all13_LG8 15
9 all13_LG9 16
10 all13_LG10 2
11 all13_LG11 3
12 all13_LG12 4
13 all13_LG13 5
14 all13_LG14 6
15 all13_LG15 7
16 all13_LG16 8

How to swap values between variables in a data frame by column name in R

I would like to swap values for the following data set by column name. For example, swap the values of beta0_C1 and beta0_C2 for the row 10 to 15, remaining values remain unchanged. Similarly for row 10 to 15, swap the values of beta1_C1 and beta1_C2. Similarly for beta2_C1 and beta2_C2,
beta3_C1 and beta3_C2
beta0_C1 beta1_C1 beta2_C1 beta3_C1 beta0_C2 beta1_C2 beta2_C2
1 6.010537 0.2826006 0.001931834 -0.0014162495 6.862525 -0.7267671 0.12065368
2 6.182425 0.1633226 0.025748699 -0.0028515529 6.780775 -0.6686269 0.10548767
3 6.222667 0.1109463 0.036438064 -0.0034054813 6.891512 -0.7372192 0.11895311
4 5.980246 0.3095103 -0.002670511 -0.0011975572 6.677035 -0.5774936 0.08990028
5 6.146192 0.1661733 0.024968028 -0.0027346213 6.881571 -0.7439543 0.11835484
6 6.056259 0.2374753 0.010833872 -0.0019526540 7.094971 -0.8504940 0.13648015
7 6.051281 0.2265750 0.017030676 -0.0024138722 6.829044 -0.7180662 0.12121844
8 5.911484 0.3628966 -0.014161483 -0.0005192893 6.784079 -0.6090060 0.09075940
9 5.956709 0.3486160 -0.011525364 -0.0006776760 6.934137 -0.7821656 0.12996924
10 6.010721 0.2821788 0.002475369 -0.0014508507 6.810553 -0.7140603 0.12471406
11 6.021261 0.3180654 -0.004986709 -0.0010968281 6.708342 -0.6259794 0.10697798
12 6.171459 0.2020801 0.015380862 -0.0021379484 6.592252 -0.5040888 0.07813420
13 6.103334 0.2432321 0.010022319 -0.0019386513 6.831204 -0.6854066 0.11129609
14 5.989656 0.3026038 -0.003007319 -0.0011073984 6.782081 -0.6822204 0.10769549
15 6.024628 0.2786942 0.001861784 -0.0014022176 6.864881 -0.7299905 0.12030466
16 6.023082 0.2707312 0.008308583 -0.0019947781 6.850565 -0.7136916 0.11551886
17 5.988829 0.3267394 -0.007576506 -0.0008493887 6.882956 -0.7739330 0.13467615
18 6.072949 0.2744519 0.002846329 -0.0014917373 6.886863 -0.7853582 0.13512568
19 6.030894 0.2693881 0.006378019 -0.0017875603 6.842824 -0.7238131 0.11835479
20 6.197286 0.1311579 0.036005746 -0.0035338268 6.807729 -0.6549960 0.10400631
beta3_C2
1 -0.005112708
2 -0.003982831
3 -0.004824895
4 -0.003356916
5 -0.004724677
6 -0.005657009
7 -0.005200557
8 -0.003065364
9 -0.005408715
10 -0.005551546
11 -0.004516814
12 -0.002726879
13 -0.004493288
14 -0.004053661
15 -0.004913402
16 -0.004609239
17 -0.006101912
18 -0.005945182
19 -0.004801623
20 -0.004151904
Any help is appreciated.
Given this input :
(df1 <- as.data.frame(matrix(1:12, ncol = 3)))
# V1 V2 V3
#1 1 5 9
#2 2 6 10
#3 3 7 11
#4 4 8 12
You can use rev
df1[1:2, c("V2", "V3")] <- rev(df1[1:2, c("V2", "V3")])
Result
df1
# V1 V2 V3
#1 1 9 5
#2 2 10 6
#3 3 7 11
#4 4 8 12
Written as a function of rows and cols
f <- function(data, rows, cols) {
data[rows, cols] <- rev(data[rows, cols])
data
}
f(df1, 1:2, c("V2", "V3"))

Replacing each value in a vector with its rank number for a data.frame

In this hypothetical scenario, I have performed 5 different analyses on 13 chemicals, resulting in a score assigned to each chemical within each analysis. I have created a table as follows:
---- Analysis1 Analysis2 Analysis3 Analysis4 Analysis5
Chem_1 3.524797844 4.477695034 4.524797844 4.524797844 4.096698498
Chem_2 2.827511555 3.827511555 3.248136118 3.827511555 3.234398548
Chem_3 2.682144761 3.474646298 3.017780505 3.682144761 3.236152242
Chem_4 2.134137304 2.596921333 2.95181339 2.649076603 2.472875191
Chem_5 2.367736454 3.027814219 2.743137896 3.271122346 2.796607809
Chem_6 2.293110565 2.917318708 2.724156207 3.293110565 2.530967343
Chem_7 2.475709113 3.105794018 2.708222528 3.475709113 3.088819908
Chem_8 2.013451822 2.259454085 2.683273938 2.723554966 2.400976121
Chem_9 2.345123123 3.050074893 2.682845391 3.291851228 2.700844104
Chem_10 2.327658894 2.848729452 2.580415233 3.327658894 2.881490893
Chem_11 2.411243882 2.98131398 2.554456095 3.411243882 3.109205453
Chem_12 2.340778276 2.576860244 2.549707035 3.340778276 3.236545826
Chem_13 2.394698249 2.90682524 2.542599327 3.394698249 3.12936843
I would like to create columns corresponding to each analysis which contain the rank position for each chemical. For instance, under Analysis1,Chem_1 would have value "1", Chem_2 would have value "2", Chem_3 would have value "4", Chem_7 would have value "4", Chem_11 would have value "5", and so on.
We can use dense_rank from dplyr
library(dplyr)
df %>%
mutate_each(funs(dense_rank(-.)))
In base R, we can do
df[] <- lapply(-df, rank, ties.method="min")
In data.table, we can use
library(data.table)
setDT(df)[, lapply(-.SD, frank, ties.method="dense")]
To avoid the copies from multiplying with -, as #Arun mentioned in the comments
lapply(.SD, frankv, order=-1L, ties.method="dense")
You can also do this in base R:
cbind("..." = df[,1], data.frame(do.call(cbind,
lapply(df[,-1], order, decreasing = T))))
... Analysis1 Analysis2 Analysis3 Analysis4 Analysis5
1 Chem_1 1 1 1 1 1
2 Chem_2 2 2 2 2 12
3 Chem_3 3 3 3 3 3
4 Chem_4 7 7 4 7 2
5 Chem_5 11 9 5 11 13
6 Chem_6 13 5 6 13 11
7 Chem_7 5 11 7 12 7
8 Chem_8 9 6 8 10 10
9 Chem_9 12 13 9 6 5
10 Chem_10 10 10 10 9 9
11 Chem_11 6 4 11 5 6
12 Chem_12 4 12 12 8 4
13 Chem_13 8 8 13 4 8
If I'm not mistaken, you want to have the column-wise rank of your table. Here is my solution:
m=data.matrix(df) # converts data frame to matrix, convert your data to matrix accordingly
apply(m, 2, function(c) rank(c)) # increasingly
apply(m, 2, function(c) rank(-c)) # decreasingly
However, I believe you could solve it by yourself with the help of the answers to this question
Get rank of matrix entries?

Resources