R sorts a vector on its own accord - r

df.sorted <- c("binned_walker1_1.grd", "binned_walker1_2.grd", "binned_walker1_3.grd",
"binned_walker1_4.grd", "binned_walker1_5.grd", "binned_walker1_6.grd",
"binned_walker2_1.grd", "binned_walker2_2.grd", "binned_walker3_1.grd",
"binned_walker3_2.grd", "binned_walker3_3.grd", "binned_walker3_4.grd",
"binned_walker3_5.grd", "binned_walker4_1.grd", "binned_walker4_2.grd",
"binned_walker4_3.grd", "binned_walker4_4.grd", "binned_walker4_5.grd",
"binned_walker5_1.grd", "binned_walker5_2.grd", "binned_walker5_3.grd",
"binned_walker5_4.grd", "binned_walker5_5.grd", "binned_walker5_6.grd",
"binned_walker6_1.grd", "binned_walker7_1.grd", "binned_walker7_2.grd",
"binned_walker7_3.grd", "binned_walker7_4.grd", "binned_walker7_5.grd",
"binned_walker8_1.grd", "binned_walker8_2.grd", "binned_walker9_1.grd",
"binned_walker9_2.grd", "binned_walker9_3.grd", "binned_walker9_4.grd",
"binned_walker10_1.grd", "binned_walker10_2.grd", "binned_walker10_3.grd")
One would expect that order of this vector would be 1:length(df.sorted), but that appears not to be the case. It looks like R internally sorts the vector according to its logic but tries really hard to display it the way it was created (and is seen in the output).
order(df.sorted)
[1] 37 38 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[26] 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Is there a way to "reset" the ordering to 1:length(df.sorted)? That way, ordering, and the output of the vector would be in sync.

Use the mixedsort (or) mixedorder functions in package gtools:
require(gtools)
mixedorder(df.sorted)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[28] 28 29 30 31 32 33 34 35 36 37 38 39

construct it as an ordered factor:
> df.new <- ordered(df.sorted,levels=df.sorted)
> order(df.new)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
EDIT :
After #DWins comment, I want to add that it is even not nessecary to make it an ordered factor, just a factor is enough if you give the right order of levels :
> df.new2 <- factor(df.sorted,levels=df.sorted)
> order(df.new)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
The difference will be noticeable when you use those factors in a regression analysis, they can be treated differently. The advantage of ordered factors is that they let you use comparison operators as < and >. This makes life sometimes a lot easier.
> df.new2[5] < df.new2[10]
[1] NA
Warning message:
In Ops.factor(df.new[5], df.new[10]) : < not meaningful for factors
> df.new[5] < df.new[10]
[1] TRUE

Isn't this simply the same thing you get with all lexicographic shorts (as e.g. ls on directories) where walker10_foo sorts higher than walker1_foo?
The easiest way around, in my book, is to use a consistent number of digits, i.e. I would change to binned_walker01_1.grd and so on inserting a 0 for the one-digit counts.

In response to Dwin's comment on Dirk's answer: the data are always putty in your hands. "This is R. There is no if. Only how." -- Simon Blomberg
You can add 0 like so:
df.sorted <- gsub("(walker)([[:digit:]]{1}_)", "\\10\\2", df.sorted)
If you needed to add 00, you do it like this:
df.sorted <- gsub("(walker)([[:digit:]]{1}_)", "\\10\\2", df.sorted)
df.sorted <- gsub("(walker)([[:digit:]]{2}_)", "\\10\\2", df.sorted)
...and so on.

Related

rpart -- number of splits

Using printcp I got output resembling the following (this is only a portion):
CP nsplit rel error xerror xstd
1 3.254666e-01 0 1.0000000 1.0000000 0.003976889
2 5.395058e-02 1 0.6745334 0.6745334 0.003567289
3 4.125633e-02 3 0.5666322 0.5878145 0.003401065
4 1.726150e-02 4 0.5253759 0.5492028 0.003317552
5 1.222830e-02 7 0.4735914 0.4925069 0.003183022
6 1.193864e-02 10 0.4364909 0.4744730 0.003137010
7 9.243634e-03 12 0.4126137 0.4489081 0.003068901
8 5.238899e-03 13 0.4033700 0.4277007 0.003009687
9 3.878800e-03 14 0.3981311 0.4183311 0.002982702
10 3.664710e-03 16 0.3903735 0.4115054 0.002962714
11 3.261718e-03 18 0.3830441 0.4098935 0.002957953
12 2.934287e-03 20 0.3765207 0.4063421 0.002947406
13 2.871320e-03 24 0.3647835 0.4044783 0.002941839
14 2.770571e-03 25 0.3619122 0.4000201 0.002928437
15 2.052742e-03 26 0.3591416 0.3973503 0.002920351
16 1.989774e-03 28 0.3550361 0.3924892 0.002905511
17 1.813465e-03 29 0.3530464 0.3911795 0.002901486
18 1.763091e-03 30 0.3512329 0.3880563 0.002891845
19 1.737904e-03 31 0.3494698 0.3863688 0.002886609
20 1.674936e-03 32 0.3477319 0.3832708 0.002876947
21 1.670739e-03 35 0.3422915 0.3830693 0.002876317
22 1.662343e-03 39 0.3355666 0.3827167 0.002875212
23 1.653947e-03 40 0.3339042 0.3824900 0.002874502
Which value shows the total number of splits in the tree -- nsplit, or the largest index (left-most column)? (I.e., 23 or 40?)
The table your are seeing from the printcp function is the $cptable object from your CART model. Column "nsplit" shows the number of splits, indeed.
So, you can get the total number of splits in the tree with
max(carttree$cptable[,"nsplit"])
Where carttree is the name of your CART tree.

How can I create a matrix , with random number on row and not replace,but in col can replace, R language

How can I create a matrix , with random number on row and not replace.
like this
5 29 24 20 31 33
2 18 35 4 11 21
30 40 22 14 2 28
33 14 4 18 5 10
10 33 15 2 28 18
7 22 9 25 31 20
12 29 31 22 37 26
7 31 34 28 19 23
7 34 11 6 31 28
my code :
matrix(sample(1:42, 60, replace = FALSE), ncol = 6)
But I receive this error message:
Error in sample.int(length(x), size, replace, prob) : cannot take a
sample larger than the population when 'replace = FALSE'
but it's wrong because only 1~42, it can't create a 60 matrix.
You can not generate all 60 of the numbers with one sample function as you want to allow replacement of numbers in a different row. Therefore you have to do one sample per row. #Jav provided very neat code to accomplish this in the comment to the question:
t(sapply(1:10, function(x) sample(1:42, 6, replace = FALSE)))
if you want to have a different sample in each row, then replicate can help you -- but replicate (as pretty much everything else in R) works naturally columnwise, so you have to transpose the result:
t(replicate(10, sample(1:42, 6)))
replace = FALSE is the default, so I didn't include it
after transposing, 10 becomes the number of rows and 6 becomes the number of columns

How do I get rid of commas and periods, etc in R? [duplicate]

This question already has answers here:
How to load comma separated data into R?
(2 answers)
Closed 6 years ago.
This is my data set:
Depth.Fe
1 0,14.21
2 3,19.35
3 10,17.22
4 14,15.87
5 23,13.62
6 30,16.31
7 36,14.13
8 48,13.95
9 59,15
10 66,14.23
11 68,16.81
12 81,15.93
13 94,16.02
14 96,17.85
15 102,17.02
16 115,15.87
17 121,19.84
18 130,16.94
19 163,16.72
20 168,19.2
21 205,20.41
22 239,16.88
23 251,18.74
24 283,16.67
25 297,18.56
26 322,18.87
27 335,20.81
28 351,24.52
29 370,25.03
30 408,25.11
31 416,23.28
32 419,22.56
33 425,19
34 429,20.53
35 443,19.08
36 447,22.83
37 465,21.06
38 474,24.96
39 493,19.12
40 502,22.24
41 522,26.88
42 550,21.15
43 558,28.92
44 571,27.96
45 586,25.03
46 596,26.27
I want depth and Fe to be separated as individual columns, but nothing I try is working.
please help
First of all, #akrun is definitely right in his comment to your post. If this is a dataset imported from somewhere, then follow his comment.
Assuming that somehow you were handed this weird dataset, I would try this:
df <- data.frame(matrix(as.numeric(unlist(strsplit(df$Depth.Fe,split=","))),nrow=2,byrow = T),stringsAsFactors = F)
colnames(df) <- c("Depth","Fe")
This would take a dataset that looks like this:
Depth.Fe
1 0,14.21
2 3,19.35
to this:
Depth Fe
1 0 14.21
2 3 19.34

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Multiple unions

I am trying to do unions on several lists (these are actually GRanges objects not integer lists but the priciple is the same), basically one big union.
x<-sort(sample(1:20, 9))
y<-sort(sample(10:30, 9))
z<-sort(sample(20:40, 9))
mylists<-c("x","y","z")
emptyList<-list()
sapply(mylists,FUN=function(x){emptyList<-union(emptyList,get(x))})
That is just returning the list contents.
I need the equivalent of
union(x,union(y,z))
[1] 2 3 5 6 7 10 13 15 20 14 19 21 24 27 28 29 26 31 36 39
but written in an extensible and non-"variable explicit" form
A not necessarily memory efficient paradigm that will work with GRanges is
Reduce(union, list(x, y, z))
The argument might also be a GRangesList(x, y, z) for appropriate values of x etc.
x<-sort(sample(1:20, 9))
y<-sort(sample(10:30, 9))
z<-sort(sample(20:40, 9))
Both of the below produce the same output
unique(c(x,y,z))
[1] 1 2 4 6 7 8 11 15 17 14 16 18 21 23 26 28 29 20 22 25 31 32 35
union(x,union(y,z))
[1] 1 2 4 6 7 8 11 15 17 14 16 18 21 23 26 28 29 20 22 25 31 32 35
unique(unlist(mget(mylists, globalenv())))
will do the trick. (Possibly changing the environment given in the call to mget, as required.)
I think it would be cleaner to separate the "dereference" part from the n-ary union part, e.g.
dereflist <- function(l) lapply(a,get)
nunion <- function(l) Reduce(union,l)
But if you look at how union works, you'll see that you could also do
nunion <- function(l) unique(do.call(c,l))
which is faster in all the cases I've tested (much faster for long lists).
-s
This can be done by using the reduce function in the purrr package.
purrr::reduce(list(x, y, z),union)
ok this works but I am curious why sapply seems to have its own scope
x<-sort(sample(1:20, 9))
y<-sort(sample(10:30, 9))
z<-sort(sample(20:40, 9))
mylists<-c("x","y","z")
emptyList<-vector()
for(f in mylists){emptyList<-union(emptyList,get(f))}

Resources