How to sort a matrix by all columns - r

Suppose I have
arr = 2 1 3
1 2 3
1 1 2
How can I sort this into the below?
arr = 1 1 2
1 2 3
2 1 3
That is, first by column one, then by column two etc.

The function you're after is order (how I arrived at this conclusion -- my first thought was "well, sorting, what about sort?". Tried sort(arr) which looks like it sorts arr as a vector instead of row-wise. Looking at ?sort, I see in the "See Also: order for sorting on or reordering multiple variables.").
Looking at ?order, I see that order(x,y,z, ...) will order by x, breaking ties by y, breaking further ties by z, and so on. Great - all I have to do is pass in each column of arr to order to do this. (There is even an example for this in the examples section of ?order):
order( arr[,1], arr[,2], arr[,3] )
# gives 3 2 1: row 3 first, then row 2, then row 1.
# Hence:
arr[ order( arr[,1], arr[,2], arr[,3] ), ]
# [,1] [,2] [,3]
#[1,] 1 1 2
#[2,] 1 2 3
#[3,] 2 1 3
Great!
But it is a bit annoying that I have to write out arr[,i] for each column in arr - what if I don't know how many columns it has in advance?
Well, the examples show how you can do this too: using do.call. Basically, you do:
do.call( order, args )
where args is a list of arguments into order. So if you can make a list out of each column of arr then you can use this as args.
One way to do this is is to convert arr into a data frame and then into a list -- this will automagically put one column per element of the list:
arr[ do.call( order, as.list(as.data.frame(arr)) ), ]
The as.list(as.data.frame is a bit kludgy - there are certainly other ways to create a list such that list[[i]] is the ith column of arr, but this is just one.

This would work:
arr[do.call(order, lapply(1:NCOL(arr), function(i) arr[, i])), ]
What it is doing is:
arr[order(arr[, 1], arr[, 2], arr[ , 3]), ]
except it allows an arbitrary number of columns in the matrix.

I wrote this little func that does decreasing order as well
cols allows to choose which columns to order and their order
ord.mat = function(M, decr = F, cols = NULL){
if(is.null(cols))
cols = 1: ncol(M)
out = do.call( "order", as.data.frame(M[,cols]))
if (decr)
out = rev(out)
return(M[out,])
}

I had a similar problem, and solution seems to be simple and elegant:
t(apply(t(yourMatrix),2,sort))

Related

Extracting single row from data.frame without loss of names [duplicate]

This question already has answers here:
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 1 year ago.
I am simply extracting a single row from a data.frame. Consider for example
d=data.frame(a=1:3,b=1:3)
d[1,] # returns a data.frame
# a b
# 1 1 1
The output matched my expectation. The result was not as I expected though when dealing with a data.frame that contains a single column.
d=data.frame(a=1:3)
d[1,] # returns an integer
# [1] 1
Indeed, here, the extracted data is not a data.frame anymore but an integer! To me, it seems a little strange that the same function on the same data type wants to return different data types. One of the issue with this conversion is the loss of the column name.
To solve the issue, I did
extractRow = function(d,index)
{
if (ncol(d) > 1)
{
return(d[index,])
} else
{
d2 = as.data.frame(d[index,])
names(d2) = names(d)
return(d2)
}
}
d=data.frame(a=1:3,b=1:3)
extractRow(d,1)
# a b
# 1 1 1
d=data.frame(a=1:3)
extractRow(d,1)
# a
# 1 1
But it seems unnecessarily cumbersome. Is there a better solution?
Just subset with the drop = FALSE option:
extractRow = function(d, index) {
return(d[index, , drop=FALSE])
}
R tries to simplify data.frame cuts by default, the same thing happens with columns:
d[, "a"]
# [1] 1 2 3
Alternatives are:
d[1, , drop = FALSE]
tibble::tibble which has drop = FALSE by default
I can't tell you why that happens - it seems weird. One workaround would be to use slice from dplyr (although using a library seems unecessary for such a simple task).
library(dplyr)
slice(d, 1)
a
1 1
data.frames will simplify to vectors or scallars whith base subsetting [,].
If you want to avoid that, you can use tibbles instead:
> tibble(a=1:2)[1,]
# A tibble: 1 x 1
a
<int>
1 1
tibble(a=1:2)[1,] %>% class
[1] "tbl_df" "tbl" "data.frame"

return indices of duplicated elements corresponding to the unique elements in R

anyone know if there's a build in function in R that can return indices of duplicated elements corresponding to the unique elements?
For instance I have a vector
a <- ["A","B","B","C","C"]
unique(a) will give ["A","B","C"]
duplicated(a) will give [F,F,T,F,T]
is there a build-in function to get a vector of indices for the same length as original vector a, that shows the location a's elements in the unique vecor (which is [1,2,2,3,3] in this example)?
i.e., something like the output variable "ic" in the matlab function "unique". (which is, if we let c = unique(a), then a = c(ic,:)).
http://www.mathworks.com/help/matlab/ref/unique.html
Thank you!
We can use match
match(a, unique(a))
#[1] 1 2 2 3 3
Or convert to factor and coerce to integer
as.integer(factor(a, levels = unique(a)))
#[1] 1 2 2 3 3
data
a <- c("A","B","B","C","C")
This should work:
cumsum( !duplicated( sort( a)) ) # one you replace Mathlab syntax with R syntax.
Or just:
as.numeric(factor(a) )

Remove quotes from vector element in order to use it as a value

Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).

Add a column of ranks

I have some data:
test <- data.frame(A=c("aaabbb",
"aaaabb",
"aaaabb",
"aaaaab",
"bbbaaa")
)
and so on. All the elements are the same length, and are already sorted before I get them.
I need to make a new column of ranks, "First", "Second", "Third", anything after that can be left blank, and it needs to account for ties. So in the above case, I'd like to get the following output:
A B
aaabbb First
aaaabb Second
aaaabb Second
aaaaab Third
bbbaaa
bbbbaa
I looked at rank() and some other posts that used it, but I wasn't able to get it to do what I was looking for.
How about this:
test$B <- match(test$A , unique(test$A)[1:3] )
test
A B
1 aaabbb 1
2 aaaabb 2
3 aaaabb 2
4 aaaaab 3
5 bbbaaa NA
6 bbbbaa NA
One of many ways to do this. Possibly not the best, but one that readily springs to mind and is fairly intuitive. You can use unique because you receive the data pre-sorted.
As data is sorted another suitable function worth considering is rle, although it's slightly more obtuse in this example:
rnk <- rle(as.integer(df$A))$lengths
rnk
# [1] 1 2 1 1 1
test$B <- c( rep( 1:3 , times = rnk[1:3] ) , rep(NA, sum( rnk[-c(1:3)] ) ) )
rle computes the lengths (and values which we don't really care about here) of runs of equal values in a vector - so again this works because your data are already sorted.
And if you don't have to have blanks after the third ranked item it's even simpler (and more readable):
test$B <- rep(1:length(rnk),times=rnk)
This seems like a good application for factors:
test$B <- as.numeric(factor(test$A, levels = unique(test$A)))
cumsum also comes to mind, where we add 1 every time the value changes:
test$B <- cumsum(c(TRUE, tail(test$A, -1) != head(test$A, -1)))
(Like #Simon said, there are many ways to do this...)

Retain Vector Names as Dataframe Column Names

In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1

Resources