Sort dataframe with multiple columns for multiple years - r

I have a data.frame with multiple columns and first column being Year. I want to sort my data frame in descending order for each year. I have fifteen years of data and then over 3000 columns.
I illustrate as follows:
Year A B C D
2000 2 3 4 NA
2001 3 4 NA 1
Desired output, my data frame has NAs as well but I can not remove those.
Year C B A
2000 4 3 2
Year B A D
2001 4 3 1
And this verion as well
Year
2000 C B A
2001 B A D
I have scripted this code
Asc <-order(df[-1], decreasing=True)
But I'm unable to obtain my desired output. I have referred in R sort row data in ascending order but still its different for what I'm looking for.
Would appreciate your help in this regard.

We can use apply with MARGIN=1. We loop through the rows of the dataset (excluding the first column) with apply, get the index of non-NA elements ('i1'), order the non-NA values descendingly ('i2'), and use that to rearrange the column names of the dataset.
m1 <- t(apply(df1[-1], 1, function(x) {
i1 <- !is.na(x)
i2 <- order(-x[i1])
names(df1)[-1][i1][i2]}))
m1
# [,1] [,2] [,3]
#[1,] "C" "B" "A"
#[2,] "B" "A" "D"
If we need the values and also the names, a list approach would be more suitable as it won't create any problems in the class
lst <- apply(df1[-1], 1, function(x){
i1 <- !is.na(x)
list(sort(x[i1],decreasing=TRUE))})
lst
#[[1]]
#[[1]][[1]]
#C B A
#4 3 2
#[[2]]
#[[2]][[1]]
#B A D
#4 3 1
We can extract the names or the elements from the 'lst'
do.call(rbind, do.call(`c`,rapply(lst, names,
how='list')))
# [,1] [,2] [,3]
#[1,] "C" "B" "A"
#[2,] "B" "A" "D"
Or
t(sapply(do.call(c, lst), names))
and the values as
t(simplify2array(do.call(c, lst)))

Related

Display identical columns in R dataframe

Suppose I have the following dataframe :
df <- data.frame(A=c(1,2,3),B=c("a","b","c"),C=c(2,1,3),D=c(1,2,3),E=c("a","b","c"),F=c(1,2,3))
> df
A B C D E F
1 1 a 2 1 a 1
2 2 b 1 2 b 2
3 3 c 3 3 c 3
I want to filter out the columns that are identical. I know that I can do it with
DuplCols <- df[duplicated(as.list(df))]
UniqueCols <- df[ ! duplicated(as.list(df))]
In the real world my dataframe has more than 500 columns and I do not know how many identical columns of the same kind I have and I do not know the names of the columns. However, each columnname is unique (as in df). My desired result is (optimally) a dataframe where in each row the column names of the identical columns of one kind are stored. The number of columns in the DesiredResult dataframe is the maximal number of identical columns of one kind in the original dataframe and if there are less identical columns of another kind, NA should be stored:
> DesiredResult
X1 X2 X3
1 A D F
2 B E NA
3 C NA NA
(With "identical column of the same kind" I mean the following: in df the columns A, D, F are identical columns of the same kind and B, E are identical columns of the same kind.)
You can use unique and then test with %in% where it matches to extract the colname.
tt_lapply(unique(as.list(df)), function(x) {colnames(df)[as.list(df) %in% list(x)]})
tt
#[[1]]
#[1] "A" "D" "F"
#
#[[2]]
#[1] "B" "E"
#
#[[3]]
#[1] "C"
t(sapply(tt, "length<-", max(lengths(tt)))) #As data.frame
# [,1] [,2] [,3]
#[1,] "A" "D" "F"
#[2,] "B" "E" NA
#[3,] "C" NA NA

How can I order a column of a matrix?

I have created a matrix out of two vectors
x<-c(1,118,3,220)
y<-c("A","B","C","D")
z<-c(x,y)
m<-matrix(z,ncol=2)
Now I want order the second row, but it doesn't work properly.
I tried:
m[order(m[,2]),]
The order should be 1,3,118,220, but it shows 1,118,220,3
The matrix can only hold one class which in this case would be character since you have "A","B","C","D".
So if still want to order the rows in matrix you need to subset the first column convert it into numeric, use order and then use them to reorder rows.
m[order(as.numeric(m[, 1])), ]
# [,1] [,2]
#[1,] "1" "A"
#[2,] "3" "C"
#[3,] "118" "B"
#[4,] "220" "D"
Since you have data with mixed data types why not store them in dataframe instead ?
x<-c(1,118,3,220)
y<-c("A","B","C","D")
df <- data.frame(x,y)
df[order(df[,1]),]
# x y
#1 1 A
#3 3 C
#2 118 B
#4 220 D

returning matrix column indices matching value(s) in R

I'm looking for a fast way to return the indices of columns of a matrix that match values provided in a vector (ideally of length 1 or the same as the number of rows in the matrix)
for instance:
mat <- matrix(1:100,10)
values <- c(11,2,23,12,35,6,97,3,9,10)
the desired function, which I call rowMatches() would return:
rowMatches(mat, values)
[1] 2 1 3 NA 4 1 10 NA 1 1
Indeed, value 11 is first found at the 2nd column of the first row, value 2 appears at the 1st column of the 2nd row, value 23 is at the 3rd column of the 3rd row, value 12 is not in the 4th row... and so on.
Since I haven't found any solution in package matrixStats, I came up with this function:
rowMatches <- function(mat,values) {
res <- integer(nrow(mat))
matches <- mat == values
for (col in ncol(mat):1) {
res[matches[,col]] <- col
}
res[res==0] <- NA
res
}
For my intended use, there will be millions of rows and few columns. So splitting the matrix into rows (in a list called, say, rows) and calling Map(match, as.list(values), rows) would be way too slow.
But I'm not satisfied by my function because there is a loop, which may be slow if there are many columns. It should be possible to use apply() on columns, but it won't make it faster.
Any ideas?
res <- arrayInd(match(values, mat), .dim = dim(mat))
res[res[, 1] != seq_len(nrow(res)), 2] <- NA
# [,1] [,2]
# [1,] 1 2
# [2,] 2 1
# [3,] 3 3
# [4,] 2 NA
# [5,] 5 4
# [6,] 6 1
# [7,] 7 10
# [8,] 3 NA
# [9,] 9 1
#[10,] 10 1
Roland's answer is good, but I'll post an alternative solution:
res <- which(mat==values, arr.ind = T)
res <- res[match(seq_len(nrow(mat)), res[,1]), 2]

Split matrix to a list of matrix by vector

I am trying to split my matrix to a list by unique value in vector. Vector will have as many values as is in each column in matrix.
Here is an example:
#matrix
b <- cbind(c(2,2,1,0), c(2,2,1,5), c(2,2,5,6))
#vector
a <- c(5,5,4,1)
#??
#my outcome should looks like
v <- list(cbind(c(2,2), c(2,2), c(2,2)), c(1,1,5), c(0,5,6))
so basically, I want to split my matrix into multiple matrices by rows by unique values in a vector. More specifically, my vector is sorted from highest value to lowest value and I need to keep it in a list! As you can see in the example, v[[1]] is matrix for unique(a)[1] and so on.
lapply(split(seq_along(a), a), #split indices by a
function(m, ind) m[ind,], m = b)[order(unique(a))]
#$`5`
# [,1] [,2] [,3]
#[1,] 2 2 2
#[2,] 2 2 2
#
#$`4`
#[1] 1 1 5
#
#$`1`
#[1] 0 5 6

Row names & column names in R

Do the following function pairs generate exactly the same results?
Pair 1) names() & colnames()
Pair 2) rownames() & row.names()
As Oscar Wilde said
Consistency is the last refuge of the
unimaginative.
R is more of an evolved rather than designed language, so these things happen. names() and colnames() work on a data.frame but names() does not work on a matrix:
R> DF <- data.frame(foo=1:3, bar=LETTERS[1:3])
R> names(DF)
[1] "foo" "bar"
R> colnames(DF)
[1] "foo" "bar"
R> M <- matrix(1:9, ncol=3, dimnames=list(1:3, c("alpha","beta","gamma")))
R> names(M)
NULL
R> colnames(M)
[1] "alpha" "beta" "gamma"
R>
Just to expand a little on Dirk's example:
It helps to think of a data frame as a list with equal length vectors. That's probably why names works with a data frame but not a matrix.
The other useful function is dimnames which returns the names for every dimension. You will notice that the rownames function actually just returns the first element from dimnames.
Regarding rownames and row.names: I can't tell the difference, although rownames uses dimnames while row.names was written outside of R. They both also seem to work with higher dimensional arrays:
>a <- array(1:5, 1:4)
> a[1,,,]
> rownames(a) <- "a"
> row.names(a)
[1] "a"
> a
, , 1, 1
[,1] [,2]
a 1 2
> dimnames(a)
[[1]]
[1] "a"
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
I think that using colnames and rownames makes the most sense; here's why.
Using names has several disadvantages. You have to remember that it means "column names", and it only works with data frame, so you'll need to call colnames whenever you use matrices. By calling colnames, you only have to remember one function. Finally, if you look at the code for colnames, you will see that it calls names in the case of a data frame anyway, so the output is identical.
rownames and row.names return the same values for data frame and matrices; the only difference that I have spotted is that where there aren't any names, rownames will print "NULL" (as does colnames), but row.names returns it invisibly. Since there isn't much to choose between the two functions, rownames wins on the grounds of aesthetics, since it pairs more prettily withcolnames. (Also, for the lazy programmer, you save a character of typing.)
And another expansion:
# create dummy matrix
set.seed(10)
m <- matrix(round(runif(25, 1, 5)), 5)
d <- as.data.frame(m)
If you want to assign new column names you can do following on data.frame:
# an identical effect can be achieved with colnames()
names(d) <- LETTERS[1:5]
> d
A B C D E
1 3 2 4 3 4
2 2 2 3 1 3
3 3 2 1 2 4
4 4 3 3 3 2
5 1 3 2 4 3
If you, however run previous command on matrix, you'll mess things up:
names(m) <- LETTERS[1:5]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 3 2 4 3 4
[2,] 2 2 3 1 3
[3,] 3 2 1 2 4
[4,] 4 3 3 3 2
[5,] 1 3 2 4 3
attr(,"names")
[1] "A" "B" "C" "D" "E" NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[20] NA NA NA NA NA NA
Since matrix can be regarded as two-dimensional vector, you'll assign names only to first five values (you don't want to do that, do you?). In this case, you should stick with colnames().
So there...

Resources