Select names of columns which contain specific values in row - r

I'm using a data.frame:
data.frame("A"=c(NA,5,NA,NA,NA),
"B"=c(1,2,3,4,NA),
"C"=c(NA,NA,NA,2,3),
"D"=c(NA,NA,NA,7,NA))
This delivers a data.frame in this form:
A B C D
1 NA 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 4 2 7
5 NA NA 3 NA
My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.
The desired output (value greater 2) should be:
for row 1 of the data.frame
x[1,]: c()
for row 2
x[2,]: c("A")
for row3
x[3,]: c("B")
for row4
x[4,]: c("B","D")
and for row5 of the data.frame
x[5,]: c("C")
Thanks for your help!

You can use which:
lapply(apply(dat, 1, function(x)which(x>2)), names)
with dat being your data frame.
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "B" "D"
[[5]]
[1] "C"
EDIT
Shorter version suggested by flodel:
lapply(apply(dat > 2, 1, which), names)
Edit: (from Arun)
First, there's no need for lapply and apply. You can get the same just with apply:
apply(dat > 2, 1, function(x) names(which(x)))
But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.

To answer #flodel's concerns, I'll write it as a separate answer:
1) Using lapply gets a list and apply doesn't guarantee this always:
A fair point. I'll illustrate the issue with an example:
df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA),
C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")
A B C D
1 3 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 1 2 7
5 NA NA 3 NA
# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"
So, how can we guarantee a list with apply?
By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:
unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "D"
[[5]]
[1] "C"
2) lapply is overall shorter, and does not require anonymous function:
Yes, but it's slower. Let me illustrate this on a big example.
set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE),
ncol = 100))
system.time(t1 <- lapply(apply(df > 2, 1, which), names))
user system elapsed
5.025 0.342 5.651
system.time(t2 <- unlist(apply(df, 1, function(x)
list(names(which(x>2)))), recursive=FALSE))
user system elapsed
2.860 0.181 3.065
identical(t1, t2) # TRUE
3) All answers are wrong and the answer that'll work with all inputs:
lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])
First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.
Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).
# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
user system elapsed
517.545 0.312 517.872
Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:
# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy
# or even better using `data.table` `setattr` function to
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)
Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):
all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE

why not do
colnames(df[,df[i,]>2])
for each row, where df is your data frame and i is the row number ;)

Related

Alternate elements of a vector with multiple NAs

I have a character vector in R, and want to make a new vector with multiple NAs between the elements of the character vector. To simplify, the character vector is:
cv <- c( "A", "B", "C" )
Let's say we just want 3 NAs (actually need much more). Desired output vector would be:
"A", NA, NA, NA, "B", NA, NA, NA, "C", NA, NA, NA
I'm guessing this has been asked before, but it's very difficult to search for. I've tried various permutations and combinations of rep and rbind with no success. Be gentle; my first question :-)
Use sapply to concatenate c(NA, NA, NA) to each element of cv so that for each element of cv we get a 4-vector. sapply will arrange these into a 4 x n matrix (where n is the length of cv) and c on the left will unravel that matrix into a vector.
c(sapply(cv, c, rep(NA, 3)))
## [1] "A" NA NA NA "B" NA NA NA "C" NA NA NA
You can try to play it with matrix() and as.vector()
v <- as.vector(rbind(cv,matrix(nrow = 3,ncol = length(cv))))
such that
> v
[1] "A" NA NA NA "B" NA NA NA "C" NA NA
[12] NA
We could create a vector with NA's and replace cv elements based on position generated by seq.
n <- 3
vec <- rep(NA, (n + 1) * length(cv))
vec[seq(1, length(vec), n + 1)] <- cv
vec
#[1] "A" NA NA NA "B" NA NA NA "C" NA NA NA

Aggregate dataframe to list by sorted column values

I have a matrix of the following form:
adj <- matrix(c(2, 3, 335, 337, 6, 7, 10,
1, 1, 1, 1, 3, 3, 3), nrow = 7)
adj
[,1] [,2]
[1,] 2 1
[2,] 3 1
[3,] 335 1
[4,] 337 1
[5,] 6 3
[6,] 7 3
[7,] 10 3
The matrix is sorted first by column 2, next by column 1.
I would want to convert this to an (adjacency) list of the form:
[[1]] 2 3 335 337
[[2]] integer(0)
[[3]] 6 7 10
I'm recently new to R (and Stack Overflow) and know that the choice
of implementation may drastically increase the speed of computation.
My first naive implementation to perform this task was
adj <- lapply(1:(tail(adj, 1)[2]), function(x) {
as.integer(adj[which(adj[,2] == x), 1])
})
which unfortunately does not exploit the knowledge of column 2 being sorted and seems to be quite slow when 'adj' is a large matrix (more specifically, 68.2 Mb), whereas I was able to completely construct the original matrix in a fraction of seconds.
Hence, I was wondering what's a more 'R-friendly' way of implementing such code. (I have mostly been avoiding for loops so far.)
Convert the second column to a factor, fac, having all levels and then split the first column on that. (If adj[, 2] were not sorted then use min(adj[, 2]) and max(adj[, 2]) as the arguments of seq.)
nr <- nrow(adj)
fac <- factor(adj[, 2], levels = seq(adj[1, 2], adj[nr, 2]))
split(adj[, 1], fac)
giving:
$`1`
[1] 2 3 335 337
$`2`
numeric(0)
$`3`
[1] 6 7 10
Note that if you want integers convert adj to integer first and then run the above code.
mode(adj) <- "integer"
A base R option would be to do split. Create a list of length 3 with elements integer(0) and then assign the split values of first column based on the second column to those elements that are found in second column of 'adj'
lst <- setNames(rep(list(integer(0)), 3), 1:3)
lst[unique(adj[,2])] <- split(adj[,1], adj[,2])
lst
#$`1`
#[1] 2 3 335 337
#$`2`
#integer(0)
#$`3`
#[1] 6 7 10

Find indices of vector elements in a another vector

This extends a previous question I asked.
I have 2 vectors:
a <- c("a","b","c","d","e","d,e","f")
b <- c("a","b","c","d,e","f")
I created b from a by eliminating elements of a that are contained in other, comma separated, elements in a (e.g., "d" and "e" in a are contained in "d,e" and therefore only "d,e" is represented in b).
I am looking for an efficient way to map between indices of the elements of a and b.
Specifically, I would like to have a list of the length of b where each element is a vector with the indices of the elements in a that map to that b element.
For this example the output should be:
list(1, 2, 3, c(4,5,6), 7)
Modifying slightly from my answer at your previous question, try:
a <- c("a","b","c","d","e","d,e","f")
b <- c("a","b","c","d,e","f")
B <- setNames(lapply(b, gsub, pattern = ",", replacement = "|"), seq_along(b))
lapply(B, function(x) which(grepl(x, a)))
# $`1`
# [1] 1
#
# $`2`
# [1] 2
#
# $`3`
# [1] 3
#
# $`4`
# [1] 4 5 6
#
# $`5`
# [1] 7

R print out matrix with row and column names using apply

I have the following matrix 'x'
a b
a 1 3
b 2 4
It is a really large matrix (trimmed down for this question)
I would like to print out this matrix by each row name and column name combination along with the value in that cell. So the expected output would be
a,a,1
a,b,3
b,a,2
b,b,4
I could loop through them, but I'm pretty sure this can be avoided (apply?). Any pointers much appreciated.
One way is to use the melt function from the reshape2 package.
x <- matrix(1:4, nrow = 2, ncol = 2,
dimnames = list(dim1 = c("a", "b"), dim2 = c("a", "b")))
library(reshape2)
melt(x)
# dim1 dim2 value
# 1 a a 1
# 2 b a 2
# 3 a b 3
# 4 b b 4
Edit
If your data is so big that speed is an issue, I would also suggest:
data.frame(dim1 = rep(rownames(x), ncol(x)),
dim2 = rep(colnames(x), each = nrow(x)),
value = c(x))
Edit2
After testing with relatively big data, I would not rule out melt:
x <- matrix(runif(9e6), nrow = 3000, ncol = 3000,
dimnames = list(dim1 = paste0("x", runif(3000)),
dim2 = paste0("y", runif(3000))))
system.time(y1 <- melt(x))
# user system elapsed
# 1.17 0.44 1.61
system.time(y2 <- data.frame(dim1 = rep(rownames(x), ncol(x)),
dim2 = rep(colnames(x), each = nrow(x)),
value = c(x)))
# user system elapsed
# 1.98 0.37 2.36
You can also use the base R function row and col
If you want to reference the row.names and col.names then use as.factor = T. Using as.character and as.numeric flattens the matrix.
do.call(data.frame,list(lapply(list(row = row(x, T),col=col(x,T)), as.character),
value =as.numeric(x)))
## row col value
## 1 a a 1
## 2 b a 2
## 3 a b 3
## 4 b b 4
If you want a matrix you will need to have all the columns as the same class (character or numeric. You could then use
do.call(cbind, lapply(list(row = row(x), col = col(x), value = x), as.numeric))
## row col value
## [1,] 1 1 1
## [2,] 2 1 2
## [3,] 1 2 3
## [4,] 2 2 4
Or as character
do.call(cbind, lapply(list(row = row(x, T), col = col(x, T), value = x), as.character))
## row col value
## [1,] "a" "a" "1"
## [2,] "b" "a" "2"
## [3,] "a" "b" "3"
## [4,] "b" "b" "4"

Row names & column names in R

Do the following function pairs generate exactly the same results?
Pair 1) names() & colnames()
Pair 2) rownames() & row.names()
As Oscar Wilde said
Consistency is the last refuge of the
unimaginative.
R is more of an evolved rather than designed language, so these things happen. names() and colnames() work on a data.frame but names() does not work on a matrix:
R> DF <- data.frame(foo=1:3, bar=LETTERS[1:3])
R> names(DF)
[1] "foo" "bar"
R> colnames(DF)
[1] "foo" "bar"
R> M <- matrix(1:9, ncol=3, dimnames=list(1:3, c("alpha","beta","gamma")))
R> names(M)
NULL
R> colnames(M)
[1] "alpha" "beta" "gamma"
R>
Just to expand a little on Dirk's example:
It helps to think of a data frame as a list with equal length vectors. That's probably why names works with a data frame but not a matrix.
The other useful function is dimnames which returns the names for every dimension. You will notice that the rownames function actually just returns the first element from dimnames.
Regarding rownames and row.names: I can't tell the difference, although rownames uses dimnames while row.names was written outside of R. They both also seem to work with higher dimensional arrays:
>a <- array(1:5, 1:4)
> a[1,,,]
> rownames(a) <- "a"
> row.names(a)
[1] "a"
> a
, , 1, 1
[,1] [,2]
a 1 2
> dimnames(a)
[[1]]
[1] "a"
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
I think that using colnames and rownames makes the most sense; here's why.
Using names has several disadvantages. You have to remember that it means "column names", and it only works with data frame, so you'll need to call colnames whenever you use matrices. By calling colnames, you only have to remember one function. Finally, if you look at the code for colnames, you will see that it calls names in the case of a data frame anyway, so the output is identical.
rownames and row.names return the same values for data frame and matrices; the only difference that I have spotted is that where there aren't any names, rownames will print "NULL" (as does colnames), but row.names returns it invisibly. Since there isn't much to choose between the two functions, rownames wins on the grounds of aesthetics, since it pairs more prettily withcolnames. (Also, for the lazy programmer, you save a character of typing.)
And another expansion:
# create dummy matrix
set.seed(10)
m <- matrix(round(runif(25, 1, 5)), 5)
d <- as.data.frame(m)
If you want to assign new column names you can do following on data.frame:
# an identical effect can be achieved with colnames()
names(d) <- LETTERS[1:5]
> d
A B C D E
1 3 2 4 3 4
2 2 2 3 1 3
3 3 2 1 2 4
4 4 3 3 3 2
5 1 3 2 4 3
If you, however run previous command on matrix, you'll mess things up:
names(m) <- LETTERS[1:5]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 3 2 4 3 4
[2,] 2 2 3 1 3
[3,] 3 2 1 2 4
[4,] 4 3 3 3 2
[5,] 1 3 2 4 3
attr(,"names")
[1] "A" "B" "C" "D" "E" NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[20] NA NA NA NA NA NA
Since matrix can be regarded as two-dimensional vector, you'll assign names only to first five values (you don't want to do that, do you?). In this case, you should stick with colnames().
So there...

Resources