Apply, dataframes, and booleans don't work together? - r

In the following, logical operators don't seem to work properly.
a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)
wtf = data.frame(a, b, c, d)
wtf$huh = apply(wtf, 1, function(row) {
if (row['a'] == T) { return('we win') }
if (row['c'] < 5) { return('hooray') }
if (row['d'] == 1) { return('a thing') }
return('huh?')
})
Producing:
> wtf
a b c d huh
1 TRUE a 1 0 hooray
2 FALSE b 2 0 hooray
3 TRUE c 3 0 hooray
4 FALSE de 4 0 hooray
5 TRUE f 5 0 huh?
6 TRUE g 6 1 a thing
Where naively one would expect that in rows 1, 3, 5, and 6, there would be we win.
Can someone explain to me (1) why it does this, (2) how can this be fixed such that it doesn't happen, (3) why all my logical columns are seemingly changed to characters, and (4) how can a function be type-safely applied to rows in a data frame?

Why does this happen? Because is apply is made for matrices. When you give it a data frame, then the first thing that happens is it gets converted to a matrix:
m = as.matrix(wtf)
m
# a b huh huh1
# [1,] " TRUE" "a" "huh?" "hooray"
# [2,] "FALSE" "b" "huh?" "huh?"
# [3,] " TRUE" "c" "huh?" "hooray"
# [4,] "FALSE" "de" "huh?" "huh?"
# [5,] " TRUE" "f" "huh?" "hooray"
# [6,] " TRUE" "g" "huh?" "hooray"
When that happens, your different data types are lost and your data frame-style indexing doesn't work anymore:
m['a']
# [1] NA
Solution? Use a simple for loop:
wtf$huh1 = NA
for (i in 1:nrow(wtf)) {
wtf$huh1[i] = if(wtf[i, 'a']) "hooray" else "huh?"
}
If you have a function foo then
wtf$huh2 = NA
for (i in 1:nrow(wtf)) {
wtf$huh1[i] = foo(wtf[i, 'a'])
}
Functions that aren't vectorized can be vectorized to avoid the need for loops:
foov = Vectorize(foo)
# then you can
wtf$huh4 = foov(wtf$a)

Probably the easiest way to fix this is using ifelse which is vectorized, so you don't need to deal with loops, or apply:
myfunc <- function(row) {
ifelse (row['a'] == T,'hooray','huh?')
}
wtf$huh <- myfunc(wtf)
a b a
1 TRUE a hooray
2 FALSE b huh?
3 TRUE c hooray
4 FALSE de huh?
5 TRUE f hooray
6 TRUE g hooray

One advantage of a data.frame is that they can contain variables of different types of variables.
lapply(wtf, typeof)
$a
[1] "logical"
$b
[1] "factor"
$huh
[1] "character"
As noted by Gregor, apply requires a matrix and will convert the object you give it to one if possible. But matrices cannot contain multiple variable types and so as.matrix will look for a lowest common denominator that can represent the data, in this case, character.
typeof(as.matrix(wtf))
[1] "character"
class(as.matrix(wtf))
[1] "matrix"

Related

Mapply on a function with conditional expressions

Background: PDF Parse My program looks for data in scanned PDF documents. I've created a CSV with rows representing various parameters to be searched for in a PDF, and columns for the different flavors of document that might contain those parameters. There are different identifiers for each parameter depending on the type of document. The column headers use dot separation to uniquely identify the document by type, subtype... , like so: type.subtype.s_subtype.s_s_subtype.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
p1 str1 str2
p2 str3 str4
p3 str5 str6
p4 str7
...
I'm reading in PDF files, and based on the filepaths they can be uniquely categorized into one of these types. I can apply various logical conditions to a substring of a given filepath, and based on that I'd like to output an NxM Boolean matrix, where N = NROW(filepath_vector), and M = ncol(params_csv). This matrix would show membership of a given file in a type with TRUE, and FALSE elsewhere.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
fpath1 FALSE FALSE TRUE FALSE
fpath2 FALSE TRUE FALSE FALSE
fpath3 FALSE TRUE FALSE FALSE
fpath4 FALSE FALSE FALSE TRUE
...
My solution: I'm trying to apply a function to a matrix that takes a vector as argument, and applies the first element of the vector to the first row, the second element to the second row, etc... however, the function has conditional behavior depending on the element of the vector being applied.
I know this is very similar to the question below (my reference point), but the conditionals in my function are tripping me up. I've provided a simplified reproducible example of the issue below.
R: Apply function to matrix with elements of vector as argument
set.seed(300)
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
f <- function(x) {
sapply(v, g <- function(y) {
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
})
}
mapply(f, as.data.frame(t(m)))
Desired output:
# [,1] [,2] [,3] [,4] [,5]
# [1,] -1 0 -1 -1 -1
# [2,] 4 4 0 4 0
# [3,] 3 0 3 3 0
# [4,] 2 0 2 2 0
# [5,] 1 1 1 1 0
But I get this error:
Error in if (y == 8) { : missing value where TRUE/FALSE needed
Can't seem to figure out the error or if I'm misguided elsewhere in my entire approach, any thoughts are appreciated.
Update (03April2018):
I had provided this as a toy example for the sake of reproducibility, but I think it would be more informative to use something similar to my actual code with #grand_chat's excellent solution. Hopefully this helps someone who's struggling with a similar issue.
chk <- c(NA, "abc.TRO", "def.TRO", "ghi.TRO", "kjl.TRO", "mno.TRO")
len <- c(8, NA, NA)
seed <- c(FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)
A = matrix(seed, nrow=3, ncol=6, byrow=TRUE)
pairs <- mapply(list, as.data.frame(t(A)), len, SIMPLIFY=F)
f <- function(pair) {
x = unlist(pair[[1]])
y = pair[[2]]
if(y==8 & !is.na(y)) {
x[c(grep("TRO", chk))] <- (x[c(grep("TRO", chk))] & TRUE)
} else {x <- (x & FALSE)}
return(x)
}
t(mapply(f, pairs))
Output:
# $v1
# [1,] FALSE TRUE TRUE FALSE FALSE FALSE
# $v2
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE
# $v3
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE
You're processing the elements of vector v and the rows of your matrix m (columns of data frame t(m)) in parallel, so you could zip the corresponding elements into a list of pairs and process the pairs. Try this:
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
# Zip into pairs:
pairs <- mapply(list, as.data.frame(t(m)), v, SIMPLIFY=F)
# Define a function that acts on pairs:
f <- function(pair) {
x = pair[[1]]
y = pair[[2]]
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
}
# Apply it:
mapply(f, pairs, SIMPLIFY=F)
with result:
$V1
[1] -2 -1 -2 -2 -1
$V2
[1] 4 4 0 0 4
$V3
[1] 3 3 3 3 0
$V4
[1] 2 0 2 2 0
$V5
[1] 0 0 3 0 3
(This doesn't agree with your desired output because you don't seem to have applied your function f properly.)

Check if all rows are the same in a matrix

I'm looking to see if sample(..., replace=T) results in sampling the same row n times. I see the duplicated function checks if elements are repeated by returning a logical vector for each index, but I need to see if one element is repeated n times (single boolean value). What's the best way to go about this?
Here's just an example. Some function on this matrix should return TRUE
t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
[,1] [,2]
[1,] 4 5
[2,] 4 5
[3,] 4 5
[4,] 4 5
[5,] 4 5
[6,] 4 5
[7,] 4 5
[8,] 4 5
Here is one solution that works to produce the true/false result you are looking for:
m <- t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
apply(m, 2, function(x) length(unique(x)) == 1)
[1] TRUE TRUE
m <- rbind(m, c(4, 6))
apply(m, 2, function(x) length(unique(x)) == 1)
[1] TRUE FALSE
If you want a single boolean value saying if all columns have unique values, you can do:
all(apply(m, 2, function(x) length(unique(x)) == 1) == TRUE)
[1] FALSE
A little cleaner looking (and easier to tell what the code is doing):
m <- t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
apply(m, 2, function(x) all(x==x[1]))
[1] TRUE TRUE
Think I've got my solution.
B <- t(matrix(c(rep(c(rep(4,1),rep(5,1)),8)),nrow=2,ncol=8))
length(table(B)) == ncol(B)
[1] TRUE
B <- rbind(B,c(4,6)) # different sample
length(table(B)) == ncol(B)
[1] FALSE
We could also replicate the first row, compare with the original matrix, get the colSums and check whether it is equal to nrow of 'm'
colSums(m[1,][col(m)]==m)==nrow(m)
[1] TRUE TRUE
Or another option would be to check the variance
!apply(m, 2, var)
#[1] TRUE TRUE
For a matrix you can apply unique directly to your matrix, no need for apply:
nrow(unique(m)) == 1L
[1] TRUE
nrow(unique(rbind(m, c(6,7)))) == 1L
[1] FALSE
From the documentation ?unique:
The array method calculates for each element of the dimension specified by MARGIN if the remaining dimensions are identical to those for an earlier element (in row-major order). This would most commonly be used for matrices to find unique rows (the default) or columns (with MARGIN = 2).
Alternatively you can transpose your matrix and leverage vectorized comparison:
all(m[1,] == t(m))
[1] TRUE

R get index satisty the condition [duplicate]

I am looking for a condition which will return the index of a vector satisfying a condition.
For example-
I have a vector b = c(0.1, 0.2, 0.7, 0.9)
I want to know the first index of b for which say b >0.65. In this case the answer should be 3
I tried which.min(subset(b, b > 0.65))
But this gives me 1 instead of 3.
Please help
Use which and take the first element of the result:
which(b > 0.65)[1]
#[1] 3
Be careful, which.max is wrong if the condition is never met, it does not return NA:
> a <- c(1, 2, 3, 2, 5)
> a >= 6
[1] FALSE FALSE FALSE FALSE FALSE
> which(a >= 6)[1]
[1] NA # desirable
> which.max(a >= 6)
[1] 1 # not desirable
Why? When all elements are equal, which.max returns 1:
> b <- c(2, 2, 2, 2, 2)
> which.max(b)
[1] 1
Note: FALSE < TRUE
You may use which.max:
which.max(b > 0.65)
# [1] 3
From ?which.max: "For a logical vector x, [...] which.max(x) return[s] the index of the first [...] TRUE
b > 0.65
# [1] FALSE FALSE TRUE TRUE
You should also have a look at the result of your code subset(b, b > 0.65) to see why it can't give you the desired result.

How to ignore NA in ifelse statement

I came to R from SAS, where numeric missing is set to infinity. So we can just say:
positiveA = A > 0;
In R, I have to be verbose like:
positiveA <- ifelse(is.na(A),0, ifelse(A > 0, 1, 0))
I find this syntax is hard to read. Is there anyway I can modify ifelse function to consider NA a special value that is always false for all comparison conditions? If not, considering NA as -Inf will work too.
Similarly, setting NA to '' (blank) in ifelse statement for character variables.
Thanks.
This syntax is easier to read:
x <- c(NA, 1, 0, -1)
(x > 0) & (!is.na(x))
# [1] FALSE TRUE FALSE FALSE
(The outer parentheses aren't necessary, but will make the statement easier to read for almost anyone other than the machine.)
Edit:
## If you want 0s and 1s
((x > 0) & (!is.na(x))) * 1
# [1] 0 1 0 0
Finally, you can make the whole thing into a function:
isPos <- function(x) {
(x > 0) & (!is.na(x)) * 1
}
isPos(x)
# [1] 0 1 0 0
Replacing a NA value with zero seems rather strange behaviour to expect. R considers NA values missing (although hidden far behind scenes where you (never) need to go they are negative very large numbers when numeric ))
All you need to do is A>0 or as.numeric(A>0) if you want 0,1 not TRUE , FALSE
# some dummy data
A <- seq(-1,1,l=11)
# add NA value as second value
A[2] <- NA
positiveA <- A>0
positiveA
[1] FALSE NA FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
as.numeric(positiveA) #
[1] 0 NA 0 0 0 0 1 1 1 1 1
note that
ifelse(A>0, 1,0) would also work.
The NA values are "retained", or dealt with appropriately. R is sensible here.
Try this:
positiveA <- ifelse(!is.na(A) & A > 0, 1, 0)
If you are working with integers you can use %in%
For example, if your numbers can go up to 2
test <- c(NA, 2, 1, 0, -1)
other people has suggested to use
(test > 0) & (!is.na(test))
or
ifelse(!is.na(test) & test > 0, 1, 0)
my solution is simpler and gives you the same result.
test %in% 1:2
YOu can use the missing argument i if_else_ from hablar:
library(hablar)
x <- c(NA, 1, 0, -1)
if_else_(x > 0, T, F, missing = F)
which gives you
[1] FALSE TRUE FALSE FALSE

How could I make this R snippet faster and more R-ish?

Coming from various other languages, I find R powerful and intuitive, but I am not thrilled with its performance. So I decided to try to improve some snippet I wrote and learn how to code better in R.
Here's a function I wrote, trying to determine if a vector is binary-valued (two distinct values or just one value) or not:
isBinaryVector <- function(v) {
if (length(v) == 0) {
return (c(0, 1))
}
a <- v[1]
b <- a
lapply(v, function(x) { if (x != a && x != b) {if (a != b) { return (c()) } else { b = x }}})
if (a < b) {
return (c(a, b))
} else {
return (c(b, a))
}
}
EDIT: This function is expected to look through a vector then return c() if it is not binary-valued, and return c(a, b) if it is, a being the small value and b being the larger one (if a == b then just c(a, a). E.g., for
A B C
1 1 1 0
2 2 2 0
3 3 1 0
I will lapply this isBinaryVector and get:
$A
[1] 1 1
$B
[1] 1 1
$C
[1] 0 0
The time it took on a moderate sized dataset (about 1800 * 3500, 2/3 of them are binary-valued) is about 15 seconds. The set contains only floating-point numbers.
Is there anyway I could do this faster?
Thanks for any inputs!
You are essentially trying to write a function that returns TRUE if a vector has exactly two unique values, and FALSE otherwise.
Try this:
> dat <- data.frame(
+ A = 1:3,
+ B = c(1, 2, 1),
+ C = 0
+ )
>
> sapply(dat, function(x)length(unique(x))==2)
A B C
FALSE TRUE FALSE
Next, you want to get the min and max value. The function range does this. So:
> sapply(dat, range)
A B C
[1,] 1 1 0
[2,] 3 2 0
And there you have all the ingredients to make a small function that is easy to understand and should be extremely quick, even on large amounts of data:
isBinary <- function(x)length(unique(x))==2
binaryValues <- function(x){
if(isBinary(x)) range(x) else NA
}
sapply(dat, binaryValues)
$A
[1] NA
$B
[1] 1 2
$C
[1] NA
This function returns true or false for vectors (or columns of a data frame):
is.binary <- function(v) {
x <- unique(v)
length(x) - sum(is.na(x)) == 2L
}
Also take a look at this post
I'd use something like that to get column indicies:
bivalued <- apply(my.data.frame, 2, is.binary)
nominal <- my.data.frame[,!bivalued]
binary <- my.data.frame[,bivalued]
Sample data:
my.data.frame <- data.frame(c(0,1), rnorm(100), c(5, 19), letters[1:5], c('a', 'b'))
> apply(my.data.frame, 2, is.binary)
c.0..1. rnorm.100. c.5..19. letters.1.5. c..a....b..
TRUE FALSE TRUE FALSE TRUE

Resources