R apply function across rows, unexpected answer - r

I don't understand what is going on here:
Set up:
> df = data.frame(x1= rnorm(10), x2= rnorm(10))
> df[3,1] <- "the"
> df[6,2] <- "NA"
## I want to create values that will be challenging to coerce to numeric
> df$x1.fixed <- as.numeric(df$x1)
> df$x2.fixed <- as.numeric(df$x2)
## Here is the DF
> df
x1 x2 x1.fixed x2.fixed
1 0.955965351551298 -0.320454533088042 0.9559654 -0.3204545
2 -1.87960909714257 1.61618672247496 -1.8796091 1.6161867
3 the -0.855930398468875 NA -0.8559304
4 -0.400879592905882 -0.698655375066432 -0.4008796 -0.6986554
5 0.901252404134257 -1.08020133150191 0.9012524 -1.0802013
6 0.97786920899034 NA 0.9778692 NA
.
.
.
> table(is.na(df[,c(3,4)]))
FALSE TRUE
18 2
I wanted to find the rows that got converted to NAs, so I put in a complex apply that did not work as expected. I then simplified and tried again...
Question:
Simpler call:
> apply(df, 1, function(x) (any(is.na(df[x,3]), is.na(df[x,4]))))
which unexpectedly yielded:
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Instead, I'd expected:
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
to highlight the rows (3 & 6) where an NA existed. To verify that non-apply'ed functions would work, I tried:
> any(is.na(df[3,1]), is.na(df[3,2]))
[1] FALSE
> any(is.na(df[3,3]), is.na(df[3,4]))
[1] TRUE
as expected. To further my confusion on what apply is doing, I tried:
> apply(df, 1, function(x) is.na(df[x,1]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Why is this traversing the entire DF, when I have clearly indicated both (a) that I want it in the row direction (I passed "1" into the second parameter), and (b) the value "x" is only placed in the row id, not the column id?
I understand there are other, and perhaps better, ways to do what I am trying to do (find the rows that have been changed to NA's in the new columns. But please don't supply that in the answer. Instead, please explain why apply did not work as I'd expected, and what I could do to fix it.

To find the columns that have NA's you can do:
sapply(df, function(x) any(is.na(x)))
# x1 x2 x1.fixed x2.fixed
# FALSE FALSE TRUE TRUE
A data.frame is a list of vectors, so the above function inside sapply will evaluate any(is.na( for each element of that list, i.e. each column.
As per OP edit - to get the rows that have NA's, use apply(df, 1, ... instead:
apply(df, 1, function(x) any(is.na(x)))
# [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

apply is working exactly as it is supposed to. It is your expectations that are wrong.
apply(df, 1, function(x) is.na(df[x,1]))
The first thing that apply does (per the documentation) is coerce your data frame to a matrix. In the process, all numeric columns are coerced to character.
Next, each individual row of df is passed as the argument x to your function. In what sense is it meaningful to index df by the character values in the first row in df? So you just get a bunch of NAs. You can test this via:
> df[as.character(df[1,]),]
x1 x2 x1.fixed x2.fixed
NA <NA> <NA> NA NA
NA.1 <NA> <NA> NA NA
NA.2 <NA> <NA> NA NA
NA.3 <NA> <NA> NA NA
You say you want to know which columns introduced NAs, and yet you are applying over rows. If you really wanted to use apply (I recommend #eddi's method) you could do:
apply(df,2,function(x) any(is.na(x)))

You could use
rowSums(is.na(df))>0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
to find the rows containing NAs.
I'm not sure, but I think this is a vectorized operation which might be faster than using apply in case you are working with large data.

Related

for loop inside sapply to return string match

I wrote this lines of code to work in a dataframe that returns a new column with case insensitive match with the elements of string list.
However, the resulting column works for the first element of the list only, 'seed' in this case, but not other match. Not sure where is the wrong in the for loop.
Here is the sample dataframe you may want to check results for.
input.strings <- c('seed', 'fertilizer', 'fertiliser', 'loan', 'interest', 'feed', 'insurance')
polic = data.frame(policy_label=c('seed supply','energy subsidy','fertilizer distribution','loan guarantee','Interest waiver','feed purchase'))
polic$policy_class <- sapply(polic$policy_label, function(x){
for (i in input.strings){
if (grepl(i, tolower(x))){
return(i)
}
else{
return("others")
}
}
})
base R alternative
Here's a somewhat faster and more-direct approach using sapply (and no for loops), relying on the fact that grepl can be vectorized on x=. (It is not vectorized on pattern=, requiring that to be length 1, which is one reason why we need the sapply at all.)
matches <- sapply(input.strings, grepl, x = polic$policy_label)
matches
# seed fertilizer fertiliser loan interest feed insurance
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
Because we want to assign "others" to everything without a match (and because we will need at least one TRUE in
matches <- cbind(matches, others = rowSums(matches) == 0)
matches
# seed fertilizer fertiliser loan interest feed insurance others
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
From here, we can find the names associated with the true values and assign them (optionally ,-collapsed) into polic:
polic$policy_class <- apply(matches, 1, function(z) toString(colnames(matches)[z]))
polic
# policy_label policy_class
# 1 seed supply seed
# 2 energy subsidy others
# 3 fertilizer distribution fertilizer
# 4 loan guarantee loan
# 5 Interest waiver others
# 6 feed purchase feed
FYI, the reason I used toString is because I did not want to assume that there would always be no more than one match; that is, if two input.strings matched one policy_label for whatever reason, than toString will combine them into one string, e.g., "seed, feed" for multi-match policies.
fuzzyjoin alternative
If you're familiar with merges/joins (and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?), then this should seem familiar. If not, the concept of joining data in this way can be transformative to data-munging/cleaning.
library(fuzzyjoin)
out <- regex_left_join(
polic, data.frame(policy_class = input.strings),
by = c("policy_label" = "policy_class"))
out
# policy_label policy_class
# 1 seed supply seed
# 2 energy subsidy <NA>
# 3 fertilizer distribution fertilizer
# 4 loan guarantee loan
# 5 Interest waiver <NA>
# 6 feed purchase feed
### clean up the NAs for "others"
out$policy_class[is.na(out$policy_class)] <- "others"
In contrast to the base-R variant above, there is no safe-guard here (yet!) to handle when multiple input.strings match one policy_label; when that happens, that row with a match will be duplicated, so you'd see (e.g.) seed supply and all other columns on that row twice. This can easily be mitigated given some effort.

How to select elements of a vector properly?

Why does filtering elements of a vector with '[]' results in NA, while 'which' function does not return any NA?
Here is an example:
setor <- c('residencial','residencial',NA,'comercial')
setor[setor == 'residencial']
#"residencial" "residencial" NA`
setor[which(setor=='residencial')]
#[1] "residencial" "residencial"
Your help would be much appreciated!
Because when you use == for comparison it returns NA for NA values.
setor == 'residencial'
#[1] TRUE TRUE NA FALSE
and subsetting with NA returns NA
setor[setor=='residencial']
#[1] "residencial" "residencial" NA
However, when we use which it doesn't count NA's and returns index of only TRUE values.
which(setor=='residencial')
#[1] 1 2
setor[which(setor=='residencial')]
#[1] "residencial" "residencial"
We could use %in%, which returns FALSE where there are NA elements
setor %in% 'residencial'
#[1] TRUE TRUE FALSE FALSE
It also works when we need to subset more than one element, i.e.
setor %in% c('residencial', 'comercial')
#[1] TRUE TRUE FALSE TRUE
and this can be directly used to subset
setor[setor %in% 'residencial']
#[1] "residencial" "residencial"

Which indices are FALSE?

which() conveniently gives all the indices which are TRUE in x. What is a simple way to get all the indices of x which are FALSE?
Sample data
x <- c(T,T,F,F)
[1] TRUE TRUE FALSE FALSE
which function gives indices where we have TRUE value
which(x)
[1] 1 2
If we need to populate indices for only FALSE values
which(!x)
[1] 3 4
we can also bring false values as output as
!which(x)
[1] FALSE FALSE

Count by row with variable criteria

I have a data.frame in which I want to perform a count by row versus a specified criterion. The part I cannot figure out is that I want a different count criterion for each row.
Say I have 10 rows, I want 10 different criteria for the 10 rows.
I tried: count.above <- rowSums(Data > rate), where rate is a vector with the 10 criterion, but R used only the first as the criterion for the whole frame.
I imagine I could split my frame into 10 vectors and perform this task, but I thought there would be some simple way to do this without resorting to that.
Edit: this depends whether you want to operate over rows or columns. See below:
This is a job for mapply and Reduce. Suppose you have a data frame along the lines of
df1 <- data.frame(a=1:10,b=2:11,c=3:12)
Let's say we want to count the rows where a>6, b>3 and c>5. This is done with mapply:
mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)
$a
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
$b
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
$c
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now we use Reduce to find those which are all TRUE:
Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE))
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Lastly, we use sum to add them all up:
sum(Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)))
[1] 4
If you want a result for each row rather than a global aggregate, then apply is the function to use:
apply(df1,1,function(v) sum(v>c(6,3,5)))
[1] 0 0 1 2 2 2 3 3 3 3
Given the dummy data (from #zx8754s solution)
# dummy data
df1 <- data.frame(matrix(1:15, nrow = 3))
myRate <- c(7, 5, 1)
Solution using apply
Courtesy of #JDL
rowSums(apply(df1, 2, function(v) v > myRate))
Alternative solution using the Reduce pattern
Reduce(function(l, v) cbind(l[,1] + (l[,2] > myRate), l[,-2:-1]),
1:ncol(df1),
cbind(0, df1))

R: Choosing specific number of combinations from all possible combinations

Let's say we have the following dataset
set.seed(144)
dat <- matrix(rnorm(100), ncol=5)
The following function creates all possible combinations of columns and removes the first
(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
# Var1 Var2 Var3 Var4 Var5
# 2 TRUE FALSE FALSE FALSE FALSE
# 3 FALSE TRUE FALSE FALSE FALSE
# 4 TRUE TRUE FALSE FALSE FALSE
# ...
# 31 FALSE TRUE TRUE TRUE TRUE
# 32 TRUE TRUE TRUE TRUE TRUE
My question is how can I calculate single, binary and triple combinations only ?
Choosing the rows including no more than 3 TRUE values using the following function works for this vector: cols[rowSums(cols)<4L, ]
However, it gives following error for larger vectors mainly because of the error in expand.grid with long vectors:
Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) :
invalid 'times' value
In addition: Warning message:
In rep.fac * nx : NAs produced by integer overflow
Any suggestion that would allow me to compute single, binary and triple combinations only ?
You could try either
cols[rowSums(cols) < 4L, ]
Or
cols[Reduce(`+`, cols) < 4L, ]
You can use this solution:
col.i <- do.call(c,lapply(1:3,combn,x=5,simplify=F))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
#
# <...skipped...>
#
# [[24]]
# [1] 2 4 5
#
# [[25]]
# [1] 3 4 5
Here, col.i is a list every element of which contains column indices.
How it works: combn generates all combinations of the numbers from 1 to 5 (requested by x=5) taken m at a time (simplify=FALSE ensures that the result has a list structure). lapply invokes an implicit cycle to iterate m from 1 to 3 and returns a list of lists. do.call(c,...) converts a list of lists into a plain list.
You can use col.i to get certain columns from dat using e.g. dat[,col.i[[1]],drop=F] (1 is an index of the column combination, so you could use any number from 1 to 25; drop=F makes sure that when you pick just one column from dat, the result is not simplified to a vector, which might cause unexpected program behavior). Another option is to use lapply, e.g.
lapply(col.i, function(cols) dat[,cols])
which will return a list of data frames each containing a certain subset of columns of dat.
In case you want to get column indices as a boolean matrix, you can use:
col.b <- t(sapply(col.i,function(z) 1:5 %in% z))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# ...
[UPDATE]
More efficient realization:
library("gRbase")
coli <- function(x=5,m=3) {
col.i <- do.call(c,lapply(1:m,combnPrim,x=x,simplify=F))
z <- lapply(seq_along(col.i), function(i) x*(i-1)+col.i[[i]])
v.b <- rep(F,x*length(col.i))
v.b[unlist(z)] <- TRUE
matrix(v.b,ncol=x,byrow = TRUE)
}
coli(70,5) # takes about 30 sec on my desktop

Resources