Create indicator variables within a list - r

I have a list containing sequences of numbers. I want to create a list that indicates all non-zero elements up to the first element that matches a defined limit. I also want to create a list that indicates all non-zero elements after the first element to match the defined limit.
I prefer a base R solution. Presumably the solution will use lapply, but I have not been able to come up with a simple solution.
Below is a minimally reproducible example in which the limit is 2:
my.limit <- 2
my.samples <- list(0,c(1,2),0,c(0,1,1),0,0,0,0,0,c(1,1,2,2,3,4),c(0,1,2),0,c(0,0,1,1,2,2,3))
Here are the two desired lists:
within.limit <- list(0,c(1,1),0,c(0,1,1),0,0,0,0,0,c(1,1,1,0,0,0),c(0,1,1),0,c(0,0,1,1,1,0,0))
outside.limit <- list(0,c(0,0),0,c(0,0,0),0,0,0,0,0,c(0,0,0,1,1,1),c(0,0,0),0,c(0,0,0,0,0,1,1))

We can use match with nomatch argument as a very big number (should be greater than any length of the list, for some reason I couldn't use Inf here.)
within.limit1 <- lapply(my.samples, function(x)
+(x > 0 & seq_along(x) <= match(my.limit, x, nomatch = 1000)))
outside.limit1 <- lapply(my.samples, function(x)
+(seq_along(x) > match(my.limit, x, nomatch = 1000)))
Checking if output is correct to shown one :
all(mapply(function(x, y) all(x == y), within.limit, within.limit1))
#[1] TRUE
all(mapply(function(x, y) all(x == y), outside.limit, outside.limit1))
#[1] TRUE

I would do
within.limit <- lapply(my.samples, function(x)
+(x!=0 & (x<limit | cumsum(x == limit)==1)))
outside.limit <- lapply(my.samples, function(x)
+(x!=0 & (x>limit | cumsum(x == limit)>1)))

foo <- function(samples, limit, within = TRUE) {
`%cp%` <- if (within) `<=` else `>`
lapply(samples, function(x) pmin(x, seq_along(x) %cp% match(my.limit, x, nomatch = 1e8)))
}
> all.equal(foo(my.samples, my.limit, FALSE), outside.limit)
# [1] TRUE
> all.equal(foo(my.samples, my.limit, TRUE), within.limit)
# [1] TRUE

We can use findInterval
lapply(my.samples, function(x)
+(x > 0 & seq_along(x) <= findInterval(my.limit, x)-1))
and
lapply(my.samples, function(x) +(seq_along(x) > findInterval(my.limit, x)-1))

Related

Convert values outside a range to the range's bounds

If I have a series of values
set.seed(123)
x <- rnorm(100)
and a given range (a, b), e.g.
a <- -1; b <- 2
How could I move those values less than a to a and those greater than b to b?
The following basic method works but I'm searching a function or a one-liner command.
x[x < a] <- a
x[x > b] <- b
If we need a single line, use pmin/pmax
out <- pmin(pmax(x, a), b)
-checking
x[x < a] <- a
x[x > b] <- b
identical(out, x)
[1] TRUE

problem applying a function to a list in R

I have created a function that converts "YYYYQQ" to integer YYYYMMDD. The function works well with individual values in a list but not on the whole list. I am not unable to understand the warning message.
GetProperDate <- function(x) {
x <- as.character(x)
q<-substr(x, 5, 6)
y<-substr(x, 1,4) %>% as.numeric()
if(q=="Q1"){
x <- as.integer(paste0(y,"03","31"))
}
if(q=="Q2"){
x <- as.integer(paste0(y,"06","30"))
}
if(q=="Q3"){
x <- as.integer(paste0(y,"09","30"))
}
if(q=="Q4"){
x <- as.integer(paste0(y,"12","31"))
}
return(x)
}
> GetProperDate("2019Q1")
[1] 20190331
> GetProperDate("2019Q2")
[1] 20190630
> GetProperDate("2019Q3")
[1] 20190930
> GetProperDate("2019Q4")
[1] 20191231
> date.list<-c("2019Q1","2019Q2","2019Q3","2019Q4")
> date.list.converted<- date.list %>% GetProperDate()
Warning messages:
1: In if (q == "Q1") { :
the condition has length > 1 and only the first element will be used
2: In if (q == "Q2") { :
the condition has length > 1 and only the first element will be used
3: In if (q == "Q3") { :
the condition has length > 1 and only the first element will be used
4: In if (q == "Q4") { :
the condition has length > 1 and only the first element will be used
> date.list.converted
[1] 20190331 20190331 20190331 20190331
>
As shown above I am getting a warning message and the output is not as expected.
The issue is you have written a function GetProperDate which is not vectorised. if is used for scalar inputs and not vector. You may switch to ifelse which is vectorised and rewrite your function.
Apart from that you can also use as.yearqtr from zoo which is used to handle quarterly dates and get the last date of the quarter by using frac = 1.
as.Date(zoo::as.yearqtr(date.list), frac = 1)
#[1] "2019-03-31" "2019-06-30" "2019-09-30" "2019-12-31"
When you pass a vector to the function,it is comparing vector with a scalar. R automatically takes the first element of the vector. thats why you get warning as the condition has length > 1 and only the first element will be used..Try this
date.list<-c("2019Q1","2019Q2","2019Q3","2019Q4")
date.list.converted <- sapply(date.list, function(s) GetProperDate(s))
Try this:
library(tidyverse)
GetProperDate <- function(x) {
x <- as.character(x)
q <- substr(x, 5, 6)
y <- substr(x, 1,4) %>%
as.numeric()
x <- case_when(
q=="Q1" ~ as.integer(paste0(y,"03","31")),
q =="Q2" ~ as.integer(paste0(y,"06","30")),
q == "Q3" ~ as.integer(paste0(y,"09","30")),
TRUE ~ as.integer(paste0(y,"12","31")))
return(x)
}
date.list<-c("2019Q1","2019Q2","2019Q3","2019Q4")
GetProperDate(date.list)
> GetProperDate(date.list)
[1] 20190331 20190630 20190930 20191231

Identifying source of FALSE

My question is, does there exist a function that, given a logical statement, identifies the source of FALSE (if it is false)?
For example,
x=1; y=1; z=1;
x==1 & y==1 & z==2
Obviously it is the value of z that makes the statement false. In general though, is there a function that let's me identify the variable(s) in a logical statement who's value makes a logical statement false?
Instead of writing x==1 & y==1 & z==2 you could define
cn <- c(x == 1, y == 1, z == 2)
or
cn <- c(x, y, z) == c(1, 1, 2)
and use all(cn). Then
which(!cn)
# [1] 3
gives the source(s) of FALSE.
In general, no, there is no such function that you are looking for, but for different logical statements a similar approach should work, although it might be too lengthy to pursue.
Considering (!(x %in% c(1,2,3)) & y==3) | z %in% c(4,5), we get FALSE if z %in% c(4,5) is FALSE and (!(x %in% c(1,2,3)) & y==3) is FALSE simultaneously. So, if (!(x %in% c(1,2,3)) & y==3) | z %in% c(4,5) returns FALSE, we are sure about z and still need to check x and y, so that the list of problematic variables can be obtained as follows:
if(!((!(x %in% c(1,2,3)) & y==3) | z %in% c(4,5)))
c("x", "y", "z")[c(x %in% c(1,2,3), !y == 3, TRUE)]
# [1] "x" "y" "z"
or
a <- !(x %in% c(1,2,3))
b <- y == 3
c <- z %in% c(4,5)
if(!((a & b) | c))
c("x", "y", "z")[c(!a, !b, TRUE)]
# [1] "x" "y" "z"
I like #julius's answer but there is also the stopifnot function.
x <- 1; y <- 1; z <- 2
stopifnot(x == 1, y == 1, z == 1)
#Error: z == 1 is not TRUE
Not that the result is an error if there are any false statements and nothing if they're all true. It also stops at the first false statement so if you had something like
x <- T; y <- F; z <- F
stopifnot(x, y, z)
#Error: y is not TRUE
you would not be told that z is FALSE in this case.
So the result isn't a logical or an index but instead is either nothing or an error. This doesn't seem desirable but it is useful if the reason you're using it is for checking inputs to a function or something similar where you want to produce an error on invalid inputs and just keep on moving if everything is fine. I mention stopifnot because it seems like this might be the situation you're in. I'm not sure.
Here is a silly example where you might use it. In this case you apparently only want positive numbers as input and reject everything else:
doublePositiveNumber <- function(x){
stopifnot(is.numeric(x), x >= 0)
return(2*x)
}
which results in
> doublePositiveNumber("hey")
Error: is.numeric(x) is not TRUE
> doublePositiveNumber(-2)
Error: x >= 0 is not TRUE
> doublePositiveNumber(2)
[1] 4
So here you guarantee you get the inputs you want and produce and error message for the user that hopefully tells them what the issue is.

How to deal with NA when using lappy in R

I have a data frame err consisting of 796 rows and 54432 columns
I have to check the columns that have values not exceeding 20 and -20.
This is my approach:
do.call(cbind, (lapply(err, function(x) if((all(x<20) & all(x>-20))) return(x) )))
I Have NA values in all of the columns and after i got
Error in if ((all(x < 20) & all(x > -20))) return(x) :
missing value where TRUE/FALSE needed
I update the command using !is.na as:
do.call(cbind, (lapply(err, function(x) if(!is.na(all(x<20) & all(x>-20))) return(x) )))
But in this case all the columns are reported and the filter does not work.
Any help?
Since I don't have an example df check if this works for you:
do.call("cbind", lapply(err, function(x) if(min(x, na.rm=T) > -20 & max(x, na.rm=T) < 20) return(x) ))
Using apply
err[apply(err, 2, function(x) min(x,na.rm=T) > -20 & max(x,na.rm=T) < 20)]

Fastest way to find *the index* of the second (third...) highest/lowest value in vector or column

Fastest way to find the index of the second (third...) highest/lowest value in vector or column ?
i.e. what
sort(x,partial=n-1)[n-1]
is to
max()
but for
which.max()
Best,
Fastest way to find second (third...) highest/lowest value in vector or column
One possible route is to use the index.return argument to sort. I'm not sure if this is fastest though.
set.seed(21)
x <- rnorm(10)
ind <- 2
sapply(sort(x, index.return=TRUE), `[`, length(x)-ind+1)
# x ix
# 1.746222 3.000000
EDIT 2 :
As Joshua pointed out, none of the given solutions actually performs correct when you have a tie on the maxima, so :
X <- c(11:19,19)
n <- length(unique(X))
which(X == sort(unique(X),partial=n-1)[n-1])
fastest way of doing it correctly then. I deleted the order way, as that one doesn't work and is a lot slower, so not a good answer according to OP.
To point to the issue we ran into :
> X <- c(11:19,19)
> n <- length(X)
> which(X == sort(X,partial=n-1)[n-1])
[1] 9 10 #which is the indices of the double maximum 19
> n <- length(unique(X))
> which(X == sort(unique(X),partial=n-1)[n-1])
[1] 8 # which is the correct index of 18
The timings of the valid solutions :
> x <- runif(1000000)
> ind <- 2
> n <- length(unique(x))
> system.time(which(x == sort(unique(x),partial=n-ind+1)[n-ind+1]))
user system elapsed
0.11 0.00 0.11
> system.time(sapply(sort(unique(x), index.return=TRUE), `[`, n-ind+1))
user system elapsed
0.69 0.00 0.69
library Rfast has implemented the nth element function with return index option.
UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.
x <- runif(1e+6)
n <- 2
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
microbenchmark::microbenchmark(
topn = kit::topn(x, n,decreasing = T)[n],
Rfast = Rfast::nth(x,n,descending = T,index.return = T),
order = order(x, decreasing = TRUE)[n],
richie = which_nth_highest_richie(x,n),
joris = which_nth_highest_joris(x,n))
Unit: milliseconds
expr min lq mean median uq max neval
topn 3.741101 3.7917 4.517201 4.060752 5.108901 7.403901 100
Rfast 15.8121 16.7586 20.64204 17.73010 20.7083 47.6832 100
order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
richie 22.7846 24.1552 39.35303 27.10075 42.0132 179.289 100
joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100
Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.
Futher, simulations where run to estimate running times of finding the nth biggest number for varying n.
Variable x was repopulated for each n but it's size was always 1 million numbers.
As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n.
It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.
Method: Set all max values to -Inf, then find the indices of the max. No sorting required.
X <- runif(1e7)
system.time(
{
X[X == max(X)] <- -Inf
which(X == max(X))
})
Works with ties and is very fast.
If you can guarantee no ties, then an even faster version is
system.time(
{
X[which.max(X)] <- -Inf
which.max(X)
})
EDIT: As Joris mentioned, this method doesn't scale that well for finding third, fourth, etc., highest values.
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
Using x <- runif(1e7) and n = 2, Richie wins
system.time(which_nth_highest_richie(x, 2)) #about half a second
system.time(which_nth_highest_joris(x, 2)) #about 2 seconds
For n = 100, Joris wins
system.time(which_nth_highest_richie(x, 100)) #about 20 seconds, ouch!
system.time(which_nth_highest_joris(x, 100)) #still about 2 seconds
The balance point, where they take the same length of time, is about n = 10.
No ties which() is probably your friend here. Combine the output from the sort() solution with which() to find the index that matches the output from the sort() step.
> set.seed(1)
> x <- sample(1000, 250)
> sort(x,partial=n-1)[n-1]
[1] 992
> which(x == sort(x,partial=n-1)[n-1])
[1] 145
Ties handling The solution above doesn't work properly (and wasn't intended to) if there are ties and the ties are the values that are the ith largest or larger values. We need to take the unique values of the vector before sorting those values and then the above solution works:
> set.seed(1)
> x <- sample(1000, 1000, replace = TRUE)
> length(unique(x))
[1] 639
> n <- length(x)
> i <- which(x == sort(x,partial=n-1)[n-1])
> sum(x > x[i])
[1] 0
> x.uni <- unique(x)
> n.uni <- length(x.uni)
> i <- which(x == sort(x.uni, partial = n.uni-1)[n.uni-1])
> sum(x > x[i])
[1] 2
> tail(sort(x))
[1] 994 996 997 997 1000 1000
order() is also very useful here:
> head(ord <- order(x, decreasing = TRUE))
[1] 220 145 209 202 211 163
So the solution here is ord[2] for the index of the 2nd highest/largest element of x.
Some timings:
> set.seed(1)
> X <- sample(1e7, 1e7)
> system.time({n <- length(X); which(X == sort(X, partial = n-1)[n-1])})
user system elapsed
0.319 0.058 0.378
> system.time({ord <- order(X, decreasing = TRUE); ord[2]})
user system elapsed
14.578 0.084 14.708
> system.time({order(X, decreasing = TRUE)[2]})
user system elapsed
14.647 0.084 14.779
But as the linked post was getting at and the timings above show, order() is much slower, but both provide the same results:
> all.equal(which(X == sort(X, partial = n-1)[n-1]),
+ order(X, decreasing = TRUE)[2])
[1] TRUE
And for the ties-handling version:
foo <- function(x, i) {
X <- unique(x)
N <- length(X)
i <- i-1
which(x == sort(X, partial = N-i)[N-i])
}
> system.time(foo(X, 2))
user system elapsed
1.249 0.176 1.454
So the extra steps slow this solution down a bit, but it is still very competitive with order().
Use maxN function given by Zach to find the next max value and use which() with arr.ind = TRUE.
which(x == maxN(x, 4), arr.ind = TRUE)
Using arr.ind will return index position in any of the above solutions as well and simplify the code.
This is my solution for finding the index of the top N highest values in a vector (not exactly what the OP wanted, but this might help other people)
index.top.N = function(xs, N=10){
if(length(xs) > 0) {
o = order(xs, na.last=FALSE)
o.length = length(o)
if (N > o.length) N = o.length
o[((o.length-N+1):o.length)]
}
else {
0
}
}

Resources