How does R remember which values were indexed? - r

Code:
x <- c(1, 1, 3, 2, 3, 5)
x[x < 3] <- x[x < 3] * 100
x
Output:
[1] 100 100 3 200 3 5
I expected to get the output “[1] 100 100 200”. How does R remember the indexes where x < 3? Because when running x[x < 3], you get the output “1,1,2” (and the indexes of those numbers are 1,2,4).
Hence, R remember the values where x < 3, and manipulates those values. But how does R know the indexes of those values?

If you want to get the result of your call you need to assign it to a new variable:
x2 <- (x[x < 3] <- x[x<3] * 100)
x2
#[1] 100 100 200
Note that x will still be changed (as a result of your assignment x[x < 3] <- x[x<3] * 100).
If that is not desired, you can simply do:
x2 <- x[x<3]*100
x2
#[1] 100 100 200
x
#[1] 1 1 3 2 3 5

If you look at what is x < 3 you'll find it yields
[1] TRUE TRUE FALSE TRUE FALSE FALSE
so x[x<3] <- x[x < 3] * 100 means multiply by 100 the first, second and fourth value of x (x[x < 3]*100) and replace the first, second and fourth value of x by these new values (x[x<3] <-).

It's not remembering the locations of the x < 3 values. The left-hand side of your equation is explicitly the locations of the x < 3 values.
So the thing on the right might be (100, 100, 200), but the thing on the left is the placeholders at places 1, 2 and 4.

In your code, you ask R to go to x and multiple each value less than 3 by 100 that is in this code x[x < 3]*100. R, in this case, looks for the place of your element. That is, the code goes to x values and takes them one by one. The one that obeys the condition is then multiplied by 100. Then, the code returns you all the values of x. The correct way is to assign a name to your specific values as #Cath answer.
In addition, it is a good idea to write a general function for your convenience. Here is a function.
myfun <- function(x, n){
res <- lapply(x, function(i) (if (i < n) {i *100}else{0}))
res
}
or simply use the comment of #Cath.
myfun <- function(x, n){
res <- x[x < n]*100
res
}
myfun(x, 3)
x <- list(1, 1, 3, 2, 3, 5)
n=3
Then,
> myfun(x, 3)
[[1]]
[1] 100
[[2]]
[1] 100
[[3]]
[1] 0
[[4]]
[1] 200
[[5]]
[1] 0
[[6]]
[1] 0

Related

How to add possible divisor numbers?

How do I retrieve maximum sum of possible divisors numbers
I have a below function which will give possible divisors of number
Code
divisors <- function(x) {
y <- seq_len(ceiling(x / 2))
y[x %% y == 0]
}
Example
Divisors of 99 will give the below possible values.
divisors(99)
[1] 1 3 9 11 33
My expected Logic :
Go from last digit to first digit in the divisors value
The last number is 33, Here next immediate number divisible by 33 is 11 . So I selected 11 , now traversing from 11 the next immediate number divisible by 11 is 1. So selected 1. Now add all the numbers.
33 + 11 + 1 = 45
Move to next number 11, Now next immediate number divisible by 11 is 1. So selected 1. Now add all the numbers.
11 + 1 = 12
Here immediate
Move to next number 9, Now next immediate number divisible by 11 is 1. So selected 1. Now add all the numbers.
9 + 3 + 1 = 13
Move to next number 3, Now next immediate number divisible by 3 is 1. So selected 1. Now add all the numbers.
3+1=4
Now maximum among these is 45.
Now I am struggling to write this logic in R . Help / Advice much appreciated.
Note : Prime numbers can be ignored.
update
For large integers, e.g., the maximum integer .Machine$integer.max (prime number), you can run the code below (note that I modified functions divisors and f a bit)
divisors <- function(x) {
y <- seq(x / 2)
y[as.integer(x) %% y == 0]
}
f <- function(y) {
if (length(y) <= 2) {
return(as.integer(sum(y)))
}
l <- length(y)
h <- y[l]
yy <- y[-l]
h + f(yy[h %% yy == 0])
}
and you will see
> n <- .Machine$integer.max - 1
> x <- divisors(n)
> max(sapply(length(x):2, function(k) f(head(x, k))))
[1] 1569603656
You can define a recursive function f that gives successive divisors
f <- function(y) {
if (length(y) == 1) {
return(y)
}
h <- y[length(y)]
yy <- y[-length(y)]
c(f(yy[h %% yy == 0]), h)
}
and you will see all possible successive divisor tuples
> sapply(rev(seq_along(x)), function(k) f(head(x, k)))
[[1]]
[1] 1 11 33
[[2]]
[1] 1 11
[[3]]
[1] 1 3 9
[[4]]
[1] 1 3
[[5]]
[1] 1
Then, we apply f within sapply like below
> max(sapply(rev(seq_along(x)), function(k) sum(f(head(x, k)))))
[1] 45
which gives the desired output.
You can also use the following solution. It may sound a little bit complicated and of course there is always an easier, more efficient solution. However, I thought this could be useful to you. I will take it from your divisors output:
> x
[1] 1 3 9 11 33
# First I created a list whose first element is our original x and from then on
# I subset the first element till the last element of the list
lst <- lapply(0:(length(x)-1), function(a) x[1:(length(x)-a)])
> lst
[[1]]
[1] 1 3 9 11 33
[[2]]
[1] 1 3 9 11
[[3]]
[1] 1 3 9
[[4]]
[1] 1 3
[[5]]
[1] 1
Then I wrote a custom function in order to implement your conditions and gather your desired output. For this purpose I created a function factory which in fact is a function that creates a function:
As you might have noticed the outermost function does not take any argument. It only sets up an empty vector out to save our desired elements in. It is created in the execution environment of the outermost function to shield it from any changes that might affect it in the global environment
The inner function is the one that takes our vector x so in general we call the whole setup like fnf()(x). First element of of our out vector is in fact the first element of the original x(33). Then I found all divisors of the first element whose quotient were 0. After I fount them I took the second element (11) as the first one was (33) and stored it in our out vector. Then I modified the original x vector and omitted the max value (33) and repeated the same process
Since we were going to repeat the process over again, I thought this might be a good case to use recursion. Recursion is a programming technique that a function actually calls itself from its body or from inside itself. As you might have noticed I used fn inside the function to repeat the process again but each time with one fewer value
This may sound a bit complicated but I believed there may be some good points for you to pick up for future exploration, since I found them very useful, hoped that's the case for you too.
fnf <- function() {
out <- c()
fn <- function(x) {
out <<- c(out, x[1])
z <- x[out[length(out)]%%x == 0]
if(length(z) >= 2) {
out[length(out) + 1] <<- z[2]
} else {
return(out)
}
x <- x[!duplicated(x)][which(x[!duplicated(x)] == z[2]):length(x[!duplicated(x)])]
fn(x)
out[!duplicated(out)]
}
}
# The result of applying the custom function on `lst` would result in your
# divisor values
lapply(lst, function(x) fnf()(sort(x, decreasing = TRUE)))
[[1]]
[1] 33 11 1
[[2]]
[1] 11 1
[[3]]
[1] 9 3 1
[[4]]
[1] 3 1
[[5]]
[1] 1
In the end we sum each element and extract the max value
Reduce(max, lapply(lst, function(x) sum(fnf()(sort(x, decreasing = TRUE)))))
[1] 45
Testing a very large integer number, I used dear #ThomasIsCoding's modified divisors function:
divisors <- function(x) {
y <- seq(x / 2)
y[as.integer(x) %% y == 0]
}
x <- divisors(.Machine$integer.max - 1)
lst <- lapply(0:(length(x)-1), function(a) x[1:(length(x)-a)])
Reduce(max, lapply(lst, function(x) sum(fnf()(sort(x, decreasing = TRUE)))))
[1] 1569603656
You'll need to recurse. If I understand correctly, this should do what you want:
fact <- function(x) {
x <- as.integer(x)
div <- seq_len(abs(x)/2)
factors <- div[x %% div == 0L]
return(factors)
}
maxfact <- function(x) {
factors <- fact(x)
if (length(factors) < 3L) {
return(sum(factors))
} else {
return(max(factors + mapply(maxfact, factors)))
}
}
maxfact(99)
[1] 45

How to concisely deal with subsets when their lengths become zero?

To exclude elements from a vector x,
x <- c(1, 4, 3, 2)
we can subtract a vector of positions:
excl <- c(2, 3)
x[-excl]
# [1] 1 2
This also works dynamically,
(excl <- which(x[-which.max(x)] > quantile(x, .25)))
# [1] 2 3
x[-excl]
# [1] 1 2
until excl is of length zero:
excl.nolength <- which(x[-which.max(x)] > quantile(x, .95))
length(excl.nolength)
# [1] 0
x[-excl.nolength]
# integer(0)
I could kind of reformulate that, but I have many objects to which excl is applied, say:
letters[1:4][-excl.nolength]
# character(0)
I know I could use setdiff, but that's rather long and hard to read:
x[setdiff(seq(x), excl.nolength)]
# [1] 1 4 3 2
letters[1:4][setdiff(seq(letters[1:4]), excl.nolength)]
# [1] "a" "b" "c" "d"
Now, I could exploit the fact that nothing is excluded if the element number is greater than the number of elements:
length(x)
# [1] 4
x[-5]
# [1] 1 4 3 2
To generalize that I should probably use .Machine$integer.max:
tmp <- which(x[-which.max(x)] > quantile(x, .95))
excl <- if (!length(tmp) == 0) tmp else .Machine$integer.max
x[-excl]
# [1] 1 4 3 2
Wrapped into a function,
e <- function(x) if (!length(x) == 0) x else .Machine$integer.max
that's quite handy and clear:
x[-e(excl)]
# [1] 1 2
x[-e(excl.nolength)]
# [1] 1 4 3 2
letters[1:4][-e(excl.nolength)]
# [1] "a" "b" "c" "d"
But it seems a little fishy to me...
Is there a better equally concise way to deal with a subset of length zero in base R?
Edit
excl comes out as dynamic result of a function before (as shown with which above) and might be of length zero or not. If length(excl) == 0 nothing should be excluded. Following lines of code, e.g. x[-excl] should not have to be changed at best or as little as possible.
You can overwrite [ with your own function.
"[" <- function(x,y) {if(length(y)==0) x else .Primitive("[")(x,y)}
x <- c(1, 4, 3, 2)
excl <- c(2, 3)
x[-excl]
#[1] 1 2
excl <- integer()
x[-excl]
#[1] 1 4 3 2
rm("[") #Go back to normal mode
I would argue this is somewhat opinion based.
For example i find:
x <- x[-if(length(excl <- which(x[-which.max(x)] > quantile(x, .95))) == 0) .Machine$integer.max else excl]
very unreadable, but some people like one-liners. Reading package code you'll often find this is instead split up into one of the many suggestions you gave
excl <- which(x[-which.max(x)] > quantile(x, .95))
if(length(excl) != 0)
x <- x[-excl]
Alternatively, you could avoid which, and simply use the logical vector for subsetting, and this would likely be considered more clean by most
x <- x[!x[-which.max(x)] > quantile(x, .95)]
This would avoid zero-length index problem, at the cost of some loss of efficiency.
As a side note, the very example used above and in the question seems somewhat off. First which.max only returns the first index which is equal to the max value, and in addition the index will be offset for every value removed. More likely the expected example would be
x <- x[!(x > quantile(x, .95))[-which(x == max(x))]]
How bout this?
a <- letters[1:3]
excl1 <- c(1,3)
excl2 <- c()
a[!(seq_along(a) %in% excl1)]
a[!(seq_along(a) %in% excl2)]

Find closest value with condition

I have a function that finds me the nearest values for each row in a matrix. It then reports a list with an index of the nearest rows. However, I want it to exclude values if they are +1 in the first AND +1 in the second column away from a particular set of values (-1 in the first and -1 in the second column should also be removed). Moreover, +1 in first column and -1 in second column with respect to the values of interest should also be avoided.
As an example, if I want things closes to c(2, 1), it should accept c(3,1) or (2,2) or (1,1), but NOT c(3,2) and not c(1,0).
Basically, for an output to be reported either column 1 or column 2 should be a value of 1 away from a row of interest, but not both.
input looks like this
x
v1 v2
[1,] 3 1
[2,] 2 1
[3,] 3 2
[4,] 1 2
[5,] 8 5
myfunc(x)
The output looks like this. Notice that the closest thing to row 2 ($V2 in output) is row 1,3,4. The answer should only be 1 though.
$V1
[1] 2 3
$V2
[1] 1 3 4
$V3
[1] 1 2
$V4
[1] 2
$V5
integer(0)
Here is myfunc
myfunc = function(t){
d1 <- dist(t[,1])
d2 <- dist(t[,2])
dF <- as.matrix(d1) <= 1 & as.matrix(d2) <= 1
diag(dF) <- NA
colnames(dF) <- NULL
dF2 <- lapply(as.data.frame(dF), which)
return(dF2)
}
Basically, the rows that you want to find should differ from your reference element by +1 or -1 in one column and be identical in the other column. That means that the sum over the absolute values of the differences is exactly one. For your example c(2, 1), this works as follows:
c(3, 1): difference is c(1, 0), thus sum(abs(c(1, 0))) = 1 + 0 = 1
c(1, 1): difference is c(-1, 0), thus sum(abs(c(-1, 0))) = 1 + 0 = 1
etc.
The following function checks exactly this:
myfunc <- function(x) {
do_row <- function(r) {
r_mat <- matrix(rep(r, length = length(x)), ncol = ncol(x), byrow = TRUE)
abs_dist <- abs(r_mat - x)
return(which(rowSums(abs_dist) == 1))
}
return(apply(x, 1, do_row))
}
do_row() does the job for a single row, and then apply() is used to do this with each row. For your example, I get:
myfunc(x)
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
Using sweep(), one can write a shorter function:
myfunc2 <- function(x) {
apply(x, 1, function(r) which(rowSums(abs(sweep(x, 2, r))) == 1))
}
But this seems harder to understand and it turns out that it is slower by about a factor two for your matrix x. (I have also tried it with a large matrix, and there, the efficiency seems about the same.)

The number of data points in matrix and vector forms

Supposed that X contains 1000 rows with m columns, where m equal to 3 as follows:
set.seed(5)
X <- cbind(rnorm(1000,0,0.5), rnorm(1000,0,0.5), rnorm(1000,0,0.5))
Variable selection is performed, then the condition will be checked before performing the next operation as follows.
if(nrow(X) < 1000){print(a+b)}
,where a is 5 and b is 15, so if nrow(X) < 1000 is TRUE, then 20 will be printed out.
However, in case that X happens to be a vector because only one column is selected,
how can I check the number of data points when X can be either a matrix or vector ?
What I can think of is that
if(is.matrix(X)){
n <- nrow(X)
} else {
n <- length(X)}
if(n < 1000){print(a+b)}
Anyone has a better idea ?
Thank you
You can use NROW for both cases. From ?NROW
nrow and ncol return the number of rows or columns present in x. NCOL and NROW do the same treating a vector as 1-column matrix.
So that means that even if the subset is dropped down to a vector, as long as x is an array, vector, or data frame NROW will treat it as a one-column matrix.
sub1 <- X[,2:3]
is.matrix(sub1)
# [1] TRUE
NROW(sub1)
# [1] 1000
sub2 <- X[,1]
is.matrix(sub2)
# [1] FALSE
NROW(sub2)
# [1] 1000
So if(NROW(X) < 1000L) a + b should work regardless of whether X is a matrix or a vector. I use <= below, since X has exactly 1000 rows in your example.
a <- 5; b <- 15
if(NROW(sub1) <= 1000L) a + b
# [1] 20
if(NROW(sub2) <= 1000L) a + b
# [1] 20
A second option would be to use drop=FALSE when you make the variable selection. This will make the subset remain a matrix when the subset is only one column. This way you can use nrow with no worry. An example of this is
X[, 1, drop = FALSE]

How to check if a vector contains n consecutive numbers

Suppose that my vector numbers contains c(1,2,3,5,7,8), and I wish to find if it contains 3 consecutive numbers, which in this case, are 1,2,3.
numbers = c(1,2,3,5,7,8)
difference = diff(numbers) //The difference output would be 1,1,2,2,1
To verify that there are 3 consecutive integers in my numbers vector, I've tried the following with little reward.
rep(1,2)%in%difference
The above code works in this case, but if my difference vector = (1,2,2,2,1), it would still return TRUE even though the "1"s are not consecutive.
Using diff and rle, something like this should work:
result <- rle(diff(numbers))
any(result$lengths>=2 & result$values==1)
# [1] TRUE
In response to the comments below, my previous answer was specifically only testing for runs of length==3 excluding longer lengths. Changing the == to >= fixes this. It also works for runs involving negative numbers:
> numbers4 <- c(-2, -1, 0, 5, 7, 8)
> result <- rle(diff(numbers4))
> any(result$lengths>=2 & result$values==1)
[1] TRUE
Benchmarks!
I am including a couple functions of mine. Feel free to add yours. To qualify, you need to write a general function that tells if a vector x contains n or more consecutive numbers. I provide a unit test function below.
The contenders:
flodel.filter <- function(x, n, incr = 1L) {
if (n > length(x)) return(FALSE)
x <- as.integer(x)
is.cons <- tail(x, -1L) == head(x, -1L) + incr
any(filter(is.cons, rep(1L, n-1L), sides = 1, method = "convolution") == n-1L,
na.rm = TRUE)
}
flodel.which <- function(x, n, incr = 1L) {
is.cons <- tail(x, -1L) == head(x, -1L) + incr
any(diff(c(0L, which(!is.cons), length(x))) >= n)
}
thelatemail.rle <- function(x, n, incr = 1L) {
result <- rle(diff(x))
any(result$lengths >= n-1L & result$values == incr)
}
improved.rle <- function(x, n, incr = 1L) {
result <- rle(diff(as.integer(x)) == incr)
any(result$lengths >= n-1L & result$values)
}
carl.seqle <- function(x, n, incr = 1) {
if(!is.numeric(x)) x <- as.numeric(x)
z <- length(x)
y <- x[-1L] != x[-z] + incr
i <- c(which(y | is.na(y)), z)
any(diff(c(0L, i)) >= n)
}
Unit tests:
check.fun <- function(fun)
stopifnot(
fun(c(1,2,3), 3),
!fun(c(1,2), 3),
!fun(c(1), 3),
!fun(c(1,1,1,1), 3),
!fun(c(1,1,2,2), 3),
fun(c(1,1,2,3), 3)
)
check.fun(flodel.filter)
check.fun(flodel.which)
check.fun(thelatemail.rle)
check.fun(improved.rle)
check.fun(carl.seqle)
Benchmarks:
x <- sample(1:10, 1000000, replace = TRUE)
library(microbenchmark)
microbenchmark(
flodel.filter(x, 6),
flodel.which(x, 6),
thelatemail.rle(x, 6),
improved.rle(x, 6),
carl.seqle(x, 6),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# flodel.filter(x, 6) 96.03966 102.1383 144.9404 160.9698 177.7937 10
# flodel.which(x, 6) 131.69193 137.7081 140.5211 185.3061 189.1644 10
# thelatemail.rle(x, 6) 347.79586 353.1015 361.5744 378.3878 469.5869 10
# improved.rle(x, 6) 199.35402 200.7455 205.2737 246.9670 252.4958 10
# carl.seqle(x, 6) 213.72756 240.6023 245.2652 254.1725 259.2275 10
After diff you can check for any consecutive 1s -
numbers = c(1,2,3,5,7,8)
difference = diff(numbers) == 1
## [1] TRUE TRUE FALSE FALSE TRUE
## find alteast one consecutive TRUE
any(tail(difference, -1) &
head(difference, -1))
## [1] TRUE
It's nice to see home-grown solutions here.
Fellow Stack Overflow user Carl Witthoft posted a function he named seqle() and shared it here.
The function looks like this:
seqle <- function(x,incr=1) {
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + incr
i <- c(which(y|is.na(y)),n)
list(lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)])
}
Let's see it in action. First, some data:
numbers1 <- c(1, 2, 3, 5, 7, 8)
numbers2 <- c(-2, 2, 3, 5, 6, 7, 8)
numbers3 <- c(1, 2, 2, 2, 1, 2, 3)
Now, the output:
seqle(numbers1)
# $lengths
# [1] 3 1 2
#
# $values
# [1] 1 5 7
#
seqle(numbers2)
# $lengths
# [1] 1 2 4
#
# $values
# [1] -2 2 5
#
seqle(numbers3)
# $lengths
# [1] 2 1 1 3
#
# $values
# [1] 1 2 2 1
#
Of particular interest to you is the "lengths" in the result.
Another interesting point is the incr argument. Here we can set the increment to, say, "2" and look for sequences where the difference between the numbers are two. So, for the first vector, we would expect the sequence of 3, 5, and 7 to be detected.
Let's try:
> seqle(numbers1, incr = 2)
$lengths
[1] 1 1 3 1
$values
[1] 1 2 3 8
So, we can see that we have a sequence of 1 (1), 1 (2), 3 (3, 5, 7), and 1 (8) if we set incr = 2.
How does it work with ECII's second challenge? Seems OK!
> numbers4 <- c(-2, -1, 0, 5, 7, 8)
> seqle(numbers4)
$lengths
[1] 3 1 2
$values
[1] -2 5 7
Simple but works
numbers = c(-2,2,3,4,5,10,6,7,8)
x1<-c(diff(numbers),0)
x2<-c(0,diff(numbers[-1]),0)
x3<-c(0,diff(numbers[c(-1,-2)]),0,0)
rbind(x1,x2,x3)
colSums(rbind(x1,x2,x3) )==3 #Returns TRUE or FALSE where in the vector the consecutive intervals triplet takes place
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
sum(colSums(rbind(x1,x2,x3) )==3) #How many triplets of consecutive intervals occur in the vector
[1] 3
which(colSums(rbind(x1,x2,x3) )==3) #Returns the location of the triplets consecutive integers
[1] 2 3 7
Note that this will not work for consecutive negative intervals c(-2,-1,0) because of how diff() works

Resources