Related
I have 3 vectors
x <- c(1,3,5,7,3,8)
y <- c(3,5,7)
z <- c(3,3,8)
I want to find the elements of x that are not in y and not in z. Is there a function f that would give me the following output:
> f(x,y)
1 3 8
> f(x,z)
1 5 7
In other words, I want to find the "set difference" between 2 vectors, either of which may have repeated values. The functions %in%, match and setdiff do not work in this case for obvious reasons.
There should be some better ways to do this but here is one option
get_diff_vectors <- function(x, y) {
count_x <- table(x)
count_y <- table(y)
same_counts <- match(names(count_y), names(count_x))
count_x[same_counts] <- count_x[same_counts] - count_y
as.numeric(rep(names(count_x), count_x))
}
get_diff_vectors(x, y)
#[1] 1 3 8
get_diff_vectors(x, z)
#[1] 1 5 7
get_diff_vectors(x, c(5, 7))
#[1] 1 3 3 8
We count the frequency of x and y using table, match the numbers which occur in both and subtract the counts y from x. Finally recreate the remaining vector using rep.
Still not able to find a better way but here is dplyr way using the somewhat similar logic.
library(dplyr)
get_diff_vectors_dplyr <- function(x, y) {
df1 <- data.frame(x) %>% count(x)
df2 <- data.frame(y) %>% count(y)
final <- left_join(df1, df2, by = c("x" = "y")) %>%
mutate_at(c("n.x", "n.y"), funs(replace(., is.na(.), 0))) %>%
mutate(n = n.x - n.y)
rep(final$x, final$n)
}
get_diff_vectors_dplyr(x, y)
#[1] 1 3 8
get_diff_vectors_dplyr(x, z)
#[1] 1 5 7
get_diff_vectors_dplyr(x, c(5, 7))
#[1] 1 3 3 8
The vecsets package mentioned by OP has function vsetdiff which does this very easily
vecsets::vsetdiff(x, y)
#[1] 1 3 8
vecsets::vsetdiff(x, z)
#[1] 1 5 7
vecsets::vsetdiff(x, c(5, 7))
#[1] 1 3 3 8
Here's an attempt using make.unique to account for duplicates:
dupdiff <- function(x,y) x[-match(
make.unique(as.character(y)),
make.unique(as.character(x)),
nomatch=0
)]
Testing:
dupdiff(x,y)
#[1] 1 3 8
dupdiff(x,z)
#[1] 1 5 7
dupdiff(x, c(5, 7))
#[1] 1 3 3 8
dupdiff(x, c(5, 7, 9))
#[1] 1 3 3 8
match with a little for-loop does work:
> f(x, y)
[1] 1 3 8
> f(x, z)
[1] 1 5 7
Code
f <- function(s, r) {
for(i in 1:length(s)){
j <- match(s[i], r)
if(!is.na(j)) {
s[i] <- NA
r[j] <- NA
}
}
print(s[complete.cases(s)])
}
There is the new Hadley-verse package waldo
Makes real nice and concise overviews of the differences of objects, not only vectors
library(waldo)
compare(x, y)
#> `old`: 1 3 5 7 3 8
#> `new`: 3 5 7
compare(x, z)
#> `old`: 1 3 5 7 3 8
#> `new`: 3 3 8
What is the optimal way to get the index of all elements that are repeated # times? I want to identify the elements that are duplicated more than 2 times.
rle() and rleid() both hint to the values I need but neither method directly gives me the indices.
I came up with this code:
t1 <- c(1, 10, 10, 10, 14, 37, 3, 14, 8, 8, 8, 8, 39, 12)
t2 <- lag(t1,1)
t2[is.na(t2)] <- 0
t3 <- ifelse(t1 - t2 == 0, 1, 0)
t4 <- rep(0, length(t3))
for (i in 2:length(t3)) t4[i] <- ifelse(t3[i] > 0, t3[i - 1] + t3[i], 0)
which(t4 > 1)
returns:
[1] 4 11 12
and those are the values I need.
Are there any R-functions that are more appropriate?
Ben
One option with data.table. No real reason to use this instead of lag/shift when n = 2, but for larger n this would save you from creating a large number of new lagged vectors.
library(data.table)
which(rowid(rleid(t1)) > 2)
# [1] 4 11 12
Explanation:
rleid will produce a unique value for each "run" of equal values, and rowid will mark how many elements "into" the run each element is. What you want is elements more than 2 "into" a run.
data.table(
t1,
rleid(t1),
rowid(t1))
# t1 V2 V3
# 1: 1 1 1
# 2: 10 2 1
# 3: 10 2 2
# 4: 10 2 3
# 5: 14 3 1
# 6: 37 4 1
# 7: 3 5 1
# 8: 14 6 2
# 9: 8 7 1
# 10: 8 7 2
# 11: 8 7 3
# 12: 8 7 4
# 13: 39 8 1
# 14: 12 9 1
Edit: If, as in the example posed by this question, no two runs (even length-1 "runs") are of the same value (or if you don't care whether the duplicates are next to eachother), you can just use which(rowid(t1) > 2) instead. (This is noted by Frank in the comments)
Hopefully this example clarifies the differences
a <- c(1, 1, 1, 2, 2, 1)
which(rowid(a) > 2)
# [1] 3 6
which(rowid(rleid(a)) > 2)
# [1] 3
You can use dplyr::lag or data.table::shift (note, default for shift is to lag, so shift(t1, 1) is equal to shift(t1, 1, type = "lag"):
which(t1 == lag(t1, 1) & lag(t1, 1) == lag(t1, 2))
[1] 4 11 12
# Or
which(t1 == shift(t1, 1) & shift(t1, 1) == shift(t1, 2))
[1] 4 11 12
If you need it to scale for several duplicates you can do the following (thanks for the tip #IceCreamToucan):
n <- 2
df1 <- sapply(0:n, function(x) shift(t1, x))
which(rowMeans(df1 == df1[,1]) == 1)
[1] 4 11 12
This is usually a case that rle is useful, i.e.
v1 <- rle(t1)
i1 <- seq_along(t1)[t1 %in% v1$values[v1$lengths > 2]]
i2 <- t1[t1 %in% v1$values[v1$lengths > 2]]
tapply(i1, i2, function(i) tail(i, -2))
#$`8`
#[1] 11 12
#$`10`
#[1] 4
You can unlist and get it as a vector,
unlist(tapply(i1, i2, function(i) tail(i, -2)))
#81 82 10
#11 12 4
There is also a function called rleid in data.table package which we can use,
unlist(lapply(Filter(function(i) length(i) > 2, split(seq_along(t1), data.table::rleid(t1))),
function(i) tail(i, -2)))
#2 71 72
#4 11 12
Another possibility involving rle() could be:
pseudo_rleid <- with(rle(t1), rep(seq_along(values), lengths))
which(ave(t1, pseudo_rleid, FUN = function(x) seq_along(x) > 2) != 0)
[1] 4 11 12
First, I simplify my question. I want to extract certain ranges from a numeric vector. For example, extracting 3 ranges from 1:20 at the same time :
1 < x < 5
8 < x < 12
17 < x < 20
Therefore, the expected output is 2, 3, 4, 9, 10, 11, 18, 19.
I try to use the function findInterval() and control arguments rightmost.closed and left.open to do that, but any arguments sets cannot achieve the goal.
x <- 1:20
v <- c(1, 5, 8, 12, 17, 20)
x[findInterval(x, v) %% 2 == 1]
# [1] 1 2 3 4 8 9 10 11 17 18 19
x[findInterval(x, v, rightmost.closed = T) %% 2 == 1]
# [1] 1 2 3 4 8 9 10 11 17 18 19 20
x[findInterval(x, v, left.open = T) %% 2 == 1]
# [1] 2 3 4 5 9 10 11 12 18 19 20
By the way, the conditions can also be a matrix like that :
[,1] [,2]
[1,] 1 5
[2,] 8 12
[3,] 17 20
I don't want to use for loop if it's not necessary.
I am grateful for any helps.
I'd probably do it using purrr::map2 or Map, passing your lower-bounds and upper-bounds as arguments and filtering your dataset with a custom function
library(purrr)
x <- 1:20
lower_bounds <- c(1, 8, 17)
upper_bounds <- c(5, 12, 20)
map2(
lower_bounds, upper_bounds, function(lower, upper) {
x[x > lower & x < upper]
}
)
You may use data.table::inrange and its incbounds argument. Assuming ranges are in a matrix 'm', as shown in your question:
x[data.table::inrange(x, m[ , 1], m[ , 2], incbounds = FALSE)]
# [1] 2 3 4 9 10 11 18 19
m <- matrix(v, ncol = 2, byrow = TRUE)
You were on the right path, and left.open indeed helps, but rightmost.closed actually concerns only the last interval rather than the right "side" of each interval. Hence, we need to use left.open twice. As you yourself figured out, it looks like an optimal way to do that is
x[findInterval(x, v) %% 2 == 1 & findInterval(x, v, left.open = TRUE) %% 2 == 1]
# [1] 2 3 4 9 10 11 18 19
Clearly there are alternatives. E.g.,
fun <- function(x, v)
if(length(v) > 1) v[1] < x & x < v[2] | fun(x, v[-1:-2]) else FALSE
x[fun(x, v)]
# [1] 2 3 4 9 10 11 18 19
I found an easy way just with sapply() :
x <- 1:20
v <- c(1, 5, 8, 12, 17, 20)
(v.df <- as.data.frame(matrix(v, 3, 2, byrow = T)))
# V1 V2
# 1 1 5
# 2 8 12
# 3 17 20
y <- sapply(x, function(x){
ind <- (x > v.df$V1 & x < v.df$V2)
if(any(ind)) x else NA
})
y[!is.na(y)]
# [1] 2 3 4 9 10 11 18 19
I have a numeric vector in R, say
v= c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
Now, I have to find all the consecutive sub-vector of size 4 out of it with the condition that each element of the sub-vector must be greater than 2 and all sub-vector must be disjoint in the sense that non of the two sub-vector can contain same index element. So my output will be:
(3,5,6,7),(3,4,5,7),(5,6,7,11)
Edited:
Other examples for illustration purpose: for,
v=c(3,3,3,3,1,3,3,3,3,3,3,3,3)
output will be :
(3,3,3,3), (3,3,3,3),(3,3,3,3).
and for,
v= c(2,3,5,5,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
output will be
(3,5,5,7),(3,4,5,7),(5,6,7,11)
The second condition on the output simply says that if we found any sub- array say (v[m],v[m+1],v[m+2],v[m+3]) with each element greater than > 2 then it will goes into my output and the next sub-array can only be start from v[m+4](if possible)
This solution uses embed() to create a matrix of lags and then extracts the desired rows from this matrix:
v <- c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
e <- embed(v, 4)
ret <- which(
apply(e, 1, function(x)all(x > 2)) &
apply(e, 1, function(x)length(unique(x)) == 4)
)
rows <- ret[c(1, 1 + which(diff(ret) > 4))]
e[rows, 4:1]
[,1] [,2] [,3] [,4]
[1,] 3 5 6 7
[2,] 3 4 5 7
[3,] 5 6 7 11
Try:
fun1 <- function(vec, n, cond1) {
lst1 <- lapply(1:(length(vec) - n+1), function(i) {
x1 <- vec[i:(i + (n-1))]
if (all(diff(x1) >= 0) & all(x1 > cond1))
x1
})
indx <- which(sapply(lst1, length) == n)
indx2 <- unlist(lapply(split(indx, cumsum(c(TRUE, diff(indx) != 1))), function(x) x[seq(1,
length(x), by = n-1)]))
lst1[indx2]
}
v1 <- c(3,3,3,3,1,3,3,3,3,3,3,3,3)
v2 <- c(2,3,5,5,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
v3 <- c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
fun1(v1,4,2)
#[[1]]
#[1] 3 3 3 3
#[[2]]
#[1] 3 3 3 3
#[[3]]
#[1] 3 3 3 3
fun1(v2,4,2)
#[[1]]
#[1] 3 5 5 7
#[[2]]
#[1] 3 4 5 7
#[[3]]
#[1] 5 6 7 11
fun1(v3,4,2)
#[[1]]
#[1] 3 5 6 7
#[[2]]
#[1] 3 4 5 7
#[[3]]
#[1] 5 6 7 11
Here is another idea based on rle:
ff = function(x, size, thres)
{
valid_subsets = sapply(head(seq_along(x), -(size - 1)),
function(i) all(x[i:(i + (size - 1))] > thres))
r = rle(valid_subsets)
lapply(unlist(mapply(function(a, b) a + (seq_len(b) - 1) * size,
(cumsum(r$lengths) - r$lengths + 1)[which(r$values)],
(r$lengths[which(r$values)] + size - 1) %/% size)),
function(i) x[i:(i + (size - 1))])
}
ff(c(3,3,3,3,1,3,3,3,3,3,3,3,3), 4, 2)
ff(c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4), 4, 2)
Testing on another vector (I assume this is the correct output):
set.seed(4); xx = sample(1:10, 20, T)
xx
# [1] 6 1 3 3 9 3 8 10 10 1 8 3 2 10 5 5 10 6 10 8
ff(xx, 4, 2)
#[[1]]
#[1] 3 3 9 3
#
#[[2]]
#[1] 10 5 5 10
Unless I'm missing something, on "xx" (as well as on other cases) the other posted answers do not seem to work:
fun1(xx, 4, 2)
#[[1]]
#[1] 3 8 10 10
#e[rows, 4:1]
#[1] 9 3 8 10
I have a zoo object of 12 sets of monthly returns on stock tickers. I want to get the symbol, which is the name of the series, or at least the column, of each month's best performing stock. I've been trying to do this with applying the max function, by row. How do I get the column name?
#Apply 'max' function across each row. I need to get the col number out of this.
apply(tsPctChgs, 1, max, na.rm = TRUE)
The usual answer would be via which.max() however, do note that this will return only the first of the maximums if there are two or more observations taking the maximum value.
An alternative is which(x == max(x)), which would return all value taking the maximum in the result of a tie.
You can then use the index returned to select the series maximum. Handling NAs is covered below to try to keep the initial discussion simple.
require("zoo")
set.seed(1)
m <- matrix(runif(50), ncol = 5)
colnames(m) <- paste0("Series", seq_len(ncol(m)))
ind <- seq_len(nrow(m))
mz <- zoo(m, order.by = ind)
> apply(mz, 1, which.max)
1 2 3 4 5 6 7 8 9 10
3 5 5 1 4 1 1 2 3 2
> apply(mz, 1, function(x) which(x == max(x)))
1 2 3 4 5 6 7 8 9 10
3 5 5 1 4 1 1 2 3 2
So use that to select the series name
i1 <- apply(mz, 1, function(x) which(x == max(x)))
colnames(mz)[i1]
> i1 <- apply(mz, 1, function(x) which(x == max(x)))
> colnames(mz)[i1]
[1] "Series3" "Series5" "Series5" "Series1" "Series4" "Series1" "Series1"
[8] "Series2" "Series3" "Series2"
Handling tied maximums
To illustrate the different behaviour, copy the maximum from month 1 (series 3) into series 1
mz2 <- mz ## copy
mz2[1,1] <- mz[1,3]
mz2[1,]
> mz2[1,]
1 0.9347052 0.2059746 0.9347052 0.4820801 0.8209463
Now try the two approaches again
> apply(mz2, 1, which.max)
1 2 3 4 5 6 7 8 9 10
1 5 5 1 4 1 1 2 3 2
> apply(mz2, 1, function(x) which(x == max(x)))
$`1`
Series1 Series3
1 3
.... ## truncated output ###
Notice how which.max only returns the maximum in series 1.
To use this approach to select the series name, you need to apply something to the list returned by apply(), e.g.
i2 <- apply(mz2, 1, function(x) which(x == max(x)))
lapply(i2, function (i, zobj) colnames(zobj)[i], zobj = mz2)
$`1`
[1] "Series1" "Series3"
$`2`
[1] "Series5"
$`3`
[1] "Series5"
$`4`
[1] "Series1"
$`5`
[1] "Series4"
$`6`
[1] "Series1"
$`7`
[1] "Series1"
$`8`
[1] "Series2"
$`9`
[1] "Series3"
$`10`
[1] "Series2"
Handling NAs
As you have potential for NAs, I would do the following:
apply(mz, 1, which.max, na.rm = TRUE) ## as you did already
apply(mz, 1, function(x, na.rm = TRUE) {
if(na.rm) {
x <- x[!is.na(x)]
}
which(x == max(x))
})
Since apply converts to matrix, I would use rollapply with width=1:
require("zoo")
set.seed(1)
m <- matrix(runif(50), ncol=5)
mz <- setNames(zoo(m, seq(nrow(m))), paste0("Series",seq(ncol(m))))
rollapply(mz, 1, function(r) colnames(mz)[which.max(r)], by.column=FALSE)