select multiple head() and tail() values in a vector - r

I have a vector as follows:
v <- c(1,3,4,5,6,7,8,9,NA,NA,NA,NA,27,25,30,41,NA,NA)
How can I extract the values 1, 9, 27 and 41 (i. e. the first and last position of each subset without NAs)?
I thought about using head(v, 1) and tail(v, 1) in combination. However I don't have an idea how to 'stop' at the NAs and restart again after them.

We create a grouping variable with rleid based on the logical vector (is.na(v)), use that in tapply to select the first and last values of each group, unlist the list output, remove the NA elements with na.omit and remove the attributes with c.
library(data.table)
c(na.omit(unlist(tapply(v, rleid(is.na(v)), function(x) c(x[1],
x[length(x)])), use.names=FALSE)))
#[1] 1 9 27 41
Or another option is rle from base R
v[with(rle(!is.na(v)), {
i1 <- cumsum(lengths)
i2 <- lengths[values]
c(rbind(i1[values] - i2 + 1 , i1[values]))
})]
#[1] 1 9 27 41

Another possible solution via base R could be to split based on NA entries in the vector, lapply the head and tail functions and remove NA's, i.e.
ind <- unname(unlist(lapply(split(v, cumsum(c(1, diff(is.na(v)) != 0))), function(i)
c(head(i, 1), tail(i, 1)))))
ind[!is.na(ind)]
#[1] 1 9 27 41

A base R solution:
x = na.omit( v[is.na(c(NA,diff(v))) | is.na(c(diff(v),NA))] )
> as.numeric(x)
# [1] 1 9 27 41

Related

Selecting number from string based on criteria

I have the following data set:
PATH = c("5-8-10-8-17-20",
"56-85-89-89-0-15-88-10",
"58-85-89-65-49-51")
INDX = c(18, 89, 50)
data.frame(PATH, INDX)
PATH
INDX
5-8-10-8-17-20
18
56-85-89-89-0-15-88-10
89
58-85-89-65-49-51
50
The column PATH has strings that represent a numerical series and I want to be able to pick the largest number from the string that satisfies PATH <= INDX, that is selecting a number from PATH that is equal to INDX or the largest number from PATH that is yet less than INDX
my desired output would look like this:
PATH
INDX
PICK
5-8-10-8-17-20
18
17
56-85-89-89-0-15-88-10
89
88
58-85-89-65-49-51
50
49
Some of my thought-process behind the answer:
I know that If I have a function such strsplit I could separate each string by "-", arrange by number and then subtract with INDX and thus select the smallest negative number or zero. However, the original dataset is quite large and I wonder if there is a faster or more efficient way to perform this task.
Another option:
mapply(
\(x, y) max(x[x <= y]),
strsplit(PATH, "-") |> lapply(as.integer),
INDX
)
# [1] 17 88 49
Using purrr::map2_dbl():
library(purrr)
PICK <- map2_dbl(
strsplit(PATH, "-"),
INDX,
~ max(
as.numeric(.x)[as.numeric(.x) <= .y]
)
)
# 17 89 49
The below should be reasonably efficient, there is nothing wrong with your approach.
numpath <- sapply(strsplit(PATH, "-"), as.numeric)
maxindexes <- lapply(1:length(numpath), function(x) which(numpath[[x]] <= INDX[x]))
result <- sapply(1:length(numpath), function(x) max(numpath[[x]][maxindexes[[x]]]))
> result
[1] 17 89 49
Using dplyr
library(dplyr)
df |>
rowwise() |>
mutate(across(PATH, ~ {
a = unlist(strsplit(.x, split = "-"))
max(as.numeric(a)[which(as.numeric(a) <= INDX)])
}, .names = "PICK"))
PATH INDX PICK
<chr> <dbl> <dbl>
1 5-8-10-8-17-20 18 17
2 56-85-89-89-0-15-88-10 89 89
3 58-85-89-65-49-51 50 49
You can create a custom function like below:
my_func <- function(vec1, vec2) {
sort(as.numeric(unlist(strsplit(vec1, split = "-")))) -> x
return(x[max(cumsum(x <= vec2))])
}
df$PICK <- sapply(seq_len(nrow(df)), function(i) my_func(df$PATH[i], df$INDX[i]))
which will yield the following output:
# PATH INDX PICK
# 1 5-8-10-8-17-20 18 17
# 2 56-85-89-89-0-15-88-10 89 89
# 3 58-85-89-65-49-51 50 49

R: How to slice a window of elements in named vector

Given the following named vector:
x <- c(54, 36, 67, 25, 76)
names(x) <- c('a', 'b', 'c', 'd', 'e')
How one can extract the elements between 'b' and 'd'? I can do that for data tables with the dplyr::select(dt, b:d) but for some reason, I cannot find a solution for named vectors (all the examples I find are for extracting element(s) by giving all the names not a range of names)...
You could do
x[which(names(x) == "b"):which(names(x) == "d")]
#> b c d
#> 36 67 25
The problem being that there is no guarantee in a named vector that names are unique, and if there are duplicate names the entire concept becomes meaningless.
If you wanted a complete solution that allows for tidyverse-style non-standard evaluation and sensible error messages you could have
subset_named <- function(data, exp)
{
if(missing(exp)) return(data)
exp <- as.list(match.call())$exp
if(is.numeric(exp)) return(data[exp])
if(is.character(exp)) return(data[exp])
tryCatch({
ss <- suppressWarnings(eval(exp))
return(data[ss])},
error = function(e)
{
if(as.character(exp[[1]]) != ":")
stop("`exp` must be a sequence created by ':'")
n <- names(data)
first <- as.character(exp[[2]])
second <- as.character(exp[[3]])
first_match <- which(n == first)
second_match <- which(n == second)
if(length(first_match) == 0)
stop("\"", first, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(second_match) == 0)
stop("\"", second, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(first_match) > 1) {
warning("\"", first,
"\" found more than once. Using first occurence only")
first_match <- first_match[1]
}
if(length(second_match) > 1) {
warning("\"", second,
"\" found more than once. Using first occurence only")
second_match <- second_match[1]
}
return(data[first_match:second_match])
})
}
That allows the following behaviour:
subset_named(x, "b":"d")
#> b c d
#> 36 67 25
subset_named(x, b:d)
#> b c d
#> 36 67 25
subset_named(x, 1:3)
#> a b c
#> 54 36 67
subset_named(x, "e")
#> e
#> 76
subset_named(x)
#> a b c d e
#> 54 36 67 25 76
One option could be:
x[Reduce(`:`, which(names(x) %in% c("b", "d")))]
b c d
36 67 25
You can use match in base R :
x[match('b', names(x)):match('d', names(x))]
# b c d
#36 67 25
Or if you want to use something like b:d convert it into dataframe as column
library(dplyr)
t(x) %>%
as.data.frame() %>%
select(b:d)
1) subset In base R this can be done using the select argument of subset. The only catch is that only the data.frame method of subset supports the select argument but we can convert x to a data.frame and then convert back. It also allows more complex specifications such as c(b:d, d) .
unlist(subset(data.frame(as.list(x)), select = b:d))
## b c d
## 36 67 25
2) evalq Another base R possibility is to create a list with the values 1, 2, 3, ... and the same names as x and then evaluate b:d with respect to it giving the desired indexes which can then be indexed into x. This also allows complex specifications as in (1).
x[ evalq(b:d, setNames(as.list(seq_along(x)), names(x))) ]
## b c d
## 36 67 25
We could turn this into a function like this:
sel <- function(x, select, envir = parent.frame()) {
ix <- setNames(as.list(seq_along(x)), names(x))
x[ eval(substitute(select), ix, envir) ]
}
sel(x, b:d)
sel(x, c(b:c, d))
sel(x, d:b) # reverse order
3) logical condition Again with only base R, if the names are in sorted order, as in the question, then we can check for names between the endpoints:
x[names(x) >= "b" & names(x) <= "d"]
## b c d
## 36 67 25
4) zoo If the names are in ascending order, as in the question, we could create a zoo series with those names as the times and then use window.zoo to pick out the subseries and finally convert back.
library(zoo)
coredata(window(zoo(x, names(x)), start = "b", end = "d"))
## b c d
## 36 67 25

R - Grabbing the index of the first occurrence of a value after a refresh point?

Let's say I have a vector:
vec <- c(3,0,1,3,0,1,0,1,2,3,0,0,1,3,1,3)
I want to obtain the index of the first occurrence of 1 after every 3. So, the output of indices I want is
3,6,13,15
How would I do this in R?
One approach would be to use cumsum to keep track of 3s.
mat <- cbind(cumsum(vec==3), vec == 1)
which(!duplicated(mat) & mat[,2] & mat[,1] > 0)
[1] 3 6 13 15
We can also use rleid
library(data.table)
na.omit(as.vector(tapply(seq_along(vec) * (vec== 1), rleid(vec == 3), FUN = function(x)x[x > 0][1])))
#[1] 3 6 13 15

How to count unique values from vector? [duplicate]

Let's say I have:
v = rep(c(1,2, 2, 2), 25)
Now, I want to count the number of times each unique value appears. unique(v) returns what the unique values are, but not how many they are.
> unique(v)
[1] 1 2
I want something that gives me
length(v[v==1])
[1] 25
length(v[v==2])
[1] 75
but as a more general one-liner :) Something close (but not quite) like this:
#<doesn't work right> length(v[v==unique(v)])
Perhaps table is what you are after?
dummyData = rep(c(1,2, 2, 2), 25)
table(dummyData)
# dummyData
# 1 2
# 25 75
## or another presentation of the same data
as.data.frame(table(dummyData))
# dummyData Freq
# 1 1 25
# 2 2 75
If you have multiple factors (= a multi-dimensional data frame), you can use the dplyr package to count unique values in each combination of factors:
library("dplyr")
data %>% group_by(factor1, factor2) %>% summarize(count=n())
It uses the pipe operator %>% to chain method calls on the data frame data.
It is a one-line approach by using aggregate.
> aggregate(data.frame(count = v), list(value = v), length)
value count
1 1 25
2 2 75
length(unique(df$col)) is the most simple way I can see.
table() function is a good way to go, as Chase suggested.
If you are analyzing a large dataset, an alternative way is to use .N function in datatable package.
Make sure you installed the data table package by
install.packages("data.table")
Code:
# Import the data.table package
library(data.table)
# Generate a data table object, which draws a number 10^7 times
# from 1 to 10 with replacement
DT<-data.table(x=sample(1:10,1E7,TRUE))
# Count Frequency of each factor level
DT[,.N,by=x]
To get an un-dimensioned integer vector that contains the count of unique values, use c().
dummyData = rep(c(1, 2, 2, 2), 25) # Chase's reproducible data
c(table(dummyData)) # get un-dimensioned integer vector
1 2
25 75
str(c(table(dummyData)) ) # confirm structure
Named int [1:2] 25 75
- attr(*, "names")= chr [1:2] "1" "2"
This may be useful if you need to feed the counts of unique values into another function, and is shorter and more idiomatic than the t(as.data.frame(table(dummyData))[,2] posted in a comment to Chase's answer. Thanks to Ricardo Saporta who pointed this out to me here.
This works for me. Take your vector v
length(summary(as.factor(v),maxsum=50000))
Comment: set maxsum to be large enough to capture the number of unique values
or with the magrittr package
v %>% as.factor %>% summary(maxsum=50000) %>% length
Also making the values categorical and calling summary() would work.
> v = rep(as.factor(c(1,2, 2, 2)), 25)
> summary(v)
1 2
25 75
You can try also a tidyverse
library(tidyverse)
dummyData %>%
as.tibble() %>%
count(value)
# A tibble: 2 x 2
value n
<dbl> <int>
1 1 25
2 2 75
If you need to have the number of unique values as an additional column in the data frame containing your values (a column which may represent sample size for example), plyr provides a neat way:
data_frame <- data.frame(v = rep(c(1,2, 2, 2), 25))
library("plyr")
data_frame <- ddply(data_frame, .(v), transform, n = length(v))
You can also try dplyr::count
df <- tibble(x=c('a','b','b','c','c','d'), y=1:6)
dplyr::count(df, x, sort = TRUE)
# A tibble: 4 x 2
x n
<chr> <int>
1 b 2
2 c 2
3 a 1
4 d 1
If you want to run unique on a data.frame (e.g., train.data), and also get the counts (which can be used as the weight in classifiers), you can do the following:
unique.count = function(train.data, all.numeric=FALSE) {
# first convert each row in the data.frame to a string
train.data.str = apply(train.data, 1, function(x) paste(x, collapse=','))
# use table to index and count the strings
train.data.str.t = table(train.data.str)
# get the unique data string from the row.names
train.data.str.uniq = row.names(train.data.str.t)
weight = as.numeric(train.data.str.t)
# convert the unique data string to data.frame
if (all.numeric) {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) as.numeric(unlist(strsplit(x, split=","))))))
} else {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) unlist(strsplit(x, split=",")))))
}
names(train.data.uniq) = names(train.data)
list(data=train.data.uniq, weight=weight)
}
I know there are many other answers, but here is another way to do it using the sort and rle functions. The function rle stands for Run Length Encoding. It can be used for counts of runs of numbers (see the R man docs on rle), but can also be applied here.
test.data = rep(c(1, 2, 2, 2), 25)
rle(sort(test.data))
## Run Length Encoding
## lengths: int [1:2] 25 75
## values : num [1:2] 1 2
If you capture the result, you can access the lengths and values as follows:
## rle returns a list with two items.
result.counts <- rle(sort(test.data))
result.counts$lengths
## [1] 25 75
result.counts$values
## [1] 1 2
count_unique_words <-function(wlist) {
ucountlist = list()
unamelist = c()
for (i in wlist)
{
if (is.element(i, unamelist))
ucountlist[[i]] <- ucountlist[[i]] +1
else
{
listlen <- length(ucountlist)
ucountlist[[i]] <- 1
unamelist <- c(unamelist, i)
}
}
ucountlist
}
expt_counts <- count_unique_words(population)
for(i in names(expt_counts))
cat(i, expt_counts[[i]], "\n")

Using pmatch for checking columns of a matrix into another

I have two matrices and I want to check which (column) vectors of the first one are also in the second one, and if so to get their index.
I tried to use pmatch but I have to tweak it a bit because it first convert the matrices into vector, see the MWE:
X <- matrix(rnorm(12), 3, 4)
x <- X[, c(2, 4)]
pm <- pmatch(x, X)
print(pm)
[1] 4 5 6 10 11 12
d1 <- dim(X)[1]
d2 <- length(pm)/d1
ind <- pmatch(x, X)[d1*c(1:d2)]/d1
print(ind)
[1] 2 4
ind is what I want, but I guess there might be prebuilt function to do it. And I'm also concerned with computational efficiency.
We can loop over the columns of 'x' and use ==
sapply(seq_len(ncol(x)), function(i) which(!colSums(X != x[,i])))
#[1] 2 4

Resources