R: How to slice a window of elements in named vector - r

Given the following named vector:
x <- c(54, 36, 67, 25, 76)
names(x) <- c('a', 'b', 'c', 'd', 'e')
How one can extract the elements between 'b' and 'd'? I can do that for data tables with the dplyr::select(dt, b:d) but for some reason, I cannot find a solution for named vectors (all the examples I find are for extracting element(s) by giving all the names not a range of names)...

You could do
x[which(names(x) == "b"):which(names(x) == "d")]
#> b c d
#> 36 67 25
The problem being that there is no guarantee in a named vector that names are unique, and if there are duplicate names the entire concept becomes meaningless.
If you wanted a complete solution that allows for tidyverse-style non-standard evaluation and sensible error messages you could have
subset_named <- function(data, exp)
{
if(missing(exp)) return(data)
exp <- as.list(match.call())$exp
if(is.numeric(exp)) return(data[exp])
if(is.character(exp)) return(data[exp])
tryCatch({
ss <- suppressWarnings(eval(exp))
return(data[ss])},
error = function(e)
{
if(as.character(exp[[1]]) != ":")
stop("`exp` must be a sequence created by ':'")
n <- names(data)
first <- as.character(exp[[2]])
second <- as.character(exp[[3]])
first_match <- which(n == first)
second_match <- which(n == second)
if(length(first_match) == 0)
stop("\"", first, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(second_match) == 0)
stop("\"", second, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(first_match) > 1) {
warning("\"", first,
"\" found more than once. Using first occurence only")
first_match <- first_match[1]
}
if(length(second_match) > 1) {
warning("\"", second,
"\" found more than once. Using first occurence only")
second_match <- second_match[1]
}
return(data[first_match:second_match])
})
}
That allows the following behaviour:
subset_named(x, "b":"d")
#> b c d
#> 36 67 25
subset_named(x, b:d)
#> b c d
#> 36 67 25
subset_named(x, 1:3)
#> a b c
#> 54 36 67
subset_named(x, "e")
#> e
#> 76
subset_named(x)
#> a b c d e
#> 54 36 67 25 76

One option could be:
x[Reduce(`:`, which(names(x) %in% c("b", "d")))]
b c d
36 67 25

You can use match in base R :
x[match('b', names(x)):match('d', names(x))]
# b c d
#36 67 25
Or if you want to use something like b:d convert it into dataframe as column
library(dplyr)
t(x) %>%
as.data.frame() %>%
select(b:d)

1) subset In base R this can be done using the select argument of subset. The only catch is that only the data.frame method of subset supports the select argument but we can convert x to a data.frame and then convert back. It also allows more complex specifications such as c(b:d, d) .
unlist(subset(data.frame(as.list(x)), select = b:d))
## b c d
## 36 67 25
2) evalq Another base R possibility is to create a list with the values 1, 2, 3, ... and the same names as x and then evaluate b:d with respect to it giving the desired indexes which can then be indexed into x. This also allows complex specifications as in (1).
x[ evalq(b:d, setNames(as.list(seq_along(x)), names(x))) ]
## b c d
## 36 67 25
We could turn this into a function like this:
sel <- function(x, select, envir = parent.frame()) {
ix <- setNames(as.list(seq_along(x)), names(x))
x[ eval(substitute(select), ix, envir) ]
}
sel(x, b:d)
sel(x, c(b:c, d))
sel(x, d:b) # reverse order
3) logical condition Again with only base R, if the names are in sorted order, as in the question, then we can check for names between the endpoints:
x[names(x) >= "b" & names(x) <= "d"]
## b c d
## 36 67 25
4) zoo If the names are in ascending order, as in the question, we could create a zoo series with those names as the times and then use window.zoo to pick out the subseries and finally convert back.
library(zoo)
coredata(window(zoo(x, names(x)), start = "b", end = "d"))
## b c d
## 36 67 25

Related

Selecting number from string based on criteria

I have the following data set:
PATH = c("5-8-10-8-17-20",
"56-85-89-89-0-15-88-10",
"58-85-89-65-49-51")
INDX = c(18, 89, 50)
data.frame(PATH, INDX)
PATH
INDX
5-8-10-8-17-20
18
56-85-89-89-0-15-88-10
89
58-85-89-65-49-51
50
The column PATH has strings that represent a numerical series and I want to be able to pick the largest number from the string that satisfies PATH <= INDX, that is selecting a number from PATH that is equal to INDX or the largest number from PATH that is yet less than INDX
my desired output would look like this:
PATH
INDX
PICK
5-8-10-8-17-20
18
17
56-85-89-89-0-15-88-10
89
88
58-85-89-65-49-51
50
49
Some of my thought-process behind the answer:
I know that If I have a function such strsplit I could separate each string by "-", arrange by number and then subtract with INDX and thus select the smallest negative number or zero. However, the original dataset is quite large and I wonder if there is a faster or more efficient way to perform this task.
Another option:
mapply(
\(x, y) max(x[x <= y]),
strsplit(PATH, "-") |> lapply(as.integer),
INDX
)
# [1] 17 88 49
Using purrr::map2_dbl():
library(purrr)
PICK <- map2_dbl(
strsplit(PATH, "-"),
INDX,
~ max(
as.numeric(.x)[as.numeric(.x) <= .y]
)
)
# 17 89 49
The below should be reasonably efficient, there is nothing wrong with your approach.
numpath <- sapply(strsplit(PATH, "-"), as.numeric)
maxindexes <- lapply(1:length(numpath), function(x) which(numpath[[x]] <= INDX[x]))
result <- sapply(1:length(numpath), function(x) max(numpath[[x]][maxindexes[[x]]]))
> result
[1] 17 89 49
Using dplyr
library(dplyr)
df |>
rowwise() |>
mutate(across(PATH, ~ {
a = unlist(strsplit(.x, split = "-"))
max(as.numeric(a)[which(as.numeric(a) <= INDX)])
}, .names = "PICK"))
PATH INDX PICK
<chr> <dbl> <dbl>
1 5-8-10-8-17-20 18 17
2 56-85-89-89-0-15-88-10 89 89
3 58-85-89-65-49-51 50 49
You can create a custom function like below:
my_func <- function(vec1, vec2) {
sort(as.numeric(unlist(strsplit(vec1, split = "-")))) -> x
return(x[max(cumsum(x <= vec2))])
}
df$PICK <- sapply(seq_len(nrow(df)), function(i) my_func(df$PATH[i], df$INDX[i]))
which will yield the following output:
# PATH INDX PICK
# 1 5-8-10-8-17-20 18 17
# 2 56-85-89-89-0-15-88-10 89 89
# 3 58-85-89-65-49-51 50 49

Creating a vector of numbers based on letters

So, this is the question:
"Create a function that given one word, return the position
of word’s letters on letters vector. For example, if the word
is ‘abba’, the function will return 1 2 2 1."
What I have so far is this:
l <- function(word) {
chr <- c()
y <- c()
strsplit(chr,word)
i<-1
while(i<length) {
o<-letters[i]
x<-chr[i]
if(o==x) {
y[i]<-i
}
i+1
}
y
}
I have tried running l("hello") and it returns NULL. I'm very lost and would appreciate any help! Thank you!
With base R:
lapply(strsplit(x, "", fixed = TRUE), match, letters)
[[1]]
[1] 1 2 2 1
I provide another interesting function in base:
x <- "abcxyz"
strtoi(strsplit(x, "")[[1]], 36) - 9
# [1] 1 2 3 24 25 26
strtoi() transforms the base-n numeral system into base-10 (i.e. decimal) numeral system. Take base-16 (i.e. hexadecimal) for example, strtoi("12", base = 16) will get 18 because 12 in hexadecimal is 18 in decimal. If base is 36, strtoi() will map (1~9, a~z) to 1~35, namely, a~z in a base-36 system is 10~35 in decimal. -9 in my code will convert 10~35 to 1~26, which is what the OP requires. Another common use is to transform binary number into decimal. E.g. strtoi("01001", base = 2) gets 9.
library(purrr)
my_fun <- function(x) {
x %>%
strsplit("") %>%
map(factor, levels = letters) %>%
map(as.numeric)
}
x <- c("abba", "hello")
my_fun(x)
#> [[1]]
#> [1] 1 2 2 1
#>
#> [[2]]
#> [1] 8 5 12 12 15
Here we use that factors are integers under the hood.
Let str be a character vector, e.g. str <- c('a', 'b', 'b', 'a'). When we run factor(str, levels = letters) we convert it to a factor with 26 levels: 'a', 'b', 'c', and so on. If we apply as.integer to it, a will become 1, because it's the first level, 'b' - 2 and so on.

select multiple head() and tail() values in a vector

I have a vector as follows:
v <- c(1,3,4,5,6,7,8,9,NA,NA,NA,NA,27,25,30,41,NA,NA)
How can I extract the values 1, 9, 27 and 41 (i. e. the first and last position of each subset without NAs)?
I thought about using head(v, 1) and tail(v, 1) in combination. However I don't have an idea how to 'stop' at the NAs and restart again after them.
We create a grouping variable with rleid based on the logical vector (is.na(v)), use that in tapply to select the first and last values of each group, unlist the list output, remove the NA elements with na.omit and remove the attributes with c.
library(data.table)
c(na.omit(unlist(tapply(v, rleid(is.na(v)), function(x) c(x[1],
x[length(x)])), use.names=FALSE)))
#[1] 1 9 27 41
Or another option is rle from base R
v[with(rle(!is.na(v)), {
i1 <- cumsum(lengths)
i2 <- lengths[values]
c(rbind(i1[values] - i2 + 1 , i1[values]))
})]
#[1] 1 9 27 41
Another possible solution via base R could be to split based on NA entries in the vector, lapply the head and tail functions and remove NA's, i.e.
ind <- unname(unlist(lapply(split(v, cumsum(c(1, diff(is.na(v)) != 0))), function(i)
c(head(i, 1), tail(i, 1)))))
ind[!is.na(ind)]
#[1] 1 9 27 41
A base R solution:
x = na.omit( v[is.na(c(NA,diff(v))) | is.na(c(diff(v),NA))] )
> as.numeric(x)
# [1] 1 9 27 41

Referencing a dataframe recursively

Is there a way to have a dataframe refer to itself?
I find myself spending a lot of time writing things like y$Category1[is.na(y$Category1)]<-NULL which are hard to read and feel like a lot of slow repetitive typing. I wondered if there was something along the lines of:
y$Category1[is.na(self)] <- NULL I could use instead.
Thanks
What a great question. Unfortunately, as #user295691 pointed out in the coments, the issue is with regards to referencing a vector twice: once as the object being indexed and once as the subject of a condition. It does appear impossible to avoid the double reference.
numericVector[cond(numericVector)] <- newVal
What I think we can do is have a nice and neat function so that instead of
# this
y$Category1[is.na(y$Category1)] <- list(NULL)
# we can have this:
NAtoNULL(y$Category1)
For example, the following functions wrap selfAssign() (below):
NAtoNULL(obj) # Replaces NA values in obj with NULL.
NAtoVal(obj, val) # Replaces NA values in obj with val.
selfReplace(obj, toReplace, val) # Replaces toReplace values in obj with val
# and selfAssign can be called directly, but I'm not sure there would be a good reason to
selfAssign(obj, ind, val) # equivalent to obj[ind] <- val
Example:
# sample df
df <- structure(list(subj=c("A",NA,"C","D","E",NA,"G"),temp=c(111L,112L,NA,114L,115L,116L,NA),size=c(0.7133,NA,0.7457,NA,0.0487,NA,0.8481)),.Names=c("subj","temp","size"),row.names=c(NA,-7L),class="data.frame")
df
subj temp size
1 A 111 0.7133
2 <NA> 112 NA
3 C NA 0.7457
4 D 114 NA
5 E 115 0.0487
6 <NA> 116 NA
7 G NA 0.8481
# Make some replacements
NAtoNULL(df$size) # Replace all NA's in df$size wtih NULL's
NAtoVal(df$temp, 0) # Replace all NA's in df$tmp wtih 0's
NAtoVal(df$subj, c("B", "E")) # Replace all NA's in df$subj with alternating "B" and "E"
# the modified df is now:
df
subj temp size
1 A 111 0.7133
2 B 112 NULL
3 C 0 0.7457
4 D 114 NULL
5 E 115 0.0487
6 E 116 NULL
7 G 0 0.8481
# replace the 0's in temp for NA
selfReplace(df$temp, 0, NA)
# replace NULL's in size for 1's
selfReplace(df$size, NULL, 1)
# replace all "E"'s in subj with alternate c("E", "F")
selfReplace(df$subj, c("E"), c("E", "F"))
df
subj temp size
1 A 111 0.7133
2 B 112 1
3 C NA 0.7457
4 D 114 1
5 E 115 0.0487
6 F 116 1
7 G NA 0.8481
Right now this works for vectors, but will fail with *apply. I would love to get it working fully, especially with applying plyr. The key would be to modify
FUNCTIONS
The code for the functions are below.
An important point. This does not (yet!) work with *apply / plyr.
I believe it can by modifying the value of n and adjusting sys.parent(.) in match.call() but it still needs some fiddling.
Any suggestions / modifications would be grealy appreciated
selfAssign <- function(self, ind, val, n=1, silent=FALSE) {
## assigns val to self[ind] in environment parent.frame(n)
## self should be a vector. Currently will not work for matricies or data frames
## GRAB THE CORRECT MATCH CALL
#--------------------------------------
# if nested function, match.call appropriately
if (class(match.call()) == "call") {
mc <- (match.call(call=sys.call(sys.parent(1))))
} else {
mc <- match.call()
}
# needed in case self is complex (ie df$name)
mc2 <- paste(as.expression(mc[[2]]))
## CLEAN UP ARGUMENT VALUES
#--------------------------------------
# replace logical indecies with numeric indecies
if (is.logical(ind))
ind <- which(ind)
# if no indecies will be selected, stop here
if(identical(ind, integer(0)) || is.null(ind)) {
if(!silent) warning("No indecies selected")
return()
}
# if val is a string, we need to wrap it in quotes
if (is.character(val))
val <- paste('"', val, '"', sep="")
# val cannot directly be NULL, must be list(NULL)
if(is.null(val))
val <- "list(NULL)"
## CREATE EXPRESSIONS AND EVAL THEM
#--------------------------------------
# create expressions to evaluate
ret <- paste0("'[['(", mc2, ", ", ind, ") <- ", val)
# evaluate in parent.frame(n)
eval(parse(text=ret), envir=parent.frame(n))
}
NAtoNULL <- function(obj, n=1) {
selfAssign(match.call()[[2]], is.na(obj), NULL, n=n+1)
}
NAtoVal <- function(obj, val, n=1) {
selfAssign(match.call()[[2]], is.na(obj), val, n=n+1)
}
selfReplace <- function(obj, toReplace, val, n=1) {
## replaces occurrences of toReplace within obj with val
# determine ind based on value & length of toReplace
# TODO: this will not work properly for data frames, but neither will selfAssign, yet.
if (is.null(toReplace)) {
ind <- sapply(obj, function(x) is.null(x[[1]]))
} else if (is.na(toReplace)) {
ind <- is.na(obj)
} else {
if (length(obj) > 1) { # note, this wont work for data frames
ind <- obj %in% toReplace
} else {
ind <- obj == toReplace
}
}
selfAssign(match.call()[[2]], ind, val, n=n+1)
}
## THIS SHOULD GO INSIDE NAtoNULL, NAtoVal etc.
# todo: modify for use with *apply
if(substr(paste(as.expression(x1)), 1, 10) == "FUN(obj = ") {
# PASS. This should identify when the call is coming from *apply.
# in such a case, need to increase n by 1 for apply & lapply. Increase n by 2 for sapply
# I'm not sure the increase required for plyr functions
}

Aggregate over categories that contain NAs with ddply and lapply?

I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
library(plyr)
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
}
y <- lapply(x, doddply);y
y comes very close to what I want to get
$category
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
$categoryA
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
EDIT:
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
sep=""))
}
res <- lapply(exlist,eval,envir=dfs)
return(res)
}
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
by.list=list(mydata$group,mydata$size,mydata[,column])
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
}
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)

Resources