Function that splits numeric vector in the natural sequences it contains - r

I have a vector as the following:
example <- c(1, 2, 3, 8, 10, 11)
And I am trying to write a function that returns an output as the one you would get from:
desired_output <- list(first_sequence = c(1, 2, 3),
second_sequence = 8,
third_sequence = c(10, 11)
)
Actually, what I want is to count how many sequences as of those there are in my vector, and the length of each one. It just happens that a list as the one in "desired_ouput" would be sufficient.
The finality is to construct another vector, let's call it "b", that contains the following:
b <- c(3, 3, 3, 1, 2, 2)
The real world problem behind this is to measure the height of 3d objects contained in a 3D pointcloud.
I've tried to program both a function that returns the list in "example_list" and a recursive function that directly outputs vector "b", succeeded at none.
Someone has any idea?
Thank you very much.

We can split to a list by creating a grouping by difference of adjacent elements
out <- split(example, cumsum(c(TRUE, abs(diff(example)) != 1)))
Then, we get the lengths and replicate
unname(rep(lengths(out), lengths(out)))
[1] 3 3 3 1 2 2

You could do:
out <- split(example, example - seq_along(example))
To get the lengths:
ln <- unname(lengths(out))
rep(ln, ln)
[1] 3 3 3 1 2 2

Here is one more. Not elegant but a different approach:
Create a dataframe of the example vector
Assign the elements to groups
aggregate with tapply
example_df <- data.frame(example = example)
example_df$group <- cumsum(ifelse(c(1, diff(example) - 1), 1, 0))
tapply(example_df$example, example_df$group, function(x) x)
$`1`
[1] 1 2 3
$`2`
[1] 8
$`3`
[1] 10 11

One other option is to use ave:
ave(example, cumsum(c(1, diff(example) != 1)), FUN = length)
# [1] 3 3 3 1 2 2
#or just
ave(example, example - seq(example), FUN = length)

Related

Is there a R function equivalent to the subset operator `[ ]`, in order to slice by row index?

I know that [] is a function itself, but is there a function that does the following ?
vect = c(1, 5, 4)
# Slicing by row index with []
vect[2]
# [1] 5
# Does this kind of function exist ?
slicing_func(vect, 2)
# [1] 5
# And for dataframes ?
To understand the deeper meaning of "[] is actually a function" —
vect[2]
# [1] 5
is equivalent to:
`[`(vect, 2)
# [1] 5
Seems you have already used the function you are looking for.
Note, that it also works for data frames/matrices.
dat
# X1 X2 X3 X4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
`[`(dat, 2, 3)
# [1] 8
`[`(dat, 2, 3, drop=F) ## to get a data frame back
# X3
# 2 3
Data:
vect <- c(1, 5, 4)
dat <- data.frame(matrix(1:12, 3, 4))
You can use getElement function
vect = c(1, 5, 4)
getElement(vect, 2)
#> 5
Or you can use
vctrs::vec_slice(vect , 2)
#> 5
which works for slices and data.frames too.
For a data frame you can use slice:
library(dplyr)
vect = c(1, 5, 4)
vect %>% as.data.frame() %>% slice(2)
#> .
#> 1 5
nth(vect, 2)
#> [1] 5
Created on 2022-07-10 by the reprex package (v2.0.1)
slice according to documentation:
slice() lets you index rows by their (integer) locations. It allows
you to select, remove, and duplicate rows.
We could use pluck or chuck from purrr package:
pluck() and chuck() implement a generalised form of [[ that allow you to index deeply and flexibly into data structures. pluck() consistently returns NULL when an element does not exist, chuck() always throws an error in that case.
library(purrr)
pluck(vect, 2)
chuck(vect, 2)
> pluck(vect, 2)
[1] 5
> chuck(vect, 2)
[1] 5

R: pass multiple arguments to accumulate/reduce

This is related to R: use the newly generated data in the previous row
I realized the actual problem I was faced with is a bit more complicated than the example I gave in the thread above - it seems I have to pass 3 arguments to the recursive calculation to achieve what I want. Thus, accumulate2 or reduce may not work. So I open a new question here to avoid possible confusion.
I have the following dataset grouped by ID:
ID <- c(1, 2, 2, 3, 3, 3)
pw <- c(1:6)
add <- c(1, 2, 3, 5, 7, 8)
x <- c(1, 2, NA, 4, NA, NA)
df <- data.frame(ID, pw, add, x)
df
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 NA
4 3 4 5 4
5 3 5 7 NA
6 3 6 8 NA
Within each group for column x, I want to keep the value of the first row as it is, while fill in the remaining rows with lagged values raised to the power stored in pw, and add to the exponent the value in add. I want to update the lagged values as I proceed. So I would like to have:
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 2^3 + 3
4 3 4 5 4
5 3 5 7 4^5 + 7
6 3 6 8 (4^5 + 7)^6 + 8
I have to apply this calculation to a large dataset, so it would be perfect if there is a fast way to do this!
If we want to use accumulate2, then specify the arguments correctly i.e. it takes two input arguments as 'pw' and 'add' and an initialization argument which would be the first value of 'x'. As it is a grouped by 'ID', do the grouping before we do the accumulate2, extract the lambda default arguments ..1, ..2 and ..3 respectively in that order and create the recursive function based on this
library(dplyr)
library(purrr)
out <- df %>%
group_by(ID) %>%
mutate(x1 = accumulate2(pw[-1], add[-1], ~ ..1^..2 + ..3,
.init = first(x)) %>%
flatten_dbl ) %>%
ungroup
out$x1
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
With more than 3 arguments, a for loop would be better
# // initialize an empty vector
out <- c()
# // loop over the `unique` ID
for(id in unique(df$ID)) {
# // create a temporary subset of data based on that id
tmp_df <- subset(df, ID == id)
# // initialize a temporary storage output
tmp_out <- numeric(nrow(tmp_df))
# // initialize first value with the first element of x
tmp_out[1] <- tmp_df$x[1]
# // if the number of rows is greater than 1
if(nrow(tmp_df) > 1) {
// loop over the rows
for(i in 2:nrow(tmp_df)) {
#// do the recursive calculation and update
tmp_out[i] <- tmp_out[i - 1]^ tmp_df$pw[i] + tmp_df$add[i]
}
}
out <- c(out, tmp_out)
}
out
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
In base R we could use the following solution for more than two arguments.
In this solution I first subset the original data set on ID values
Then I chose row id values through seq_len(nrow(tmp))[-1] omitting the first row id since it was provided by init
In anonymous function I used in Reduce, b argument represents accumulated/ previous value starting from init and c represents new/current values of our vector which is row numbers
So in every iteration our previous value (starting from init) will be raised to the power of new value from pw and will be summed by new value from add
cbind(df[-length(df)], unlist(lapply(unique(df$ID), function(a) {
tmp <- subset(df, df$ID == a)
Reduce(function(b, c) {
b ^ tmp$pw[c] + tmp$add[c]
}, init = tmp$x[1],
seq_len(nrow(tmp))[-1], accumulate = TRUE)
}))) |> setNames(c(names(df)))
ID pw add x
1 1 1 1 1.000000e+00
2 2 2 2 2.000000e+00
3 2 3 3 1.100000e+01
4 3 4 5 4.000000e+00
5 3 5 7 1.031000e+03
6 3 6 8 1.201025e+18
Data
structure(list(ID = c(1, 2, 2, 3, 3, 3), pw = 1:6, add = c(1,
2, 3, 5, 7, 8), x = c(1, 2, NA, 4, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
Base R, not using Reduce() but rather a while() Loop:
# Split-apply-combine while loop: res => data.frame
res <- do.call(rbind, lapply(with(df, split(df, ID)), function(y){
# While there are any NAs in x:
while(any(is.na(y$x))){
# Store the index of the first NA value: idx => integer scalar
idx <- with(y, head(which(is.na(x)), 1))
# Calculate x at that index using the business rule provided:
# x => numeric vector
y$x[idx] <- with(y, x[(idx-1)] ** pw[idx] + add[idx])
}
# Explicitly define the return object: y => GlobalEnv
y
}
)
)
OR recursive function:
# Recursive function: estimation_func => function()
estimation_func <- function(value_vec, exponent_vec, add_vec){
# Specify the termination condition; when all elements
# of value_vec are no longer NA:
if(all(!(is.na(value_vec)))){
# Return value_vec: numeric vector => GlobalEnv
return(value_vec)
# Otherwise recursively apply the below:
}else{
# Store the index of the first na value: idx => integer vector
idx <- Position(is.na, value_vec)
# Calculate the value of the value_vec at that index;
# using the provided business logic: value_vec => numeric vector
value_vec[idx] <- (value_vec[(idx-1)] ** exponent_vec[idx]) + add_vec[idx]
# Recursively apply function: function => Local Env
return(estimation_func(value_vec, exponent_vec, add_vec))
}
}
# Split data.frame into a list on ID;
# Overwrite x values, applying recursive function;
# Combine list into a data.frame
# res => data.frame
res <- data.frame(
do.call(
rbind,
Map(function(y){y$x <- estimation_func(y$x, y$pw, y$add); y}, split(df, df$ID))
), row.names = NULL
)

Function to find sub ID's of an ID in a data frame

I have a data frame that contains two columns, an ID column and a column with sub ID's that are related to the corresponding ID. The sub ID's can again have sub ID's (in this case the previous sub ID is now an ID).
library(tibble)
df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))
df
# A tibble: 6 x 2
id sub_id
<dbl> <dbl>
1 1 2
2 1 3
3 2 4
4 2 5
5 3 6
6 7 8
I would like to write a function that finds all sub ID's that are related to an ID. It should return a vector with all sub ID's.
find_all_sub_ids <- function (data, id) {
data %>% ...
}
find_all_sub_ids(df, id = 1)
[1] 2 3 4 5 6
find_all_sub_ids(df, id = 2)
[1] 4 5
find_all_sub_ids(df, id = 9)
[1] NULL
This is very different from everything I have done in R so far and it was hard for me to phrase a good title for this question. So it is possible that with the right phrasing I could have already found an answer by just googling.
My first intuition for solving this was while loops. Since I also do not know how many sublevels there could be the function should continue until all are found. I never used while loops though and don't really know how I could implement them here.
Maybe someone knows a good solution for this problem. Thanks!
Edit: Forgot to assign the tibble to df and to use this argument in the function call.
With igraph:
library(igraph)
g <- graph_from_data_frame(d, directed = TRUE)
find_all_subs <- function(g,id){
#find child nodes, first one being origin
r <- igraph::subcomponent(g,match(id, V(g)$name),"out")$name
#remove origin
as.numeric(r[-1])
}
find_all_subs(g,1)
[1] 2 3 4 5 6
find_all_subs(g,2)
[1] 5 6
I think it's easiest to formulate this as a graph problem.
Your data.frame describes a directed graph (vertices going from id to sub_id), and you are interested in which nodes are reachable from a certain vertex.
Using tidygraph, this can be achieved as such:
library(tidyverse)
library(tidygraph)
df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))
find_all_sub_ids <- function (id) {
if (!(id %in% df$id)) {
return(NULL)
}
grph <- df %>%
as_tbl_graph(directed = TRUE)
id <- which(grph %>% pull(name) == as.character(id))
grph %>%
activate(nodes) %>%
mutate(reachable = !is.na(bfs_dist(id))) %>%
as_tibble() %>%
filter(reachable) %>%
pull(name) %>%
as.numeric()
}
We see which nodes are reachable (they have a non-NA distance to your given node), we use bfs_dist (see here for explanation).
This gives
> find_all_sub_ids(1)
[1] 1 2 3 4 5 6
> find_all_sub_ids(2)
[1] 2 4 5
> find_all_sub_ids(9)
NULL
The advantage of such an approach is that it can search many levels deep without you needing to write a loop explicitly.
Edit
There was a bug in my code, tidygraph::bfs_dist uses a differend id than I expected. Fixed it now.
On the new example:
> find_all_sub_ids(10)
[1] 10 200 300
I did it using a dataframe. The following works.
x= c(1,1,2,2,3,7)
y = c(2, 3, 4, 5, 6, 8)
df <- data.frame(cbind(x,y))
colnames(df) =c('id', 'sub_id')
find_all_sub_ids <- function (df, id_requested) {
si <- df[df$id==id_requested,]$sub_id
return(si)
}
find_all_sub_ids(df,id=2)
[1] 4 5

New to R, probably confused with lists and how to convert them [duplicate]

I have a dataframe such as:
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
aframe = data.frame(a1, a2, a3)
I tried the following to convert one of the columns to a vector, but it doesn't work:
avector <- as.vector(aframe['a2'])
class(avector)
[1] "data.frame"
This is the only solution I could come up with, but I'm assuming there has to be a better way to do this:
class(aframe['a2'])
[1] "data.frame"
avector = c()
for(atmp in aframe['a2']) { avector <- atmp }
class(avector)
[1] "numeric"
Note: My vocabulary above may be off, so please correct me if so. I'm still learning the world of R. Additionally, any explanation of what's going on here is appreciated (i.e. relating to Python or some other language would help!)
I'm going to attempt to explain this without making any mistakes, but I'm betting this will attract a clarification or two in the comments.
A data frame is a list. When you subset a data frame using the name of a column and [, what you're getting is a sublist (or a sub data frame). If you want the actual atomic column, you could use [[, or somewhat confusingly (to me) you could do aframe[,2] which returns a vector, not a sublist.
So try running this sequence and maybe things will be clearer:
avector <- as.vector(aframe['a2'])
class(avector)
avector <- aframe[['a2']]
class(avector)
avector <- aframe[,2]
class(avector)
There's now an easy way to do this using dplyr.
dplyr::pull(aframe, a2)
You could use $ extraction:
class(aframe$a1)
[1] "numeric"
or the double square bracket:
class(aframe[["a1"]])
[1] "numeric"
You do not need as.vector(), but you do need correct indexing: avector <- aframe[ , "a2"]
The one other thing to be aware of is the drop=FALSE option to [:
R> aframe <- data.frame(a1=c1:5, a2=6:10, a3=11:15)
R> aframe
a1 a2 a3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
R> avector <- aframe[, "a2"]
R> avector
[1] 6 7 8 9 10
R> avector <- aframe[, "a2", drop=FALSE]
R> avector
a2
1 6
2 7
3 8
4 9
5 10
R>
You can try something like this-
as.vector(unlist(aframe$a2))
Another advantage of using the '[[' operator is that it works both with data.frame and data.table. So if the function has to be made running for both data.frame and data.table, and you want to extract a column from it as a vector then
data[["column_name"]]
is best.
as.vector(unlist(aframe['a2']))
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
aframe = data.frame(a1, a2, a3)
avector <- as.vector(aframe['a2'])
avector<-unlist(avector)
#this will return a vector of type "integer"
If you just use the extract operator it will work. By default, [] sets option drop=TRUE, which is what you want here. See ?'[' for more details.
> a1 = c(1, 2, 3, 4, 5)
> a2 = c(6, 7, 8, 9, 10)
> a3 = c(11, 12, 13, 14, 15)
> aframe = data.frame(a1, a2, a3)
> aframe[,'a2']
[1] 6 7 8 9 10
> class(aframe[,'a2'])
[1] "numeric"
I use lists to filter dataframes by whether or not they have a value %in% a list.
I had been manually creating lists by exporting a 1 column dataframe to Excel where I would add " ", around each element, before pasting into R: list <- c("el1", "el2", ...) which was usually followed by FilteredData <- subset(Data, Column %in% list).
After searching stackoverflow and not finding an intuitive way to convert a 1 column dataframe into a list, I am now posting my first ever stackoverflow contribution:
# assuming you have a 1 column dataframe called "df"
list <- c()
for(i in 1:nrow(df)){
list <- append(list, df[i,1])
}
View(list)
# This list is not a dataframe, it is a list of values
# You can filter a dataframe using "subset([Data], [Column] %in% list")
We can also convert data.frame columns generically to a simple vector. as.vector is not enough as it retains the data.frame class and structure, so we also have to pull out the first (and only) element:
df_column_object <- aframe[,2]
simple_column <- df_column_object[[1]]
All the solutions suggested so far require hardcoding column titles. This makes them non-generic (imagine applying this to function arguments).
Alternatively, you could, of course read the column names from the column first and then insert them in the code in the other solutions.
Another option is using as.matrix with as.vector. This can be done for one column but is also possible if you want to convert all columns to one vector. Here is a reproducible example with first converting one column to a vector and second convert complete dataframe to one vector:
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
aframe = data.frame(a1, a2, a3)
# Convert one column to vector
avector <- as.vector(as.matrix(aframe[,"a2"]))
class(avector)
#> [1] "numeric"
avector
#> [1] 6 7 8 9 10
# Convert all columns to one vector
avector <- as.vector(as.matrix(aframe))
class(avector)
#> [1] "numeric"
avector
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Created on 2022-08-27 with reprex v2.0.2

Removing values from a vector that are not duplicated at least x number of times

Given a vector:
eg.:
a = c(1, 2, 2, 4, 5, 3, 5, 3, 2, 1, 5, 3)
Using a[a%in%a[duplicated(a)]] I can remove values not duplicated. However, it only works for values that are only present once.
How would I go on about removing all values that aren't present in this thrice? (or more, in other situations)
The expected result would be:
2 2 5 3 5 3 2 5 3
with 1 and 4 removed, as they are only present twice and once
You can do this in one line with the ave function:
a[ave(a, a, FUN=length) >= 3]
# [1] 2 2 5 3 5 3 2 5 3
The call to ave(a, a, FUN=length) returns, for each element a[i] in vector a, the total number of times a[i] appears within a. Then you can subset a, limiting to the indices where the total number of times is 3 or more.
Reasonably straightforward (longer than using ave but possibly more comprehensible):
x <- c(1,2,2,4,5,3,5,3,2,1,5,3)
tt <- table(x) ## tabulate
## find relevant values
ttr <- as.numeric(names(tt)[tt>=3])
x[x %in% ttr] ## subset

Resources