Repeating patterns in a vector in R - r

If a vector is produced from a vector of unknown length with unique elements by repeating it unknown times
small_v <- c("as","d2","GI","Worm")
big_v <- rep(small_v, 3)
then how to determine how long that vector was and how many times it was repeated?
So in this example the original length was 4 and it repeats 3 times.
Realistically in my case the vectors will be fairly small and will be repeated only a few times.

1) Assuming that there is at least one unique element in small_v (which is the case in the question since it assumes all elements in small_v are unique):
min(table(big_v))
## [1] 3
or using pipes
big_v |> table() |> min()
## [1] 3
Here is a more difficult test but it still works because small_v2[2] is unique in small_v2 even though the other elements of small_v2 are not unique.
# test data
small_v2 <- c(small_v, small_v[-2])
big_v2 <- rep(small_v2, 3)
min(table(big_v2))
## [1] 3
2) If we knew that the first element of small_v were unique (which is the case in the question since it assumes all elements in small_v are unique) then this would work:
sum(big_v[1] == big_v)
## [1] 3

1) If the elements are all repeating and no other values are there, then use
length(big_v)/length(unique(big_v))
[1] 3
2) Or use
library(data.table)
max(rowid(big_v))
[1] 3

Alternatively we could use rle with with to count the repeats
with(rle(sort(big_v)), max(lengths))
Created on 2023-02-04 with reprex v2.0.2
[1] 3

Related

How to detect and distinguish between first, second, and third increases in a vector of numbers

Say I have the following vector:
vector <- c(5,5,11,11,7,16,16,16,12,8,20,24,20)
As we can see, there are 4 separate increases in this vector, though lets just say we only care about the first 3. These 3 increases are: from 2nd element to 3rd element (increase from 5 to 11, which is an increase of 6), from 5th element to 6th element (increase from 7 to 16, which is an increase of 9), and from 10th element to 11th element (increase from 8 to 20, which is an increase of 12). I am looking for some help on constructing some sort of for loop algorithm to store the amount of each of these increases in separate variables. Thus far, I know how to detect each of the increases, but I have no idea how to distinguish between each of them in order to store them separately. Any help would be much appreciated!
You can use diff() and subset the result to find the increases, take the first 3 using head() and convert to a list to separate.
dv <- diff(vector)
as.list(head(dv[dv > 0], 3))
[[1]]
[1] 6
[[2]]
[1] 9
[[3]]
[1] 12
Another option could be to compare each value in the vector to the value before it (using the lag() function from the dplyr package) and only keep positive differences:
library(dplyr)
vector <- c(5,5,11,11,7,16,16,16,12,8,20,24,20)
# differences between numbers
diffs <- vector - lag(vector)
# only keep the first three positive differences
head(diffs[which(diffs > 0)], 3)
#> [1] 6 9 12
Created on 2021-07-23 by the reprex package (v2.0.0)

How to find missing numbers in a sequence?

I have a vector containing a list of numbers. How do I find numbers that are missing from the vector?
For example:
sequence <- c(12:17,1:4,6:10,19)
The missing numbers are 5, 11 and 18.
sequence <- c(12:17,1:4,6:10,19)
seq2 <- min(sequence):max(sequence)
seq2[!seq2 %in% sequence]
...and the output:
> seq2[!seq2 %in% sequence]
[1] 5 11 18
>
You can use the setdiff() function to compute set differences. You want the difference between the complete sequence (from min(sequence) to max(sequence)) and the sequence with missing values.
setdiff(min(sequence):max(sequence), sequence)
This answer just gets all of the numbers from the lowest to highest in the sequence, then asks which are not present in the original sequence.
which(!(seq(min(sequence), max(sequence)) %in% sequence))
[1] 5 11 18
c(1:max(sequence))[!duplicated(c(sequence,1:max(sequence)))[-(1:length(sequence))]]
[1] 5 11 18
Not a particularly elegant solution, I admit, but what it does is determines which in the vector 1:max(sequence) are duplicates of sequence, and then selects those out of that same vector.

Split a list of elements into two unique lists (and get all combinations) in R

I have a list of elements (my real list has 11 elements, this is just an example):
x <- c(1, 2, 3)
and want to split them into two lists (using all entries) but I want to get all possible combinations of that list to be returned e.g.:
(1,2)(3) & (1)(2,3) & (2)(1,3)
Does anyone know an efficient way to do this for a more complex list?
Thanks in advance for your help!
List with 3 elements:
vec <- 1:3
Note that for each element we have two possibilities: it is either in 1st split or in 2nd split. So we define a matrix of all possible splits (in rows) using expand.grid which produces all possible combinations:
groups <- as.matrix(expand.grid(rep(list(1:2), length(vec))))
However This will treat scenarios where the groups are flipped as different splits. Also will include scenarios where all the observations are in the same group (but there will only be 2 of them).
If you want to remove them we need to remove the lines from groups matrix that only have one group (2 such lines) and all the lines that split the vector in the same way, only switching the groups.
One-group entries are on top and bottom so removing them is easy:
groups <- groups[-c(1, nrow(groups)),]
Duplicated entries are a bit trickier. But note that we can get rid fo them by removing all the rows where the first group is 2. In effect this will make a requirement that the first element is always assigned to group 1.
groups <- groups[groups[,1]==1,]
Then the job is to split the list we have using each of the rows in the groups matrix. For that we use Map to call split() function on our list vec and each row of groups matrix:
splits <- Map(split, list(vec), split(groups, row(groups)))
> splits
[[1]]
[[1]]$`1`
[1] 1 3
[[1]]$`2`
[1] 2
[[2]]
[[2]]$`1`
[1] 1 2
[[2]]$`2`
[1] 3
[[3]]
[[3]]$`1`
[1] 1
[[3]]$`2`
[1] 2 3

using seq_along() to handle the empty case

I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349

Determining minimum values in a vector in R

I need some help in determining more than one minimum value in a vector. Let's suppose, I have a vector x:
x<-c(1,10,2, 4, 100, 3)
and would like to determine the indexes of the smallest 3 elements, i.e. 1, 2 and 3. I need the indexes of because I will be using the indexes to access the corresponding elements in another vector. Of course, sorting will provide the minimum values but I want to know the indexes of their actual occurrence prior to sorting.
In order to find the index try this
which(x %in% sort(x)[1:3]) # this gives you and index vector
[1] 1 3 6
This says that the first, third and sixth elements are the first three lowest values in your vector, to see which values these are try:
x[ which(x %in% sort(x)[1:3])] # this gives the vector of values
[1] 1 2 3
or just
x[c(1,3,6)]
[1] 1 2 3
If you have any duplicated value you may want to select unique values first and then sort them in order to find the index, just like this (Suggested by #Jeffrey Evans in his answer)
which(x %in% sort(unique(x))[1:3])
I think you mean you want to know what are the indices of the bottom 3 elements? In that case you want order(x)[1:3]
You can use unique to account for duplicate minimum values.
x<-c(1,10,2,4,100,3,1)
which(x %in% sort(unique(x))[1:3])
Here's another way with rank that includes duplicates.
x <- c(x, 3)
# [1] 1 10 2 4 100 3 3
which(rank(x, ties.method='min') <= 3)
# [1] 1 3 6 7

Resources