Identify sequences between identical values - r

I have a large matrix:
id v1 v2 v3 v4 v5 v6 v7 v8
1001 37 15 30 37 4 11 35 37
2111 44 31 44 30 24 39 44 18
3121 43 49 39 34 44 43 26 24
4532 45 31 26 33 12 47 37 15
5234 23 27 34 23 30 34 23 4
6345 9 46 39 34 8 43 26 24
For each row (id), I would like to identify intervals of numbers in column v1 to v8. An interval is here defined as a sequence of numbers which starts and ends with the same number.
For example, in the first row, there are two sequences which both start and ends with 37: From column 1 to 4 (37, 15, 30, 37) and from column 4 to column 8 (37, 4, 11, 35, 37).
The focal value should only occur in start and end positions. For example, in the first row, the sequence from 37 at V1, to 37 at V8 is not included, because 37 also occurs in V4.
For each interval, I want the index of the start and end columns, the focal start and end value, and the sequence of numbers in between.
Desired output:
1001 [v1] to [v4] 37 to 37: 15,30
1001 [v4] to [v8] 37 to 37: 4, 11, 35
2111 [v1] to [v3] 44 to 44: 31
2111 [v3] to [v7] 44 to 44: 30, 24, 39
Any suggestions? Algorithm?
I managed to code for the indices for a vector not a matrix,
a <- which(x == 37)
from <- a[!(a-1) %in% a]
to <- a[!(a+1) %in% a]
rbind(from, to)

Very brute-force method. Get unique elements for the given row, check if they are present more than once but not side-by-side, then lapply through each, getting the elements of the row x between them.
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
Output:
# [[1]]
# [[1]]$`37`
# [[1]]$`37`[[1]]
# [1] 15 30
#
# [[1]]$`37`[[2]]
# [1] 4 11 35
#
#
#
# [[2]]
# [[2]]$`44`
# [[2]]$`44`[[1]]
# [1] 31
#
# [[2]]$`44`[[2]]
# [1] 30 24 39
#
#
#
# [[3]]
# [[3]]$`43`
# [[3]]$`43`[[1]]
# [1] 49 39 34 44
#
#
#
# [[4]]
# named list()
#
# [[5]]
# [[5]]$`23`
# [[5]]$`23`[[1]]
# [1] 27 34
#
# [[5]]$`23`[[2]]
# [1] 30 34
#
#
# [[5]]$`34`
# [[5]]$`34`[[1]]
# [1] 23 30
#
#
#
# [[6]]
# named list()
Edit: Henrik's answer inspired me to do a join-based version
library(data.table)
library(magrittr)
d <- melt(as.data.table(m), "id", variable.name = 'ci')[, ci := rowid(id)]
setorder(d, id)
options(datatable.nomatch = 0)
d[d, on = .(id, value, ci > ci)
, .(id, value, i.ci, x.ci)
, mult = 'first'] %>%
.[d, on = .(id, i.ci < ci, x.ci > ci)
, .(id, value, from_ci = x.i.ci, to_ci = x.x.ci, i.value)] %>%
.[, .(val = .(i.value))
, by = setdiff(names(.), 'i.value')]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 34 3 6 23,30
# 8: 5234 23 4 7 30,34

Here's a data.table alternative.
Convert matrix to data.table and melt to long format. Create a column index 'ci' to keep track of the original columns (rowid(id)). Order by 'id'.
For each 'id' and 'value' (by = .(id, value)), check if number of rows is larger than one (if(.N > 1)), i.e. if there is at least one sequence. If so, grab the row index (.I) of the sequences and their column indexes (in the original data). For each sequence, grab the corresponding values between start and end index. Wrap in list twice (.(.() to create a list column.
library(data.table)
d <- melt(as.data.table(m), id.vars = "id")
d[ , `:=`(
ci = rowid(id),
variable = NULL)]
setorder(d, id)
d2 <- d[ , if(.N > 1){
.(from = .I[-.N], to = .I[-1],
from_ci = ci[-.N], to_ci = ci[ -1])
}, by = .(id, value)]
d2[ , val := .(.(d$value[seq(from + 1, to - 1)])), by = 1:nrow(d2)]
d2[ , `:=`(from = NULL, to = NULL)]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 23 4 7 30,34
# 8: 5234 34 3 6 23,30

Related

R: Counting the Frequencies of Coin Flips

I am working with the R programming language.
I simulated this dataset which contains 1000 coin flips - then I calculated the number of "2 Flip Sequences":
Coin <- c('H', 'T')
Results = sample(Coin,1000, replace = TRUE)
My_Data = data.frame(id = 1:1000, Results)
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second Freq
1 H H 255
2 T H 245
3 H T 246
4 T T 253
I am curious - is it possible to extend the above code for "3 Flip Sequences"?
For example - I tried modifying parts of the code to see how the results change (and hoped to stumble across the correct way to write this code):
# First Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = head(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 245
3 H T H 0
4 T T H 0
5 H H T 0
6 T H T 0
7 H T T 246
8 T T T 253
# Second Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 0
3 H T H 0
4 T T H 245
5 H H T 246
6 T H T 0
7 H T T 0
8 T T T 253
I am not sure which of these options are correct?
In general, I am looking to understand the logic as to how I can adapt the above code for an "arbitrary number of coin flips" (e.g. "4 flip sequences", "5 flip sequences", etc.)
Also, this might not be the most efficient way to calculate these frequencies - I would also be interested in learning about other ways that might be more efficient ( e.g. as the overall size of the data increases).
Thanks!
It might be helpful to work with strings.
coin <- c("H", "T")
results <- sample(coin, 1000, replace = TRUE)
Then to get sequence counts (assuming overlapping sequences also count) for triples, we could do something like:
triples <- table(
sapply(
1:(length(results) - 3),
function(i) sprintf(
"%s%s%s",
results[i],
results[i + 1],
results[i + 2]
)
)
)
which gives me something like:
HHH HHT HTH HTT THH THT TTH TTT
132 129 138 115 129 124 116 114
This idea could be generalized fairly easily, for example:
n_sequences <- function(n, results) {
helper <- function(i, n) if (n < 1) "" else sprintf(
"%s%s",
helper(i, n - 1),
results[i + n - 1]
)
result <- data.frame(
table(
sapply(
1:(length(results) - n + 1),
function(i) helper(i, n)
)
)
)
colnames(result) <- c("Sequence", "Frequency")
result
}
For example:
n_sequences(5, results)
Gives me something like:
Sequence Frequency
1 HHHHH 34
2 HHHHT 31
3 HHHTH 36
4 HHHTT 31
5 HHTHH 35
6 HHTHT 36
7 HHTTH 20
8 HHTTT 37
9 HTHHH 35
10 HTHHT 34
11 HTHTH 41
12 HTHTT 27
13 HTTHH 27
14 HTTHT 24
15 HTTTH 34
16 HTTTT 30
17 THHHH 31
18 THHHT 36
19 THHTH 36
20 THHTT 26
21 THTHH 34
22 THTHT 32
23 THTTH 31
24 THTTT 27
25 TTHHH 32
26 TTHHT 28
27 TTHTH 25
28 TTHTT 31
29 TTTHH 33
30 TTTHT 31
31 TTTTH 30
32 TTTTT 20
You could first cut along 3 + 1 breaks, split it along the levels. The interaction can now be tabled to get the result.
My_Data$cut3 <- cut(seq_len(nrow(My_Data)), seq.int(1, nrow(My_Data), length.out=3 + 1), include.lowest=TRUE)
(res <- interaction(split(My_Data$Results, My_Data$cut3)) |> table() |> as.data.frame())
# Var1 Freq
# 1 H.H.H 51
# 2 T.H.H 58
# 3 H.T.H 43
# 4 T.T.H 49
# 5 H.H.T 38
# 6 T.H.T 51
# 7 H.T.T 64
# 8 T.T.T 46
To get the desired output, we can strsplit Var1.
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c('first', 'second', 'third', 'Freq'))
# first second third Freq
# 1 H H H 51
# 2 T H H 58
# 3 H T H 43
# 4 T T H 49
# 5 H H T 38
# 6 T H T 51
# 7 H T T 64
# 8 T T T 46
Note, that nrow of your data should be divisible by 3.
Edit
To generalize, we may write a small function.
f <- \(x, n) {
ct <- cut(seq_len(nrow(x)), seq.int(1L, nrow(x), length.out=n + 1L), include.lowest=TRUE)
res <- interaction(split(x$Results, ct)) |> table() |> as.data.frame()
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c(LETTERS[seq_len(n)], 'Freq'))
}
f(My_Data, 4)
# A B C D Freq
# 1 H H H H 13
# 2 T H H H 25
# 3 H T H H 18
# 4 T T H H 17
# 5 H H T H 18
# 6 T H T H 15
# 7 H T T H 21
# 8 T T T H 24
# 9 H H H T 26
# 10 T H H T 15
# 11 H T H T 16
# 12 T T H T 18
# 13 H H T T 22
# 14 T H T T 18
# 15 H T T T 10
# 16 T T T T 24
Data:
set.seed(42)
My_Data <- data.frame(id=1:1200, Results=sample(c('H', 'T'), 1200, replace=TRUE))
A slightly generalized solution with tidyverse tools. Change the sets variable for longer or shorter sequences.
coin <- c("H", "T")
sets <- 4
rolls <- 10000
results <- sample(coin, sets * rolls, rep = TRUE)
named_results <- purrr::map_chr(
0:(rolls - 1),
~ paste0(results[(sets * .x + 1):(sets * .x + sets)],
collapse = ""
)
)
dplyr::count(tibble::tibble(x = named_results), x)
with output
# A tibble: 16 x 2
x n
<chr> <int>
1 HHHH 629
2 HHHT 627
3 HHTH 638
4 HHTT 599
5 HTHH 602
6 HTHT 633
7 HTTH 596
8 HTTT 661
9 THHH 631
10 THHT 589
11 THTH 633
12 THTT 647
13 TTHH 660
14 TTHT 637
15 TTTH 623
16 TTTT 595
sets = 8 would give something like
# A tibble: 256 x 2
x n
<chr> <int>
1 HHHHHHHH 37
2 HHHHHHHT 36
3 HHHHHHTH 43
4 HHHHHHTT 35
5 HHHHHTHH 38
6 HHHHHTHT 27
7 HHHHHTTH 32
8 HHHHHTTT 28
9 HHHHTHHH 33
10 HHHHTHHT 38
# ... with 246 more rows

Add number to vector repeatdly and duplicate vector

I have a two value
3 and 5
and I make vector
num1 <- 3
num2 <- 12
a <- c(num1, num2)
I want add number(12) to vector "a" and
also I want to make new vector with repeat and append
like this:
3,12, 15,24, 27,36, 39,48 ....
repeat number "n" is 6
I don't have any idea.
Here are two methods in base R.
with outer, you could do
c(outer(c(3, 12), (12 * 0:4), "+"))
[1] 3 12 15 24 27 36 39 48 51 60
or with sapply, you can explicitly loop through and calculate the pairs of sums.
c(sapply(0:4, function(i) c(3, 12) + (12 * i)))
[1] 3 12 15 24 27 36 39 48 51 60
outer returns a matrix where every pair of elements of the two vectors have been added together. c is used to return a vector. sapply loops through 0:4 and then calculates the element-wise sum. It also returns a matrix in this instance, so c is used to return a vector.
Here is a somewhat generic function that takes as input your original vector a, the number to add 12, and n,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n/len1), function(i) x*i)
v2 <- rep(v1, each = n/length(v1))
v3 <- rep(vec, n/len1)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48
f1(a, 11, 12)
#[1] 3 12 14 23 25 34 36 45 47 56 58 67 69 78
f1(a, 3, 2)
#[1] 3 12 6 15
EDIT
If by n=6 you mean 6 times the whole vector then,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n), function(i) x*i)
v2 <- rep(v1, each = len1)
v3 <- rep(vec, n)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48 51 60 63 72 75 84
Using rep for repeating and cumsum for the addition:
n = 6
rep(a, n) + cumsum(rep(c(12, 0), n))
# [1] 15 24 27 36 39 48 51 60 63 72 75 84

Creating new columns from long strings split into 300 substrings?

I have a column containing 1200 character strings. In each one, every four character group is hexadecimal for a number. i.e. 300 numbers in hexadecimal crammed into a 1200 character string, in every row. I need to get each number out into decimal, and into its own column (300 new columns) named 1-300.
Here's what I've figured out so far:
Data.frame:
BigString
[1] 0043003E803C0041004A...(etc...)
Here's what I've done so far:
decimal.fours <- function(x) {
strtoi(substring(BigString[x], seq(1,1197,4), seq(4,1197,4)), 16L)
}
decimal.fours(1)
[1] 283 291 239 177 ...
But now I'm stuck. How can I output these individual number, (and the remaining 296, into new columns? I have fifty total rows/strings. It would be great to do them all at once, i.e. 300 new columns, containing split up substrings from 50 strings.
You can use read.fwf which read in files with fixed width for each column:
# an example vector of big strings
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
n = 5 # n is the number of columns for your result(300 for your real case)
as.data.frame(
lapply(read.fwf(file = textConnection(BigString),
widths = rep(4, n),
colClasses = "character"),
strtoi, base = 16))
# V1 V2 V3 V4 V5
#1 67 62 32828 65 74
#2 67 62 32828 65 74
#3 67 62 32828 65 74
If you'd like to keep the decimal.hours function, you can modify it as follows and call lapply to convert your bigStrings to list of integers which can be further converted to data.frame with do.call(rbind, ...) pattern:
decimal.fours <- function(x) {
strtoi(substring(x, seq(1,1197,4), seq(4,1197,4)), 16L)
}
do.call(rbind, lapply(BigString, decimal.fours))
Obligatory tidyverse example:
library(tidyverse)
Setup some data
set.seed(1492)
bet <- c(0:9, LETTERS[1:6]) # alphabet for hex digit sequences
i <- 8 # number of rows
n <- 10 # number of 4-hex-digit sequences
df <- data_frame(
some_other_col=LETTERS[1:i],
big_str=map_chr(1:i, ~sample(bet, 4*n, replace=TRUE) %>% paste0(collapse=""))
)
df
## # A tibble: 8 × 2
## some_other_col big_str
## <chr> <chr>
## 1 A 432100D86CAA388C15AEA6291E985F2FD3FB6104
## 2 B BC2673D112925EBBB3FD175837AF7176C39B4888
## 3 C B4E99FDAABA47515EADA786715E811EE0502ABE8
## 4 D 64E622D7037D35DE6ADC40D0380E1DC12D753CBC
## 5 E CF7CDD7BBC610443A8D8FCFD896CA9730673B181
## 6 F ED86AEE8A7B65F843200B823CFBD17E9F3CA4EEF
## 7 G 2B9BCB73941228C501F937DA8E6EF033B5DD31F6
## 8 H 40823BBBFDF9B14839B7A95B6E317EBA9B016ED5
Do the manipulation
read_fwf(paste0(df$big_str, collapse="\n"),
fwf_widths(rep(4, n)),
col_types=paste0(rep("c", n), collapse="")) %>%
mutate_all(strtoi, base=16) %>%
bind_cols(df) %>%
select(some_other_col, everything(), -big_str)
## # A tibble: 8 × 11
## some_other_col X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 A 17185 216 27818 14476 5550 42537 7832 24367 54267
## 2 B 48166 29649 4754 24251 46077 5976 14255 29046 50075
## 3 C 46313 40922 43940 29973 60122 30823 5608 4590 1282
## 4 D 25830 8919 893 13790 27356 16592 14350 7617 11637
## 5 E 53116 56699 48225 1091 43224 64765 35180 43379 1651
## 6 F 60806 44776 42934 24452 12800 47139 53181 6121 62410
## 7 G 11163 52083 37906 10437 505 14298 36462 61491 46557
## 8 H 16514 15291 65017 45384 14775 43355 28209 32442 39681
## # ... with 1 more variables: X10 <int>
just a try using base-R
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
df = data.frame(BigString)
t(sapply(df$BigString, function(x) strtoi(substring(x, seq(1, 297, 4)[1:5],
seq(4, 300, 4)[1:5]), base = 16)))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 67 62 32828 65 74
#[2,] 67 62 32828 65 74
#[3,] 67 62 32828 65 74
# you can set the columns together at the end using `paste0("new_col", 1:300)`
# [1:5] was just used for this example, because i had strings of length 20cahr

Create combinations within a split group in r

With the data frame below of Locations, Days, and Quantities, I'm searching for a solution to create combinations of quantities by Location across each Day. In production, these combinations may grow pretty large, so a data.table or plyr approach would be appreciated.
library(gtools)
dat <- data.frame(Loc = c(51,51,51,51,51), Day = c("Mon","Mon","Tue","Tue","Wed"),
Qty = c(1,2,3,4,5))
The output for this example should be:
Loc Day Qty
1 51 Mon 1
2 51 Tue 3
3 51 Wed 5
4 51 Mon 1
5 51 Tue 4
6 51 Wed 5
7 51 Mon 2
8 51 Tue 3
9 51 Wed 5
10 51 Mon 2
11 51 Tue 4
12 51 Wed 5
I've tried a few nested lapply's which gets me close, but then I'm not sure how to take it to the next step and use the combn() function within each store.
lapply(split(dat, dat$Loc), function(x) {
lapply(split(x, x$Day), function(y) {
y$Qty
})
})
I'm able to get the correct combinations if each Store > Day group was in it's own list, but am struggling how to get there from a data frame using a split-apply-combine method.
loc51_mon <- c(1,2)
loc51_tue <- c(3,4)
loc51_wed <- c(5)
unlist(lapply(loc51_mon, function(x) {
lapply(loc51_tue, function(y) {
lapply(loc51_wed, function(z) {
combn(c(x,y,z), 3)
})
})
}), recursive = FALSE)
[[1]]
[[1]][[1]]
[,1]
[1,] 1
[2,] 3
[3,] 5
[[2]]
[[2]][[1]]
[,1]
[1,] 1
[2,] 4
[3,] 5
[[3]]
[[3]][[1]]
[,1]
[1,] 2
[2,] 3
[3,] 5
[[4]]
[[4]][[1]]
[,1]
[1,] 2
[2,] 4
[3,] 5
This should work however further complexity would require changes to the function:
library(data.table)
dat <- data.frame(Loc = c(51,51,51,51,51), Day = c("Mon","Mon","Tue","Tue","Wed"),
Qty = c(1,2,3,4,5), stringsAsFactors = F)
setDT(dat)
comb_in <- function(Qty_In,Day_In){
temp_df <- aggregate(Qty_In ~ Day_In, cbind(Qty_In, as.character(Day_In)), paste, collapse = "|")
temp_list <- strsplit(temp_df$Qty_In, split = "|", fixed = T)
names(temp_list) <- as.character(temp_df$Day)
melt(as.data.table(expand.grid(temp_list))[, case_group := .I], id.vars = "case_group", variable.name = "Day", value.name = "Qty")
}
dat[, comb_in(Qty_In = Qty, Day_In = Day), by = Loc][order(Loc,case_group,Day)]
Loc case_group Day Qty
1: 51 1 Mon 1
2: 51 1 Tue 3
3: 51 1 Wed 5
4: 51 2 Mon 2
5: 51 2 Tue 3
6: 51 2 Wed 5
7: 51 3 Mon 1
8: 51 3 Tue 4
9: 51 3 Wed 5
10: 51 4 Mon 2
11: 51 4 Tue 4
12: 51 4 Wed 5
You can now filter by case_group to get each combination
this question is quite similar to How to expand.grid on vectors sets rather than single elements
for a general approach (performance likely to be slower than a problem specified approach):
permu.sets <- function(listoflist) {
#assumes that each list within listoflist contains vectors of equal lengths
temp <- expand.grid(listoflist)
do.call(cbind, lapply(temp, function(x) do.call(rbind, x)))
} #permu.sets
#for the problem posted in OP
dat <- data.frame(Loc = c(51,51,51,51,51), Day = c("Mon","Mon","Tue","Tue","Wed"),
Qty = c(1,2,3,4,5))
vecsets <- lapply(split(dat, dat$Day), function(x) split(as.matrix(x), row(x)))
res <- permu.sets(vecsets)
lapply(split(res, seq(nrow(res))), function(x) matrix(x, ncol=3, byrow=T ))

Create an update function

I would like to create an update function using lazy evaluation and the mutate_if function from dplyrExtras by skranz.
It would work something like this:
data %>%
update(variable1_original = variable1_update,
variable2_original = variable2_update)
would be evaluated as
data %>%
mutate_if(!is.na(variable1_update),
variable1_original = variable1_update) %>%
mutate_if(!is.na(variable2_update),
variable2_original = variable2_update) %>%
select(-variable1_update, variable2_update)
Yikes, that package isn't very fun to use. mutate_if doesn't seem to work with data.frames and the package doesn't have standard-evaluation alternatives for functions like standard dplyr does. Here's an attempt to re-create the function
myupdate <- function(.data, ...) {
dots <- as.list(substitute(...()))
dx <- Reduce(function(a,b) {
upd <- b[[1]]
ifc <- bquote(!is.na(.(upd)))
do.call("mutate_if", c(list(a, ifc), b))
}, split(dots, seq_along(dots)), .data)
select_(dx, .dots=sapply(dots, function(x) bquote(-.(x))))
}
To test it, i used
library(data.table)
dd<-data.table(
a = c(1:3, NA, 5:8)+0,
b = c(1:2, NA, 4:5, NA, 7:8)+100,
x= 1:8+20,
y=1:8+30
)
dd
# a b x y
# 1: 1 101 21 31
# 2: 2 102 22 32
# 3: 3 NA 23 33
# 4: NA 104 24 34
# 5: 5 105 25 35
# 6: 6 NA 26 36
# 7: 7 107 27 37
# 8: 8 108 28 38
and then I ran
myupdate(dd, x=b, y=a)
# x y
# 1: 101 1
# 2: 102 2
# 3: 23 3
# 4: 104 34
# 5: 105 5
# 6: 26 6
# 7: 107 7
# 8: 108 8
Notice how columns "a" and "b" disappear. Also see how values in rows 3 and 6 in column "x" and the value in row 4 in column "y" was preserved because the corresponding values in columns "b" and "a" were NA.

Resources