Finding sublists in lists in R - r

I have a matrix of booleans. Most rows look like this
1 1 1 1 1 0 0 0 0
but some of them look like this
1 1 1 1 0 0 1 1 1
I want to find the ones that have a 0 followed by a 1. How can I do this? My naive attempt is to try
c(0, 1) %in% my_list
but that returns
[1] TRUE TRUE
since both 0 and 1 are in the list. D'oh! Any help?

How about looking at diffs instead
x <- c(1,1,1,1,0,0,1,1,1)
1 %in% diff(x)
Then create a function, and apply it to the rows of your matrix.

This is probably pretty pointless as the other answer is already pretty quick, but this will scale better for very big matrices to identify the rows you want. E.g.:
no <- c(1, 1, 1, 1, 1, 0, 0, 0, 0)
yes <- c(1, 1, 1, 1, 0, 0, 1, 1, 1)
m <- rbind(no,yes,no,yes,no,yes,yes)
# 1 2 3 4 5 6 7
# result should thus be c(2,4,6,7)
col(t(m[,-1]))[diff(t(m))==1]
#[1] 2 4 6 7
1 million row matrix benchmark:
m <- m[sample(1:2,1000000,replace=TRUE),]
system.time(apply(m, 1, function(x) 1 %in% diff(x) ))
# user system elapsed
# 12.09 0.00 12.09
system.time(col(t(m[,-1]))[diff(t(m))==1])
# user system elapsed
# 0.61 0.05 0.65
Alternative suggested by #MatthewLundberg, which is probably a really good balance of speed and readability.
system.time(apply(diff(t(m))==1, 2, any))
# user system elapsed
# 1.85 0.00 1.84

Related

How do I create a conditional variable based on another variable in R?

I'm back to using R after using SAS for a few years, and I'm relearning everything again.
I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:
0 - 1/3 acre = 0
1/3 - 2/3 acre = 1
2/3 - 1 acre = 2
1+ acre = 3
Into a new variable LS_cat.
I've explored the mutate command but I keep returning errors. Anyone have any ideas?
UPDATE
Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.
I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?
Use case_when().
library(tidyverse)
set.seed(123)
my_df <- tibble(
lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)
my_df |> mutate(
ls_cat = case_when(lot_size < 1 / 3 ~ 0,
lot_size < 2 / 3 ~ 1,
lot_size < 1 ~ 2,
TRUE ~ 3)
)
#> A tibble: 10 x 2
#> lot_size ls_cat
#> <dbl> <dbl>
#> 1 0.486 1
#> 2 0.988 2
#> 3 0.608 1
#> 4 1.08 3
#> 5 1.14 3
#> 6 0.244 0
#> 7 0.727 2
#> 8 1.09 3
#> 9 0.751 2
#>10 0.656 1
Case_when() is usually a sound solution when there's more than two options (if_else() if there are just two), but in this case there's a simpler math(s) solution.
my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <- ceiling((my_df$lot_size*3)-0.99)
Though, this may be less instructive on R programming.
For your follow on question, ifelse() works well, e.g.
Base:
my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)
Or Tidyverse:
my_df <- my_df %>%
mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))
NB: if_else() is a more pedantic version of ifelse(). Both should work equally well here, but if_else() is better for catching possible errors
We can use findInterval:
Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
# [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
# [1] 0 0 1 1 1 2 2 2 3 3
In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L (integer 1).
cut it.
dat <- transform(dat, Lot_Size_cat=
cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
include.lowest=TRUE))
dat
# X1 Lot_Size Lot_Size_cat
# 1 0.77436849 1.0509024 3
# 2 0.19722419 0.2819626 0
# 3 0.97801384 0.8002238 2
# 4 0.20132735 0.9272001 2
# 5 0.36124443 0.6396998 1
# 6 0.74261194 1.0990851 3
# 7 0.97872844 1.1648617 3
# 8 0.49811371 0.7221819 2
# 9 0.01331584 1.1915689 3
# 10 0.25994613 0.4076475 1
Data:
set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))

Cluster groups based on pairwise distances

I have an n x n matrix with pairwise distances as entries. The matrix looks for example like this:
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
Now I want to put every letter in the same cluster if the distance to any other letter is 0. For the example above, I should get three clusters consisting of:
(A,B,E)
(C,F)
(D)
I would be interested in the number of entries in each cluster. At the end, I want to have a vector like:
clustersizes = c(3,2,1)
I assume it is possible by using the hclust function, but I'm not able to extract the three clusters. I also tried the cutree function, but if I don't know the number of clusters before and also not the cutoff for the height, how should I do it?
This is what I tried:
h <- hclust(dist(m),method="single")
plot(h)
Thanks!
Welcome to SO.
There are several ways to handle this but an easy choice is to use the igraph package.
First we convert your matrix m to an adjacency matrix. It contains the distances to neighbouring nodes, where 0 means no connection. Thus, we subtract your matrix from 1 to get that
mm <- 1 - m
diag(mm) <- 0 # We don't allow loops
This gives
> mm
A B C D E F
A 0 1 0 0 0 0
B 1 0 0 0 1 0
C 0 0 0 0 0 1
D 0 0 0 0 0 0
E 0 1 0 0 0 0
F 0 0 1 0 0 0
Then we just need to feed it to igraph to compute communities
library("igraph")
fastgreedy.community(as.undirected(graph.adjacency(mm)))
which produces
IGRAPH clustering fast greedy, groups: 3, mod: 0.44
+ groups:
$`1`
[1] "A" "B" "E"
$`2`
[1] "C" "F"
$`3`
[1] "D"
Now if you save that result you can get the community sizes right away
res < fastgreedy.community(as.undirected(graph.adjacency(mm)))
sizes(res)
which yields
Community sizes
1 2 3
3 2 1

finding the length and positions of sub-series within a series of numbers

I have a vector made of 0 and non-zero numbers. I would like to know the length and starting-position of each of the non-zero number series:
a = c(0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2.6301334 1.8372030 0.0000000 0.0000000 0.0000000 1.5632647 1.1433757 0.0000000 1.5412216 0.8762267 0.0000000 1.3087967 0.0000000 0.0000000 0.0000000)
based on a previous post it is easy to find the starting positions of the non-zero regions:
Finding the index of first changes in the elements of a vector in R
c(1,1+which(diff(a)!=0))
However I cannot seem to configure a way of finding the length of these regions....
I have tried the following:
dif=diff(which(a==0))
dif_corrected=dif-1 # to correct for the added lengths
row=rbind(postion=seq(length(a)), length=c(1, dif_corrected))
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
length 1 0 0 0 0 2 0 0 2 2 1 0 0 1 0
NOTE: not all columns are displayed ( there are actually 20)
Then I subset this to take away 0 values:
> row[,-which(row[2,]==0)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
position 1 6 9 10 11 14 19
length 1 2 2 2 1 1 2
This seems like a decent way of coming up with the positions and lengths of each non-zero series in the series, but it is incorrect:
The position 9 (identified as the start of a non-zero series) is a 0 and instead 10 and 11 are non-zero so I would expect the position 10 and a length of 2 to appear here....
The only result that is correct is position 6 which is the start of the first non-zero series- it is correctly identified as having a length of 2- all other positions are incorrect.
Can anyone tell me how to index correctly to identify the starting-position of each of the non-zero series and the corresponding lengths?
NOTE I only did this in R because of the usefulness of the which command but it would also be good to know how to do this numpy and create a dictionary of positions and length values
It seems like rle could be useful here.
# a slightly simpler vector
a <- c(0, 0, 1, 2, 0, 2, 1, 2, 0, 0, 0, 1)
# runs of zero and non-zero elements
r <- rle(a != 0)
# lengths of non-zero elements
r$lengths[r$values]
# [1] 2 3 1
# start of non-zero runs
cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
# [1] 3 6 12
This also works on vectors with only 0 or non-0, and does not depend on whether or not the vector starts/ends with 0 or non-0. E.g.:
a <- c(1, 1)
a <- c(0, 0)
a <- c(1, 1, 0, 1, 1)
a <- c(0, 0, 1, 1, 0, 0)
A possibly data.table alternative, using rleid to create groups, and .I to get start index and calculate length.
library(data.table)
d <- data.table(a)
d[ , .(start = min(.I), len = max(.I) - min(.I) + 1, nonzero = (a != 0)[1]),
by = .(run = rleid(a != 0))]
# run start len nonzero
# 1: 1 1 2 FALSE
# 2: 2 3 2 TRUE
# 3: 3 5 1 FALSE
# 4: 4 6 3 TRUE
# 5: 5 9 3 FALSE
# 6: 6 12 1 TRUE
If desired, the runs can then easily be sliced by the 'nonzero' column.
For numpy this is a parallel method to #Maple (with a fix for arrays ending with a nonzero):
def subSeries(a):
d = np.logical_not(np.isclose(a, np.zeros_like(a))).astype(int)
starts = np.where(np.diff(np.r_[0, d, 0]) == 1))
ends = np.where(np.diff(np.r_[0, d, 0]) == -1))
return np.c_[starts - 1, ends - starts]
Definition:
sublistLen = function(list) {
z_list <- c(0, list, 0)
ids_start <- which(diff(z_list != 0) == 1)
ids_end <- which(diff(z_list != 0) == - 1)
lengths <- ids_end - ids_start
return(
list(
'ids_start' = ids_start,
'ids_end' = ids_end - 1,
'lengths' = lengths)
)
}
Example:
> a <- c(-2,0,0,12,5,0,124,0,0,0,0,4,48,24,12,2,0,9,1)
> sublistLen(a)
$ids_start
[1] 1 4 7 12 18
$ids_end
[1] 1 5 7 16 19
$lengths
[1] 1 2 1 5 2

How to utilize recursive functions to help rank matrix rows - R

I would like some advice as to how best solve this puzzle. I have got some of the way to solving it using manually written long-hand code. I feel as if I need to utilize recursive functions, but I am still not very good at using them. I hope this question is not too long, I'm trying to be as succinct as possible whilst giving enough information. Sorry if it's too long - though hopefully somebody finds it of interest.
I have a matrix mat1
# A B C D E F G
# A 0 2 1 1 0 1 1
# B 0 0 0 1 2 2 1
# C 1 2 0 0 0 2 1
# D 1 1 2 0 1 2 1
# E 2 0 2 1 0 2 1
# F 1 0 0 0 0 0 1
# G 1 1 1 1 1 1 0
This represents the results of contests between individuals in rows and columns. Numbers refer to how often the individual in the row 'won' against the individual in the column.
I wish to rank individuals A-G from 1-7 using the following criteria:
number of wins against all others (most wins should be ranked 1, least wins 7, 2nd most wins 2, etc.)
if number of wins are tied, then ranks should be based on the number of wins obtained when considering contests only between those individuals with the same number of wins.
if individuals still have a tied number of wins, then ranks should be applied randomly.
I realize that this is not a very good ranking system, but that's not the issue here. According to the above scheme, ranks should be the following:
1 - D or E - D & E have joint highest overall wins (8), and equal wins also in contests between them.
2 - E or D - pick randomly D or E for rank 1 and rank 2
3 - A or C - tied with A,B,C,G for overall 6 wins, both have 4 wins in contents with ABCG
4 - C or A - considering contests between C&A both have 1 win, so randomly pick for rank3 and rank4
5 - G - tied with A,B,C,G for overall 6 wins, has 3 wins in contests between A,B,C,G
6 - B - tied with A,B,C,G for overall 6 wins, but only has 1 win in contests between A,B,C,G
7 - F - has the fewest wins of all in the overall win matrix
What I have tried:
storeresults <- vector("list") #use this to store results of the following
Step 1: Use winsfun function (see below) to identify number of wins of each individual & whether wins are unique (as noted by dupes column):
w1 <- winsfun(mat1)
storeresults[[1]] <- w1 #store results
w1 Only "F" has a unique number of wins and so can be ranked (7th) in the first instance:
# wins ranks dupes
#A 6 4.5 TRUE
#B 6 4.5 TRUE
#C 6 4.5 TRUE
#D 8 1.5 TRUE
#E 8 1.5 TRUE
#F 2 7.0 FALSE
#G 6 4.5 TRUE
Step 2: For individuals with non-unique wins (i.e. duplicated ranks) subset them into matrices considering only contests against others with the same number of wins, and determine new ranks if possible.
allSame(w1[,3]) #FALSE - this says that not all wins/ranks are unique so need to subset
s2 <- subsetties(w1) #this just splits the data into groups by number of wins (see below)
w2 <- lapply(s2, winsfun, m=mat1)
storeresults[[2]] <- w2 # store results
w2 As can be seen, those individuals with 8 wins (the most of anyone) from Step1 ("D" and "E") each have one win versus each other. They cannot be teased apart, so will be ranked 1 and 2 randomly. Those individuals with 6 wins (A, B, C, G) have different number of wins when only considering contests between each other. "B" and "G" can be ranked 6th overall and 5th overall respectively. We need to reconsider "A" and "C" in contests against only each other:
$`6`
wins ranks dupes
A 4 1.5 TRUE
B 1 4.0 FALSE
C 4 1.5 TRUE
G 3 3.0 FALSE
$`8`
wins ranks dupes
D 1 1.5 TRUE
E 1 1.5 TRUE
Step 3: Repeat Step 2 where required
allSame(w2[[1]][,3]) #FALSE - need to subset again as not everyone has same number of wins
allSame(w2[[2]][,3]) #TRUE - no more action required
s3 <- subsetties(w2[[1]])
w3 <- winsfun(s3[[1]], m=mat1)
storeresults[[3]] <- w3 #store results
w3 When considering "A" and "C" together, they have one win each, so should now be ranked randomly in 2nd and 3rd place. They cannot be teased apart.
wins ranks dupes
A 1 1.5 TRUE
C 1 1.5 TRUE
allSame(w3[,3]) #TRUE - no more action required - both have same number of wins
Step 4 Processing Stored Results
storeresults
# I can manually work out ranks from this, but have yet to work out how to do it in R
Below are the functions used in the above:
Function to calculate wins and ranks of subsetted matrices
winsfun <- function(m, out=NULL){
if (is.null(out)==F){
m1 <- m[rownames(out),rownames(out)]
wins <- apply(m1, 1, sum)
ranks <- rank(-wins)
dupes <- duplicated(wins)| duplicated(wins, fromLast = T)
df <- data.frame(wins, ranks,dupes)
return(df)
}
else
wins <- apply(m, 1, sum)
ranks <- rank(-wins)
dupes <- duplicated(wins)| duplicated(wins, fromLast = T)
df <- data.frame(wins, ranks,dupes)
return(df)
}
Function to subset those rows with duplicated ranks
subsetties <- function(df){
df1 <- df[df[,3]==T,]
df1.sp <- split(df1, df1$wins)
return(df1.sp)
}
Function to test if all elements of vector are identical
allSame <- function(x) length(unique(x)) == 1
Code to recreate above matrix:
structure(c(0, 0, 1, 1, 2, 1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 0, 0,
2, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 2, 0, 1, 0, 0, 1, 1, 2, 2,
2, 2, 0, 1, 1, 1, 1, 1, 1, 1, 0), .Dim = c(7L, 7L), .Dimnames = list(
c("A", "B", "C", "D", "E", "F", "G"), c("A", "B", "C", "D",
"E", "F", "G")))
I hope this question is clear. I am trying to work out how to perform this algorithm iteratively. I am not too sure how to achieve this, but hopefully by writing this out long-hand and providing the functions I have been using, it may be obvious to somebody. One extra thing is that it's best to have the proposed solution be generally applicable (i.e. to matrices of different sizes).
calc_gain<-function(mat=mat1){
if(nrow(mat)==1) {
return(row.names(mat))
} else {
classement<-sort(rowSums(mat),decreasing=T)
diffgains<-diff(classement)
if (all(diffgains!=0)){
return(names(classement))
} else {
if (all(diffgains==0)){
return(sample(names(classement)))
} else {
parex<-split(classement,factor(classement,levels=unique(classement)))
class_parex<-lapply(parex,function(vect){calc_gain(mat[names(vect),names(vect),drop=F])})
return(unlist(class_parex))
}
}
}
}
Here is what the function does :
if there is only one element, it returns the name of it (only "player" there is)
else, it calculates the scores.
If there is no tie, it returns the "players" in the order first to last
else, - if all "players" have the same score, it randomly gives an order.
else, it splits the ordered list according to the scores and apply the function (that is the recursive part) on the subsets of "players" with tied scores.
Here's a start:
Step0:
> split(rownames(m), -rowSums( m ) )
$`-8`
[1] "D" "E"
$`-6`
[1] "A" "B" "C" "G"
$`-2`
[1] "F"
Step1:
m <- m[ order( -rowSums(m) ), ]) # order within overall wins
A B C D E F G
D 1 1 2 0 1 2 1
E 2 0 2 1 0 2 1
A 0 2 1 1 0 1 1
B 0 0 0 1 2 2 1
C 1 2 0 0 0 2 1
G 1 1 1 1 1 1 0
F 1 0 0 0 0 0 1
> rowSums( m )
D E A B C G F
8 8 6 6 6 6 2
Step2: Order within group that has 4 wins
> mred <- m[c("A","B","C","G"), c("A","B","C","G") ]
> mred
A B C G
A 0 2 1 1
B 0 0 0 1
C 1 2 0 1
G 1 1 1 0
> rowSums(mred)
A B C G
4 1 4 3
> rownames(mred)[order(-rowSums(mred))]
[1] "A" "C" "G" "B"

Create counter within consecutive runs of certain values

I have an hourly value. I want to count how many consecutive hours the value has been zero since the last time it was not zero. This is an easy job for a spreadsheet or for loop, but I am hoping for a snappy vectorized one-liner to accomplish the task.
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
df <- data.frame(x, zcount = NA)
df$zcount[1] <- ifelse(df$x[1] == 0, 1, 0)
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(df$x[i] == 0, df$zcount[i - 1] + 1, 0)
Desired output:
R> df
x zcount
1 1 0
2 0 1
3 1 0
4 0 1
5 0 2
6 0 3
7 1 0
8 1 0
9 0 1
10 0 2
William Dunlap's posts on R-help are the place to look for all things related to run lengths. His f7 from this post is
f7 <- function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}
and in the current situation f7(!x). In terms of performance there is
> x <- sample(0:1, 1000000, TRUE)
> system.time(res7 <- f7(!x))
user system elapsed
0.076 0.000 0.077
> system.time(res0 <- cumul_zeros(x))
user system elapsed
0.345 0.003 0.349
> identical(res7, res0)
[1] TRUE
Here's a way, building on Joshua's rle approach: (EDITED to use seq_len and lapply as per Marek's suggestion)
> (!x) * unlist(lapply(rle(x)$lengths, seq_len))
[1] 0 1 0 1 2 3 0 0 1 2
UPDATE. Just for kicks, here's another way to do it, around 5 times faster:
cumul_zeros <- function(x) {
x <- !x
rl <- rle(x)
len <- rl$lengths
v <- rl$values
cumLen <- cumsum(len)
z <- x
# replace the 0 at the end of each zero-block in z by the
# negative of the length of the preceding 1-block....
iDrops <- c(0, diff(v)) < 0
z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
# ... to ensure that the cumsum below does the right thing.
# We zap the cumsum with x so only the cumsums for the 1-blocks survive:
x*cumsum(z)
}
Try an example:
> cumul_zeros(c(1,1,1,0,0,0,0,0,1,1,1,0,0,1,1))
[1] 0 0 0 1 2 3 4 5 0 0 0 1 2 0 0
Now compare times on a million-length vector:
> x <- sample(0:1, 1000000,T)
> system.time( z <- cumul_zeros(x))
user system elapsed
0.15 0.00 0.14
> system.time( z <- (!x) * unlist( lapply( rle(x)$lengths, seq_len)))
user system elapsed
0.75 0.00 0.75
Moral of the story: one-liners are nicer and easier to understand, but not always the fastest!
rle will "count how many consecutive hours the value has been zero since the last time it was not zero", but not in the format of your "desired output".
Note the lengths for the elements where the corresponding values are zero:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 1 3 2 2
# values : num [1:6] 1 0 1 0 1 0
A simple base R approach:
ave(!x, cumsum(x), FUN = cumsum)
#[1] 0 1 0 1 2 3 0 0 1 2
One-liner, not exactly super elegant:
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
unlist(lapply(split(x, c(0, cumsum(abs(diff(!x == 0))))), function(x) (x[1] == 0) * seq(length(x))))
Using purr::accumulate() is very straightforward, so this tidyverse solution may add some value here. I must acknowledge it is definitely not the fastest, as it calls the same function length(x)times.
library(purrr)
accumulate(x==0, ~ifelse(.y!=0, .x+1, 0))
[1] 0 1 0 1 2 3 0 0 1 2

Resources