Split vector into contiguous runs of equal values - r

I have a data table and one of the columns is a bunch of 0's and 1's, just like vec below.
vec = c(rep(1, times = 6), rep(0, times = 10), rep(1, times = 11), rep(0, times = 4))
> vec
[1] 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
What I want to do is to split the data everytime there's a change in that column from 0 to 1 or vice-versa. Here is what I have done so far:
b = c(vec[1],diff(vec))
rowby = numeric(0)
for (i in 2:(length(b))) {
if (b[i] != 0) {
rowby <- c(rowby, i-1)
}
}
splitted_data <- split(vec, cumsum(c(TRUE,(1:length(vec) %in% rowby)[-length(vec)])))
There must be some thing right under my nose I can't see. What is a correct way to do this? This works for the example above, but not generally.

Try
split(vec,cumsum(c(1, abs(diff(vec)))))
#$`1`
#[1] 1 1 1 1 1 1
#$`2`
#[1] 0 0 0 0 0 0 0 0 0 0
#$`3`
#[1] 1 1 1 1 1 1 1 1 1 1 1
#$`4`
#[1] 0 0 0 0
Or use rle
split(vec,inverse.rle(within.list(rle(vec), values <- seq_along(values))))
With current versions of data.table, rleid is one function which can be used for this job:
library(data.table)#v1.9.5+
split(vec,rleid(vec))

Related

Alternative ways to create a repetitive vector in R

I am tasked to create the vector
0 1 0 1 0 1 0 1 0 1
using two approaches without using c() or rep() in R.
I have tried a bunch of methods, but none of them seem to work.
Here are some of my attempts (all of which have failed) -
vector(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
a<-seq(from = 0, to = 1 , by = 1)
a
replicate(5, a)
b<-1*(0:1)
do.call(cbind, replicate(5, b, simplify=FALSE))
Any help on this would be appreciated! Thank you.
We can use bitwAnd
> bitwAnd(0:9, 1)
[1] 0 1 0 1 0 1 0 1 0 1
or kronecker
> kronecker(as.vector(matrix(1, 5)), 0:1)
[1] 0 1 0 1 0 1 0 1 0 1
> kronecker((1:5)^0, 0:1)
[1] 0 1 0 1 0 1 0 1 0 1
or outer
> as.vector(outer(0:1, (1:5)^0))
[1] 0 1 0 1 0 1 0 1 0 1
Solution 1: Generalized Function my_rep()
A generalized solution my_rep() for any vector x you wish repeated n times
my_rep <- function(x, n) {
return(
# Use modulo '%%' to subscript the original vector (whose length I'll call "m"), by
# cycling 'n' times through its indices.
x[0:(length(x) * n - 1) %% length(x) + 1]
# 1 2 ... m 1 2 ... m 1 2 ... m
# | 1st cycle | | 2nd cycle | ... | nth cycle |
)
}
which can solve this case
my_rep(x = 0:1, n = 5)
# [1] 0 1 0 1 0 1 0 1 0 1
and many others
# Getting cute, to make a vector of strings without using 'c()'.
str_vec <- strsplit("a b ", split = " ")[[1]]
str_vec
# [1] "a" "b" ""
my_rep(x = str_vec, n = 3)
# [1] "a" "b" "" "a" "b" "" "a" "b" ""
Solution 2: Binary Vector of Arbitrary Length
Another quick solution, for a 0 1 0 1 ... 0 1 vector of arbitrary length l
# Whatever length you desire.
l <- 10
# Generate a vector of alternating 0s and 1s, of length 'l'.
(1:l - 1) %% 2
which yields the output:
[1] 0 1 0 1 0 1 0 1 0 1
Note
Special thanks to #Adam, who figured out 0:9 %% 2 on their own, shortly after my comment with that same solution; and who gracefully retracted their initial answer in favor of mine. :)
Exploiting boolean coercion.
+(1:10*c(-1, 1) > 0)
# [1] 0 1 0 1 0 1 0 1 0 1
Or without c().
+(1:10*(0:1*2) - 1 > 0)
# [1] 0 1 0 1 0 1 0 1 0 1
Here is a way using the apply functions.
unlist(lapply(1:5, function(x) 0:1))
# [1] 0 1 0 1 0 1 0 1 0 1
Similar but with replicate.
as.vector(replicate(5, 0:1))
# [1] 0 1 0 1 0 1 0 1 0 1
And just in case you love trig.
abs(as.integer(cos((1:10 * pi) / 2)))
# [1] 0 1 0 1 0 1 0 1 0 1
And here is one last one that I consider cheating just because. This one generalizes to any vector you want!
unlist(unname(read.table(textConnection("0 1 0 1 0 1 0 1 0 1"))))
We can use purrr::accumulate, and a simple negate(!) operation.
accumulate will perform the same operation recursively over its data argument and output all intermediate results.
In this case, it can be broken down into:
output[1] <-0
output[2] <-!output[1]
output[3] <-!output[2]
...
the output would then be c(0, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE), which is coerced to numeric.
purrr::accumulate(0:9, ~!.x)
[1] 0 1 0 1 0 1 0 1 0 1
Firstly we will make a list of given no. and then apply unlist() function on list to convert it into a vector as shown in below code:
my_list = list(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
v = unlist(my_list)
print(v)
[ 1 ] 0 1 0 1 0 1 0 1 0 1

How to randomly replace a value

I have a vector of a certain length of which I want to randomly replace every 2 by 0 or 1, with a probability of 0.4 (for value=1). I have used this code below. I expected to have a different value (0 or 1) for the different 2 replaced, but I have only 1 or 0 that replace the 2.
vec<-c(rep(2,18),1,0)
ifelse (vec==2,rbinom(1,1,0.40)
here is one output
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
and another output
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
When you go into the source code of ifelse via typing View(ifelse), you will see a piece of code showing that
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
That means, once you have one single value for yes or no in ifelse, that single value is repeated len times and placed to the corresponding logical positions.
In you case, rbinom(1,1,0.40) is just a single value for yes, thus being repeated once it has an realization.
One workaround is like below
> ifelse(vec == 2, rbinom(sum(vec == 2), 1, 0.40), vec)
[1] 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0
This replaces all 2 values with either 0 or 1
vec[vec == 2] <- rbinom(sum(vec == 2), 1, prob = .4)
If you draw a 0 and want the value to remain 2 then you could use sample, which would be equivalent to a binomial draw:
vec[vec == 2] <- sample(c(1, 2), sum(vec == 2), prob = c(0.4, 0.6), replace = T)
Try next code:
#Code
vec<-c(rep(2,18),1,0)
vec2 <- unlist(lapply(seq(2,length(vec),by=2), function(x) {vec[x] <- rbinom(1,1,0.40)}))
vec[seq(2,length(vec),by=2)] <-vec2
Output:
vec
[1] 2 0 2 0 2 1 2 0 2 0 2 0 2 1 2 0 2 0 1 1

Loop for Correlation Item-Score without containing Item

I have a huge dataset and I want to compute the correlation of each item with the total score of the scale, but without containing the item. Now I could do it separately for each item, but I am trying to do a loop, so that it is a bit easier.
Example dataset:
dat <- read.table(header=TRUE, text="
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5 ItemX6 ItemY1 ItemY2 ItemY3 ItemY4 ItemY5 ItemY6
1 1 0 1 0 1 1 1 0 1 0 1
0 1 0 0 0 1 0 1 0 0 0 1
1 1 0 1 0 1 1 1 0 1 0 0
1 0 1 0 0 1 1 0 1 0 0 1
1 1 0 1 1 1 1 1 0 1 1 0
0 0 1 1 0 0 0 0 1 1 0 0
")
xscore <- rowSums(select(dat, starts_with("ItemX")))
Now I could do it like the following, but as I have 107 Items it is a bit much.
cor(dat$ItemX1,rowSums(select(dat, starts_with("ItemX") & -"ItemX1")),use="pairwise.complete.obs")
cor(dat$ItemX2,rowSums(select(dat, starts_with("ItemX") & -"ItemX2")),use="pairwise.complete.obs")
cor(dat$ItemX3,rowSums(select(dat, starts_with("ItemX") & -"ItemX3")),use="pairwise.complete.obs")
cor(dat$ItemX4,rowSums(select(dat, starts_with("ItemX") & -"ItemX4")),use="pairwise.complete.obs")
cor(dat$ItemX5,rowSums(select(dat, starts_with("ItemX") & -"ItemX5")),use="pairwise.complete.obs")
cor(dat$ItemX6,rowSums(select(dat, starts_with("ItemX") & -"ItemX6")),use="pairwise.complete.obs")
That's why I'm trying out the following loop, but now I don't know how to specify that the rowSums is calculated without the item which is in use for the correlation.
variables <- names(dat)
names.item <- c(grep("ItemX", variables, value = TRUE))
item.diff.p <- data.frame(matrix(NA, ncol=2, nrow=(length(names.item)-1)))
names(item.diff.p) <- c("Item", "cor")
length(names.item)
for(i in 1:(length(names.item))-1){
item <- names.item[i]
par <- cor(dat[,names(dat)[grepl("ItemX",names(dat))]],
rowSums(select(dat, starts_with("ItemX"))),use="pairwise.complete.obs")
item.diff.p[i, c("cor")]
}
par
Thank you all!
You can iterate through the columns of a subsetted dataframe, and calculate:
X_dat = dat[,grep("^ItemX",colnames(dat))]
res = sapply(1:ncol(X_dat),function(i){
cor(X_dat[,i],rowSums(X_dat[,-i]),use="p")
})
names(res) = colnames(X_dat)
res
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5 ItemX6
0.6324555 0.1250000 -0.7500000 0.1250000 0.4152274 0.2335497

Find a numeric pattern R

I would like the find the pattern of either a 0/1 followed by a 2 which occurs more than three times in a row. I would like to find this pattern and transform the 2's in this pattern into 1s - such as
Input:
Y <- c(0,1,0,3,2,5,2,1,2,0,2,1,2,0,1,2,1,3,1,2,1)
Some Function findPattern that finds the pattern:
findPattern(Y)
And Outputs the following:
[1] 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0
I have tried the following:
as.numeric(Y == 2 & lead(Y) %in% 1:2)
1. Find 0/1 followed by 2s
findPattern<-function(Y){
as.numeric(Y==2 & (c(NA,Y[-length(Y)])==0 |c(NA,Y[-length(Y)])==1 ))
}
I add a NA a the start and remove last item so that you "shift" your vector by 1 position but still keep same vector length. This way you avoid for loops.
If you want to use %in% which avoids a second passage:
findPattern<-function(Y){
as.numeric(Y==2 & (c(NA,Y[-length(Y)]) %in% c(0,1))
}
2. Select only those that have at least three 1s every other position
findPattern<-function(Y){
w <- which(Y==2 & (c(NA,Y[-length(Y)]) %in% c(0,1)))
centers<- w[((w - 2) %in% w) & ((w+2) %in% w)]
result<-rep(0, times = length(Y))
result[c(centers,centers-2,centers+2)]<-1
return(result)
}
Testing:
findPattern(c(0,1,0,3,2,5,2,1,2,0,2,1,2,0,1,2,1,3,1,2,1))
[1] 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0
Here is a possible approach to solve the problem where you can combine with the regular expression to find the pattern.
Starting vector:
> Y
[1] 0 2 0 3 2 5 2 1 2 0 2 1 2 0 1
1) Find out all the 2s preceded by 0 or 1;
> ind <- as.integer(lag(Y %in% c(0, 1)) & (Y == 2) )
> ind
[1] 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0
2) Paste the resulting vector into a string and use regular expression to find out the location and length of the required pattern, i.e., alternating 0 and 1 equal or more than three times;
> id <- gregexpr("(01){3,}", paste0(ind, collapse = ""))
> id
[[1]]
[1] 8
attr(,"match.length")
[1] 6
attr(,"useBytes")
[1] TRUE
3) Extracting the location and length from the regular expression result and convert them into the index pattern;
> start <- as.numeric(id[[1]])
> end <- start + attr(id[[1]], "match.length") - 1
> indArray <- unlist(Map(`:`, start, end))
> indArray
[1] 8 9 10 11 12 13
4) Assign all the values at 01 pattern less than 3 times to 0
> ind[-indArray] <- 0
> ind
[1] 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0
Wrap them into a function:
library(dplyr)
findPattern <- function(Y) {
ind <- as.integer(lag(Y %in% c(0, 1)) & (Y == 2) )
id <- gregexpr("(01){3,}", paste0(ind, collapse = ""))
start <- as.numeric(id[[1]])
end <- start + attr(id[[1]], "match.length") - 1
indArray <- unlist(Map(`:`, start, end))
ind[-indArray] <- 0
ind
}
Using stringi package
Y <- c(0,1,0,3,2,5,2,1,2,0,2,1,2,0,1)
matchVec = stri_count(Y,fixed=2)
remapVec = as.integer(matchVec & (cumsum(matchVec)>=3))
remapVec
#[1] 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0

Convert a factor column to multiple boolean columns

Given data that looks like:
library(data.table)
DT <- data.table(x=rep(1:5, 2))
I would like to split this data into 5 boolean columns that indicate the presence of each number.
I can do this like this:
new.names <- sort(unique(DT$x))
DT[, paste0('col', new.names) := lapply(new.names, function(i) DT$x==i), with=FALSE]
But this uses a pesky lapply which is probably slower than the data.table alternative and this solutions strikes me as not very "data.table-ish".
Is there a better and/or faster way to create these new columns?
How about model.matrix?
model.matrix(~factor(x)-1,data=DT)
factor(x)1 factor(x)2 factor(x)3 factor(x)4 factor(x)5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(x)`
[1] "contr.treatment"
Apparently, you can put model.matrix into [.data.table to give the same results. Not sure if it would be faster:
DT[,model.matrix(~factor(x)-1)]
There is also nnet::class.ind
library(nnet)
cbind(DT, setnames(as.data.table(DT[, class.ind(x)]),paste0('col', unique(DT$x))))
library(data.table)
DT <- data.table(x=rep(1:5, 2))
# add column with id
DT[, id := seq.int(nrow(DT))]
# cast long table into wide
DT.wide <- dcast(DT, id ~ x, value.var = "x", fill = 0, fun = function(x) 1)

Resources