How many times occur pair of 1 in a vector - r

i have a problem.
I have a vector, that consists from 0 or 1 - for example (011011111011100001111). In R i need to figure out, how to count how many times appears in vector two 1, three 1, four 1 and so on. In this example vector I have 1 times 11, 1 times 111, 1 times 1111 and 1 times 11111.
Thanks a lot, Peter

I'm assuming you have an actual vector like c(0, 1, 1, 0...).
Here is a solution using table and rle. I've also provided some longer sample data to make it a bit more interesting.
set.seed(1)
myvec <- sample(c(0, 1), 100, replace = TRUE)
temp <- rle(myvec)
table(temp$lengths[temp$values == 1])
#
# 1 2 3 4 6
# 15 8 1 2 1
If, indeed, you are dealing with a crazy-long character string of ones and zeroes, just use strsplit and follow the same logic as above.
myvec <- "00110111100010101101101000001001001110101111110011010000011010001001"
myvec <- as.numeric(strsplit(myvec, "")[[1]])
Here, I've converted to numeric, but that's just so you can use the same code as earlier. You can use rle on a character vector too.

rle is your friend:
vec <-c(0,1,1,0,1,1,1,1,1,0,1,1,1,0,0,0,0,1,1,1,1)
res <-data.frame(table(rle(vec)))
res[res$values==1,]
lengths values Freq
6 1 1 0
7 2 1 1
8 3 1 1
9 4 1 1
10 5 1 1

Related

How to repeat the indices of a vector based on the values of that same vector?

Given a random integer vector below:
z <- c(3, 2, 4, 2, 1)
I'd like to create a new vector that contains all z's indices a number of times specified by the value corresponding to that element of z. To illustrate this. The desired result in this case should be:
[1] 1 1 1 2 2 3 3 3 3 4 4 5
There must be a simple way to do this.
You can use rep and seq to repeat the indices of a vector based on the values of that same vector. seq to get the indices and rep to repeat them.
rep(seq(z), z)
# [1] 1 1 1 2 2 3 3 3 3 4 4 5
Starting with all the indices of the vector z. These are given by:
1:length(z)
Then these elements should be repeated. The number of times these numbers should be repeated is specified by the values of z. This can be done using a combination of the lapply or sapply function and the rep function:
unlist(lapply(X = 1:length(z), FUN = function(x) rep(x = x, times = z[x])))
[1] 1 1 1 2 2 3 3 3 3 4 4 5
unlist(sapply(X = 1:length(z), FUN = function(x) rep(x = x, times = z[x])))
[1] 1 1 1 2 2 3 3 3 3 4 4 5
Both alternatives give the same result.

Optimization of for loop in R

I'm trying to optimize for-loop in my R-code.
Summary:
I've a data table with 47 million rows and 4 columns( designated by 'nvars' in code).
I want to compare row-wise values in each column and if any two are equal, set delete flag as 1, else 0.
I need to delete all those rows in which at least two values in any of 4 columns are equal. (values are numeric in all columns, e.g. 1,2,3... )
I tried optimising using vectorisation but it's still taking ~1.5 hours (approx.)
Can this be optimised further?
test2 <- as.data.table(test2)
delete_output <- numeric(nrow(test2))
for (i in 1:nrow(test2)){
for (j in 1:(nvars-1)){
k=j+1
if (test2[i,..j] == test2[i,..k]){
delete_output[i] <- 1
next
}
}
}
If any two values in a particular row are equal, it should assign delete flag as 1.
My file should look like the one in the image. This is an example of 3 input variable and corresponding output variable (delete). Check that if all V1, V2, V3 are unique for a particular row, delete flag is equal to 0, else 1.
We can use apply (but I fear it might not be fast enough) and check for any duplicated value.
df$delete <- +(apply(df, 1, function(x) any(duplicated(x))))
df
# V1 V2 V3 V4 delete
#1 3 3 3 1 1
#2 1 4 4 3 1
#3 2 2 1 4 1
#4 2 2 3 3 1
#5 2 4 4 2 1
#6 1 3 2 4 0
#7 1 1 1 3 1
#8 4 2 1 1 1
#9 3 4 2 2 1
#10 1 2 2 4 1
data
set.seed(1432)
df <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
You can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2[apply(test2, 1, function(x) all(table(x)==1)), ]
This will select only those rows, in which all elements are unique.
If you need the extra column you can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2$delete <- !apply(test2, 1, function(x) all(table(x)==1))
test2

Count the maximum of consecutive letters in a string

I have this vector:
vector <- c("XXXX-X-X", "---X-X-X", "--X---XX", "--X-X--X", "-X---XX-", "-X--X--X", "X-----XX", "X----X-X", "X---XX--", "XX--X---", "---X-XXX", "--X-XX-X")
I want to detect the maximum of consecutive times that appears X. So, my expected vector would be:
4, 1, 2, 1,2, 1, 2, 1, 2, 2, 3, 2
In base R, we can split each vector into separate characters and then using rle find the max consecutive length for "X".
sapply(strsplit(vector, ""), function(x) {
inds = rle(x)
max(inds$lengths[inds$values == "X"])
})
#[1] 4 1 2 1 2 1 2 1 2 2 3 2
Here is a slightly different approach. We can split each term in the input vector on any number of dashes. Then, find the substring with the greatest length.
sapply(vector, function(x) {
max(nchar(unlist(strsplit(x, "-+"))))
})
XXXX-X-X ---X-X-X --X---XX --X-X--X -X---XX- -X--X--X X-----XX X----X-X
4 1 2 1 2 1 2 1
X---XX-- XX--X--- ---X-XXX --X-XX-X
2 2 3 2
I suspect that X really just represents any non dash character, so we don't need to explicitly check for it. If you do really only want to count X, then we can try removing all non X characters before we count:
sapply(vector, function(x) {
max(nchar(gsub("[^X]", "", unlist(strsplit(x, "-+")))))
})
Use strapply in gsubfn to extract out the X... substrings applying nchar to each to count its number of character producing a list of vectors of lengths. sapply the max function each such vector.
library(gsubfn)
sapply(strapply(vector, "X+", nchar), max)
## [1] 4 1 2 1 2 1 2 1 2 2 3 2
Here are a couple of tidyverse alternatives:
map_dbl(vector, ~sum(str_detect(., strrep("X", 1:8))))
# [1] 4 1 2 1 2 1 2 1 2 2 3 2
map_dbl(strsplit(vector,"-"), ~max(nchar(.)))
# [1] 4 1 2 1 2 1 2 1 2 2 3 2

Assigning values to correlative series in r

I hope you can help me with this issue I have.
I have a big dataframe, to simplify it, it look like this:
df <- data.frame(radius = c (2,3,5,7,4,6,9,8,3,7,8,9,2,4,5,2,6,7,8,9,1,10,8))
df$num <- c(1,2,3,4,5,6,7,8,9,10,11,1,12,13,1,14,15,16,17,18,19,1,1)
df
The column $num has correlative series (1-11, 1, 12-13, 1, 14-19,1,1)
I would like to assign a value (sorted) per each correlative serie as a column. the outcome should be like this:
df$outcome <- c(1,1,1,1,1,1,1,1,1,1,1,2,3,3,4,5,5,5,5,5,5,6,7)
df
thanks a lot!
A.
We can get the difference between adjacent elements in 'num' using diff and check whether it is not equal to 1. The logical output will be one less than the length of the 'num' vector. We pad with 'TRUE' and cumsum to get the expected output.
df$outcome <- cumsum(c(TRUE,diff(df$num)!=1))
df$outcome
#[1] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 4 5 5 5 5 5 5 6 7

Operations on data frame with a variable containing vectors

When I create a dataframe for which one variable contains a vector of integers, like
id <- 1:5
meas <- list(NA,c(1,2),c(1),c(1,2,3),c(1,2,3,4))
myDf <- data.frame(cbind(id,meas))
I can easily copy the vector into another variable or check if it contains a NA
myDf$copyMeas <- myDf$meas
myDf$naMeas <- is.na(myDf$meas)
but when I want to get the length of the vectors I get the number of observations in the data frame
myDf$lengthMeas <- length(myDf$meas)
id meas copyMeas naMeas lengthMeas
1 1 NA NA TRUE 5
2 2 1, 2 1, 2 FALSE 5
3 3 1 1 FALSE 5
4 4 1, 2, 3 1, 2, 3 FALSE 5
5 5 1, 2, 3, 4 1, 2, 3, 4 FALSE 5
Why is this behaviour? What should I use when I want the length of the vectors in another variable?
Because that column is a list. If you ask for the length of a list, you'll get how many elements it has. You seem to want the length of each element:
sapply(myDf$meas,length)
[1] 1 2 1 3 4
This does the trick:
sapply(myDf$meas, length)
[1] 1 2 1 3 4
length is not vectorized, it assumes you want the length of the object you put into it. Using sapply you force that length is done for each entry in myDf$meas.
Have a look at
str(myDf)
and you will see that myDf$meas is still a list. Accordingly, the result of length(myDf$meas) is the length of this list, which is 5.
You are looking for
myDf$lengthMeas <- sapply(myDf$meas, length).

Resources