If I have a vector like
"a": 0 0 1 1 1 0 0 0 0 1 1 0 0 0
How can I generate a vector of the same length containing the count of consecutive elements, like so:
"b": 2 2 3 3 3 4 4 4 4 2 2 3 3 3
I tried rle, but I did not manage to stretch it out this way.
Another option using rle and rep
with(rle(a), rep(lengths, times = lengths))
# [1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
data
a <- c(0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0)
Create a grouping variable using diff and use it in ave to calculate length of each group.
ave(x, cumsum(c(0, diff(x) != 0)), FUN = length)
# [1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
You can do the same with dplyr lag
library(dplyr)
ave(x,cumsum(x != lag(x, default = FALSE)), FUN = length)
#[1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
And for completeness with data.table rleid
library(data.table)
ave(x, rleid(x), FUN = length)
#[1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
data
x <- c(0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0)
Here is another solution using vapply
count_consec <- function (a) {
# creating output vector out
out <- integer(length(a))
# consecutive differences
diffs <- which(diff(a) != 0)
# returning 0 just in order to have a return statement in vapply - you can return anything else
vapply(1:(length(diffs)+1), function (k) {
if (k == 1) {
out[1:diffs[1]] <<- diffs[1]
return (0L)
}
if (k == length(diffs)+1) {
out[(diffs[k-1]+1):length(out)] <<- length(out) - diffs[k-1]
return (0L)
}
out[(diffs[k-1]+1):diffs[k]] <<- diffs[k] - diffs[k-1]
return (0L)
}, integer(1))
out
}
count_consec(a)
# [1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
with the data a <- as.integer(unlist(strsplit('0 0 1 1 1 0 0 0 0 1 1 0 0 0', ' ')))
Related
If I have a vector like
"a": 0 0 1 1 1 0 0 0 0 1 1 0 0 0
How can I generate a vector of the same length containing the count of consecutive elements, like so:
"b": 2 2 3 3 3 4 4 4 4 2 2 3 3 3
I tried rle, but I did not manage to stretch it out this way.
Another option using rle and rep
with(rle(a), rep(lengths, times = lengths))
# [1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
data
a <- c(0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0)
Create a grouping variable using diff and use it in ave to calculate length of each group.
ave(x, cumsum(c(0, diff(x) != 0)), FUN = length)
# [1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
You can do the same with dplyr lag
library(dplyr)
ave(x,cumsum(x != lag(x, default = FALSE)), FUN = length)
#[1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
And for completeness with data.table rleid
library(data.table)
ave(x, rleid(x), FUN = length)
#[1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
data
x <- c(0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0)
Here is another solution using vapply
count_consec <- function (a) {
# creating output vector out
out <- integer(length(a))
# consecutive differences
diffs <- which(diff(a) != 0)
# returning 0 just in order to have a return statement in vapply - you can return anything else
vapply(1:(length(diffs)+1), function (k) {
if (k == 1) {
out[1:diffs[1]] <<- diffs[1]
return (0L)
}
if (k == length(diffs)+1) {
out[(diffs[k-1]+1):length(out)] <<- length(out) - diffs[k-1]
return (0L)
}
out[(diffs[k-1]+1):diffs[k]] <<- diffs[k] - diffs[k-1]
return (0L)
}, integer(1))
out
}
count_consec(a)
# [1] 2 2 3 3 3 4 4 4 4 2 2 3 3 3
with the data a <- as.integer(unlist(strsplit('0 0 1 1 1 0 0 0 0 1 1 0 0 0', ' ')))
Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4
I have a series of 14 Boolean variables and I would like to find the top 3 combinations of 3 or more variables (where the value == 1).
Sample data:
df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
var1 = c(0, 0, 1, 1, 1, 0, 0, 1),
var2 = c(1, 0, 0, 1, 1, 1, 1, 0),
var3 = c(0, 0, 1, 1, 1, 1, 0, 0),
var4 = c(1, 1, 1, 1, 1, 0, 1, 1),
var5 = c(0, 0, 0, 1, 1, 0, 1, 1)
)
df
> df
ID var1 var2 var3 var4 var5
1 1 0 1 0 1 0
2 2 0 0 0 1 0
3 3 1 0 1 1 0
4 4 1 1 1 1 1
5 5 1 1 1 1 1
6 6 0 1 1 0 0
7 7 0 1 0 1 1
8 8 1 0 0 1 1
I found a solution to bring all column names together per unique occurance:
# Bring to long format
df_long <- df %>%
melt(id.vars = "ID")
# Collapse the variables that have a '1' together per row
df_combo <- ddply(df_long, "ID", summarize,
combos = paste(variable[value == 1], collapse = "/"))
> df_combo
ID combos
1 1 var2/var4
2 2 var4
3 3 var1/var3/var4
4 4 var1/var2/var3/var4/var5
5 5 var1/var2/var3/var4/var5
6 6 var2/var3
7 7 var2/var4/var5
8 8 var1/var4/var5
If I only wanted counts on unique combinations this would be fine, but I would like to know the number of times each combination of 3 or more variables occurs, even in cases where other variables also occur. The combination (var1/var4/var5) occurs 3 times in the above example, but twice it occurs next to two other variables.
There must be an easy way to extract this information, just can't think of it. Thank you for your help!!
An attempt, using combn as the workhorse function.
arr <- which(df[-1] == 1, arr.ind=TRUE)
tmp <- tapply(arr[,"col"], arr[,"row"],
FUN=function(x) if (length(x) >= 3) combn(x,3, simplify=FALSE) )
tmp <- data.frame(do.call(rbind, unlist(tmp, rec=FALSE)))
aggregate(count ~ . , cbind(tmp, count=1), sum)
## X1 X2 X3 count
##1 1 2 3 2
##2 1 2 4 2
##3 1 3 4 3
##4 2 3 4 2
##5 1 2 5 2
##6 1 3 5 2
##7 2 3 5 2
##8 1 4 5 3
##9 2 4 5 3
##10 3 4 5 2
I have a vector of 1s and 0s. I would like to replace the 1s with its "spot" in the vector.
For example I would like to change
x = c(1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1)
to
1 1 1 0 2 2 2 0 0 3 0 4
The numbers of 1s and 0s in a row can change.
Here is one way to do it...
x = c(1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1)
x[x==1] <- cumsum(c(x[1], diff(x)) == 1)[x==1]
x
[1] 1 1 1 0 2 2 2 0 0 3 0 4
Another way with rle
x * with(rle(x), rep(cumsum(values), lengths))
#[1] 1 1 1 0 2 2 2 0 0 3 0 4
We create a run-length sequence of x and repeat cumsum of values lengths time and multiply it by x so that 0's remain as 0's and only 1's are changed.
Here is an option with rle and inverse.rle
inverse.rle(within.list(rle(x), values[values==1] <- seq_along(values[values==1])))
#[1] 1 1 1 0 2 2 2 0 0 3 0 4
Or an option using rleid from data.table
library(data.table)
x1 <- rleid(x)
x1[x!= 0] <- rleid(x1[x!=0])
x1 * x
#[1] 1 1 1 0 2 2 2 0 0 3 0 4
vector A:
a = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1)
vector B: (only used for initialization)
b = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Dataframe:
dft <- data.frame(a,b)
The following for-loop compares for each row "i" the value A[i] with A[i+1] in vector A.
If i+1 is different -> write "count"
else check i+2 and increment "count" ...
The idea is to know for each row, the number of rows until the value in A changes.
count = 0
% takes endless (for large set) but does its job
for(i in 1:nrow(dft)) {
for(j in i+1:nrow(dft)-1) {
j_value <- dft[j,"a"]
i_value <- dft[i,"a"]
if (!is.na(j_value) & !is.na(i_value)){
tmp_value <- abs(i_value - j_value)
if(tmp_value > 0) {
dft[i,"b"] <- count
count = 0
break
} else {
count = count + 1
}
}
}
}
Results should be:
b
1: 5
2: 4
3: 3
4: 2
5: 1
6: 1
7: 2
8: 1
9: 3
10: 2
11: 1
12: 5
13: 4
14: 3
15: 2
16: 1
17: 0
The following should work:
b = rle(a)
unlist(mapply(":", b$lengths, 1))
# [1] 5 4 3 2 1 1 2 1 3 2 1 5 4 3 2 1 1
Or in one line:
with(rle(a), unlist(Map(":", lengths, 1)))
Using "data.table", you can do the following:
library(data.table)
data.table(a)[, b := .N:1, rleid(a)][]
# a b
# 1: 0 5
# 2: 0 4
# 3: 0 3
# 4: 0 2
# 5: 0 1
# 6: 1 1
# 7: 0 2
# 8: 0 1
# 9: 1 3
# 10: 1 2
# 11: 1 1
# 12: 0 5
# 13: 0 4
# 14: 0 3
# 15: 0 2
# 16: 0 1
# 17: 1 1
How about this, using data.table. There's a bit of reverse ordering, and use of shift to compare values with subsequent values. It might be a little convoluted, but it seems to work.
library( data.table )
dft <- data.table(a)
dft[ , f := shift( a, 1L, fill = F, type = "lead" ) != a
][ .N:1, b := seq_len(.N), by = cumsum(f)
][ , f := NULL ]
dft
a b
1: 0 5
2: 0 4
3: 0 3
4: 0 2
5: 0 1
6: 1 1
7: 0 2
8: 0 1
9: 1 3
10: 1 2
11: 1 1
12: 0 5
13: 0 4
14: 0 3
15: 0 2
16: 0 1
17: 1 1
Here is another approach with apply:
# The data
a=c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1)
# An index of the data
ind <- 1:length(a)
# The function to apply
f <- function(x){ifelse(!is.na(which(a[x]!=a[x:length(a)])[1] - 1), # Check if we are in the last group before series ends
which(a[x]!=a[x:length(a)])[1] - 1, # if not return distance to nearest value change
ind[length(a)] - x + 1) # if we are return length of last block of values
}
unlist(lapply(ind, f)) # Apply and unlist to vector
#> [1] 5 4 3 2 1 1 2 1 3 2 1 5 4 3 2 1 1
If you wanted you could reduce it to just the which() statement, in which case the last block of homogenous values would be assigned an NA. Depending on the context there are different ways you might want to treat the last block, as the number of repetitions until the value changes is censored (maybe you want to supply a string in the second term of the ifelse like '1+').