Remove zeros in the start and end of a vector - r

I have a vector like this:
x <- c(0, 0, 0, 0, 4, 5, 0, 0, 3, 2, 7, 0, 0, 0)
I want to keep only the elements from position 5 to 11. I want to delete the zeroes in the start and end. For this vector it is quite easy since it is small.
I have very large data and need something in general for all vectors.

Try this:
x[ min( which ( x != 0 )) : max( which( x != 0 )) ]
Find index for all values that are not zero, and take the first -min and last - max to subset x.

You can try something like:
x=c(0,0,0,0,4,5,0,0,3,2,7,0,0,0)
rl <- rle(x)
if(rl$values[1] == 0)
x <- tail(x, -rl$lengths[1])
if(tail(rl$values,1) == 0)
x <- head(x, -tail(rl$lengths,1))
x
## 4 5 0 0 3 2 7
Hope it helps,
alex

This would also work :
x[cumsum(x) & rev(cumsum(rev(x)))]
# [1] 4 5 0 0 3 2 7

I would probably define two functions, and compose them:
trim_leading <- function(x, value=0) {
w <- which.max(cummax(x != value))
x[seq.int(w, length(x))]
}
trim_trailing <- function(x, value=0) {
w <- which.max(cumsum(x != value))
x[seq.int(w)]
}
And then pipe your data through:
x %>% trim_leading %>% trim_trailing

Related

How to use a loop with mutate dplyr

I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]
I want to fill the variable count_y based on an adjusted cumulative function below:
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
}
I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?
df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.))
I think this is the perfect case to use purrr::accumulate2().
purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.
purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.
Here is the code, matching your expected output.
library(tidyverse)
df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8),
check = c(0, 1, 1, 0, 0, 1, 0, 0, 0),
count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)),
row.names = c(NA, -9L), class = "data.frame")
df %>%
mutate(
count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
case_when(
x==0 ~ co,
x==1 & lx==0 ~ 1,
x==1 & lx==1 ~ co+1,
TRUE ~ 999 #error value in case of unexpected input
)
})
)
#> id check count_x count_y
#> 1 8 0 0 0
#> 2 8 1 1 1
#> 3 8 1 2 2
#> 4 8 0 2 2
#> 5 8 0 2 2
#> 6 8 1 3 1
#> 7 8 0 3 1
#> 8 8 0 3 1
#> 9 8 0 3 1
Created on 2021-05-05 by the reprex package (v2.0.0)
The first issue is that you weren't returning anything in your function. The second issue is that you don't need to use a mutate_at (or even a mutate as would be more appropriate for a single variable) when you're writing the function that modifies the entire tibble. The simplest way to get it working is adding a return statement and running it in line like so:
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
return(df)
}
df %>% group_by(id) %>% count_n(.)
However, I would use Dan's answer above because it's much cleaner and has the advantage of not running a for loop, which isn't very "R". :)

What solves my problem: Map, reduce or a recursion?

I really need some help to write a recursion in R.
The function that I want changes a certain observation according to a set of comparisons between different rows in a data frame, which I shall call g. One of these comparisons depends on the previous value of this same observation.
Suppose first that I want to update the value of column index, row i in my data df in the following way:
j <- 1:4
g <- (df$dom[i] > 0 &
abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
df$index[i] <- ifelse(any(g), which(g)[[1]], df$index[[i]])
The thing is, the object w is actually a list:
w = list(0, 1, 2, df$age[i])
So, as you can see, I want to create a function foo() that updates df$index iteratively. It changes it by looping through w and comparisons depend on updated values.
Here is some data:
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
I am not sure if a recursive function is actually needed or if something like reduce or map would do it.
Thank you!
The following function uses a double for loop to change the values of column index according to the condition defining g. It accepts a data.frame as input and returns the updated data.frame.
foo <- function(x){
change_index <- function(x, i, w){
j <- seq_len(nrow(x))
(x$dom[i] > 0 & abs(x$V2009[i] - x$V2009[j]) <= w) |
x$index[i] == x$index[j]
}
for(i in seq_len(nrow(x))){
W <- list(0, 1, 2, x$age[i])
for(w in W){
g <- change_index(x, i, w)
if(any(g)) x$index[i] <- which(g)[1]
}
}
x
}
foo(df)
# dom V2009 index age
#1 0 9 1 2
#2 0 11 2 2
#3 6 9 1 2
#4 6 11 1 2
One can define w inside a function and use lexical scoping (closure).
Using your instructions, the function index_value calculates for any given i the index value.
correct_index_col returns the corrected df.
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
index_value <- function(df, i) {
j <- nrow(df)
w <- c(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
ifelse(any(g), which(g)[[1]], df$index[[i]])
}
correct_index_col <- function(df) {
indexes <- Vectorize(function(i) {
index_value(df, i)
})
df$index <- indexes(1:nrow(df))
df
}
# > correct_index_col(df)
# dom V2009 index age
# 1 0 9 1 2
# 2 0 11 1 2
# 3 6 9 3 2
# 4 6 11 1 2
#
If you want to really update (mutate) your df, then you have to do
df <- correct_index_col(df).
Here is an attempt of my own. I guess I figured out a way to use recursion over mutate:
test <- function(i, df, k){
j <- 1:nrow(df)
w <- list(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w[k]) |
df$index[i] == df$index[j]
l <- ifelse(any(g), which(g)[1], df$index[i])
return(l)
}
loop <- function(data,
k = 1) {
data <- data %>%
mutate(index = map_dbl(seq(n()),
~ test(.x, df = cur_data(), k)))
if (k == 4) {
return(data)
} else {
return(loop(data, k + 1))
}
}
df %>% loop()
I welcome any comments in case this is inefficient considering large datasets

Applying a Logical Calculation to Two Vectors and Returning the Result in a Third Vector

I'm fairly new to R and am having trouble implementing something that should be very basic. Can someone point me in the right direction?
I need to apply a logical calculation based on the values of two vectors and return the value of that function in a third vector.
I want to do this in a user defined function so I can easily apply this in several other areas of the algorithm and make modifications to the implementation with ease.
Here's what I have tried, but I cannot get this implementation to work. I believe it is because I cannot send vectors as parameters to this function.
<!-- language: python -->
calcSignal <- function(fVector, sVector) {
if(!is.numeric(fVector) || !is.numeric(sVector)) {
0
}
else if (fVector > sVector) {
1
}
else if (fVector < sVector) {
-1
}
else {
0 # is equal case
}
}
# set up data frame
df <- data.frame(x=c("NA", 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, "NA"))
# call function
df$z <- calcSignal(df$x, df$y)
I want the output to be a vector with the following values, but I am not implementing the function correctly.
[0,-1,1,-1,0,0]
Can someone help explain how to implement this function to properly perform the logic outlined?
I appreciate your assistance!
There are some misunderstandings in your code:
in R, "NA" is considered as character (string is called character in R). the correct
form is NA without quotes.
it is worth noting that data.frame automatically will convert character to factor type which can be disabled by using data.frame(...,stringsAsFactors = F).
each column of a data.frame has a type, not each element. so when you have a column containing numbers and NA, class of that column will be numeric and is.numeric gives you True even for NA elements. is.na will do the job
|| only compares first element of each vector. | does elementwise comparison.
Now let's implement what you wanted:
Implementation 1:
#set up data frame
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
calcSignal <- function(f,s){
if(is.na(f) | is.na(s))
return(0)
else if(f>s)
return(1)
else if(f<s)
return(-1)
else
return(0)
}
df$z = mapply(calcSignal, df$x, df$y, SIMPLIFY = T)
to run a function on two or more vectors element-wise, we can use mapply.
Implementaion 2
not much different from previous. here the function is easier to use.
#set up data frame
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
calcSignal <- function(fVector, sVector) {
res = mapply(function(f,s){
if(is.na(f) | is.na(s))
return(0)
else if(f>s)
return(1)
else if(f<s)
return(-1)
else
return(0)
},fVector,sVector,SIMPLIFY = T)
return(res)
}
df$z = calcSignal(df$x,df$y)
Implementaion 3 (Vectorized)
This one is much better. because it is vectorized and is much faster:
calcSignal <- function(fVector, sVector) {
res = rep(0,length(fVector))
res[fVector>sVector] = 1
res[fVector<sVector] = -1
#This line isn't necessary.It's just for clarification
res[(is.na(fVector) | is.na(sVector))] = 0
return(res)
}
df$z = calcSignal(df$x,df$y)
Output:
> df
x y z
1 NA 4 0
2 2 1 1
3 9 5 1
4 7 9 -1
5 0 0 0
6 5 NA 0
No need for loopage as ?sign has your back:
# fixing the "NA" issue:
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
s <- sign(df$x - df$y)
s[is.na(s)] <- 0
s
#[1] 0 1 1 -1 0 0
ifelse is another handy function. Less elegant here than sign though
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
cs <- function(x, y){
a <- x > y
b <- x < y
out <- ifelse(a, 1, ifelse(b, -1, 0))
ifelse(is.na(out), 0, out)
}
cs(df$x, df$y)

Cut elements from the beginning and end of an R vector

For time series analysis I handle data that often contains leading and trailing zero elements. In this example, there are 3 zeros at the beginning an 2 at the end. I want to get rid of these elements, and filter for the contents in the middle (that also may contain zeros)
vec <- c(0, 0, 0, 1, 2, 0, 3, 4, 0, 0)
I did this by looping from the beginning and end, and masking out the unwanted elements.
mask <- rep(TRUE, length(vec))
# from begin
i <- 1
while(vec[i] == 0 && i <= length(vec)) {
mask[i] <- FALSE
i <- i+1
}
# from end
i <- length(vec)
while(i >= 1 && vec[i] == 0) {
mask[i] <- FALSE
i <- i-1
}
cleanvec <- vec[mask]
cleanvec
[1] 1 2 0 3 4
This works, but I wonder if there is a more efficient way to do this, avoiding the loops.
vec[ min(which(vec != 0)) : max(which(vec != 0)) ]
Basically the which(vec != 0) part gives the positions of the numbers that are different from 0, and then you take the min and max of them.
We could use the range and Reduce to get the sequence
vec[Reduce(`:`, range(which(vec != 0)))]
#[1] 1 2 0 3 4
Take the cumsum forward and backward of abs(vec) and keep only elements > 0. if it were known that all elements of vec were non-negative, as in the question, then we could optionally omit abs.
vec[cumsum(abs(vec)) > 0 & rev(cumsum(rev(abs(vec)))) > 0]
## [1] 1 2 0 3 4

counting numbers between two zeros

Suppose I have a vector, say:
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0,1,1,1,1,0,1,1,0,1,0,0,0,0,0,1,0,1,0,1,0)
and I would like to obtain a vector that sums the values that falls between two zeros, i.e. the output should look like:
y = c(1,2,4,1,1,1)
Note that all ones should have zero at the beginning and zero at the end, otherwise it will not be counted. so the string 01010 only produce 1.
I tried to use run length with an index of zeros.
Thanks in Advance
sum.between.zeroes <- function(x) {
library(stringr)
x.str <- paste(x, collapse = "")
nchar(str_extract_all(x.str, "01+0")[[1]]) - 2L
}
sum.between.zeroes(c(1,0,1,0,0,0,1,1,0,0,1,1,1,1,0,1,1,0,1,0,0,0,0,0,1,0,1,0,1,0))
# [1] 1 2 4 1 1 1
sum.between.zeroes(c(0,1,0,1,0))
# [1] 1
sum.between.zeroes(c(1,1))
# integer(0)
If you want to remain within the base package, you can use gregexpr and regmatches:
sum.between.zeroes <- function(x) {
x.str <- paste(x, collapse = "")
nchar(regmatches(x.str, gregexpr("01+0", x.str))[[1]]) - 2L
}

Resources