R Select N evenly spaced out elements in vector, including first and last - r

I'm looking for a way to extract evenly spaced elements in a vector. I'd like a general way to do this because I am trying to specify the values that I want in a plotly chart. I tried using pretty but that only seems to work with ggplot2.
I'm pretty much looking for an R version of this question that was answered for python.
Here's a sample set. This sample is a vector of 23 elements, a prime that cannot be factored.
x <- 1:23
Ideally, there would be a function that takes a number for the spacing (n) and that splits x into a subset of n evenly spaced values that also includes the first and last element. For example:
split_func(x, n = 4)
[1] 1 4 8 12 16 20 23
The output elements centered between the first and last elements and are spaced by 4, with the exception of the first/second and second-to-last/last elements.
A couple other examples:
split_func(x, n = 5)
[1] 1 5 10 15 20 23 # either this
[1] 1 4 9 14 19 23 # or this would work
split_func(1:10, n = 3)
[1] 1 3 6 9 10 # either this
[1] 1 2 5 8 10 # or this would work
split_func(1:27, n = 6)
[1] 1 5 11 17 23 27
Is there a function that does this already?

Try this:
split_func <- function(x, by) {
r <- diff(range(x))
out <- seq(0, r - by - 1, by = by)
c(round(min(x) + c(0, out - 0.51 + (max(x) - max(out)) / 2), 0), max(x))
}
split_func(1:23, 4)
# [1] 1 4 8 12 16 20 23
split_func(1:23, 5)
# [1] 1 4 9 14 19 23
split_func(1:10, 3)
# [1] 1 4 7 10
split_func(1:27, 6)
# [1] 1 5 11 17 23 27

Related

How to implement extract/separate functions (from dplyr and tidyr) to separate a column into multiple columns. based on arbitrary values?

I have a column:
Y = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
I would like to split into multiple columns, based on the positions of the column values. For instance, I would like:
Y1=c(1,2,3,4,5)
Y2=c(6,7,8,9,10)
Y3=c(11,12,13,14,15)
Y4=c(16,17,18,19,20)
Since I am working with a big data time series set, the divisions will be arbitrary depending on the length of one time period.
You can use the base split to split this vector into vectors that are each 5 items long. You could also use a variable to store this interval length.
Using rep with each = 5, and creating a sequence programmatically, gets you a sequence of the numbers 1, 2, ... up to the length divided by 5 (in this case, 4), each 5 times consecutively. Then split returns a list of vectors.
It's worth noting that a variety of SO posts will recommend you store similar data in lists such as this, rather than creating multiple variables, so I'm leaving it in list form here.
Y <- 1:20
breaks <- rep(1:(length(Y) / 5), each = 5)
split(Y, breaks)
#> $`1`
#> [1] 1 2 3 4 5
#>
#> $`2`
#> [1] 6 7 8 9 10
#>
#> $`3`
#> [1] 11 12 13 14 15
#>
#> $`4`
#> [1] 16 17 18 19 20
Created on 2019-02-12 by the reprex package (v0.2.1)
Not a dplyr solution, but I believe the easiest way would involve using matrices.
foo = function(data, sep.in=5) {
data.matrix = matrix(data,ncol=5)
data.df = as.data.frame(data.matrix)
return(data.df)
}
I have not tested it but this function should create a data.frame who can be merge to a existing one using cbind()
We can make use of split (writing the commented code as solution) to split the vector into a list of vectors.
lst <- split(Y, as.integer(gl(length(Y), 5, length(Y))))
lst
#$`1`
#[1] 1 2 3 4 5
#$`2`
#[1] 6 7 8 9 10
#$`3`
#[1] 11 12 13 14 15
#$`4`
#[1] 16 17 18 19 20
Here, the gl create a grouping index by specifying the n, k and length parameters where n - an integer giving the number of levels, k - an integer giving the number of replications, and length -an integer giving the length of the result.
In our case, we want to have 'k' as 5.
as.integer(gl(length(Y), 5, length(Y)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
If we want to have multiple objects in the global environment, use list2env
list2env(setNames(lst, paste0("Y", seq_along(lst))), envir = .GlobalEnv)
Y1
#[1] 1 2 3 4 5
Y2
#[1] 6 7 8 9 10
Y3
#[1] 11 12 13 14 15
Y4
#[1] 16 17 18 19 20
Or as the OP mentioned dplyr/tidyr in the question, we can use those packages as well
library(tidyverse)
tibble(Y) %>%
group_by(grp = (row_number()-1) %/% 5 + 1) %>%
summarise(Y = list(Y)) %>%
pull(Y)
#[[1]]
#[1] 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 10
#[[3]]
#[1] 11 12 13 14 15
#[[4]]
#[1] 16 17 18 19 20
data
Y <- 1:20

calculating column sum for certain row

I am trying to calculate column sum of per 5 rows for each row, in R using the following code:
df <- data.frame(count=1:10)
for (loop in (1:nrow(df)))
{df[loop,"acc_sum"] <- sum(df[max(1,loop-5):loop,"count"])}
But I don't like the explicit loop here, how can I modify it? Thanks.
According to your question, your desired result is:
df
# count acc_sum
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
# 7 7 27
# 8 8 33
# 9 9 39
# 10 10 45
This can be done like this:
df <- data.frame(count=1:10)
library(zoo)
df$acc_sum <- rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
To obtain this result, we are reversing the order of df$count, we sum the elements (using partial = TRUE and align = "left" is important here), and we reverse the result to have the vector needed.
rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
# [1] 1 3 6 10 15 21 27 33 39 45
Note that this sums 6 elements, not 5. According to the code in your question, this gives the same output. If you just want to sum 5 rows, just replace the 6 with a 5.

Function for lagged sums

I know how to take the lagged difference:
delX = diff(x)
But the only way I know to take the lagged sum is:
sumY = apply(embed(c(0,y),2),1, sum)
Is there a function that can take the lagged sum? This way (or sliding the index in some other fashion) is not very intuitive.
You're looking for filter:
x <- 1:10
filter(x, filter=c(1,1), sides=1)
# [1] NA 3 5 7 9 11 13 15 17 19
You could also use head and tail:
head(x, -1) + tail(x, -1)
# [1] 3 5 7 9 11 13 15 17 19
Two more options:
x <- 1:10
x + dplyr::lag(x)
# [1] NA 3 5 7 9 11 13 15 17 19
x + data.table::shift(x)
# [1] NA 3 5 7 9 11 13 15 17 19
Note that you can easily change the number of lags in both functions. Instead of lagging, you can also create a leading vector by using dplyr::lead() or data.table::shift(x, 1L, type = "lead"). Both functions also allow you to specify default values (which are NA by default).

R vector :removing values conditioned on surrounding values

This may not be the best title, feel free to edit it.
x=c(NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA....)
or
x=c(NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA....)
y=c(seq(1:length(x)))
I would like z to be a new vector that is equal to y except when NAs are between 0 and 1 (not 1 and 0) where it should repeat the value taken when x=0
[1] 1 2 3 4 5 6 7 8 9 9 9 9 13 14 15 16 16 16 16 16 21 22 23 24 25 25
or
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25 26
depending on x
I really don't know how to translate this condition in R.
My solution is clunkier than #James's (now deleted) answer but maybe (?) it's more flexible:
## identify strings of NAs preceded by 0
library(zoo)
na_following_zero <- na.locf(c(1,x))[-1]==0 & is.na(x)
## now identify the 'chunks' to reset
## (there may be a more elegant way to do this)
rr <- rle(na_following_zero)
startvals <- cumsum(c(0,rr$lengths))+1
endvals <- cumsum(rr$lengths)
values <- c(NA,y[startvals-1])
z <- y
## replace values in chunks
for (i in seq_along(rr$values)[rr$values])
z[startvals[i]:endvals[i]] <- values[i]
If time isn't prohibitive, you can just use a "for" loop:
z <- y
between.0.1 <- rep(FALSE, length(x))
for(i in 2:length(x)){
if(!is.na(x[i-1]) && x[i-1]==0){ # switch on after a 0
between.0.1[i] <- TRUE
value.at.0 <- y[i-1]
z[i] <- value.at.0
}
if(between.0.1[i-1]){ # if switched on, stay switched on
between.0.1[i] <- TRUE
z[i] <- value.at.0
}
if(!is.na(x[i]) && x[i]==1){ # switch off if at a 1
between.0.1[i] <- FALSE
value.at.0 <- NA
}
}
z[between.0.1] # diagnostic check
Another approach:
y0 <- which(x==0)
y1<-which(x==1)
# need a kicker to start with first zero
y1<-y1[y1>y0[1]]
# and check for end of sequence
if(y1[length(y1)]< length(x)) y1[(length(y1)+1] <- length(x)+1
#now y0 and y1 better be same length
z<-y
#now do a loop any way you want
for (jj in 1: length(y0) ) z[y0[jj]:(y1[jj]-1)]<-y[y0[jj]]
Rgames> z
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25
[26] 26

Conditional cumulative sum

I have this data frame
t<-data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
Scanning the data.frame from top to bottom, I want to get the cumulative sum of v1 for as long as v2 is positive. When v2 becomes negative, it should stop, record the value (of the cum.sum up to then) and the cumulative sum should restart again from the next first positive v2 and so on. So that in the end for the above data frame would be be the vector
8, 10 , 12, 2
Any ideas?
I changed the name of the data.frame because t is a function (transpose). I don't get why you want to use cumsum if you only want the sum.
dtf<-data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
groups <- rle(dtf$v2 > 0)
dtf$groups<- rep(seq_along(groups$values), groups$lengths)
library(plyr)
daply(dtf, .(groups), function(x) sum(x$v1))[groups$values]
1 3 5 7
8 10 12 2
Here's one way:
t <- data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
unname(with(t, tapply(v1[v2>0], cumsum(abs(diff(sign(c(0,v2)))))[v2>0], sum)))
[1] 8 10 12 2
It might seem a bit complicated at first :)
The cumsum(abs(diff(sign(c(0,v2))))) generates a unique group id for each run of positive or negative values. Using diff and cumsum for this is a "common" trick that's good to know about... A snag is that diff produces a shorter vector - that's why the c(0, v2) is used.
Here's another way.
> r <- rle(sign(t$v2))
> diff(c(0,cumsum(t$v1)[cumsum(r$lengths)]))[r$values==1]
[1] 8 10 12 2
It's easier to understand if you split it up; it works by picking out the right elements of the cumulative sum and subtracting them.
> (s <- cumsum(t$v1))
[1] 1 3 4 8 14 21 29 31 34 38 46 47 49
> (r <- rle(sign(t$v2)))
Run Length Encoding
lengths: int [1:7] 4 2 2 1 2 1 1
values : num [1:7] 1 -1 1 -1 1 -1 1
> (k <- cumsum(r$lengths))
[1] 4 6 8 9 11 12 13
> (a <- c(0,s[k]))
[ 1] 0 8 21 31 34 46 47 49
> (d <- diff(a))
[1] 8 13 10 3 12 1 2
> d[r$values==1]
[1] 8 10 12 2
Similarly, but without rle:
> k <- which(diff(c(sign(t$v2),0))!=0)
> diff(c(0,cumsum(t$v1)[k]))[t$v2[k]>0]
[1] 8 10 12 2

Resources