Conditional cumulative sum - r

I have this data frame
t<-data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
Scanning the data.frame from top to bottom, I want to get the cumulative sum of v1 for as long as v2 is positive. When v2 becomes negative, it should stop, record the value (of the cum.sum up to then) and the cumulative sum should restart again from the next first positive v2 and so on. So that in the end for the above data frame would be be the vector
8, 10 , 12, 2
Any ideas?

I changed the name of the data.frame because t is a function (transpose). I don't get why you want to use cumsum if you only want the sum.
dtf<-data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
groups <- rle(dtf$v2 > 0)
dtf$groups<- rep(seq_along(groups$values), groups$lengths)
library(plyr)
daply(dtf, .(groups), function(x) sum(x$v1))[groups$values]
1 3 5 7
8 10 12 2

Here's one way:
t <- data.frame(v1=c(1,2,1,4,6,7,8,2,3,4,8,1,2), v2=c(2,3,6,1,-3,-2,1,2,-3,6,7,-2,1))
unname(with(t, tapply(v1[v2>0], cumsum(abs(diff(sign(c(0,v2)))))[v2>0], sum)))
[1] 8 10 12 2
It might seem a bit complicated at first :)
The cumsum(abs(diff(sign(c(0,v2))))) generates a unique group id for each run of positive or negative values. Using diff and cumsum for this is a "common" trick that's good to know about... A snag is that diff produces a shorter vector - that's why the c(0, v2) is used.

Here's another way.
> r <- rle(sign(t$v2))
> diff(c(0,cumsum(t$v1)[cumsum(r$lengths)]))[r$values==1]
[1] 8 10 12 2
It's easier to understand if you split it up; it works by picking out the right elements of the cumulative sum and subtracting them.
> (s <- cumsum(t$v1))
[1] 1 3 4 8 14 21 29 31 34 38 46 47 49
> (r <- rle(sign(t$v2)))
Run Length Encoding
lengths: int [1:7] 4 2 2 1 2 1 1
values : num [1:7] 1 -1 1 -1 1 -1 1
> (k <- cumsum(r$lengths))
[1] 4 6 8 9 11 12 13
> (a <- c(0,s[k]))
[ 1] 0 8 21 31 34 46 47 49
> (d <- diff(a))
[1] 8 13 10 3 12 1 2
> d[r$values==1]
[1] 8 10 12 2
Similarly, but without rle:
> k <- which(diff(c(sign(t$v2),0))!=0)
> diff(c(0,cumsum(t$v1)[k]))[t$v2[k]>0]
[1] 8 10 12 2

Related

R Select N evenly spaced out elements in vector, including first and last

I'm looking for a way to extract evenly spaced elements in a vector. I'd like a general way to do this because I am trying to specify the values that I want in a plotly chart. I tried using pretty but that only seems to work with ggplot2.
I'm pretty much looking for an R version of this question that was answered for python.
Here's a sample set. This sample is a vector of 23 elements, a prime that cannot be factored.
x <- 1:23
Ideally, there would be a function that takes a number for the spacing (n) and that splits x into a subset of n evenly spaced values that also includes the first and last element. For example:
split_func(x, n = 4)
[1] 1 4 8 12 16 20 23
The output elements centered between the first and last elements and are spaced by 4, with the exception of the first/second and second-to-last/last elements.
A couple other examples:
split_func(x, n = 5)
[1] 1 5 10 15 20 23 # either this
[1] 1 4 9 14 19 23 # or this would work
split_func(1:10, n = 3)
[1] 1 3 6 9 10 # either this
[1] 1 2 5 8 10 # or this would work
split_func(1:27, n = 6)
[1] 1 5 11 17 23 27
Is there a function that does this already?
Try this:
split_func <- function(x, by) {
r <- diff(range(x))
out <- seq(0, r - by - 1, by = by)
c(round(min(x) + c(0, out - 0.51 + (max(x) - max(out)) / 2), 0), max(x))
}
split_func(1:23, 4)
# [1] 1 4 8 12 16 20 23
split_func(1:23, 5)
# [1] 1 4 9 14 19 23
split_func(1:10, 3)
# [1] 1 4 7 10
split_func(1:27, 6)
# [1] 1 5 11 17 23 27

Referring to elements of a matrices list

I've got a matrices list created as following:
#create the database
vect_date <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
vect <- c(48,40,32,36,37,37,20,15,15,24,24,10,10,10)
vect <- as.data.frame(cbind(vect_date, vect))
vect <- vect[order(vect$vect_date),]
#create levels depending on vect$vect value
vect$level <- 1
for(i in 2:length(vect$vect)){vect$level[i] <- ifelse(vect$vect[i]==vect$vect[i-1], vect$level[i- 1],vect$level[i-1]+1)}
#create the list
monotone <- split(vect, f=vect$level)
Now, I would like to change vect$vect value of each of these levels/matrices depending on the vect$vect value of the subsequent matrix. I guess the issue consists of indexing elements and using for loops, but I don't know how to do that.
As an example, I would like to change the value of vect$vect depending on the fact that the subsequent is 10. In that case, the vect$vect value of that level should be multiplied by 100, obtaining:
vect <- c(48,40,37,36,37,37,20,15,15,2400,2400,10,10,10)
Any help would be great!
I think you can use factor in R first to get your levels:
vect_date <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
vect <- c(48,40,32,36,37,37,20,15,15,24,24,10,10,10)
vect <- as.data.frame(cbind(vect_date, vect))
vect <- vect[order(vect$vect_date),]
vect$level = factor(vect$vect,levels=unique(vect$vect))
vect$level = as.numeric(vect$level)
So if we want to change the level that comes before vect that have values 10, we can do:
level_tochange = vect$level[vect$vect==10] - 1
level_tochange
[1] 8 8 8
This tells us we need to change rows with level == 8. Note I use %in% because in events where you have more than 2 levels with vect==10, this will still work:
rows_tochange = which(vect$level %in% level_tochange)
vect$vect[rows_tochange] = vect$vect[rows_tochange]*100
vect
vect_date vect level
1 1 48 1
2 2 40 2
3 3 32 3
4 4 36 4
5 5 37 5
6 6 37 5
7 7 20 6
8 8 15 7
9 9 15 7
10 10 2400 8
11 11 2400 8
12 12 10 9
13 13 10 9
14 14 10 9

Get specific column value for each row

I want to get a "m" length vector that, considering a m x n matrix, for each row, gives the value on the column identified by another column (say column "Z").
I made it using a for loop:
for (i in 1:dim(data.frame)[1]){vector[i] <- data.frame[i,data.frame$Z[i]]}
Do you see a simpler way to code it avoiding the loop?
"apply" is a possibility:
> M <- cbind( matrix(1:15,3,5), "Z"=c(3,1,2) )
> M
Z
[1,] 1 4 7 10 13 3
[2,] 2 5 8 11 14 1
[3,] 3 6 9 12 15 2
> v <- apply(M,1,function(x){x[x["Z"]]})
> v
[1] 7 2 6
>

cumsum the opposite of diff in r

I have a question and I'm not sure if I'm being totally stupid here or if this is a genuine problem, or if I've misunderstood what these functions do.
Is the opposite of diff the same as cumsum? I thought it was. However, using this example:
dd <- c(17.32571,17.02498,16.71613,16.40615,
16.10242,15.78516,15.47813,15.19073,
14.95551,14.77397)
par(mfrow = c(1,2))
plot(dd)
plot(cumsum(diff(dd)))
> dd
[1] 17.32571 17.02498 16.71613 16.40615 16.10242 15.78516 15.47813 15.19073 14.95551
[10] 14.77397
> cumsum(diff(dd))
[1] -0.30073 -0.60958 -0.91956 -1.22329 -1.54055 -1.84758 -2.13498 -2.37020 -2.55174
These aren't the same. Where have I gone wrong?
AHHH! Fridays.
Obviously
The functions are quite different: diff(x) returns a vector of length (length(x)-1) which contains the difference between one element and the next in a vector x, while cumsum(x) returns a vector of length equal to the length of x containing the sum of the elements in x
Example:
x <- c(1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1 1 1 1 1
v <- cumsum(x)
> v
#[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum() is the cumulative sum and therefore the entries of the vector v[i] that it returns are a result of all elements in x between x[1] and x[i]. In contrast, diff(x) only takes the difference between one element x[i] and the next, x[i+1].
The combination of cumsum and diff leads to different results, depending on the order in which the functions are executed:
> cumsum(diff(x))
# 1 2 3 4 5 6 7 8 9
Here the result is the cumulative sum of a sequence of nine "1". Note that if this result is compared with the original vector x, the last entry 10 is missing.
On the other hand, by calculating
> diff(cumsum(x))
# 2 3 4 5 6 7 8 9 10
one obtains a vector that is again similar to the original vector x, but now the first entry 1 is missing.
In none of the cases the original vector is restored, therefore it cannot be stated that cumsum() is the opposite or inverse function of diff()
You forgot to account for the impact of the first element
dd == c(dd[[1]], dd[[1]] + cumsum(diff(dd)))
#RHertel answered it well, stating that diff() returns a vector with length(x)-1.
Therefore, another simple workaround would be to add 0 to the beginning of the original vector so that diff() computes the difference between x[1] and 0.
> x <- 5:10
> x
#[1] 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1
> diff(c(0,x))
#[1] 5 1 1 1 1 1
This way it is possible to use diff() with c() as a representation of the inverse of cumsum()
> cumsum(diff(c(0,x)))
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(c(0,cumsum(x)))
#[1] 1 2 3 4 5 6 7 8 9 10
If you know the value of "lag" and "difference".
x<-5:10
y<-diff(x,lag=1,difference=1)
z<-diffinv(y,lag=1,differences = 1,xi=5) #xi is first value.
k<-as.data.frame(cbind(x,z))
k
x z
1 5 5
2 6 6
3 7 7
4 8 8
5 9 9
6 10 10

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources