cumsum the opposite of diff in r - r

I have a question and I'm not sure if I'm being totally stupid here or if this is a genuine problem, or if I've misunderstood what these functions do.
Is the opposite of diff the same as cumsum? I thought it was. However, using this example:
dd <- c(17.32571,17.02498,16.71613,16.40615,
16.10242,15.78516,15.47813,15.19073,
14.95551,14.77397)
par(mfrow = c(1,2))
plot(dd)
plot(cumsum(diff(dd)))
> dd
[1] 17.32571 17.02498 16.71613 16.40615 16.10242 15.78516 15.47813 15.19073 14.95551
[10] 14.77397
> cumsum(diff(dd))
[1] -0.30073 -0.60958 -0.91956 -1.22329 -1.54055 -1.84758 -2.13498 -2.37020 -2.55174
These aren't the same. Where have I gone wrong?
AHHH! Fridays.
Obviously

The functions are quite different: diff(x) returns a vector of length (length(x)-1) which contains the difference between one element and the next in a vector x, while cumsum(x) returns a vector of length equal to the length of x containing the sum of the elements in x
Example:
x <- c(1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1 1 1 1 1
v <- cumsum(x)
> v
#[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum() is the cumulative sum and therefore the entries of the vector v[i] that it returns are a result of all elements in x between x[1] and x[i]. In contrast, diff(x) only takes the difference between one element x[i] and the next, x[i+1].
The combination of cumsum and diff leads to different results, depending on the order in which the functions are executed:
> cumsum(diff(x))
# 1 2 3 4 5 6 7 8 9
Here the result is the cumulative sum of a sequence of nine "1". Note that if this result is compared with the original vector x, the last entry 10 is missing.
On the other hand, by calculating
> diff(cumsum(x))
# 2 3 4 5 6 7 8 9 10
one obtains a vector that is again similar to the original vector x, but now the first entry 1 is missing.
In none of the cases the original vector is restored, therefore it cannot be stated that cumsum() is the opposite or inverse function of diff()

You forgot to account for the impact of the first element
dd == c(dd[[1]], dd[[1]] + cumsum(diff(dd)))

#RHertel answered it well, stating that diff() returns a vector with length(x)-1.
Therefore, another simple workaround would be to add 0 to the beginning of the original vector so that diff() computes the difference between x[1] and 0.
> x <- 5:10
> x
#[1] 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1
> diff(c(0,x))
#[1] 5 1 1 1 1 1
This way it is possible to use diff() with c() as a representation of the inverse of cumsum()
> cumsum(diff(c(0,x)))
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(c(0,cumsum(x)))
#[1] 1 2 3 4 5 6 7 8 9 10

If you know the value of "lag" and "difference".
x<-5:10
y<-diff(x,lag=1,difference=1)
z<-diffinv(y,lag=1,differences = 1,xi=5) #xi is first value.
k<-as.data.frame(cbind(x,z))
k
x z
1 5 5
2 6 6
3 7 7
4 8 8
5 9 9
6 10 10

Related

Operation between two dataframe with different size in R

I'd like to sum two dataframe with different size in R.
> x = data.frame(a=c(1,2,3),b=c(5,6,7))
> y = data.frame(x=c(1,1,1))
> x
a b
1 1 5
2 2 6
3 3 7
> y
x
1 1
2 1
3 1
The result I want is,
>
a b
1 2 6
2 3 7
3 4 8
How can I do this?
Maybe easiest to convert y to a vector with unlist and then perform the operation. Here, the vector in unlist(y) will be recycled over the columns of the data.frame x.
x + unlist(y)
a b
1 2 6
2 3 7
3 4 8
As a side note, data.frames are a special type of list object and sometimes performing operations on lists can be a bit more involved. On the otherhand, they tend to work fairly well with vectors as long as the dimensions line up (here, as long as the vector has the same length as the number of rows in the data.frame).
We can make the dimensions same and then get the sum
x + rep(y, ncol(x))
# a b
#1 2 6
#2 3 7
#3 4 8
Or another option is sweep
sweep(x, y$x, 1, `+`)
# a b
#1 2 6
#2 3 7
#3 4 8

Summing a column to a certain value

I have a data.frame with 2 variables, and 177 observations. I would like to sum up one variable to a certain value, and then get the value of the other variable when that threshold is reached. I will try to add an reproducible example. I am new here so forgive me if I do it wrong.
> df <- data.frame(x=10:1,y=1:10)
> print(df)
x y
1 10 1
2 9 2
3 8 3
4 7 4
5 6 5
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10
How can I sum column y until it reaches a certain value, let's say 7, and then either have it return the value of X(4), or the row number 7. I am sure it is pretty straightforward, but I seem to be drawing a blank.
Here is my solution.
df[cumsum(df$y) <= 7,]
x y
1 10 1
2 9 2
3 8 3
The OP just asked for the relevant value of x which would be done using:
df$x[which(cumsum(df$y) >= 10)[1]]
Also note this finds the first where cumsum(df$y) is at least 10 whereas the other answers find the last <= 7 which is potentially different (though not for this dataset). For the original question (pre-comment) it would need to be:
df$x[which(cumsum(df$y) > 7)[1]]
If you want to stay with base R, try this
> df$x[df$y >= 7][1]
[1] 4
> max(cumsum(df$y[df$y <= 7]))
[1] 28
Or if you need this in a matrix form:
> cbind(df$x[df$y >= 7][1], max(cumsum(df$y[df$y <= 7])))
[,1] [,2]
[1,] 4 28
I would still look into switching to data.table or at least dplyr packages for data manipulation.

Finding the minimum positive value

I guess I don't know which.min as well as I thought.
I'm trying to find the occurrence in a vector of a minimum value that is positive.
TIME <- c(0.00000, 4.47104, 6.10598, 6.73993, 8.17467, 8.80862, 10.00980, 11.01080, 14.78110, 15.51520, 16.51620, 17.11680)
I want to know for the values z of 1 to 19, the index of the above vector TIME containing the value that is closest to but above z. I tried the following code:
vec <- sapply(seq(1,19,1), function(z) which.min((z-TIME > 0)))
vec
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 1 1
To my mind, the last two values of vec should be '12, 12'. The reason it's doing this is because it thinks that '0.0000' is closest to 0.
So, I thought that maybe it was because I exported the data from external software and that 0.0000 wasn't really 0. But,
TIME[1]==0 #TRUE
Then I got further confused. Why do these give the answer of index 1, when really they should be an ERROR?
which.min(0 > 0 ) #1
which.min(-1 > 0 ) #1
I'll be glad to be put right.
EDIT:
I guess in a nutshell, what is the better way to get this result:
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
which shows the index of TIME that gives the smallest possible positive value, when subtracting each element of TIME from the values of 1 to 19.
The natural function to use here (both to limit typing and for efficiency) is actually not which.min + sapply but the cut function, which will determine which range of times each of the values 1:19 falls into:
cut(1:19, breaks=TIME, right=FALSE)
# [1] [0,4.47) [0,4.47) [0,4.47) [0,4.47) [4.47,6.11) [4.47,6.11) [6.74,8.17)
# [8] [6.74,8.17) [8.81,10) [8.81,10) [10,11) [11,14.8) [11,14.8) [11,14.8)
# [15] [14.8,15.5) [15.5,16.5) [16.5,17.1) <NA> <NA>
# 11 Levels: [0,4.47) [4.47,6.11) [6.11,6.74) [6.74,8.17) [8.17,8.81) ... [16.5,17.1)
From this, you can easily determine what you're looking for, which is the index of the smallest element in TIME greater than the cutoff:
(x <- as.numeric(cut(1:19, breaks=TIME, right=FALSE))+1)
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 NA NA
The last two entries appear as NA because there is no element in TIME that exceeds 18 or 19. If you wanted to replace these with the largest element in TIME, you could do so with replace:
replace(x, is.na(x), length(TIME))
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
Here's one way:
x <- t(outer(TIME,1:19,`-`))
max.col(ifelse(x<0,x,Inf),ties="first")
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
It's computationally wasteful to take all the differences in this way, since both vectors are ordered.

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Binning a numeric variable

I have a vector X that contains positive numbers that I want to bin/discretize. For this vector, I want the numbers [0, 10) to show up just as they exist in the vector, but numbers [10,∞) to be 10+.
I'm using:
x <- c(0,1,3,4,2,4,2,5,43,432,34,2,34,2,342,3,4,2)
binned.x <- as.factor(ifelse(x > 10,"10+",x))
but this feels klugey to me. Does anyone know a better solution or a different approach?
How about cut:
binned.x <- cut(x, breaks = c(-1:9, Inf), labels = c(as.character(0:9), '10+'))
Which yields:
# [1] 0 1 3 4 2 4 2 5 10+ 10+ 10+ 2 10+ 2 10+ 3 4 2
# Levels: 0 1 2 3 4 5 6 7 8 9 10+
You question is inconsistent.
In description 10 belongs to "10+" group, but in code 10 is separated level.
If 10 should be in the "10+" group then you code should be
as.factor(ifelse(x >= 10,"10+",x))
In this case you could truncate data to 10 (if you don't want a factor):
pmin(x, 10)
# [1] 0 1 3 4 2 4 2 5 10 10 10 2 10 2 10 3 4 2 10
x[x>=10]<-"10+"
This will give you a vector of strings. You can use as.numeric(x) to convert back to numbers ("10+" become NA), or as.factor(x) to get your result above.
Note that this will modify the original vector itself, so you may want to copy to another vector and work on that.

Resources