R - Range of number sequences, or how can I get the reverse of 1:10 - r

I can't seem to find an elegant solution to finding ranges. For me, it would come down to this:
> seq(1:10)
[1] 1 2 3 4 5 6 7 8 9 10
I would like to get the reverse:
function(c(1,2,3,4,5,6,7,8,9,10))
result 1:10
Real world problem is that I have 1200 indices, some are 0, some are 1:
c(0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1)
And I would like the ranges/coordinates within the vector for each set of 0s and 1s.

Will this simple solution work?
> rev(seq(1:10))
[1] 10 9 8 7 6 5 4 3 2 1
> range(seq(1:10))
[1] 1 10

Related

Finding matching position of numeric values in R

The numeric variable weitage is given like,
> weitage
[1] 20 10 50 10 5 5
Then,
sort_wei<-sort(weitage,decreasing = T)
sort_wei
[1] 50 20 10 10 5 5
match(sort_wei,weitage)
results in 3 1 2 2 5 5. But actually needed position is 3 1 2 4 5 6. How to get these positions? Can i use match() in R?
We can try using the order function, which returns the indices of the input vector according to some sort order:
order(weitage, decreasing=TRUE)
#[1] 3 1 2 4 5 6

cumsum the opposite of diff in r

I have a question and I'm not sure if I'm being totally stupid here or if this is a genuine problem, or if I've misunderstood what these functions do.
Is the opposite of diff the same as cumsum? I thought it was. However, using this example:
dd <- c(17.32571,17.02498,16.71613,16.40615,
16.10242,15.78516,15.47813,15.19073,
14.95551,14.77397)
par(mfrow = c(1,2))
plot(dd)
plot(cumsum(diff(dd)))
> dd
[1] 17.32571 17.02498 16.71613 16.40615 16.10242 15.78516 15.47813 15.19073 14.95551
[10] 14.77397
> cumsum(diff(dd))
[1] -0.30073 -0.60958 -0.91956 -1.22329 -1.54055 -1.84758 -2.13498 -2.37020 -2.55174
These aren't the same. Where have I gone wrong?
AHHH! Fridays.
Obviously
The functions are quite different: diff(x) returns a vector of length (length(x)-1) which contains the difference between one element and the next in a vector x, while cumsum(x) returns a vector of length equal to the length of x containing the sum of the elements in x
Example:
x <- c(1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1 1 1 1 1
v <- cumsum(x)
> v
#[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum() is the cumulative sum and therefore the entries of the vector v[i] that it returns are a result of all elements in x between x[1] and x[i]. In contrast, diff(x) only takes the difference between one element x[i] and the next, x[i+1].
The combination of cumsum and diff leads to different results, depending on the order in which the functions are executed:
> cumsum(diff(x))
# 1 2 3 4 5 6 7 8 9
Here the result is the cumulative sum of a sequence of nine "1". Note that if this result is compared with the original vector x, the last entry 10 is missing.
On the other hand, by calculating
> diff(cumsum(x))
# 2 3 4 5 6 7 8 9 10
one obtains a vector that is again similar to the original vector x, but now the first entry 1 is missing.
In none of the cases the original vector is restored, therefore it cannot be stated that cumsum() is the opposite or inverse function of diff()
You forgot to account for the impact of the first element
dd == c(dd[[1]], dd[[1]] + cumsum(diff(dd)))
#RHertel answered it well, stating that diff() returns a vector with length(x)-1.
Therefore, another simple workaround would be to add 0 to the beginning of the original vector so that diff() computes the difference between x[1] and 0.
> x <- 5:10
> x
#[1] 5 6 7 8 9 10
> diff(x)
#[1] 1 1 1 1 1
> diff(c(0,x))
#[1] 5 1 1 1 1 1
This way it is possible to use diff() with c() as a representation of the inverse of cumsum()
> cumsum(diff(c(0,x)))
#[1] 1 2 3 4 5 6 7 8 9 10
> diff(c(0,cumsum(x)))
#[1] 1 2 3 4 5 6 7 8 9 10
If you know the value of "lag" and "difference".
x<-5:10
y<-diff(x,lag=1,difference=1)
z<-diffinv(y,lag=1,differences = 1,xi=5) #xi is first value.
k<-as.data.frame(cbind(x,z))
k
x z
1 5 5
2 6 6
3 7 7
4 8 8
5 9 9
6 10 10

Summing a column to a certain value

I have a data.frame with 2 variables, and 177 observations. I would like to sum up one variable to a certain value, and then get the value of the other variable when that threshold is reached. I will try to add an reproducible example. I am new here so forgive me if I do it wrong.
> df <- data.frame(x=10:1,y=1:10)
> print(df)
x y
1 10 1
2 9 2
3 8 3
4 7 4
5 6 5
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10
How can I sum column y until it reaches a certain value, let's say 7, and then either have it return the value of X(4), or the row number 7. I am sure it is pretty straightforward, but I seem to be drawing a blank.
Here is my solution.
df[cumsum(df$y) <= 7,]
x y
1 10 1
2 9 2
3 8 3
The OP just asked for the relevant value of x which would be done using:
df$x[which(cumsum(df$y) >= 10)[1]]
Also note this finds the first where cumsum(df$y) is at least 10 whereas the other answers find the last <= 7 which is potentially different (though not for this dataset). For the original question (pre-comment) it would need to be:
df$x[which(cumsum(df$y) > 7)[1]]
If you want to stay with base R, try this
> df$x[df$y >= 7][1]
[1] 4
> max(cumsum(df$y[df$y <= 7]))
[1] 28
Or if you need this in a matrix form:
> cbind(df$x[df$y >= 7][1], max(cumsum(df$y[df$y <= 7])))
[,1] [,2]
[1,] 4 28
I would still look into switching to data.table or at least dplyr packages for data manipulation.

Finding the minimum positive value

I guess I don't know which.min as well as I thought.
I'm trying to find the occurrence in a vector of a minimum value that is positive.
TIME <- c(0.00000, 4.47104, 6.10598, 6.73993, 8.17467, 8.80862, 10.00980, 11.01080, 14.78110, 15.51520, 16.51620, 17.11680)
I want to know for the values z of 1 to 19, the index of the above vector TIME containing the value that is closest to but above z. I tried the following code:
vec <- sapply(seq(1,19,1), function(z) which.min((z-TIME > 0)))
vec
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 1 1
To my mind, the last two values of vec should be '12, 12'. The reason it's doing this is because it thinks that '0.0000' is closest to 0.
So, I thought that maybe it was because I exported the data from external software and that 0.0000 wasn't really 0. But,
TIME[1]==0 #TRUE
Then I got further confused. Why do these give the answer of index 1, when really they should be an ERROR?
which.min(0 > 0 ) #1
which.min(-1 > 0 ) #1
I'll be glad to be put right.
EDIT:
I guess in a nutshell, what is the better way to get this result:
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
which shows the index of TIME that gives the smallest possible positive value, when subtracting each element of TIME from the values of 1 to 19.
The natural function to use here (both to limit typing and for efficiency) is actually not which.min + sapply but the cut function, which will determine which range of times each of the values 1:19 falls into:
cut(1:19, breaks=TIME, right=FALSE)
# [1] [0,4.47) [0,4.47) [0,4.47) [0,4.47) [4.47,6.11) [4.47,6.11) [6.74,8.17)
# [8] [6.74,8.17) [8.81,10) [8.81,10) [10,11) [11,14.8) [11,14.8) [11,14.8)
# [15] [14.8,15.5) [15.5,16.5) [16.5,17.1) <NA> <NA>
# 11 Levels: [0,4.47) [4.47,6.11) [6.11,6.74) [6.74,8.17) [8.17,8.81) ... [16.5,17.1)
From this, you can easily determine what you're looking for, which is the index of the smallest element in TIME greater than the cutoff:
(x <- as.numeric(cut(1:19, breaks=TIME, right=FALSE))+1)
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 NA NA
The last two entries appear as NA because there is no element in TIME that exceeds 18 or 19. If you wanted to replace these with the largest element in TIME, you could do so with replace:
replace(x, is.na(x), length(TIME))
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
Here's one way:
x <- t(outer(TIME,1:19,`-`))
max.col(ifelse(x<0,x,Inf),ties="first")
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
It's computationally wasteful to take all the differences in this way, since both vectors are ordered.

Binning a numeric variable

I have a vector X that contains positive numbers that I want to bin/discretize. For this vector, I want the numbers [0, 10) to show up just as they exist in the vector, but numbers [10,∞) to be 10+.
I'm using:
x <- c(0,1,3,4,2,4,2,5,43,432,34,2,34,2,342,3,4,2)
binned.x <- as.factor(ifelse(x > 10,"10+",x))
but this feels klugey to me. Does anyone know a better solution or a different approach?
How about cut:
binned.x <- cut(x, breaks = c(-1:9, Inf), labels = c(as.character(0:9), '10+'))
Which yields:
# [1] 0 1 3 4 2 4 2 5 10+ 10+ 10+ 2 10+ 2 10+ 3 4 2
# Levels: 0 1 2 3 4 5 6 7 8 9 10+
You question is inconsistent.
In description 10 belongs to "10+" group, but in code 10 is separated level.
If 10 should be in the "10+" group then you code should be
as.factor(ifelse(x >= 10,"10+",x))
In this case you could truncate data to 10 (if you don't want a factor):
pmin(x, 10)
# [1] 0 1 3 4 2 4 2 5 10 10 10 2 10 2 10 3 4 2 10
x[x>=10]<-"10+"
This will give you a vector of strings. You can use as.numeric(x) to convert back to numbers ("10+" become NA), or as.factor(x) to get your result above.
Note that this will modify the original vector itself, so you may want to copy to another vector and work on that.

Resources