Say I applied the cut function on seq(15) like this cut(seq(15), 5)
I would get a list of bins in which each element would fall. What if I want to extract the members or elements of the third level? How can I refer to the elements that would fall in the 3rd bin after cutting the sequence?
Addressing Arun's comment:I will provide the cut function a vector like this: temp <- cut(seq(15), c(.9,4,8,12,15)). I am looking for the elements of the seq(15) that would fall in the 3rd level. They are 9,10,11,12. There is already an answer that worked bellow.
You can use labels=F to get
cut(seq(15),5,labels=F)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Then
x <- seq(15)
> x[cut(x,5,labels=F)==3]
[1] 7 8 9
Your question is poorly worded and somewhat ambiguous, but can use basic indexing for this:
temp <- cut(seq(15), 5)
temp[temp == levels(temp)[3]]
# [1] (6.6,9.4] (6.6,9.4] (6.6,9.4]
# Levels: (0.986,3.79] (3.79,6.6] (6.6,9.4] (9.4,12.2] (12.2,15]
Or, if you wanted the relevant values from seq(15):
seq(15)[temp == levels(temp)[3]]
# [1] 7 8 9
Related
I have the following vector in R:
> A<-c(8.1915935, 3.0138083, 0.3245712, 10.7353747, 13.7505131 ,63.2337407, 16.7505131, 5.7781297)
I want to sort it, and, at the same time, know each element's position in the sorted vector. So i use the following function:
sort(A, index.return=T)
And I get the following output, which I don't clearly understand:
$x
[1] 0.3245712 3.0138083 5.7781297 8.1915935 10.7353747 13.7505131 16.7505131 63.2337407
$ix
[1] 3 2 8 1 4 5 7 6
Looking at the original vector A, the first element, goes in the 4th position of the sorted vector. So the first element of "$ix" should be 4. Why is it 3?
Then, the biggest number of the vector is the 6th of A. But the 6th element of $ix is not 8, as I expected to see (the length of the vector)but 6. Why?
And so on, for all the elements. Clearly, there is something I don't understand about this output.
$ix is indicating the position of the elements of x in the original vector; you were hoping for the reverse -- the location of the elements in the original vector in x. The difference is between order() and rank()
> order(A)
[1] 3 2 8 1 4 5 7 6
> rank(A)
[1] 4 2 1 5 6 8 7 3
Note that order(order(A)) == rank(A), so one way to get the answer you're looking for is
result <- sort(A, index.return = TRUE)
order(result$ix)
When we want a sequence in R, we use either construction:
> 1:5
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
this produces a sequence from start to stop (inclusive)
is there a way to generate a sequence from start to stop (exclusive)? like
[1] 1 2 3 4
Also, I don't want to use a workaround like a minus operator, like:
seq(1,5-1)
This is because I would like to have statements in my code that are elegant and concise. In my real world example the start and stop are not hardcoded integers but descriptive variable names. Using the variable_name -1 construction just my script uglier and difficult to read for a reviewer.
PS: The difference between this question and the one at remove the last element of a vector is that I am asking for sequence generation while the former focuses on removing the last element of a vector
Moreover the answers provided here are different and relevant to my problem
One possible solution would be
head(1:5, -1)
# [1] 1 2 3 4
or you could define your own function
seq_last_exclusive <- function(x) return(x[-length(x)])
seq_last_exclusive(1:5)
# [1] 1 2 3 4
We can use the following function
f <- function(start, stop, ...) {
if(identical(start, stop)) {
return(vector("integer", 0))
}
seq.int(from = start, to = stop - 1L, ...)
}
Test
f(1, 5)
# [1] 1 2 3 4
f(1, 1)
# integer(0)
First of all sorry for this question. I suppose it's super basic but I can't find the right search terms. For a vector a lets say:
a<-c(1,1,3,2,1)
I want to get a vector b which results when suming element by element
>b
1 2 5 7 8
it would be something like:
x<-2
b<-as.vector(a[1])
while(x<=length(a)) {
c<-a[x]+b[x-1]
b=c(b,c)
x=x+1
}
rm(x,c)
but isn't there a built-in function for this?
You are looking for cumsum:
a = c(1,1,3,2,1)
R> cumsum(a)
[1] 1 2 5 7 8
I'm trying to get a handle on the ubiquitous which function. Until I started reading questions/answers on SO I never found the need for it. And I still don't.
As I understand it, which takes a Boolean vector and returns a weakly shorter vector containing the indices of the elements which were true:
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> x <- seq(10)
> tf <- (x == 6 | x == 8)
> tf
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
> w <- which(tf)
> w
[1] 6 8
So why would I ever use which instead of just using the Boolean vector directly? I could maybe see some memory issues with huge vectors, since length(w) << length(tf), but that's hardly compelling. And there are some options in the help file which don't add much to my understanding of possible uses of this function. The examples in the help file aren't of much help either.
Edit for clarity-- I understand that the which returns the indices. My question is about two things: 1) why you would ever need to use the indices instead of just using the boolean selector vector? and 2) what interesting behaviors of which might make it preferred to just using a vectorized Boolean comparison?
Okay, here is something where it proved useful last night:
In a given vector of values what is the index of the 3rd non-NA value?
> x <- c(1,NA,2,NA,3)
> which(!is.na(x))[3]
[1] 5
A little different from DWin's use, although I'd say his is compelling too!
The title of the man page ?which provides a motivation. The title is:
Which indices are TRUE?
Which I interpret as being the function one might use if you want to know which elements of a logical vector are TRUE. This is inherently different to just using the logical vector itself. That would select the elements that are TRUE, not tell you which of them was TRUE.
Common use cases were to get the position of the maximum or minimum values in a vector:
> set.seed(2)
> x <- runif(10)
> which(x == max(x))
[1] 5
> which(x == min(x))
[1] 7
Those were so commonly used that which.max() and which.min() were created:
> which.max(x)
[1] 5
> which.min(x)
[1] 7
However, note that the specific forms are not exact replacements for the generic form. See ?which.min for details. One example is below:
> x <- c(4,1,1)
> which.min(x)
[1] 2
> which(x==min(x))
[1] 2 3
Two very compelling reasons not to forget which:
1) When you use "[" to extract from a dataframe, any calculation in the row position that results in NA will get a junk row returned. Using which removes the NA's. You can use subset or %in%, which do not create the same problem.
> dfrm <- data.frame( a=sample(c(1:3, NA), 20, replace=TRUE), b=1:20)
> dfrm[dfrm$a >0, ]
a b
1 1 1
2 3 2
NA NA NA
NA.1 NA NA
NA.2 NA NA
6 1 6
NA.3 NA NA
8 3 8
# Snipped remaining rows
2) When you need the array indicators.
which could be useful (by the means of saving both computer and human resources) e.g. if you have to filter the elements of a data frame/matrix by a given variable/column and update other variables/columns based on that. Example:
df <- mtcars
Instead of:
df$gear[df$hp > 150] <- mean(df$gear[df$hp > 150])
You could do:
p <- which(df$hp > 150)
df$gear[p] <- mean(df$gear[p])
Extra case would be if you have to filter a filtered elements what could not be done with a simple & or |, e.g. when you have to update some parts of a data frame based on other data tables. This way it is required to store (at least temporary) the indexes of the filtered element.
Another issue what cames to my mind if you have to loop thought a part of a data frame/matrix or have to do other kind of transformations requiring to know the indexes of several cases. Example:
urban <- which(USArrests$UrbanPop > 80)
> USArrests[urban, ] - USArrests[urban-1, ]
Murder Assault UrbanPop Rape
California 0.2 86 41 21.1
Hawaii -12.1 -165 23 -5.6
Illinois 7.8 129 29 9.8
Massachusetts -6.9 -151 18 -11.5
Nevada 7.9 150 19 29.5
New Jersey 5.3 102 33 9.3
New York -0.3 -31 16 -6.0
Rhode Island -2.9 68 15 -6.6
Sorry for the dummy examples, I know it makes not much sense to compare the most urbanized states of USA by the states prior to those in the alphabet, but I hope this makes sense :)
Checking out which.min and which.max gives some clue also, as you do not have to type a lot, example:
> row.names(mtcars)[which.max(mtcars$hp)]
[1] "Maserati Bora"
Well, I found one possible reason. At first I thought it might be the ,useNames option, but it turns out that simple boolean selection does that too.
However, if your object of interest is a matrix, you can use the ,arr.ind option to return the result as (row,column) ordered pairs:
> x <- matrix(seq(10),ncol=2)
> x
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> which((x == 6 | x == 8),arr.ind=TRUE)
row col
[1,] 1 2
[2,] 3 2
> which((x == 6 | x == 8))
[1] 6 8
That's a handy trick to know about, but hardly seems to justify its constant use.
Surprised no one has answered this: how about memory efficiency?
If you have a long vector of very sparse TRUE's, then keeping track of only the indices of the TRUE values will probably be much more compact.
I use it quiet often in data exploration. For example if I have a dataset of kids data and see from summary that the max age is 23 (and should be 18), I might go:
sum(dat$age>18)
If that was 67, and I wanted to look closer I might use:
dat[which(dat$age>18)[1:10], ]
Also useful if you're making a presentation and want to pull out a snippet of data to demonstrate a certain oddity or what not.
I'm having some trouble with an rle command that is designed to find the point at which participants reach 8 contiguous ones in a row.
For example, if:
x <- c(0,1,0,1,1,1,1,1,1,1,1,1)
i want to return a value of 11.
Thanks to DWin to I've been using this piece of code:
which( rle(x2)$values==1 & rle(x2)$lengths >= 8)
sum(rle(x)$lengths[ 1:(min(which(rle(x)$lengths >= 8))-1) ]) + 8
I've been using this code successfully to process my data. However, i noticed that it made a mistake when processing one of my data files.
For example, if
x <- c(1,1,1,1,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
the code returns 19, which is the point at which eight contiguous zeros in a row is reached. i'm not sure what is going wrong or how it fix it.
thanks in advance for your help.
Will
You need to paste the first line of code in its entirety into the second:
sum(rle(x)$lengths[ 1:(min(which( rle(x2)$values==1 & rle(x2)$lengths >= 8))-1) ]) + 8
[1] 39
However, here is another approach, using the function filter. This yields the same result in what I consider to be much more readable code:
which(filter(x2, rep(1/8, 8), sides=1) == 1)[1]
[1] 39
The filter function when used in this way essentially computes a moving average over a block of 8 values in the vector. I then return the position of the first value where the moving average equals 1.
In the basic programming course I teach, I advise students to give proper names to subresults, and to inspect these subresults:
lengthOfrepeatsOfAnything<-rle(x)$lengths
#4 2 5 11 2 2 3 2 17
whichRepeatsAreOfOnes<-rle(x)$values==1
#1 3 5 7 9
repeatsOfOnesLength<-lengthOfrepeatsOfAnything * whichRepeatsAreOfOnes #TRUE = 1, FALSE=0
#4 0 5 0 2 0 3 0 17
whichRepeatOfOneAreLongerThanEight<-which(repeatsOfOnesLength >= 8)
#9
result<-NA
if(length(whichRepeatOfOneAreLongerThanEight)>0){
firstRepeatOfOneAreLongerThanEight<-whichRepeatOfOneAreLongerThanEight[1]
#9
if(firstRepeatOfOneAreLongerThanEight==1){
result<-8
}
else{
repeatsBeforeFirstEightOnes<-1:(firstRepeatOfOneAreLongerThanEight-1)
#1 2 3 4 5 6 7 8
lengthsOfRepeatsBeforeFirstEightOnes<-lengthOfrepeatsOfAnything[repeatsBeforeFirstEightOnes]
#4 2 5 11 2 2 3 2
result<-sum(lengthsOfRepeatsBeforeFirstEightOnes) + 8
}
}
I know it doesn't look as dandy as a oneline solution, but it helps to make things clear and to pick up errors... Besides: what if you look back at this code in 4 months? Which one will be easier to understand again?
My advice would be to break the code up into simpler pieces. As suggested by #Nick, you want to write code which can be easily debugged and modular coding allows you to do that.
# find runs of 0s and 1s
run_01 = rle(x)
# find run of 1's with length >=8
run_1 = with(run_01, which(values == 1 & lengths >=8))
# find starting position of run_1
start_pos = sum(run_01$lengths[1:(run_1 - 1)])
# add 8 to it
end_pos = start_pos + 8