rank() function in R is ranking objects with floating points rather than integers - r

I'm quite new to R so this may seem quite trivial to many experienced programmers, sorry in advance!
I've got a numeric vector of length 8 that looks like this:
data <- c(45, 67, 23, 24, 5, 23, 45, 23)
When I type in: rank(data), R returns: [1] 6.5 8.0 3.0 5.0 1.0 3.0 6.5 3.0
However with my (very basic) understanding of rank, I expect R to return to me only whole numbers... such as:
[1] 6 8 2 5 1 3 7 4
How can rank() tell me the first element in data has a floating point ranking rather than a whole number ranking? Is it because there are values in data that are repeated and so rank() is trying to handle ties in a way that I am not expecting? If so, please tell me how I can fix this so I can get output that looks like what I previously expected. Also, any information on how rank() deals with NA values would be much appreciated. A basic description on rank() and what bells and whistles can be used would be fantastic! I've looked for videos on youtube and searched stackoverflow to no avail! Thanks so much.

From ?rank:
With some values equal (called ‘ties’), the argument ties.method determines the result at the corresponding indices. The "first" method results in a permutation with increasing values at each index set of ties. The "random" method puts these in random order whereas the default, "average", replaces them by their mean, and "max" and "min" replaces them by their maximum and minimum respectively, the latter being the typical sports ranking.
Sounds like you're using the default setting of "average" for tie breaking, which uses the mean, which is not necessarily an integer.
The built-in documentation should always be your first stop in looking for help. In this case (and most cases), it details all the "bells and whistles"---here there aren't many: just tie-handling and NA-handling. It also has examples at the bottom.

Related

Finding the first significant figure of difference between two very similar values

I'm trying to reproduce the computations that led to a data set data.ref. I'd like to test how well my current implementation does by comparing the reference data to my computed results, data.my. Since each column of the data should have comparable magnitudes within the column, but not necessarily between columns, I've been looking at
(data.ref - data.my) / data.ref
to put errors on a comparable scale. However, since the data is ultimately going to be rounded off, what I'd really like to do is just run a quick and dirty check of how many significant figures worth of agreement the data has. That is, since I expect data.ref and data.my to be quite close to each other, I'd like the answer the question: what is the first significant figure at which each pair of corresponding entries differs?
Is there an R function that does this?
ceiling(log10(abs(data.ref, data.my))) seems to do the trick.
Example:
> data.my <- c(20, 30, 32, 32.01, 32.012)
> data.ref <- rep(32, length(data.my))
> ceiling(log10(abs(data.my - data.ref)))
[1] 2 1 -Inf -2 -1

R print cutoff values under a certain value

I am trying to print only values over 1.1 for a factor analysis. I assumed the print command was what I wanted, but the cutoff didnt work.
Reproducible example:
print(c(1,2,3,.5),digits=2,cutoff=1.1,sort=T)
#returns: [1] 1.0 2.0 3.0 0.5
How can I get it to return only value over 1.1?
This is three years too late, but the cutoff command does work in factor analysis:
some factor analysis where i only want loadings larger than 0.3:
print(factanal(df,factoramount)$loadings, cutoff=0.3)
The print function normally doesn't have cutoff - you are probably looking at a special implementation of print since it is generic, which means it can have different implementations for different data types (see documentation).
To select elements with a criteria, you can do num[{criteria}], in this case num[num > 1.1] as #DatamineR suggested.

Find all combinations 6 numbers using only 0,1 and 2

Is there a simple way of finding all the combinations of 6 digits using only 0, 1 and 2?
So it starts like 000000 and finishes 222222
I have looked online but all i can find is the formula for finding how many there are but i need a list of all of them
If there is a code in R that will be even better
It is not completely neccessary but if there is a way to create a list where the 1st and 4th digit sum to a maximum of 2, 2nd and 5th digit sum to a maximum of 2 and 3rd and 6th digit sum to a maximum of 2
Thankyou
You can do:
do.call(paste0, expand.grid(rep(list(0:2), 6)))
Adding a rev in there gives a different order that might feel more natural:
do.call(paste0, rev(expand.grid(rep(list(0:2), 6))))
I will only give you a hint for your new (added) question as I am now worried I might be doing your homework. expand.grid returns a data.frame. With a little work on it, you can probably extract the subset of rows that only matter to you.

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Cut function in R - exclusive or am I double counting?

Based off of a previous question I asked, which #Andrie answered, I have a question about the usage of the cut function and labels.
I'd like get summary statistics based on the range of number of times a user logs in.
Here is my data:
# Get random numbers
NumLogin <- round(runif(100,1,50))
# Set the login range
LoginRange <- cut(NumLogin,
c(0,1,3,5,10,15,20,Inf),
labels=c('1','2','3-5','6-10','11-15','16-20','20+')
)
Now I have my LoginRange, but I'm unsure how the cut function actually works. I want to find users who have logged in 1 time, 2 times, 3-5 times, etc, while only including the user if they are in that range. Is the cut function including 3 twice (In the 2 bucket and the 3-5 bucket)? If I look in my example, I can see a user who logged in 3 times, but they are cut as '2'. I've looked at the documentation and every R book I own, but no luck. What am I doing wrong?
Also - As a usage question - should I attach the LoginRange to my data frame? If so, what's the best way to do so?
DF <- data.frame(NumLogin, LoginRange)
?
Thanks
The intervals defined by the cut() function are (by default) closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up. In fact, doing this with your data (as shown below), reveals that the 2nd and 3rd bins were actually incorrectly named, which illustrates the point quite nicely!
LoginRange <- cut(NumLogin,
c(0.5, 1.5, 3.5, 5.5, 10.5, 15.5, 20.5, Inf),
# c(0,1,3,5,10,15,20,Inf) + 0.5,
labels=c('1','2-3','4-5','6-10','11-15','16-20','20+')
)

Resources