How do I preserve continuous (1,2,3,...n) ranking notation when ranking in R? - r

If I want to rank a set of numbers using the minimum rank for shared cases (aka ties):
dat <- c(13,13,14,15,15,15,15,15,15,16,17,22,45,46,112)
rank(dat, ties = 'min')
I get the results:
1 1 3 4 4 4 4 4 4 10 11 12 13 14 15
However, I want the rank to be a continuous series consisting of 1,2,3,...n, where n is the number of unique ranks.
Is there a way to make rank (or a similar function) rank a series of numbers by assigning ties to the lowest rank as above but instead of skipping subsequent rank values by the number of previous ties to instead continue ranking from the previous rank?
For example, I would like the above ranking to result in:
1 1 2 3 3 3 3 3 3 4 5 6 7 8 9

you could do it using dplyr:
library(dplyr)
dense_rank(dat)
[1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
if you don't want to load the whole library and do it in base r:
match(dat, sort(unique(dat)))
[1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9

Use a factor and then bring it back to numeric format:
as.numeric(factor(rank(dat)))
# [1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9

Related

replace a given value within a column with the next different number in a row in R

I have a data set that will ultimately be about ~30,000 observations. I have formatted a variable in such a way that the numerical values 1:4 are of interest, while the value 5 is a place holder and was not able to be collected by our testing instrument for one reason or another (not worried about why or missingness etc).
I am looking to turn any observation of 5, or series of observations of 5, into the next number in the observations. As can be seen in the example data set below, the first four observations have the number 5 while the next four observations are the number 4. In this situation I would like the first 4 observations to be changed from 5 to 4.
Note that after the 8th observation another series of 5's occur, follow by a series of 3s. In this case the 5s should be changed to 3s.
In the code block below I have provided an example of what the current data look like, delineated by the column "Current." I have also provided a column of the desired output, delineated by the column name "Desired." The obs variable was helpful to create just to show the row number of the changes in values for the case of this post.
df <- data.frame(Current = c(5,5,5,5,4,4,4,4,5,5,3,3,3,5,3,3,5,5,2,5,5,5,1),
Desired = c(4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,2,2,2,1,1,1,1))
df$obs = seq(1,nrow(df), by = 1)
You could use
library(tidyr)
library(dplyr)
df %>%
mutate(new_column = na_if(Current, 5)) %>%
fill(new_column, .direction = "up")
This returns
Current Desired new_column
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1
We use dplyr's na_if function to convert the 5 into missing values.
Next we use tidyr's fill function to replace the NA's by the following values.
You can use the following solution. I made use of zoo::na.locf function which takes the most non-NA value and replace all NAs on the way down. However, to fit this to your data set I first replaced all values equal to 5 with NA and then reverse the vector and after I replaced all the values with the desired values, I again reversed it back to its original order:
library(dplyr)
library(zoo)
library(zoo)
df %>%
mutate(Desired2 = ifelse(Current == 5, NA, Current),
Desired2 = rev(na.locf(rev(Desired2))))
Current Desired Desired2
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1

R Create column that provides grouping number for each distinct group [duplicate]

This question already has an answer here:
get sequence of group in R
(1 answer)
Closed 2 years ago.
I need to add a column to my data that contains a number grouping for each distinct combination of other columns. It will likely be more clear with this example:
# Make data
df <- data.frame(x = c(1,1,2,3,4,5,2,3,4,5),
y = c(2, 2,3,4,5,1,3,4,5,1),
value = c(1,2,3,4,5,6,7,8,9,10))
# Print the data
df
x y value
1 1 2 1
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
6 5 1 6
7 2 3 7
8 3 4 8
9 4 5 9
10 5 1 10
I need to add a "Location" column that has the numbers each unique (or distinct) combination of x and y. Duplicated x and y combinations should all use the same number. In my example there are 5 unique combinations of x and y, so I only have a maximum of 5 Locations. My goal output is this:
x y value Location
1 1 2 1 1
2 1 2 2 1
3 2 3 3 2
4 3 4 4 3
5 4 5 5 4
6 5 1 6 5
7 2 3 7 2
8 3 4 8 3
9 4 5 9 4
10 5 1 10 5
I imagine doing something like this:
df <- df %>%
group_by(x,y) %>%
mutate(Location = ndistinct(x,y)
But this doesn't work. Any help is appreciated!
Thanks!
df %>% mutate(., Location=group_indices(., x,y))
x y value Location
1 1 2 1 1
2 1 2 2 1
3 2 3 3 2
4 3 4 4 3
5 4 5 5 4
6 5 1 6 5
7 2 3 7 2
8 3 4 8 3
9 4 5 9 4
10 5 1 10 5
See here and here.
Not quite as straightforward as I thought to start with.
Update
To answer OP's question: the dot . is a placeholder for "the object on the left hand side of the pipe" (%>%). Normally you don't need it because, by default, magrittr (the package which defines the pipe) assumes that you want to use the object on the left hand side of the pipe as the first argument to the function on the right hand side of the pipe, and makes the substitution for you. This is very helpful because the tidyverse is designed so that the object on the left hand side of the pipe is always the first argument to the function on the right hand side - so you don't have to use the dot.
If you use functions that don't belong to the tidyverse, you sometimes need the dot to override magrittr's default behaviour.
I wrote my first version of this answer without testing the code because the solution seemed "obvious". But I did test it afterwards (at the same time as OP reported the error) and found that it didn't work. A quick Google brought me to the github issue in the second link above, and hence to the correct answer.
I don't yet understand why, in this particular case, a tidyverse function doesn't work as I expect. (Other than taking the easy way out and saying that my expectation was wrong!)
In base R we can use:
df$location <- as.numeric(factor(paste(df$x,df$y)))
x y value location
1 1 2 1 1
2 1 2 2 1
3 2 3 3 2
4 3 4 4 3
5 4 5 5 4
6 5 1 6 5
7 2 3 7 2
8 3 4 8 3
9 4 5 9 4
10 5 1 10 5

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

Transforming a looping factor variable into a sequence of numerics

I have a factor variable with 6 levels, which simplified looks like:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 2... 1 1 1 2 2... (with n = 78)
Note, that each number is repeated mostly but not always three times.
I need to transform this variable into the following pattern:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8...
where each repetition of the 6 levels continuous counting ascending.
Is there any way / any function that lets me do that?
Sorry for my bad description!
Assuming that you have a numerical vector that represents your simplified version you posted. i.e. x = c(1,1,1,2,2,3,3,3,1,1,2,2), you can use this:
library(dplyr)
cumsum(x != lag(x, default = 0))
# [1] 1 1 1 2 2 3 3 3 4 4 5 5
which compares each value to its previous one and if they are different it adds 1 (starting from 1).
Maybe you can try rle, i.e.,
v <- rep(seq_along((v<-rle(x))$values),v$lengths)
Example with dummy data
x = c(1,1,1,2,2,3,3,3,4,4,5,6,1,1,2,2,3,3,3,4,4)
then we can get
> v
[1] 1 1 1 2 2 3 3 3 4 4 5 6 7 7 8 8 9 9
[19] 9 10 10
In base you can use diff and cumsum.
c(1, cumsum(diff(x)!=0)+1)
# [1] 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8
Data:
x <- c(1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,1,1,1,2,2,2,2)

What does subset(df, !duplicated(x)) do?

Looking for a detailed answer.
When we have a data frame (df) that contains three variables x, y, and z, what does the following command do?
subset(df, !duplicated(x))
The duplicated function traverses its argument(s) sequentially and returns TRUE if there has been a prior value identical to the current value. It is a generic function, so it has a default definition (for vectors) but also a definition for other classes, such as objects of the data.frame class. The subset function treats expressions passed as a second or third argument as though column names are first class objects. This is called "non-standard evaluation". (Notice the negation operator.) So this call to subset will return the rows of a data.frame where only the first instance of the column named "x" is not duplicated. It would probably return a dataframe with only the number of rows that equal the number of unique items in the x column.
> dat <- data.frame( x =sample(1:5, 20, repl=TRUE), y=1:5, z=1:4)
> dat
x y z
1 2 1 1
2 2 2 2
3 2 3 3
4 5 4 4
5 4 5 1
6 1 1 2
7 2 2 3
8 2 3 4
9 5 4 1
10 1 5 2
11 2 1 3
12 4 2 4
13 5 3 1
14 4 4 2
15 3 5 3
16 3 1 4
17 4 2 1
18 4 3 2
19 1 4 3
20 1 5 4
> subset(dat, !duplicated(x))
x y z
1 2 1 1
4 5 4 4
5 4 5 1
6 1 1 2
15 3 5 3

Resources