Group values by unique elements [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Create group number for contiguous runs of equal values
(4 answers)
Closed 1 year ago.
I have a vector that looks like this:
a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")
I would like to create another another vector, based on a, that should look like:
b <- c(1,1,1,2,2,2,3,4,4,4,4,4,4,5)
In other words, b should assign a value (starting from 1) to each different element of a.

First of all, (I assume) this is your vector
a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")
As per possible solutions, here are few (can't find a good dupe right now)
as.integer(factor(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or
cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or
match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Also rle will work the similarly in your specific scenario
with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or (which is practically the same)
data.table::rleid(a)
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Though be advised that all 4 solutions have their unique behavior in different scenarios, consider the following vector
a <- c("B110","B110","B110","A220","A220","C330","D440","D440","B110","B110","E550")
And the results of the 4 different solutions:
1.
as.integer(factor(a))
# [1] 2 2 2 1 1 3 4 4 2 2 5
The factor solution begins with 2 because a is unsorted and hence the first values are getting higher integer representation within the factor function. Hence, this solution is only valid if your vector is sorted, so don't use it other wise.
2.
cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 5
This cumsum/duplicated solution got confused because of "B110" already been present at the beginning and hence grouped "D440","D440","B110","B110" into the same group.
3.
match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 1 1 5
This match/unique solution added ones at the end, because it is sensitive to "B110" showing up in more than one sequences (because of unique) and hence grouping them all into same group regardless of where they appear
4.
with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 5 5 6
This solution only cares about sequences, hence different sequences of "B110" were grouped into different groups

Related

R Create column that provides grouping number for each distinct group [duplicate]

This question already has an answer here:
get sequence of group in R
(1 answer)
Closed 2 years ago.
I need to add a column to my data that contains a number grouping for each distinct combination of other columns. It will likely be more clear with this example:
# Make data
df <- data.frame(x = c(1,1,2,3,4,5,2,3,4,5),
y = c(2, 2,3,4,5,1,3,4,5,1),
value = c(1,2,3,4,5,6,7,8,9,10))
# Print the data
df
x y value
1 1 2 1
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
6 5 1 6
7 2 3 7
8 3 4 8
9 4 5 9
10 5 1 10
I need to add a "Location" column that has the numbers each unique (or distinct) combination of x and y. Duplicated x and y combinations should all use the same number. In my example there are 5 unique combinations of x and y, so I only have a maximum of 5 Locations. My goal output is this:
x y value Location
1 1 2 1 1
2 1 2 2 1
3 2 3 3 2
4 3 4 4 3
5 4 5 5 4
6 5 1 6 5
7 2 3 7 2
8 3 4 8 3
9 4 5 9 4
10 5 1 10 5
I imagine doing something like this:
df <- df %>%
group_by(x,y) %>%
mutate(Location = ndistinct(x,y)
But this doesn't work. Any help is appreciated!
Thanks!
df %>% mutate(., Location=group_indices(., x,y))
x y value Location
1 1 2 1 1
2 1 2 2 1
3 2 3 3 2
4 3 4 4 3
5 4 5 5 4
6 5 1 6 5
7 2 3 7 2
8 3 4 8 3
9 4 5 9 4
10 5 1 10 5
See here and here.
Not quite as straightforward as I thought to start with.
Update
To answer OP's question: the dot . is a placeholder for "the object on the left hand side of the pipe" (%>%). Normally you don't need it because, by default, magrittr (the package which defines the pipe) assumes that you want to use the object on the left hand side of the pipe as the first argument to the function on the right hand side of the pipe, and makes the substitution for you. This is very helpful because the tidyverse is designed so that the object on the left hand side of the pipe is always the first argument to the function on the right hand side - so you don't have to use the dot.
If you use functions that don't belong to the tidyverse, you sometimes need the dot to override magrittr's default behaviour.
I wrote my first version of this answer without testing the code because the solution seemed "obvious". But I did test it afterwards (at the same time as OP reported the error) and found that it didn't work. A quick Google brought me to the github issue in the second link above, and hence to the correct answer.
I don't yet understand why, in this particular case, a tidyverse function doesn't work as I expect. (Other than taking the easy way out and saying that my expectation was wrong!)
In base R we can use:
df$location <- as.numeric(factor(paste(df$x,df$y)))
x y value location
1 1 2 1 1
2 1 2 2 1
3 2 3 3 2
4 3 4 4 3
5 4 5 5 4
6 5 1 6 5
7 2 3 7 2
8 3 4 8 3
9 4 5 9 4
10 5 1 10 5

How do I preserve continuous (1,2,3,...n) ranking notation when ranking in R?

If I want to rank a set of numbers using the minimum rank for shared cases (aka ties):
dat <- c(13,13,14,15,15,15,15,15,15,16,17,22,45,46,112)
rank(dat, ties = 'min')
I get the results:
1 1 3 4 4 4 4 4 4 10 11 12 13 14 15
However, I want the rank to be a continuous series consisting of 1,2,3,...n, where n is the number of unique ranks.
Is there a way to make rank (or a similar function) rank a series of numbers by assigning ties to the lowest rank as above but instead of skipping subsequent rank values by the number of previous ties to instead continue ranking from the previous rank?
For example, I would like the above ranking to result in:
1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
you could do it using dplyr:
library(dplyr)
dense_rank(dat)
[1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
if you don't want to load the whole library and do it in base r:
match(dat, sort(unique(dat)))
[1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
Use a factor and then bring it back to numeric format:
as.numeric(factor(rank(dat)))
# [1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9

Proportion of dataset equal to a value

I have the following dataset called asteroids
3 4 3 3 1 4 1 3 2 3
1 1 4 2 3 3 2 6 1 1
3 3 2 2 2 2 1 3 2 1
6 1 3 2 2 1 2 2 4 2
I need to find out what proportion of this dataset is 1.
If you have a specific value in mind you can just do an equality comparison and then use mean on the resulting logical vector.
> asteroids <- scan(what=numeric())
1: 3 4 3 3 1 4 1 3 2 3 1 1 4 2 3 3 2 6 1 1 3 3 2 2 2 2 1 3 2 1 6 1 3 2 2 1 2 2 4 2
41:
Read 40 items
> mean(asteroids == 1)
[1] 0.25
This works since the equality comparison will give TRUE and FALSE and when T/F are coerced numerically they become 1s and 0s so mean ends up giving us the proportion of TRUEs.
I assumed asteroids was a vector. You don't specify in your question but if it's a different type of structure you'll probably need to coerce it into a vector in some way or another.
Assuming that 'asteroids' is a data.frame, unlist it, get the table and find the proportion with prop.table.
prop.table(table(unlist(asteroids)==1))
# FALSE TRUE
# 0.75 0.25
Or as #Richard Scriven mentioned, we can convert the data.frame to a logical matrix, and use table directly on it as 'matrix' is a vector with dim attributes.
prop.table(table(asteroids == 1))

Recoding an arbitrary grouping variable or factor in R

Suppose I have a vector or column of arbitrary length representing some grouping/factor variable with an arbitrary number of groups and arbitrary values for same along the lines of this:
a <- c(2,2,2,2,2,7,7,7,7,10,10,10,10,10)
a
[1] 2 2 2 2 2 7 7 7 7 10 10 10 10 10
How would I most easily turn that into this:
a
[1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3
a <- c(2,2,2,2,2,7,7,7,7,10,10,10,10,10)
c(factor(a))
#[1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3
Explanation:
A factor is just an integer vector with levels attribute and a class attribute. c removes attributes as a side effect. You could use as.numeric or as.integer instead of c with similar or the same results, respectively.

Repetitive vectors in R [duplicate]

This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
R: generate a repeating sequence based on vector
To create the vector 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 is easy in one line, just type this into the command line and the appropriate output comes out immediately:
c(rep(1:3, 5))
But is there a similarly easy way to produce the vector 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 ?
The pattern of the repetition is different but it's not obvious to me why it's not amenable to a very simple solution. It's possible to do this with a "for" loop without too much difficulty, but can it be all compressed into one "line"?
You need the each parameter within rep:
> rep(1:5, each = 3)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

Resources