Number Duplicated Cases - r

I want to identify duplicate cases and number them as a vector (such as with an ID variable). Any case without any direct matches should be labeled as a fixed value (such as zero). Any case with a corresponding duplicate should be labeled 1, with each subsequent case being labeled n+1. So, if I have an ID variable like this 1, 2, 2, 2, 3, 4, 4, 5, I'd want the corresponding vector to produce: 0, 1, 2, 3, 0, 1, 2, 0.
How can I do this?
Duplicate identifies the first case as a non-duplicate, so that doesn't work.

Base R, ave with seq_along
x<-c(1,2,2,2,3,4,4,5)
ave(seq_along(x),x,FUN=function(g) if(length(g)>1) seq_along(g) else 0)
#> 0 1 2 3 0 1 2 0

Related

Function that reveals the position in R [duplicate]

This question already has answers here:
rank and order in R
(7 answers)
Closed 4 years ago.
What is the difference between sort(), rank(), and order() in R.
Can you explain with examples?
sort() sorts the vector in an ascending order.
rank() gives the respective rank of the numbers present in the vector, the smallest number receiving the rank 1.
order() returns the indices of the vector in a sorted order.
for example: if we apply these functions are applied to the vector - c (3, 1, 2, 5, 4)
sort(c (3, 1, 2, 5, 4)) will give c(1,2,3,4,5)
rank(c (3, 1, 2, 5, 4)) will give c(3,1,2,5,4)
order(c (3, 1, 2, 5, 4)) will give c(2,3,1,5,4).
if you put these indices in this order, you will get the sorted vector. Notice how v[2] = 1, v[3] = 2, v[1] = 3, v[5] = 4 and v[4] = 5
also there is a tie handling method in R. If you run rank(c (3, 1, 2, 5, 4, 2)) it will give Rank 1 to 1, since there are two 2 present R will rank them on 2 and 3 but assign Rank 2.5 to each of them, next 3 will get Rank 4.0, so
rank(c (3, 1, 2, 5, 4, 2)) will give you output [4.0 1.0 2.5 6.0 5.0 2.5]
Hope this is helpful.

R - find all possible combinations of numbers WITH constraints on combination length

Let's say you have the following vector of numbers:
1, 2, 3, 4, 5
I want to find all possible combinations of numbers with the combination length 3. The combinations must not overlap, i.e. 1, 2, 3 is the same as 1, 3, 2 and only one of those should appear in the output!
So, the answers would be:
1, 2, 3
1, 2, 4
1, 2, 5
1, 3, 4
1, 3, 5
1, 4, 5
2, 3, 4
2, 3, 5
2, 4, 5
3, 4, 5
This is just a simple example, in reality I have a vector of length 10000 and I need to find all combinations with length 8000. What code would you use to generate those combinations in R?
#chinsoon12 suggested the package RcppAlgos. I investigated it and found that the following works:
comboIter(1:10000, 8000)

Remove continuously repeating values [duplicate]

This question already has answers here:
Remove/collapse consecutive duplicate values in sequence
(5 answers)
Closed 4 years ago.
Does anyone know how to remove continuously repeating values? Not just repeating values with unique() function.
So for example, I want:
0,0,0,0,1,1,1,2,2,2,3,3,3,3,2,2,1,2
to become
0,1,2,3,2,1,2
and not just
0,1,2,3
Is there a word to describe this? I'm sure that the solution is out there somewhere and I just can't find it because I don't know the word for it.
Keep a value when it's difference from the previous value is not zero (and keep the first one):
x <- c(0,0,0,0,1,1,1,2,2,2,3,3,3,3,2,2,1,2)
x[c(1, diff(x)) != 0]
# [1] 0 1 2 3 2 1 2
v <- c(0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 2, 2, 1, 2)
rle(v)$values
Output:
[1] 0 1 2 3 2 1 2

how to fill in values in a vector?

I have vectors in R containing a lot of 0's, and a few non-zero numbers.Each vector starts with a non-zero number.
For example <1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0>
I would like to set all of the zeros equal to the most recent non-zero number.
I.e. this vector would become <1,1,1,1,1,1,2,2,2,2,2,2,4,4,4,4>
I need to do this for a about 100 vectors containing around 6 million entries each. Currently I am using a for loop:
for(k in 1:length(vector){
if(vector[k] == 0){
vector[k] <- vector[k-1]
}
}
Is there a more efficient way to do this?
Thanks!
One option, would be to replace those 0 with NA, then use zoo::na.locf:
x <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
x[x == 0] <- NA
zoo::na.locf(x) ## you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4
Thanks to Richard for showing me how to use replace,
zoo::na.locf(replace(x, x == 0, NA))
You could try this:
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
or another case that cummax would not be appropriate
k <- c(1,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0)
k[which(k != 0)[cumsum(k != 0)]]
Logic:
I am keeping "track" of the indices of the vector elements that are non zero which(k != 0), lets denote this new vector as x, x=c(1, 7, 13)
Next I am going to "sample" this new vector. How? From k I am creating a new vector that increments every time there is a non zero element cumsum(k != 0), lets denote this new vector as y y=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3)
I am "sampling" from vector x: x[y] i.e. taking the first element of x 6 times, then the second element 6 times and the third element 3 times. Let denote this new vector as z, z=c(1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 13, 13, 13)
I am "sampling" from vector k, k[z], i.e. i am taking the first element 6 times, then the 7th element 6 times then the 13th element 3 times.
Add to #李哲源's answer:
If it is required to replace the leading NAs with the nearest non-NA value, and to replace the other NAs with the last non-NA value, the codes can be:
x <- c(0,0,1,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0)
zoo::na.locf(zoo::na.locf(replace(x, x == 0, NA),na.rm=FALSE),fromLast=TRUE)
# you possibly need: `install.packages("zoo")`
# [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 4 4 4 4

Reduce each consecutive sequence to its value and length

Assume you have a vector with runs of consecutive values:
v <- c(1, 1, 1, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3)
How can it be best reduced to one value per run and the length of each run. I.e. the first run is 1 repeated two times; 2nd run: 2 repeated four times; 3rd run: 1 repeated two times, and so on:
v.df <- data.frame(value = c(1, 2, 1, 3),
repetitions = c(3, 4, 2, 4))
In a procedural language I might just iterate through a loop and build the data.frame as I go, but with a large dataset in R such an approach is inefficient. Any advice?
or more simply
data.frame(rle(v)[])
with(rle(v), data.frame(values, lengths))
should get you what you need.
values lengths
1 3
2 4
1 2
3 4

Resources