Avoid duplicates in numeric vector shifting numbers

Avoid duplicates in numeric vector shifting numbers - r

I'm looking for the optimal way to go from a numeric vector containing duplicate entries, like this one:
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
to this one, avoiding the duplicates by shifting +1 if appropriate:
b=c(1,3,4,5,6,7,8,9,27,28,29,30,42,43)
side to side comparison:
> data.frame(a=a, b=b)
a b
1 1 1
2 3 3
3 4 4
4 4 5
5 4 6
6 5 7
7 7 8
8 9 9
9 27 27
10 28 28
11 28 29
12 30 30
13 42 42
14 43 43
is there any easy and quick way to do it? Thanks!

In case you want it to be done only once (there may still be duplicates):
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
a <- ifelse(duplicated(a),a+1,a)
output:
> a
[1] 1 3 4 5 5 5 7 9 27 28 29 30 42 43
Loop that will lead to a state without any duplicates:
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
while(length(a[duplicated(a)])) {
a <- ifelse(duplicated(a),a+1,a)
}
output:
> a
[1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43

An alternative is to use a recursive function:
no_dupes <- function(x) {
if (anyDuplicated(x) == 0)
x
else
no_dupes(x + duplicated(x))
}
no_dupes(a)
[1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43

A tidyverse option using purrr::accumulate.
library(dplyr)
library(purrr)
accumulate(a, ~ if_else(.y <= .x, .x+1, .y))
# [1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43

Related

Adding a value based on a sequence of numbers

Please simplify my code. The result should be the same. The script works but R shows warning messages:
1: In data$sygnature[seq(first[v], last[v])] <- paste0(n[v], "/", syg) :
number of items to replace is not a multiple of replacement length
etc.
The idea is to assign each sequence in the column the same value.
data <- data.frame(sygnature = c(seq(1:8),seq(1:3),seq(1:11),seq(1:6),seq(1:9),seq(1:5)))
n <- c(44:49)
k<-c()
for(i in (1: nrow(data))){
s<- data$sygnature[i]
z<-data$sygnature[i+1]
if(
if(is.na(z)){
z<-1
s > z
}else{
s > z
}
){
k<- c(k, s)
}
}
last<- cumsum(k)
first<-(last-k)+1
syg <- data$sygnature
for(v in 1:6)
{
data$sygnature[seq(first[v],last[v])] <- paste0(n[v],"/",syg)
}

You can do,
data$res <- paste0(rep(n, aggregate(sygnature ~ cumsum(sygnature == 1), data, length)[[2]]),
'/',
data$sygnature)
data
sygnature res
1 1 44/1
2 2 44/2
3 3 44/3
4 4 44/4
5 5 44/5
6 6 44/6
7 7 44/7
8 8 44/8
9 1 45/1
10 2 45/2
11 3 45/3
12 1 46/1
13 2 46/2
14 3 46/3
15 4 46/4
16 5 46/5
17 6 46/6
18 7 46/7
19 8 46/8
20 9 46/9
21 10 46/10
22 11 46/11
23 1 47/1
24 2 47/2
25 3 47/3
26 4 47/4
27 5 47/5
28 6 47/6
29 1 48/1
30 2 48/2
31 3 48/3
32 4 48/4
33 5 48/5
34 6 48/6
35 7 48/7
36 8 48/8
37 9 48/9
38 1 49/1
39 2 49/2
40 3 49/3
41 4 49/4
42 5 49/5

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).

The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

Create partition based in two variables

I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
Is there a smart way to do this?
Thank you!

If I understand correctly (which I do not guarantee) I can offer the following approach:
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
check if satisfactory:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5

You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

Sampling loop from vector

I am playing around to develop a sampling function to do randomization to make days easier:
Question:
pln <- 1:80
bcap <- cumsum(c(20, 12, 16, 16, 16))
bcap
[1] 20 32 48 64 80
I want to randomize pln such that 1:20, 21:32, 33:48, 49:64, 65:80, for this example. This might vary for different scenarios.
newpln <- c(sample(1:20), sample(21:32), sample(33:48),
sample(49:64), sample(65:80))
I want create a general function where length of bcap can be of any number, however the pln should run 1: max(bcap).

Is this what you want?
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 13 19 4 16 11 2 5 20 9 14 10 3 1 7 6 8 17 12 15 18 27 24 30 32 23 25 28 21 31 26 29 22 39 41 48 36 37 45 42 47 43 38 40 34 35
[46] 44 46 33 60 52 50 58 51 54 62 55 64 61 59 49 63 53 56 57 72 74 76 78 67 69 70 66 73 79 68 80 77 71 75 65
Testing:
> pln <- 1:12
> pln
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> bcap <- cumsum(c(4, 3, 2, 3))
> bcap
[1] 4 7 9 12
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 8 9 12 11 10
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 9 8 10 12 11
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 2 3 1 4 7 6 5 8 9 11 10 12

You can do this with one call to mapply. You just need an object that contains what's inside the cumsum call of your bcap object.
bvec <- c(20, 12, 16, 16, 16)
mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec))
A small example:
bvec <- c(2,1,3,1)
set.seed(21)
unlist(mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec)))
# [1] 2 1 3 4 5 6 7

library("plyr")
unlist(
llply(
mlply(
data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),
seq),
sample),
use.names = FALSE)
Make a data.frame with each ranges from/to, use that to make a list with the sequences, sample each list, and then combine them together.
UPDATE:
worked for me:
> library("plyr")
> bcap <- cumsum(c(4, 3, 2, 3))
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 4 2 3 1 7 4 5 6 9 7 8 12 9 11 10
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 3 1 2 4 5 6 4 7 9 7 8 9 12 10 11
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 2 3 4 1 6 5 4 7 8 9 7 11 10 12 9