Sampling loop from vector

Sampling loop from vector - r

I am playing around to develop a sampling function to do randomization to make days easier:
Question:
pln <- 1:80
bcap <- cumsum(c(20, 12, 16, 16, 16))
bcap
[1] 20 32 48 64 80
I want to randomize pln such that 1:20, 21:32, 33:48, 49:64, 65:80, for this example. This might vary for different scenarios.
newpln <- c(sample(1:20), sample(21:32), sample(33:48),
sample(49:64), sample(65:80))
I want create a general function where length of bcap can be of any number, however the pln should run 1: max(bcap).

Is this what you want?
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 13 19 4 16 11 2 5 20 9 14 10 3 1 7 6 8 17 12 15 18 27 24 30 32 23 25 28 21 31 26 29 22 39 41 48 36 37 45 42 47 43 38 40 34 35
[46] 44 46 33 60 52 50 58 51 54 62 55 64 61 59 49 63 53 56 57 72 74 76 78 67 69 70 66 73 79 68 80 77 71 75 65
Testing:
> pln <- 1:12
> pln
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> bcap <- cumsum(c(4, 3, 2, 3))
> bcap
[1] 4 7 9 12
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 8 9 12 11 10
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 9 8 10 12 11
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 2 3 1 4 7 6 5 8 9 11 10 12

You can do this with one call to mapply. You just need an object that contains what's inside the cumsum call of your bcap object.
bvec <- c(20, 12, 16, 16, 16)
mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec))
A small example:
bvec <- c(2,1,3,1)
set.seed(21)
unlist(mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec)))
# [1] 2 1 3 4 5 6 7

library("plyr")
unlist(
llply(
mlply(
data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),
seq),
sample),
use.names = FALSE)
Make a data.frame with each ranges from/to, use that to make a list with the sequences, sample each list, and then combine them together.
UPDATE:
worked for me:
> library("plyr")
> bcap <- cumsum(c(4, 3, 2, 3))
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 4 2 3 1 7 4 5 6 9 7 8 12 9 11 10
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 3 1 2 4 5 6 4 7 9 7 8 9 12 10 11
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 2 3 4 1 6 5 4 7 8 9 7 11 10 12 9

Related

Avoid duplicates in numeric vector shifting numbers

I'm looking for the optimal way to go from a numeric vector containing duplicate entries, like this one:
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
to this one, avoiding the duplicates by shifting +1 if appropriate:
b=c(1,3,4,5,6,7,8,9,27,28,29,30,42,43)
side to side comparison:
> data.frame(a=a, b=b)
a b
1 1 1
2 3 3
3 4 4
4 4 5
5 4 6
6 5 7
7 7 8
8 9 9
9 27 27
10 28 28
11 28 29
12 30 30
13 42 42
14 43 43
is there any easy and quick way to do it? Thanks!

In case you want it to be done only once (there may still be duplicates):
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
a <- ifelse(duplicated(a),a+1,a)
output:
> a
[1] 1 3 4 5 5 5 7 9 27 28 29 30 42 43
Loop that will lead to a state without any duplicates:
a=c(1,3,4,4,4,5,7,9,27,28,28,30,42,43)
while(length(a[duplicated(a)])) {
a <- ifelse(duplicated(a),a+1,a)
}
output:
> a
[1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43

An alternative is to use a recursive function:
no_dupes <- function(x) {
if (anyDuplicated(x) == 0)
x
else
no_dupes(x + duplicated(x))
}
no_dupes(a)
[1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43

A tidyverse option using purrr::accumulate.
library(dplyr)
library(purrr)
accumulate(a, ~ if_else(.y <= .x, .x+1, .y))
# [1] 1 3 4 5 6 7 8 9 27 28 29 30 42 43

Use mutate and dynamically named variables in dplyr

I would like to apply a function that selects the best transformation of certain variables in a data frame, and then adds new columns to the data frame with the transformed data. I can currently get the transformation to run as follows. However, this rewrites the existing data, instead of adding new, transformed variables. I have seen the other stackoverflow posts about dynamically-added variables but can't quite seem to get it to work. Here is what I have:
df <- data.frame(study_id = c(1:10),
v1 = (sample(1:100, 10)),
v2 = (sample(1:100, 10)),
v3 = (sample(1:100, 10)),
v4 = (sample(1:100, 10)))
require(bestNormalize)
transformed <- function(x) {
bn <- bestNormalize(x)
return(bn$x.t)
}
df <- df %>%
mutate(across(c(2,4:5), transformed))
Current output:
study_id v1 v2 v3 v4
1 1 -0.001846842 43 0.6559159 0.37893888
2 2 -2.416625847 81 -1.2998111 -0.64356058
3 3 1.012132345 95 -1.5086228 -0.48845289
4 4 0.798561562 2 0.8301299 0.30168982
5 5 -0.257460026 35 0.1322051 0.78737617
6 6 -0.179681789 42 -1.1352463 -2.42438347
7 7 0.378206706 22 -0.3635088 0.79583687
8 8 0.909304988 70 1.0748401 0.63712357
9 9 0.325879668 32 0.9041796 -0.09711216
10 10 -0.568470765 7 0.7099185 0.75254380
Desired output:
study_id v1 v2 v3 v4 v1_transformed v3_transformed v4_transformed
1 1 72 7 87 100 4 3 2
2 2 57 78 64 69 10 8 6
3 3 35 65 83 96 3 5 4
4 4 24 58 94 53 6 10 10
5 5 100 62 82 63 -1 7 3
6 6 47 55 4 50 8 4 1
7 7 83 97 35 41 7 2 -1
8 8 78 86 22 73 1 -1 9
9 9 11 39 93 68 2 0 7
10 10 36 49 8 72 0 1 0
Many thanks in advance.

Use the .names= argument of across:
df %>%
mutate(across(c(2,4:5), transformed, .names = "{.col}_transformed"))
giving:
study_id v1 v2 v3 v4 v1_transformed v3_transformed v4_transformed
1 1 50 72 12 7 0.3850197 -0.7916019 -1.9775107
2 2 53 82 61 42 0.4425318 0.6132865 0.6790496
3 3 3 12 90 20 -2.3661268 0.9496526 -0.4232995
4 4 20 84 37 21 -0.5190229 0.1809655 -0.3508475
5 5 55 54 4 23 0.4790925 -1.7301008 -0.2157362
6 6 61 96 85 74 0.5812924 0.9002185 1.5209888
7 7 52 94 22 38 0.4237308 -0.2683955 0.5302984
8 8 72 41 57 35 0.7449435 0.5546340 0.4080778
9 9 13 67 6 45 -0.9434502 -1.3866702 0.7815968
10 10 74 48 93 14 0.7719892 0.9780114 -0.9526174

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).

The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

Assign weights in lpSolveAPI to prioritise variables

I am trying to set up a linear programming solution using lpSolveAPI and R to solve a scheduling problem. Below is a small sample of the data; the minutes required for each session id, and their 'preferred' order/weight.
id <- 1:100
min <- sample(0:500, 100)
weight <- (1:100)/sum(1:100)
data <- data.frame(id, min, weight)
What I want to do is arrange/schedule these session IDs so that there are maximum number sessions in a day, preferably by their weight and each day is capped by a total of 400 minutes.
This is how I have set it up currently in R:
require(lpSolveAPI)
#Set up matrix to hold results; each row represents day
r <- 5
c <- 10
row <- 1
results <- matrix(0, nrow = r, ncol = c)
rownames(results) <- format(seq(Sys.Date(), by = "days", length.out = r), "%Y-%m-%d")
for (i in 1:r){
for(j in 1:c){
lp <- make.lp(0, nrow(data))
set.type(lp, 1:nrow(data), "binary")
set.objfn(lp, rep(1, nrow(data)))
lp.control(lp, sense = "max")
add.constraint(lp, data$min, "<=", 400)
set.branch.weights(lp, data$weight)
solve(lp)
a <- get.variables(lp)*data$id
b <- a[a!=0]
tryCatch(results[row, 1:length(b)] <- b, error = function(x) 0)
if(dim(data[!data$id == a,])[1] > 0) {
data <- data[!data$id== a,]
row <- row + 1
}
break
}
}
sum(results > 0)
barplot(results) #View of scheduled IDs
A quick look at the results matrix tells me that while the setup works to maximise number of sessions so that the total minutes in a day are close to 400 as possible, the setup doesn't follow the weights given. I expect my results matrix to be filled with increasing session IDs.
I have tried assigning different weights, weights in reverse order etc. but for some reason my setup doesn't seem to enforce "set.branch.weights".
I have read the documentation for "set.branch.weights" from lpSolveAPI but I think I am doing something wrong here.
Example - Data:
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25
Should be
Day 1 67 72 36 91 80 44 76
Day 2 58 84 96 21 1 41 66 89
Day 3 62 11 42 68 25 44 90 4 33 31
Each day has a cumulative sum of <= 480m.

My simple minded approach:
df = read.table(header=T,text="
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25")
# assume sorted by weight
daynr = 1
daymax = 480
dayusd = 0
for (i in 1:nrow(df))
{
v = df$min[i]
dayusd = dayusd + v
if (dayusd>daymax)
{
daynr = daynr + 1
dayusd = v
}
df$day[[i]] = daynr
}
This will give:
> df
id min weight day
1 1 67 1 1
2 2 72 2 1
3 3 36 3 1
4 4 91 4 1
5 5 80 5 1
6 6 44 6 1
7 7 76 7 1
8 8 58 8 2
9 9 84 9 2
10 10 96 10 2
11 11 21 11 2
12 12 1 12 2
13 13 41 13 2
14 14 66 14 2
15 15 89 15 2
16 16 62 16 3
17 17 11 17 3
18 18 42 18 3
19 19 68 19 3
20 20 25 20 3
21 21 44 21 3
22 22 90 22 3
23 23 4 23 3
24 24 33 24 3
25 25 31 25 3
>

I will concentrate on the first solve. We basically solve a knapsack problem (objective + one constraint):
When I run this model as is I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.5952381
Now when I change the objective to
I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.7428571
I.e. the count stayed at 14, but the weight improved.

How to calculate the standard deviation of subsets of data down a column in R

I would like to calculate the standard deviation of every 4 values down a column from the first to the last observation. I have found lots of answers for moving SD functions, but I simply need a line of code that will calculate the sd() for every 4 data values and write the answers into a new column in the data frame as below:
Example data:
Obs Count
1 56
2 29
3 66
4 62
5 49
6 12
7 65
8 81
9 73
10 66
11 71
12 59
Desired output:
Obs Count SD
1 56 16.68
2 29 16.68
3 66 16.68
4 62 16.68
5 49 29.55
6 12 29.55
7 65 29.55
8 81 29.55
9 73 6.24
10 66 6.24
11 71 6.24
12 59 6.24
I tried the below code, but this is obviously incorrect:
a <- for(i in 1: length(df)) sd(df$Count[i:(i+3)])
This should be a very easy task, but I have not been able to find an answer. I am still learning and any help would be appreciated.

In base R, you can use the following to create an index of "every 4 rows":
(seq_len(nrow(mydf))-1) %/% 4
# [1] 0 0 0 0 1 1 1 1 2 2 2 2
Using that, you can use ave to get the desired result:
mydf$SD <- ave(mydf$Count, (seq_len(nrow(mydf))-1) %/% 4, FUN = sd)
mydf
# Obs Count SD
# 1 1 56 16.680827
# 2 2 29 16.680827
# 3 3 66 16.680827
# 4 4 62 16.680827
# 5 5 49 29.545163
# 6 6 12 29.545163
# 7 7 65 29.545163
# 8 8 81 29.545163
# 9 9 73 6.238322
# 10 10 66 6.238322
# 11 11 71 6.238322
# 12 12 59 6.238322

An anternative is using rollapply from zoo package in combination with rep.
> library(zoo)
> N <- 4 # every four values
> SDs <- rollapply(df[,2], width=N, by=N, sd)
> df$SD <- rep(SDs, each=N)
> df
Obs Count SD
1 1 56 16.680827
2 2 29 16.680827
3 3 66 16.680827
4 4 62 16.680827
5 5 49 29.545163
6 6 12 29.545163
7 7 65 29.545163
8 8 81 29.545163
9 9 73 6.238322
10 10 66 6.238322
11 11 71 6.238322
12 12 59 6.238322
You might want to get it all in a once:
df$SD <- rep( rollapply(df[,2], width=N, by=N, sd), each=N)

This looks faster (i didn't test tough):
# mydf = your data
idxs = rep(1:nrow(mydf), each = 4, length = nrow(mydf))
mydf = within(mydf, {
Sd = rep(tapply(Count, idxs, sd), each = 4)
})
print(mydf)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sampling loop from vector - r

Related

Avoid duplicates in numeric vector shifting numbers

Use mutate and dynamically named variables in dplyr

Finding the k-largest clusters in dbscan result

Assign weights in lpSolveAPI to prioritise variables

How to calculate the standard deviation of subsets of data down a column in R

Categories

Resources