Generating interaction variables in R dataframes - r

Is there a way - other than a for loop - to generate new variables in an R dataframe, which will be all the possible 2-way interactions between the existing ones?
i.e. supposing a dataframe with three numeric variables V1, V2, V3, I would like to generate the following new variables:
Inter.V1V2 (= V1 * V2)
Inter.V1V3 (= V1 * V3)
Inter.V2V3 (= V2 * V3)
Example using for loop :
x <- read.table(textConnection('
V1 V2 V3 V4
1 9 25 18
2 5 20 10
3 4 30 12
4 4 34 16'
), header=TRUE)
dim.init <- dim(x)[2]
for (i in 1: (dim.init - 1) ) {
for (j in (i + 1) : (dim.init) ) {
x[dim(x)[2] + 1] <- x[i] * x[j]
names(x)[dim(x)[2]] <- paste("Inter.V",i,"V",j,sep="")
}
}

Here is a one liner for you that also works if you have factors:
> model.matrix(~(V1+V2+V3+V4)^2,x)
(Intercept) V1 V2 V3 V4 V1:V2 V1:V3 V1:V4 V2:V3 V2:V4 V3:V4
1 1 1 9 25 18 9 25 18 225 162 450
2 1 2 5 20 10 10 40 20 100 50 200
3 1 3 4 30 12 12 90 36 120 48 360
4 1 4 4 34 16 16 136 64 136 64 544
attr(,"assign")
[1] 0 1 2 3 4 5 6 7 8 9 10

Here you go, using combn and apply:
> x2 <- t(apply(x, 1, combn, 2, prod))
Setting the column names can be done with two paste commands:
> colnames(x2) <- paste("Inter.V", combn(1:4, 2, paste, collapse="V"), sep="")
Lastly, if you want all your variables together, just cbind them:
> x <- cbind(x, x2)
> V1 V2 V3 V4 Inter.V1V2 Inter.V1V3 Inter.V1V4 Inter.V2V3 Inter.V2V4 Inter.V3V4
1 1 9 25 18 9 25 18 225 162 450
2 2 5 20 10 10 40 20 100 50 200
3 3 4 30 12 12 90 36 120 48 360
4 4 4 34 16 16 136 64 136 64 544

I think this question should be complemented with the poly/polym function, which goes futher: it generates not only interactions between the variables, but its power until the selected degree. And orthogonal iteractions, which may be very usefull.
The directly solution to the asked problem would be:
> polym(x$V1, x$V2, x$V3, x$V4, degree = 2, raw = T)
1.0.0.0 2.0.0.0 0.1.0.0 1.1.0.0 0.2.0.0 0.0.1.0 1.0.1.0 0.1.1.0 0.0.2.0 0.0.0.1 1.0.0.1 0.1.0.1 0.0.1.1 0.0.0.2
[1,] 1 1 9 9 81 25 25 225 625 18 18 162 450 324
[2,] 2 4 5 10 25 20 40 100 400 10 20 50 200 100
[3,] 3 9 4 12 16 30 90 120 900 12 36 48 360 144
[4,] 4 16 4 16 16 34 136 136 1156 16 64 64 544 256
attr(,"degree")
[1] 1 2 1 2 2 1 2 2 2 1 2 2 2 2
The columns 4, 7, 8, 11, 12, 13 has the requested in the question. Other columns have other kinds of interactions. If you would like to get orthogonal interactions, just set raw = FALSE.

Related

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

R - Issues while calling a user-defined function

I have the following dataframe named "dataset"
> dataset
V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450
I have two helper functions as below
get_A <- function(p){
return(data.frame(Scorecard = p,
Results = dataset[nrow(dataset),(p+1)]))
} #Pulls the value from the last row for a given value of (p and offset by 1)
get_P <- function(p){
return(data.frame(Scorecard= p,
Results = dataset[p,ncol(dataset)]))
} #Pulls the value from the last column for a given value of p
I have the following dataframe on which I need to run the above helper functions. There will be NAs because I'm reading this "data_sub" dataframe from an excel file which can have unequal rows for the two columns.
> data_sub
Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA
When I call the helper functions, I get some strange results as shown below:
> get_P(data_sub[complete.cases(data_sub$Key_P),]$Key_P)
Scorecard Results
1 2 1837
2 3 315
3 4 621
> get_A(data_sub[complete.cases(data_sub$Key_A),]$Key_A)
Scorecard Results.V2 Results.V4 Results.V6
1 1 12 8 11
2 3 12 8 11
3 5 12 8 11
Warning message:
In data.frame(Scorecard = p, Results = dataset[nrow(dataset), (p + :
row names were found from a short variable and have been discarded
The call to the get_P() helper function is working the way I want. I'm getting the "Results" for each non-NA value in data_sub$Key_P as a dataframe.
But the call to the get_A() helper function is giving strange results and also a warning.I was expecting it to give a similar dataframe as given the call to get_P(). Why is this happening and how can I make get_A() to give the correct dataframe? Basically, the output of this should be
Scorecard Results
1 1 12
2 3 8
3 5 11
I found this link related to the warning but it's unhelpful in solving my issue.
The following works
get_P <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ]
data.frame(
Scorecard = data_sub$Key_P,
Results = df[data_sub$Key_P, ncol(df)])
}
get_P(df, data_sub)
# Scorecard Results
#1 2 1837
#2 3 315
#3 4 621
get_A <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ];
data.frame(
Scorecard = data_sub$Key_A,
Results = as.numeric(df[nrow(df), data_sub$Key_A + 1]))
}
get_A(df, data_sub)
# Scorecard Results
#1 1 12
#2 3 8
#3 5 11
To avoid the warning, we need to strip rownames with as.numeric in get_A.
Another tip: It's better coding practice to make get_P and get_A a function of both df and data_sub to avoid global variables.
Sample data
df <- read.table(text =
" V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450", header = T, row.names = 1)
data_sub <- read.table(text =
" Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA", header = T, row.names = 1)

Assign weights in lpSolveAPI to prioritise variables

I am trying to set up a linear programming solution using lpSolveAPI and R to solve a scheduling problem. Below is a small sample of the data; the minutes required for each session id, and their 'preferred' order/weight.
id <- 1:100
min <- sample(0:500, 100)
weight <- (1:100)/sum(1:100)
data <- data.frame(id, min, weight)
What I want to do is arrange/schedule these session IDs so that there are maximum number sessions in a day, preferably by their weight and each day is capped by a total of 400 minutes.
This is how I have set it up currently in R:
require(lpSolveAPI)
#Set up matrix to hold results; each row represents day
r <- 5
c <- 10
row <- 1
results <- matrix(0, nrow = r, ncol = c)
rownames(results) <- format(seq(Sys.Date(), by = "days", length.out = r), "%Y-%m-%d")
for (i in 1:r){
for(j in 1:c){
lp <- make.lp(0, nrow(data))
set.type(lp, 1:nrow(data), "binary")
set.objfn(lp, rep(1, nrow(data)))
lp.control(lp, sense = "max")
add.constraint(lp, data$min, "<=", 400)
set.branch.weights(lp, data$weight)
solve(lp)
a <- get.variables(lp)*data$id
b <- a[a!=0]
tryCatch(results[row, 1:length(b)] <- b, error = function(x) 0)
if(dim(data[!data$id == a,])[1] > 0) {
data <- data[!data$id== a,]
row <- row + 1
}
break
}
}
sum(results > 0)
barplot(results) #View of scheduled IDs
A quick look at the results matrix tells me that while the setup works to maximise number of sessions so that the total minutes in a day are close to 400 as possible, the setup doesn't follow the weights given. I expect my results matrix to be filled with increasing session IDs.
I have tried assigning different weights, weights in reverse order etc. but for some reason my setup doesn't seem to enforce "set.branch.weights".
I have read the documentation for "set.branch.weights" from lpSolveAPI but I think I am doing something wrong here.
Example - Data:
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25
Should be
Day 1 67 72 36 91 80 44 76
Day 2 58 84 96 21 1 41 66 89
Day 3 62 11 42 68 25 44 90 4 33 31
Each day has a cumulative sum of <= 480m.
My simple minded approach:
df = read.table(header=T,text="
id min weight
1 67 1
2 72 2
3 36 3
4 91 4
5 80 5
6 44 6
7 76 7
8 58 8
9 84 9
10 96 10
11 21 11
12 1 12
13 41 13
14 66 14
15 89 15
16 62 16
17 11 17
18 42 18
19 68 19
20 25 20
21 44 21
22 90 22
23 4 23
24 33 24
25 31 25")
# assume sorted by weight
daynr = 1
daymax = 480
dayusd = 0
for (i in 1:nrow(df))
{
v = df$min[i]
dayusd = dayusd + v
if (dayusd>daymax)
{
daynr = daynr + 1
dayusd = v
}
df$day[[i]] = daynr
}
This will give:
> df
id min weight day
1 1 67 1 1
2 2 72 2 1
3 3 36 3 1
4 4 91 4 1
5 5 80 5 1
6 6 44 6 1
7 7 76 7 1
8 8 58 8 2
9 9 84 9 2
10 10 96 10 2
11 11 21 11 2
12 12 1 12 2
13 13 41 13 2
14 14 66 14 2
15 15 89 15 2
16 16 62 16 3
17 17 11 17 3
18 18 42 18 3
19 19 68 19 3
20 20 25 20 3
21 21 44 21 3
22 22 90 22 3
23 23 4 23 3
24 24 33 24 3
25 25 31 25 3
>
I will concentrate on the first solve. We basically solve a knapsack problem (objective + one constraint):
When I run this model as is I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.5952381
Now when I change the objective to
I get:
> solve(lp)
[1] 0
> x <- get.variables(lp)
> weightx <- data$weight * x
> sum(x)
[1] 14
> sum(weightx)
[1] 0.7428571
I.e. the count stayed at 14, but the weight improved.

adding rnorm to a column in loop

I am doing simulations and am trying to add error to a column repeatedly, specifically to the column titled Ao. In my output, the first 30 rows are correct; we have the initial data, the first year of altered data (error added to Ao), but then afterwards, where I would like to have 30 years of added error, I get repeats of Year 2 for Ao up to year 30. My goal is that I add error after each year of sampling. Ie. Year 2 is Year 1 Ao + error. Year 3 is Year 2 Ao + error, so on and so forth. Any helpers? Cheers.
for(t in 1:30){
Error<-rnorm(1000,0,1)
m<-rep(year1data$m,30)
r<-rep(year1data$r,30)
a<-rep(year1data$a,30)
g<-rep(year1data$g,30)
Year<-rep(2:31, each=TotSpecies)
Species<-1:TotSpecies
Ao<-year1data$Ao+sample(Error,TotSpecies,replace=FALSE)
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
TotSpeciesdata<-rbind(year1data,TotSpeciesdata)
}
> TotSpeciesdata
Species Year Ao m r a g
1 1 1 25.770783 43 119.110786 3.2305180 2.6526471
2 2 1 53.908914 138 161.894541 0.7342070 0.1151602
3 3 1 2.010732 226 193.820489 2.2890904 3.6248105
4 4 1 23.742254 332 17.315335 1.4009572 2.0037931
5 5 1 4.291080 63 187.591209 0.2563995 2.1553908
6 6 1 4.691113 343 116.267867 0.3899113 3.3950085
7 7 1 604.133044 224 132.240197 3.0410743 0.7985524
8 8 1 13.332567 166 5.367118 0.7921644 1.7861011
9 9 1 3.759268 141 212.340970 2.8733737 2.7123141
10 10 1 3.647390 209 259.400858 0.1249936 0.6594659
11 11 1 23.731109 10 114.171147 2.2437372 0.9867591
12 12 1 85.116996 69 167.412993 0.8306823 2.8905148
13 13 1 31.684280 277 177.025460 2.7618332 2.9245554
14 14 1 30.657523 205 21.710438 2.7661347 1.5911379
15 15 1 12.240410 85 210.121109 2.8827455 3.0418454
16 1 2 27.038097 43 119.110786 3.2305180 2.6526471
17 2 2 54.251600 138 161.894541 0.7342070 0.1151602
18 3 2 2.010636 226 193.820489 2.2890904 3.6248105
19 4 2 22.699369 332 17.315335 1.4009572 2.0037931
20 5 2 4.542589 63 187.591209 0.2563995 2.1553908
21 6 2 3.607833 343 116.267867 0.3899113 3.3950085
22 7 2 604.480756 224 132.240197 3.0410743 0.7985524
23 8 2 13.663513 166 5.367118 0.7921644 1.7861011
24 9 2 2.138715 141 212.340970 2.8733737 2.7123141
25 10 2 3.642769 209 259.400858 0.1249936 0.6594659
26 11 2 22.897993 10 114.171147 2.2437372 0.9867591
27 12 2 85.490897 69 167.412993 0.8306823 2.8905148
28 13 2 31.689202 277 177.025460 2.7618332 2.9245554
29 14 2 30.644419 205 21.710438 2.7661347 1.5911379
30 15 2 12.050207 85 210.121109 2.8827455 3.0418454
31 1 3 27.038097 43 119.110786 3.2305180 2.6526471
32 2 3 54.251600 138 161.894541 0.7342070 0.1151602
33 3 3 2.010636 226 193.820489 2.2890904 3.6248105
34 4 3 22.699369 332 17.315335 1.4009572 2.0037931
35 5 3 4.542589 63 187.591209 0.2563995 2.1553908
36 6 3 3.607833 343 116.267867 0.3899113 3.3950085
37 7 3 604.480756 224 132.240197 3.0410743 0.7985524
38 8 3 13.663513 166 5.367118 0.7921644 1.7861011
39 9 3 2.138715 141 212.340970 2.8733737 2.7123141
40 10 3 3.642769 209 259.400858 0.1249936 0.6594659
41 11 3 22.897993 10 114.171147 2.2437372 0.9867591
42 12 3 85.490897 69 167.412993 0.8306823 2.8905148
43 13 3 31.689202 277 177.025460 2.7618332 2.9245554
44 14 3 30.644419 205 21.710438 2.7661347 1.5911379
45 15 3 12.050207 85 210.121109 2.8827455 3.0418454
The main problem you have with your approach is the line:
TotSpeciesdata<-data.frame(Species,Year,Ao,m,r,a,g)
Because Year is a 30 * TotSpecies vector, but all the others are just TotSpecies long. So in effect, you are recycling all columns except Year 30 times when you create the data frame, which will lead to the year 2 data repeated 30 times, among other things. If you just have Year <- rep(i + 1, TotSpecies) I think your logic will work fine. That said, here is an alternate approach:
This will, for each species, create an incrementing random walk starting with Ao for that species for 5 years (just did that for display purposes):
set.seed(1)
year1data <- data.frame(species=1:10, year=1, Ao=runif(10, 1, 700))
TotSpeciesData <- do.call(
rbind,
lapply(
split(year1data, year1data$species),
function(data)
with(
data,
data.frame(species=species, year=year, Ao=c(Ao, Ao + cumsum(rnorm(5)))
) ) ) )
head(TotSpeciesData, 15)
Note I excluded columns m-g since they don't seem directly relevant to your particular question, but you can add them relatively easily. I also only did 5 years in addition to year 1 so you can see the results here, but that is also easy to change:
species year Ao
1.1 1 1 186.5906
1.2 1 1 185.7701
1.3 1 1 186.2575
1.4 1 1 186.9958
1.5 1 1 187.5716
1.6 1 1 187.2662
2.1 2 1 261.1146
2.2 2 1 262.6264
2.3 2 1 263.0162
2.4 2 1 262.3950
2.5 2 1 260.1803
2.6 2 1 261.3052
3.1 3 1 401.4245
3.2 3 1 401.3796
3.3 3 1 401.3634
It has been pointed out that the code that you provided above, or at least that I have edited, repeats itself every 15 years, rather than being unique year year in a step-wise fashion. I edited it as shown below:
TotSpeciesData <- do.call(
rbind, #bind the table by rows
lapply( #applying the function in list form
split(year1data, year1data$Species), #splits data into groups by species
function(data)
with(
data,
data.frame(Species=Species, Year=1:Community, Ao=c(Ao, Ao + cumsum(rnorm((TotSpecies-1),0,2))),m=m, r=r, a=a, g=g) #data frame is Species, Year,
) ) )
TotSpeciesData$Ao[TotSpeciesData$Ao<0]<-0 #any values less than 0 go to 0
TotSpeciesData<-TotSpeciesData[order(TotSpeciesData$Year),] #orders the data frame by Year
When I do this code:
TotSpeciesData[TotSpeciesData$Species==1 & TotSpeciesData$Year %in% c(1,2,16,17),]
I end up with an output showing that the data is repeating itself.
Species Year Ao m r a g
1.1 1 1 48.49161 239 332.9625 3.791778 2.723104
1.2 1 2 49.62851 239 332.9625 3.791778 2.723104
1.16 1 16 48.49161 239 332.9625 3.791778 2.723104
1.17 1 17 49.62851 239 332.9625 3.791778 2.723104
Any comments toward this?

Sampling loop from vector

I am playing around to develop a sampling function to do randomization to make days easier:
Question:
pln <- 1:80
bcap <- cumsum(c(20, 12, 16, 16, 16))
bcap
[1] 20 32 48 64 80
I want to randomize pln such that 1:20, 21:32, 33:48, 49:64, 65:80, for this example. This might vary for different scenarios.
newpln <- c(sample(1:20), sample(21:32), sample(33:48),
sample(49:64), sample(65:80))
I want create a general function where length of bcap can be of any number, however the pln should run 1: max(bcap).
Is this what you want?
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 13 19 4 16 11 2 5 20 9 14 10 3 1 7 6 8 17 12 15 18 27 24 30 32 23 25 28 21 31 26 29 22 39 41 48 36 37 45 42 47 43 38 40 34 35
[46] 44 46 33 60 52 50 58 51 54 62 55 64 61 59 49 63 53 56 57 72 74 76 78 67 69 70 66 73 79 68 80 77 71 75 65
Testing:
> pln <- 1:12
> pln
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> bcap <- cumsum(c(4, 3, 2, 3))
> bcap
[1] 4 7 9 12
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 8 9 12 11 10
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 4 2 3 1 6 5 7 9 8 10 12 11
> unlist(sapply(mapply(seq, c(1, bcap[1:(length(bcap)-1)]+1), bcap), sample))
[1] 2 3 1 4 7 6 5 8 9 11 10 12
You can do this with one call to mapply. You just need an object that contains what's inside the cumsum call of your bcap object.
bvec <- c(20, 12, 16, 16, 16)
mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec))
A small example:
bvec <- c(2,1,3,1)
set.seed(21)
unlist(mapply(function(x,y) sample(x)+y-x, bvec, cumsum(bvec)))
# [1] 2 1 3 4 5 6 7
library("plyr")
unlist(
llply(
mlply(
data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),
seq),
sample),
use.names = FALSE)
Make a data.frame with each ranges from/to, use that to make a list with the sequences, sample each list, and then combine them together.
UPDATE:
worked for me:
> library("plyr")
> bcap <- cumsum(c(4, 3, 2, 3))
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 4 2 3 1 7 4 5 6 9 7 8 12 9 11 10
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 3 1 2 4 5 6 4 7 9 7 8 9 12 10 11
> unlist(llply(mlply(data.frame(from=c(1,bcap[-length(bcap)]), to=bcap),seq),sample),use.names=FALSE)
[1] 2 3 4 1 6 5 4 7 8 9 7 11 10 12 9

Resources