R: How to create a equal blocks of variables randomly? - r

I have a data frame of n = 20 variables (number of columns) spread over b = 5 blocks (4 variables per block).
I would like to create p = 4 random and equal sized blocks of variables from the 5 blocks of variables.
I tried :
sample (x = 1: p, size = n, replace = TRUE)
[1] 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 4 4 4 4 4
Example of expected result (5 variables per block):
[1] 4 1 2 1 4 2 3 1 2 3 2 1 4 3 1 2 3 3 4 4
Thanks for your help !

You can try:
sample(x = rep(1:p,n/p), size = n, replace = FALSE)

Having discussed this in comments below, here is a solution:
Create a vector that looks like what you want, and then use sample to randomly sort it by sampling the whole vector without replacement:
p <- 4
b <- 5
sample(rep(1:p, b), size = p * b)
[1] 3 1 4 3 3 4 1 1 4 2 2 4 3 2 1 2 2 4 3 1

Related

R: Calculation combinations and variable iteration for loop

I want to calculate combinations in R.
I want to calculate and get results as in the below code, but in my code, the number in the for loop depends on the number of variables (e.g., length(ncomb)).
How do I set the number in a for loop?
Or is there a better way to calculate the combinations that I want?
#Block
nblock = c(1,2,3)
num_nblock = length(nblock)
#Position
tol = c(1:6)
total = length(tol)
#Calculate number of Combination
#6C1*5C2*3C3
t1 = total
ncomb=c()
for (i in 1:num_nblock) {
ncomb[i] = choose(t1,nblock[i])
t1 = t1-nblock[i]
}
#Calculate Combination
Clist = data.frame()
for (i in 1:ncomb[1]) {
comb1 = combn(total,nblock[1])
remain = setdiff(tol,comb1[,i])
for (j in 1:ncomb[2]) {
comb2 = combn(remain,nblock[2])
remain2 = setdiff(remain,comb2[,j])
for (k in 1:ncomb[3]) {
comb3 = combn(remain2,nblock[3])
ans = c(comb1[,i],comb2[,j],comb3[,k])
Clist =rbind(Clist,ans)
}
}
}
#Result :Clist
X1L X2L X3L X4L X5L X6L
1 1 2 3 4 5 6
2 1 2 4 3 5 6
3 1 2 5 3 4 6
4 1 2 6 3 4 5
5 1 3 4 2 5 6
6 1 3 5 2 4 6
7 1 3 6 2 4 5
8 1 4 5 2 3 6
9 1 4 6 2 3 5
10 1 5 6 2 3 4
.....
50 5 4 6 1 2 3
51 6 1 2 3 4 5
52 6 1 3 2 4 5
53 6 1 4 2 3 5
54 6 1 5 2 3 4
55 6 2 3 1 4 5
56 6 2 4 1 3 5
57 6 2 5 1 3 4
58 6 3 4 1 2 5
59 6 3 5 1 2 4
60 6 4 5 1 2 3
So here is an idea I have which may be harder to understand but solves your problem of having a variable number of for loops.
Before I show my code, let me explain the idea using your example of dividing 1 through 6 into blocks of 1, 2, and 3. As you said we can calculate the total number of combinations as 6C1*5C2*3C3=60. Now the question is how to fill up the 60 entries.
So if you think about a tree from Block 1 to 3, each branch of Block 1 correspond to 5C2 number of branches of Block2, and each branch of Block 2 correspond to 3C3 branch of Block 3. In this way, the total number of branches will be 6C1*5C2*3C3=60. Essentially how you wanna fill up the output matrix is to repeat each branch in Block 1 5C2*3C3 times, each branch in Block 2 3C3 time, and each branch in Block 3 should appear uniquely. To summarize you want to repeat each branch the number of times to the "cardinality" of Blocks to the right hand side.
This is what the following code is doing.
# ++++ Using your example and initialization ++++
# Block
nblock = c(1,2,3)
num_nblock = length(nblock)
# Position
tol = c(1:6)
total = length(tol)
t1 = total
ncomb=c()
for (i in 1:num_nblock) {
ncomb[i] = choose(t1,nblock[i])
t1 = t1-nblock[i]
}
# ++++++++
# Initialize result matrix
Clist = matrix(nrow = prod(ncomb), ncol = total)
# Block col ID: produce list of (1),(2,3),(4,5,6) as col ID of output matrix
block_cols = list()
start = 1
for (i in 1:num_nblock) {
block_cols[[i]] = start:(start+nblock[i]-1)
start = start + nblock[i]
}
# Fill the output matrix: iterate each (row,block) of matrix
for (i in 1:prod(ncomb)) {
for (j in 1:num_nblock) {
# First col ID of each block. In this example, always 1, 2, 4
block_first_col_id = block_cols[[j]][1]
# Fill the pos when its still NA
if (is.na(Clist[i, block_first_col_id])) {
# Filler is all combination having removed numbers appeared in left blocks
remain = setdiff(tol, Clist[i, 0:(block_first_col_id-1)])
com = combn(remain, nblock[j])
# Key step: replicate to fill remaining cardinality
filler = apply(com, 1, function(x) rep(x, each = prod(ncomb[(j+1):length(ncomb)])))
# Store filler to output.
# Filler may be a vector, in which case dim will return NULL
filler_nrow = ifelse(is.null(dim(filler)[1]), 1, dim(filler)[1])
Clist[i:(i + filler_nrow - 1), block_cols[[j]]] = filler
}
}
}

R, dplyr: Is there a way to add order of groups when there are multiple rows per group without creating a new data frame? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I have data from an experiment that has multiple rows per item (each row has the reading time for one word of a sentence of n words), and multiple items per subject. Items can be varying numbers of rows. Items were presented in a random order, and their order in the data as initially read in reflects the sequence they saw the items in. What I'd like to do is add a column that contains the order in which the subject saw that item (i.e., 1 for the first item, 2 for the second, etc.).
Here's an example of some input data that has the relevant properties:
d <- data.frame(Subject = c(1,1,1,1,1,2,2,2,2,2),
Item = c(2,2,2,1,1,1,1,2,2,2))
Subject Item
1 2
1 2
1 2
1 1
1 1
2 1
2 1
2 2
2 2
2 2
And here's the output I want:
Subject Item order
1 2 1
1 2 1
1 2 1
1 1 2
1 1 2
2 1 1
2 1 1
2 2 2
2 2 2
2 2 2
I know I can do this by setting up a temp data frame that filters d to unique combinations of Subject and Item, adding order to that as something like 1:n() or row_number(), and then using a join function to put it back together with the main data frame. What I'd like to know is whether there's a way to do this without having to create a new data frame just to store the order---can this be done inside dplyr's mutate somehow if I group by Subject and Item, for instance?
Here's one way:
d %>%
group_by(Subject) %>%
mutate(order = match(Item, unique(Item))) %>%
ungroup()
# # A tibble: 10 x 3
# Subject Item order
# <dbl> <dbl> <int>
# 1 1 2 1
# 2 1 2 1
# 3 1 2 1
# 4 1 1 2
# 5 1 1 2
# 6 2 1 1
# 7 2 1 1
# 8 2 2 2
# 9 2 2 2
# 10 2 2 2
Here is a base R option
transform(d,
order = ave(Item, Subject, FUN = function(x) as.integer(factor(x, levels = unique(x))))
)
or
transform(d,
order = ave(Item, Subject, FUN = function(x) match(x, unique(x)))
)
both giving
Subject Item order
1 1 2 1
2 1 2 1
3 1 2 1
4 1 1 2
5 1 1 2
6 2 1 1
7 2 1 1
8 2 2 2
9 2 2 2
10 2 2 2

create variable conditionally by group in R (write function)

I want to create a variable by group conditioned on existing variable on individual level. Each individual has a outlier variable 1, 2, 3. I want to create a new variable by group so that the new var = 2 whenever there is at least one individual in that group whose outlier variable = 2; and the new var = 3 whenever there is at least one individual in that group whose outlier variable = 3.
The data looks like this
grpid id outlier
1 1 1
1 2 1
1 3 2
2 4 1
2 5 3
2 6 1
3 7 1
3 8 1
3 9 1
Ideal output like this
grpid id outlier goutlier
1 1 1 2
1 2 1 2
1 3 2 2
2 4 1 3
2 5 3 3
2 6 1 3
3 7 1 1
3 8 1 1
3 9 1 1
Any suggestions?
Thanks!
It is easy with dplyr
library(dplyr)
df <- read.table(header = TRUE,sep = ",",
text = "grpid,id,outlier
1,1,1
1,2,1
1,3,2
2,4,1
2,5,3
2,6,1
3,7,1
3,8,1
3,9,1")
df %>% group_by(grpid) %>% mutate(goutlier = max(outlier))

Percolation clustering

Consider the following groupings:
> data.frame(x = c(3:5,7:9,12:14), grp = c(1,1,1,2,2,2,3,3,3))
x grp
1 3 1
2 4 1
3 5 1
4 7 2
5 8 2
6 9 2
7 12 3
8 13 3
9 14 3
Let's say I don't know the grp values but only have a vector x. What is the easiest way to generate grp values, essentially an id field of groups of values within a threshold from from each other? Is this a percolation algorithm?
One option would be to compare the next with the current value and check if the difference is greater than 1, and get the cumulative sum.
df1$grp <- cumsum(c(TRUE, diff(df1$x) > 1))
df1$grp
#[1] 1 1 1 2 2 2 3 3 3
EDIT: From #geotheory's comments.

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources