Generating random number by length of blocks of data in R data frame - r

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.

It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)

This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)

One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Related

R: all combinations of nested list of variable length [duplicate]

I'm not sure if permutations is the correct word for this. I want to given a set of n vectors (i.e. [1,2],[3,4] and [2,3]) permute them all and get an output of
[1,3,2],[1,3,3],[1,4,2],[1,4,3],[2,3,2] etc.
Is there an operation in R that will do this?
This is a useful case for storing the vectors in a list and using do.call() to arrange for an appropriate function call for you. expand.grid() is the standard function you want. But so you don't have to type out or name individual vectors, try:
> l <- list(a = 1:2, b = 3:4, c = 2:3)
> do.call(expand.grid, l)
a b c
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
However, for all my cleverness, it turns out that expand.grid() accepts a list:
> expand.grid(l)
a b c
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
This is what expand.grid does.
Quoting from the help page: Create a data frame from all combinations of the supplied vectors or factors. The result is a data.frame with a row for each combination.
expand.grid(
c(1, 2),
c(3, 4),
c(2, 3)
)
Var1 Var2 Var3
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
As an alternative to expand.grid() you could use rep() to produce the desired combination. Consider the following simplified example using the original data from this question:
a <- c(1,2)
b <- c(3,4)
c <- c(2,3)
To get the expand.grid()-like effect, use rep() with a times= argument equal to the product of the length of the other vectors (or 4). The middle vector would use a nested rep() with products of vector length to either side (or 2 and 2). The end vector is like the first but with each= argument in order to pattern correctly. This is trivial to calculate when each vector is length of 2. Example:
#tibble of all combinations of a, b and c
tibble::tibble(
var1 = rep(a, times = 4),
var2 = rep(rep(b, each= 2), times = 2), #nested rep()
var3 = rep(c, each= 4)
)
For an unknown number of input vectors (or unknown vector lengths), we can get all combinations with rep() in a function like this:
#Produces a tibble of all combinations of input vectors
expand_tibble <- function(...){
x <- list(...) #all input vectors stored here
l <- lapply(x,length)|> unlist() #vector showing length of each input vector
t <- length(l) #total input vector count
r <-list() #empty list
for(i in 1:t){
if(i==1){ #first input vector
first <-l[2:length(l)] |> prod()
r[[i]]<-rep(x[[i]], each = first)
}else{ #last input vector
if(i==t){
last <- l[1:t-1] |> prod()
r[[i]]<-rep(x[[i]], last)
}else{ #all middle input vectors
m1 <- l[1:(i-1)] |> prod()
m2 <- l[(i+1):t] |> prod()
r[[i]] <- rep(rep(x[[i]], each=m1),m2)
}
}
names(r)[i]<-paste0("var",i)
}
tibble::as_tibble(r)
}
output:
expand_tibble(a,b,c)
var1 var2 var3
<dbl> <dbl> <dbl>
1 1 3 2
2 1 3 3
3 1 4 2
4 1 4 3
5 2 3 2
6 2 3 3
7 2 4 2
8 2 4 3

How to get permutations by selecting one member of a subset with multiple subsets in R? [duplicate]

I'm not sure if permutations is the correct word for this. I want to given a set of n vectors (i.e. [1,2],[3,4] and [2,3]) permute them all and get an output of
[1,3,2],[1,3,3],[1,4,2],[1,4,3],[2,3,2] etc.
Is there an operation in R that will do this?
This is a useful case for storing the vectors in a list and using do.call() to arrange for an appropriate function call for you. expand.grid() is the standard function you want. But so you don't have to type out or name individual vectors, try:
> l <- list(a = 1:2, b = 3:4, c = 2:3)
> do.call(expand.grid, l)
a b c
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
However, for all my cleverness, it turns out that expand.grid() accepts a list:
> expand.grid(l)
a b c
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
This is what expand.grid does.
Quoting from the help page: Create a data frame from all combinations of the supplied vectors or factors. The result is a data.frame with a row for each combination.
expand.grid(
c(1, 2),
c(3, 4),
c(2, 3)
)
Var1 Var2 Var3
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
As an alternative to expand.grid() you could use rep() to produce the desired combination. Consider the following simplified example using the original data from this question:
a <- c(1,2)
b <- c(3,4)
c <- c(2,3)
To get the expand.grid()-like effect, use rep() with a times= argument equal to the product of the length of the other vectors (or 4). The middle vector would use a nested rep() with products of vector length to either side (or 2 and 2). The end vector is like the first but with each= argument in order to pattern correctly. This is trivial to calculate when each vector is length of 2. Example:
#tibble of all combinations of a, b and c
tibble::tibble(
var1 = rep(a, times = 4),
var2 = rep(rep(b, each= 2), times = 2), #nested rep()
var3 = rep(c, each= 4)
)
For an unknown number of input vectors (or unknown vector lengths), we can get all combinations with rep() in a function like this:
#Produces a tibble of all combinations of input vectors
expand_tibble <- function(...){
x <- list(...) #all input vectors stored here
l <- lapply(x,length)|> unlist() #vector showing length of each input vector
t <- length(l) #total input vector count
r <-list() #empty list
for(i in 1:t){
if(i==1){ #first input vector
first <-l[2:length(l)] |> prod()
r[[i]]<-rep(x[[i]], each = first)
}else{ #last input vector
if(i==t){
last <- l[1:t-1] |> prod()
r[[i]]<-rep(x[[i]], last)
}else{ #all middle input vectors
m1 <- l[1:(i-1)] |> prod()
m2 <- l[(i+1):t] |> prod()
r[[i]] <- rep(rep(x[[i]], each=m1),m2)
}
}
names(r)[i]<-paste0("var",i)
}
tibble::as_tibble(r)
}
output:
expand_tibble(a,b,c)
var1 var2 var3
<dbl> <dbl> <dbl>
1 1 3 2
2 1 3 3
3 1 4 2
4 1 4 3
5 2 3 2
6 2 3 3
7 2 4 2
8 2 4 3

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

Working with long data format in R

Good day,
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
df <- data.frame(d,e,f)
I have data the looks like the above. What I need to do is for each unique element of d find the first non-zero value in f, and find the corresponding value in e. To be specific, I want another vector g so it looks like this:
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
g <- c(7,7,7,6,6,6,7,7,7)
df <- data.frame(d,e,f,g)
Suggestions to do this easily? I thought I could use split(), but I am having trouble using which() after the split. I can use ave like this:
foo <- function(x){which(x>0)[1]}
df$t <- ave(df$f,df$d,FUN=foo)
But I am having trouble finding the value of e. Any help is appreciated.
Someone else can provide a base R solution, but here's a way to do this using plyr:
> ddply(df,.(d),transform,g = head(e[f != 0],1))
d e f g
1 1 5 0 7
2 1 6 0 7
3 1 7 1 7
4 2 5 0 6
5 2 6 1 6
6 2 7 0 6
7 3 5 0 7
8 3 6 0 7
9 3 7 1 7
Note that I took your note about the "first nonzero element" literally, even though your example data only had a single unique nonzero element in the column (by group).
here's a way in base R
g <- inverse.rle(list(lengths=rle(d)$lengths, values=e[f != 0]))

Combinations of multiple vectors in R

I'm not sure if permutations is the correct word for this. I want to given a set of n vectors (i.e. [1,2],[3,4] and [2,3]) permute them all and get an output of
[1,3,2],[1,3,3],[1,4,2],[1,4,3],[2,3,2] etc.
Is there an operation in R that will do this?
This is a useful case for storing the vectors in a list and using do.call() to arrange for an appropriate function call for you. expand.grid() is the standard function you want. But so you don't have to type out or name individual vectors, try:
> l <- list(a = 1:2, b = 3:4, c = 2:3)
> do.call(expand.grid, l)
a b c
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
However, for all my cleverness, it turns out that expand.grid() accepts a list:
> expand.grid(l)
a b c
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
This is what expand.grid does.
Quoting from the help page: Create a data frame from all combinations of the supplied vectors or factors. The result is a data.frame with a row for each combination.
expand.grid(
c(1, 2),
c(3, 4),
c(2, 3)
)
Var1 Var2 Var3
1 1 3 2
2 2 3 2
3 1 4 2
4 2 4 2
5 1 3 3
6 2 3 3
7 1 4 3
8 2 4 3
As an alternative to expand.grid() you could use rep() to produce the desired combination. Consider the following simplified example using the original data from this question:
a <- c(1,2)
b <- c(3,4)
c <- c(2,3)
To get the expand.grid()-like effect, use rep() with a times= argument equal to the product of the length of the other vectors (or 4). The middle vector would use a nested rep() with products of vector length to either side (or 2 and 2). The end vector is like the first but with each= argument in order to pattern correctly. This is trivial to calculate when each vector is length of 2. Example:
#tibble of all combinations of a, b and c
tibble::tibble(
var1 = rep(a, times = 4),
var2 = rep(rep(b, each= 2), times = 2), #nested rep()
var3 = rep(c, each= 4)
)
For an unknown number of input vectors (or unknown vector lengths), we can get all combinations with rep() in a function like this:
#Produces a tibble of all combinations of input vectors
expand_tibble <- function(...){
x <- list(...) #all input vectors stored here
l <- lapply(x,length)|> unlist() #vector showing length of each input vector
t <- length(l) #total input vector count
r <-list() #empty list
for(i in 1:t){
if(i==1){ #first input vector
first <-l[2:length(l)] |> prod()
r[[i]]<-rep(x[[i]], each = first)
}else{ #last input vector
if(i==t){
last <- l[1:t-1] |> prod()
r[[i]]<-rep(x[[i]], last)
}else{ #all middle input vectors
m1 <- l[1:(i-1)] |> prod()
m2 <- l[(i+1):t] |> prod()
r[[i]] <- rep(rep(x[[i]], each=m1),m2)
}
}
names(r)[i]<-paste0("var",i)
}
tibble::as_tibble(r)
}
output:
expand_tibble(a,b,c)
var1 var2 var3
<dbl> <dbl> <dbl>
1 1 3 2
2 1 3 3
3 1 4 2
4 1 4 3
5 2 3 2
6 2 3 3
7 2 4 2
8 2 4 3

Resources