I'm trying to iterate through a df with a number of addresses in different neighborhoods, and for each neighborhood I would like to randomly divide each address into one of two equal groups. My df looks roughly like this:
neighborhood <- c("armatage", "armatage", "armatage", "windom", "windom", "windom", "windom")
address <- c("a", "b", "c", "d", "e", "f", "g")
df <- data.frame(address, neighborhood)
but with many more neighborhoods with varying numbers of addresses. Using the randomizr package, so far I have been able to come up with this script, which iterates through each neighborhood name and comes up with a randomized list of 0s and 1s with the length of the number of rows within each neighborhood. The problem seems to be the second for loop, and actually assigning the randomized value to each row
for (i in df$neighborhood)
n <- nrow(df[df$neighborhood == i, ])
z <- complete_ra((n))
for (row in 1:nrow(df[df$neighborhood == i, ]))
df$group[row] <- z[row]
where df$group is where I would like to store the randomly assigned value. I would greatly appreciate any advice anyone might have. Thanks!
Here is another way and avoids a double loop:
library(data.table)
dt = as.data.table(df)
dt[, .(grp = sample(.N) %% 2,
address)
, by = neighborhood]
#> neighborhood grp address
#> 1: armatage 1 a
#> 2: armatage 0 b
#> 3: armatage 1 c
#> 4: windom 0 d
#> 5: windom 1 e
#> 6: windom 1 f
#> 7: windom 0 g
Basically, if we take the modulo while also doing the sequence of the total number of addresses in each neighborhood, we can assign randomness.
Background
Let's take a look at what the modulo operator %% does to the number sequence 1 to 4:
seq(from = 1, to = 4) ## or 1:4 or seq(4)
## [1] 1 2 3 4
seq(from = 1, to = 4) %% 2
## [1] 1 0 1 0
Mathematically, it tells us the remainder. That is, 1 / 2 has a remainder of 1; 2 / 2 has a remainder of 0; and so on. We can use this to make groupings. The problem is that this isn't random. That's where sample() comes in play
sample(4) ## or sample(1:4) or sample(seq(1, 4))
## [1] 2 1 4 3
So if we combine modulo with sample(), we can effectively randomize these by groups if we know how many are in each group. That's where grouping such as data.table dt[i, j, by] syntax could help or dplyr tibble %>% group_by() %>% mutate() syntax are of use. Yes, we could subset the unique neighborhoods in a loop, but it is more efficient to do groupings.
Since dplyr is what helped me initially, let's take a look at that version:
library(dplyr)
df %>%
group_by(neighborhood) %>%
mutate(group = sample(n()) %% 2)
## # A tibble: 7 x 3
## # Groups: neighborhood [2]
## address neighborhood group
## <chr> <chr> <dbl>
## 1 a armatage 1
## 2 b armatage 0
## 3 c armatage 1
## 4 d windom 1
## 5 e windom 0
## 6 f windom 1
## 7 g windom 0
An approach using dplyr, purrr
neighborhood <- c("armatage", "armatage", "armatage", "windom", "windom", "windom", "windom")
address <- c("a", "b", "c", "d", "e", "f", "g")
df <- data.frame(address, neighborhood)
library(dplyr)
library(purrr)
df %>%
# split original data into group of neighborhod by group_split from dplyr
group_split(neighborhood) %>%
# then for group of neighborhood apply function to split them into 2 group
# based on their row number and number of group is 2
map(.f = function(x) {
x %>% group_by((row_number() - 1) %/% (n() / 2)) %>%
nest %>% pull(.)
})
Result of above code
[[1]]
[[1]][[1]]
# A tibble: 2 x 2
address neighborhood
<chr> <chr>
1 a armatage
2 b armatage
[[1]][[2]]
# A tibble: 1 x 2
address neighborhood
<chr> <chr>
1 c armatage
[[2]]
[[2]][[1]]
# A tibble: 2 x 2
address neighborhood
<chr> <chr>
1 d windom
2 e windom
[[2]][[2]]
# A tibble: 2 x 2
address neighborhood
<chr> <chr>
1 f windom
2 g windom
In case you just want to add an index column to categorize each row into separate group.
df %>%
group_by(neighborhood) %>%
# cur_group_id gave group index + some math to calculate proper index
# for each group base on their row number.
mutate(group = (cur_group_id() - 1) * 2 + (row_number() - 1) %/% (n() / 2) + 1)
Output
# A tibble: 7 x 3
# Groups: neighborhood [2]
address neighborhood group
<chr> <chr> <dbl>
1 a armatage 1
2 b armatage 1
3 c armatage 2
4 d windom 3
5 e windom 3
6 f windom 4
7 g windom 4
Related
Suppose we have a data frame:
Event <- c("A", "A", "A", "B", "B", "C" , "C", "C")
Model <- c( 1, 2, 3, 1, 2, 1, 2, 3)
df <- data.frame(Event, Model)
Which looks like this:
event
Model
A
1
A
2
A
3
B
1
B
2
C
1
C
2
C
3
We can see that event B only has 2 models of data. As the actual data frame I am using has thousands of rows and 17 columns, how can I remove all events that do not have 3 models? My guess is to use a subset however I am not sure how to do it when we have more than one condition.
I tried the suggested code from YH Jang below:
df %>% group_by(Event) %>%
filter(max(Model)==3)
However, this would miss out entries in the data that looked like this.
event
Model
A
1
A
3
example:
# A tibble: 6 × 2
# Groups: Event [2]
Event Model
<chr> <dbl>
1 A 1
2 A 3
4 C 1
5 C 2
6 C 3
Using dplyr,
df %>% group_by(Event) %>%
filter(max(Model)=3)
the result would be
# A tibble: 6 × 2
# Groups: Event [2]
Event Model
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 C 1
5 C 2
6 C 3
or using data.table,
df[df[,.I[max(Model)==3],by=Event]$V1]
the result is same as below.
Event Model
1: A 1
2: A 2
3: A 3
4: C 1
5: C 2
6: C 3
EDIT
I misunderstood the question.
Here's the edited answer.
# with dplyr
df %>% group_by(Event) %>%
filter(length(Model)>=3)
or
# with data.table
df[df[,.I[length(Model)>=3],by=Event]$V1]
Try this:
library(dplyr)
df %>% group_by(Event) %>%
filter(length(Model) >= 3)
or, more concisely:
df %>% group_by(Event) %>%
filter(n() >= 3)
This removes rows that have fewer than three Model types
Essentially, I need to alter every row that occurs after a certain condition has been met. Though I also need the loop to obey a grouping variable. A simplified version of my data (shown below), is the grouping variable (Groups), followed by a value (N) and then the conditional variable (R). You can create a simplified version of my data as follows:
Groups <- c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C")
N <- c(1,1,1,1,1,1,1,1,1,1)
R <- c("N", "N", "Y", "N", "N", "N", "Y", "N", "N", "N")
Dat <- as.data.frame(cbind(Groups, N, R))
What I need, is for when R == "Y", that row and every row after that for that group, has +1 added to the N variable. So the solution should look like this:
Groups N R
1 A 1 N
2 A 1 N
3 A 2 Y
4 A 2 N
5 B 1 N
6 B 1 N
7 B 2 Y
8 B 2 N
9 C 1 N
10 C 1 N
So the loop needs to restart with each new group. Ideally, a solution within dplyr is preferred but I have not been able to find one yet.
Any help or guidance would be much appreciated!
Do a group by cumsum on a logical vector and add to the 'N'
library(dplyr)
Dat %>%
group_by(Groups) %>%
mutate(N = cumsum(R == "Y") + N) %>%
ungroup()
-output
# A tibble: 10 × 3
Groups N R
<chr> <dbl> <chr>
1 A 1 N
2 A 1 N
3 A 2 Y
4 A 2 N
5 B 1 N
6 B 1 N
7 B 2 Y
8 B 2 N
9 C 1 N
10 C 1 N
data
Dat <- data.frame(Groups, N, R)
# NOTE: Using `cbind` converts to `matrix` and matrix can have only a single class. Directly use `data.frame` instead of roundabout way which is not a correct approach.
You may do this with the help of match.
library(dplyr)
Dat %>%
group_by(Groups) %>%
mutate(N = N + as.integer(row_number() >= match('Y', R, nomatch = n() + 1))) %>%
ungroup
# Groups N R
# <chr> <dbl> <chr>
# 1 A 1 N
# 2 A 1 N
# 3 A 2 Y
# 4 A 2 N
# 5 B 1 N
# 6 B 1 N
# 7 B 2 Y
# 8 B 2 N
# 9 C 1 N
#10 C 1 N
I'm trying to order the rows of a data.frame conditional upon the value of another column.
Here's an example below:
library(magrittr)
library(dplyr)
df <- data.frame(grp = c(1,1,1,2,2,2),
ori = c("f","f","f","r","r","r"),
ite = c("A","B","C","A","B","C"))
df
# # grp ori ite
# 1 1 f A
# 2 1 f B
# 3 1 f C
# 4 2 r A
# 5 2 r B
# 6 2 r C
df %>%
group_by(grp) %>%
arrange(ifelse(ori == "f", ite, desc(ite)), .by_group = TRUE) %>%
ungroup()
# # A tibble: 6 × 3
# # Groups: grp [2]
# grp ori ite
# <dbl> <chr> <chr>
# 1 1 f A
# 2 1 f B
# 3 1 f C
# 4 2 r A
# 5 2 r B
# 6 2 r C
The expected output is:
# # grp ori ite
# 1 1 f A
# 2 1 f B
# 3 1 f C
# 4 2 r C
# 5 2 r B
# 6 2 r A
I have a general idea of why it doesn't work: arrange() cannot look at things on a per-row basis, which is what the ifelse() is asking it to do.
Is there a better way of accomplishing this?
The idea to use ifelse(ori == "f", ite, desc(ite)) is basically good, unfortunately desc(ite) has a negative numeric vector as output, whereas the output of ite is a character vector.
ifelse(df$ori == "f", df$ite, dplyr::desc(df$ite))
#> [1] "A" "B" "C" "-1" "-3" "-5"
To bring the result of ite in reverse order using the same output as input we can write a function asc() which just does the opposite of desc():
asc <- function(x) {
xtfrm(x)
}
No we can use both inside ifelse():
library(dplyr)
df <- data.frame(grp = c(1,1,1,2,2,2),
ori = c("f","f","f","r","r","r"),
ite = c("A","B","C","A","B","C"))
df %>%
arrange(ori, ifelse(ori == "r", desc(ite), asc(ite)))
#> grp ori ite
#> 1 1 f A
#> 2 1 f B
#> 3 1 f C
#> 4 2 r C
#> 5 2 r B
#> 6 2 r A
Created on 2022-08-21 by the reprex package (v2.0.1)
One possible way is splitting the column ori and creating a function to then combine the results as following:
df %>%
split(.$ori) %>%
map(function(x) {
if ('f' %in% x$ori) {
x %>%
group_by(grp) %>%
arrange(ite, .by_group = TRUE)
}
else {
x %>%
group_by(grp) %>%
arrange(desc(ite), .by_group = TRUE)
}
}) %>%
bind_rows()
# A tibble: 6 x 3
## Groups: grp [2]
# grp ori ite
# <dbl> <chr> <chr>
#1 1 f A
#2 1 f B
#3 1 f C
#4 2 r C
#5 2 r B
#6 2 r A
One option to achieve your desired result would be to make use of split + arrange + bind_rows like so:
library(dplyr)
library(purrr)
df %>%
split(.$ori) %>%
purrr::imap(
~ if (.y == "f") arrange(.x, grp, ite) else arrange(.x, grp, desc(ite))
) %>%
dplyr::bind_rows()
#> grp ori ite
#> 1 1 f A
#> 2 1 f B
#> 3 1 f C
#> 4 2 r C
#> 5 2 r B
#> 6 2 r A
And thanks to the suggestion by #MartinGal we could save the bind_rows step by making use of purrr::imap_dfr:
df %>%
split(.$ori) %>%
purrr::imap_dfr(
~ if (.y == "f") arrange(.x, grp, ite) else arrange(.x, grp, desc(ite))
)
Here's another solution.
df %>%
arrange(
ifelse(ori == 'f', ite, NA),
desc(ifelse(ori != 'f', ite, NA))
)
I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)
You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1
We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1
Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1
I have a situation where I am trying to find the number of intersections with a vector per group in another tibble.
Data example
a <- tibble(EXPERIMENT = rep(c("a","b","c"),each =4),
ECOTYPE = rep(1:12))
b <- tibble(ECOTYPE = c(1,1,5,4,8,7,6,1,4,4,2,5,6,7,1))
I want to find the number of intersections between ECOTYPE in b and ECOTYPEper EXPERIMENT in a.
I wonder if I can use dplyr to solve this, as the group_by function seems to fit this problem, but when I run:
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, .$ECOTYPE))
I only get the total number of intersections between a and b.
Am I missing something?
Edit:
Sorry for not posting my desired output. I would like something like this:
# A tibble: 3 x 2
EXPERIMENT INTERSECTIONS
<chr> <dbl>
1 a 8
2 b 7
3 c 0
Depending how you want to count, this will give the number of rows in b matching a:
b %>% mutate(b_flag = 1) %>%
right_join(a) %>%
group_by(EXPERIMENT) %>%
summarize(INTERSECTIONS = sum(b_flag, na.rm = T))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <dbl>
# 1 a 8
# 2 b 7
# 3 c 0
I think the only problem with your code is the unnecessary .$, but it gives the counts of distinct ecotypes in b, ignoring the fact that b has three ECOTYPE = 1 rows, for example.
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, ECOTYPE)))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <int>
# 1 a 3
# 2 b 4
# 3 c 0
This is a result of how intersect works:
intersect(c(1, 2, 3), c(1, 1, 1))
# [1] 1
Join the two and count how many are left:
inner_join(a,b, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% count()
# A tibble: 2 x 2
# Groups: EXPERIMENT [2]
EXPERIMENT n
<chr> <int>
1 a 8
2 b 7
Now, if you add an indicator column to b, you can start to count absences as well:
b %>% mutate(present=TRUE) %>% right_join(a, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% summarise(n(), missing=sum(is.na(present)))
# A tibble: 3 x 3
EXPERIMENT `n()` missing
<chr> <int> <int>
1 a 9 1
2 b 7 0
3 c 4 4