How to sample across a dataset with two factors in it?

How to sample across a dataset with two factors in it? - r

I have a dataframe with two species A and B and certain variables a b associated with the total of 100 rows.
I want to create a sampler such that in one set it randomly picks 6 rows reps from the df dataset. However, the samples for A must only come from rows associated with sp A from df, similarly from B. I want do this for 500 times over for each of species A and B.
I attempted a for loop and when I ran sampling it shows a single row with 6 columns. I would appreciate any guidance
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
df.sample <- for(i in 1:1000){
sampling <- sample(df[i,],6,replace = TRUE)
}
#Output in a single row
a a.1 sp b sp.1 a.2
1000 1.68951 1.68951 B 1.395995 B 1.68951
#Expected dataframe
df.sample
set rep a b sp
1 1 1 9 A
1 2 3 2 A
1 3 0 2 A
1 4 1 2 A
1 5 1 6 A
1 6 4 2 A
2 1 1 2 B
2 2 5 2 B
2 3 1 2 B
2 4 1 6 B
2 5 1 8 B
2 6 9 2 B
....

Here's how I would do it (using tidyverse):
data:
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# create an empty table with desired columns
library(tidyverse)
output <- tibble(a = numeric(),
b = numeric(),
sp = character(),
set = numeric())
# sampling in a loop
set.seed(42)
for(i in 1:500){
samp1 <- df %>% filter(sp == 'A') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
samp2 <- df %>% filter(sp == 'B') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
output %>% add_row(bind_rows(samp1, samp2)) -> output
}
Result
> head(output, 20)
# A tibble: 20 × 4
a b sp set
<dbl> <dbl> <chr> <dbl>
1 2.59 3.31 A 1
2 1.84 1.66 A 1
3 2.35 1.17 A 1
4 2.33 1.95 A 1
5 0.418 1.11 A 1
6 1.19 2.54 A 1
7 2.35 0.899 B 1
8 1.19 1.63 B 1
9 0.901 0.986 B 1
10 3.12 1.75 B 1
11 2.28 2.61 B 1
12 1.37 3.47 B 1
13 2.33 1.95 A 2
14 1.84 1.66 A 2
15 3.76 1.26 A 2
16 2.96 3.10 A 2
17 1.03 1.81 A 2
18 1.42 2.00 A 2
19 0.901 0.986 B 2
20 2.37 1.39 B 2

You could split df by species at first. Random rows in each species can be drawn by x[sample(nrow(x), 6), ]. Pass it into replicate(), you could do sampling for many times. Here dplyr::bind_rows() is used to combine samples and add a new column set indicating the sampling indices.
lapply(split(df, df$sp), function(x) {
dplyr::bind_rows(
replicate(3, x[sample(nrow(x), 6), ], FALSE),
.id = "set"
)
})
Output
$A
set a b sp
1 1 1.52480034 3.41257975 A
2 1 1.82542370 2.08511584 A
3 1 1.80019901 1.39279162 A
4 1 2.20765154 2.11879412 A
5 1 1.61295185 2.04035172 A
6 1 1.92936567 2.90362816 A
7 2 0.88903679 2.46948106 A
8 2 3.19223788 2.81329767 A
9 2 1.28629416 2.69275525 A
10 2 2.61044815 0.82495427 A
11 2 2.30928735 1.67421328 A
12 2 -0.09789704 2.62434719 A
13 3 2.10386603 1.78157862 A
14 3 2.17542841 0.84016203 A
15 3 3.22202227 3.49863423 A
16 3 1.07929909 -0.02032945 A
17 3 2.95271838 2.34460193 A
18 3 1.90414536 1.54089645 A
$B
set a b sp
1 1 3.5130317 -0.4704879 B
2 1 3.0053072 1.6021795 B
3 1 4.1167657 1.1123342 B
4 1 1.5460589 3.2915979 B
5 1 0.8742753 0.9132530 B
6 1 2.0882660 1.5588471 B
7 2 1.2444645 1.8199525 B
8 2 2.7960117 2.6657735 B
9 2 2.5970774 0.9984187 B
10 2 1.1977317 3.7360884 B
11 2 2.2830643 1.0452440 B
12 2 3.1047150 1.5609482 B
13 3 2.9309124 1.5679255 B
14 3 0.8631965 1.3501631 B
15 3 1.5460589 3.2915979 B
16 3 2.7960117 2.6657735 B
17 3 3.1047150 1.5609482 B
18 3 2.8735390 0.6329279 B

If I understood well what you want, it could be done following this code
# Create the initial data frame
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# Rows with sp=A
row.A <- which(df$sp=="A")
row.B <- which(df$sp=="B")
# Sampling data.frame
sampling <- data.frame(matrix(ncol = 5, nrow = 0))
# "rep" column for each iteration
rep1 <- rep(1:6,2)
# Build the dara.frame
for(i in 1:500){
# Sampling row.A
s.A <- sample(row.A,6,replace = T)
# Sampling row.B
s.B <- sample(row.B,6,replace = T)
# Data frame with the subset of df and "set" and "rep" values
sampling <- rbind(sampling, set=cbind(rep(i,12),rep=rep1,df[c(s.A,s.B),]))
}
# Delete row.names of sampling and redefine sampling's column names
row.names(sampling) <- NULL
colnames(sampling) <- c("set", "rep", "a", "b", "sp")
And the output looks like this:
set rep a b sp
1 1 3.713663 2.717456 A
1 2 2.456070 2.803443 A
1 3 2.166655 1.395556 A
1 4 1.453738 5.662969 A
1 5 2.692518 2.971156 A
1 6 2.699634 3.016791 A

Related

How can I compute the reverse rank abundance of a species matrix?

I have a dataset in R that contains species abundance data ordered by station and replicate sample. So, column one contains the station number, column two contains the replicate number, column three contains the species name, and column four contains the species abundance.
I want to add a new fifth column that contains the reverse rank abundance of a species per station/replicate combination (i.e., If there are four species in a station/replicate, I want the species with the lowest abundance to be given a value of 1, and the species with the highest abundance to be given a value of 4).
Here a sample code of the type of dataset I am working with:
library(tidyverse)
dat <- as.data.frame(matrix(c(1,1,"A",2.34,
1,1,"B",4.32,
1,1,"C",2.46,
1,1,"D",6.32,
1,2,"A",3.54,
1,2,"B",7.67,
1,2,"D",3.45,
2,1,"D",4.67,
2,1,"E",6.54,
2,1,"G",5.67,
2,2,"B",2.31,
2,2,"G",1.12), ncol = 4, nrow = 12, byrow = TRUE
))
names(dat)[1] <- "station"
names(dat)[2] <- "replicate"
names(dat)[3] <- "taxa"
names(dat)[4] <- "abundance"
dat %>%
mutate(abundance = parse_number(abundance))
station
replicate
taxa
abundance
1
1
A
2.34
1
1
B
4.32
1
1
C
2.46
1
1
D
6.32
1
2
A
3.54
1
2
B
7.67
1
2
D
3.45
2
1
D
4.67
2
1
E
6.54
2
1
G
5.67
2
2
B
2.31
2
2
G
1.12
And here is some code to reorder the dataset so that it goes from the species with the lowest abundance to the species with the highest abundance per station/replicate:
dat %>%
arrange(abundance) %>%
arrange(replicate) %>%
arrange(station)
For some reason, I am unsure how to continue from here. Any help would be greatly appreciated!

If you want to rank by station/replicate, you first group_by this combination, and then create a new column with the rank value.
library(tidyverse)
dat %>%
group_by(station, replicate) %>%
mutate(abundance = as.numeric(abundance),
rank = rank(abundance))
Output
station replicate taxa abundance rank
<chr> <chr> <chr> <dbl> <dbl>
1 1 1 A 2.34 1
2 1 1 B 4.32 3
3 1 1 C 2.46 2
4 1 1 D 6.32 4
5 1 2 A 3.54 2
6 1 2 B 7.67 3
7 1 2 D 3.45 1
8 2 1 D 4.67 1
9 2 1 E 6.54 3
10 2 1 G 5.67 2
11 2 2 B 2.31 2
12 2 2 G 1.12 1

Calculate standard deviation across multiple rows grouped by ID

I want to calculate the standard deviation across multiple rows (not per row) and then save the results into a new data frame. Best to explain using an example.
Data:
ID <- c("a","a","a","a","b","b","b","b","c","c","c","c")
y1 <- c(8,9,3,6,6,4,5,8,7,5,8,1)
y2 <- c(3,6,6,1,7,3,8,7,5,8,1,7)
y3 <- c(9,3,1,8,4,6,3,8,4,6,5,7)
df <- data.frame(ID, y1, y2, y3)
ID y1 y2 y3
1 a 8 3 9
2 a 9 6 3
3 a 3 6 1
4 a 6 1 8
5 b 6 7 4
6 b 4 3 6
7 b 5 8 3
8 b 8 7 8
9 c 7 5 4
10 c 5 8 6
11 c 8 1 5
12 c 1 7 7
I want to calculate the standard deviation of ID$a, ID$b and ID$c and store in a new data frame. I know I can do this:
sd_a <- sd(as.matrix(subset(df, ID == "a")), na.rm = TRUE)
sd_b <- sd(as.matrix(subset(df, ID == "b")), na.rm = TRUE)
sd_c <- sd(as.matrix(subset(df, ID == "c")), na.rm = TRUE)
ID <- c("a","b","c")
sd <- c(sd_a,sd_b,sd_c)
df2 <- data.frame(ID, sd)
ID sd
1 a 2.958040
2 b 1.912875
3 c 2.386833
But is there a more straightforward way of achieving this?

You can use pivot_longer() to stack y1 to y3 and then calculate the sd.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(y1:y3) %>%
group_by(ID) %>%
summarise(sd = sd(value))
# # A tibble: 3 x 2
# ID sd
# <chr> <dbl>
# 1 a 2.96
# 2 b 1.91
# 3 c 2.39

One dplyr solution could be:
df %>%
group_by(ID) %>%
summarise(sd = sd(unlist(cur_data())))
ID sd
<fct> <dbl>
1 a 2.96
2 b 1.91
3 c 2.39

In base R you can do:
aggregate(values ~ ID, cbind(df[1], stack(df[-1])), sd)
ID values
1 a 2.958040
2 b 1.912875
3 c 2.386833

R: refer to var in different list element on a looped condition

I've got a dataset with a number of vars (t01-t05 in a dummy example but many more in the real dataset). I calculate pred variable as a proportion of target == 1/n() per all group-level combinations (5th element in the ns_by_group_list). However, if the total number of people in that combination (s var) less than 6, I need to use the pred value from the equivalent t01-t04 combination (4th element of ns_by_group_list). If this one is less than 6, then from t01-t03 combinations (3rd element of ns_by_group_list), etc. The final output should look like ns_by_group_list[[5]] but with pred values coming from different ns_by_group_list list elements.
I was thinking of renaming pred and s vars in different list elements to pred1, pred2, .. pred5 and then pulling it all together to one data.frame, then create a long case_when statement... But surely there's a better/more elegant way to do it?
library(tibble)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
## functions ####
create_t_labels <- function(n) {
paste0('t', str_pad(1:n, 2, 'left', '0'))
}
ns_by_group <- function(group_vars) {
input %>%
group_by_at(.vars = vars(group_vars)) %>%
summarise(n = n()) %>% # total number of people in each group
ungroup() %>%
spread(key = target, value = n) %>%
mutate(`0` = replace_na(`0`, 0),
n = replace_na(`1`, 0),
s = n + `0`,
pred = round(n/s, 3)
) %>%
select(-c(`1`, `0`))
}
### input data ####
set.seed(1)
input <- tibble(
target = sample(0:1, 50, replace = TRUE),
t01 = sample(1:3, 50, replace = TRUE),
t02 = rep(1:2, each = 25),
t03 = rep(1:5, each = 10),
t04 = rep(1, 50),
t05 = rep(1:2, each = 25)
)
## calculations ####
group_combo_list <- map(1:5, create_t_labels)
group_combo_list <- map(group_combo_list, function(x) c(x, 'target'))
ns_by_group_list <- map(group_combo_list, ns_by_group)

Recursively joining and replacing:
reduce(
ns_by_group_list,
~ {
left_join(.y, .x, by = grep("^t\\d+$", names(.x), value = TRUE),
suffix = c("", ".replacement")) %>%
mutate(pred = if_else(s < 6, pred.replacement, pred),
s = if_else(s < 6, s.replacement, s)) %>%
select(-ends_with(".replacement"))
},
.dir = "backward"
)
# # A tibble: 16 x 8
# t01 t02 t03 t04 t05 n s pred
# <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1 1 1 1 1 16 0.562
# 2 1 1 2 1 1 1 16 0.562
# 3 1 2 3 1 2 2 12 0.583
# 4 1 2 4 1 2 4 6 0.667
# 5 1 2 5 1 2 1 12 0.583
# 6 2 1 1 1 1 3 13 0.385
# 7 2 1 2 1 1 2 6 0.333
# 8 2 1 3 1 1 0 13 0.385
# 9 2 2 4 1 2 1 6 0.5
# 10 2 2 5 1 2 2 6 0.5
# 11 3 1 1 1 1 0 8 0.125
# 12 3 1 2 1 1 1 8 0.125
# 13 3 1 3 1 1 0 8 0.125
# 14 3 2 3 1 2 0 7 0.714
# 15 3 2 4 1 2 1 7 0.714
# 16 3 2 5 1 2 4 7 0.714

bind_rows to each group of tibble

Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you

We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6

split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200

An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL

We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6

Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))

Creating a starting value variable with longitudinal data (conditional)

I am trying to create a new variable that is basically the starting value of another variable in my dataframe. Example data:
id <- rep(c(1, 2), each = 8)
outcome <- rep(1:5, length.out = 16)
time <- rep(c(0, 1, 3, 4),4)
Attitude <- rep(c('A1', 'A2', 'A1', 'A2'), each = 4)
df <- data.frame(id, Attitude, outcome, time)
What I'd like to get is a new column named new_var (or whatever) that is equal to the value of outcome at time == 0 for id = id and also depends on Attitude. Thus what I'd like to extend the dataframe to is:
df$new_var <- c(1,1,1,1,5,5,5,5,4,4,4,4,3,3,3,3)
Only then with some decent coding. In SAS I know I can do this with the lag function. I would really appreciate a solution that isn't a 'work around' so it is like SAS, but rather the proper r solution. In the end I want to get stronger in r too.
Related: Retain and lag function in R as SAS
However I prefer some solution that is based on indices or the 'usual' r way. And here it's also not dependent on other conditions.
So, important here is that the coding works for the different ids, attitude levels / variables (A1, A2, ...) and that the outcome value at time == 0 is basically copied to new_var.
I hope I am clear in conveying my message. If not I think the small piece of example code and how I'd like to extend it should be clear enough. Looking forward to suggestions.
EDIT Another example code for #jogo answer.
ID <- rep(1, 36)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'),
length.out =36)
Answer_a <- rep(1:5, length.out = 36)
time <- as.character(rep(c(0, 1, 3, 4), each = 9))
df <- data.frame(ID, Attitude, Answer_a, time)
df$time <- as.character(df$time)

I think this is what you mean - assuming the data is always in the correct order?
EDIT Added an arrange step to ensure the data is always correctly ordered.
library(tidyverse)
df %>% group_by(id, Attitude) %>%
arrange(time) %>%
mutate(new_var2 = first(outcome[!is.na(outcome)])
# A tibble: 16 x 6
# Groups: id, Attitude [4]
id Attitude outcome time new_var new_var2
<dbl> <fct> <int> <dbl> <dbl> <int>
1 1.00 A1 1 0 1.00 1
2 1.00 A1 2 1.00 1.00 1
3 1.00 A1 3 3.00 1.00 1
4 1.00 A1 4 4.00 1.00 1
5 1.00 A2 5 0 5.00 5
6 1.00 A2 1 1.00 5.00 5
7 1.00 A2 2 3.00 5.00 5
8 1.00 A2 3 4.00 5.00 5
9 2.00 A1 4 0 4.00 4
10 2.00 A1 5 1.00 4.00 4
11 2.00 A1 1 3.00 4.00 4
12 2.00 A1 2 4.00 4.00 4
13 2.00 A2 3 0 3.00 3
14 2.00 A2 4 1.00 3.00 3
15 2.00 A2 5 3.00 3.00 3
16 2.00 A2 1 4.00 3.00 3

Here is a solution with data.table:
library("data.table")
setDT(df)
df[, new_var:=outcome[1], rleid(Attitude)][] # or
# df[, new_var:=outcome[time==0], rleid(Attitude)][]
For testing I named the new column new_var2:
id <- rep(c(1, 2), each = 8)
outcome <- rep(1:5, length.out = 16)
time <- rep(c(0, 1, 3, 4),4)
Attitude <- rep(c('A1', 'A2', 'A1', 'A2'), each = 4)
df <- data.frame(id, Attitude, outcome, time)
df$new_var <- c(1,1,1,1,5,5,5,5,4,4,4,4,3,3,3,3)
library("data.table")
setDT(df)
df[, new_var2:=outcome[1], rleid(Attitude)][]
# > df[, new_var2:=outcome[1], rleid(Attitude)][]
# id Attitude outcome time new_var new_var2
# 1: 1 A1 1 0 1 1
# 2: 1 A1 2 1 1 1
# 3: 1 A1 3 3 1 1
# 4: 1 A1 4 4 1 1
# 5: 1 A2 5 0 5 5
# 6: 1 A2 1 1 5 5
# 7: 1 A2 2 3 5 5
# 8: 1 A2 3 4 5 5
# 9: 2 A1 4 0 4 4
# 10: 2 A1 5 1 4 4
# 11: 2 A1 1 3 4 4
# 12: 2 A1 2 4 4 4
# 13: 2 A2 3 0 3 3
# 14: 2 A2 4 1 3 3
# 15: 2 A2 5 3 3 3
# 16: 2 A2 1 4 3 3
Your second example shows that you have to reorder the rows of the data. Usinf data.table this can be done by setkey():
ID <- rep(1, 36)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'),
length.out =36)
Answer_a <- rep(1:5, length.out = 36)
time <- as.character(rep(c(0, 1, 3, 4), each = 9))
df <- data.frame(ID, Attitude, Answer_a, time)
df$time <- as.character(df$time)
library("data.table")
setDT(df)
setkey(df, ID, Attitude, time)
df[, new_var:=Answer_a[1], rleid(Attitude)]
df

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to sample across a dataset with two factors in it? - r

Related

How can I compute the reverse rank abundance of a species matrix?

Calculate standard deviation across multiple rows grouped by ID

R: refer to var in different list element on a looped condition

bind_rows to each group of tibble

Creating a starting value variable with longitudinal data (conditional)

Categories

Resources