Boostraping in hierarchical data in R - r

I have a dataset of the following form:
dat <- expand.grid(cat=factor(1:4), lab=factor(1:10))
dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))
where I have observations of 4 cats in 10 labs.
Now I want to simulate samples from this dataset by resampling in order to have:
each cat observed in 5 (random) labs AND each lab with 50% (or 2) random cats observed.
Honestly I cannot figure my way out of this... Thanks in advance

This type of thing is generally easiest with a function.
This function takes the data and first filters the number of labs to sample from, then samples the cats for each lab.
library(dplyr)
dat <- expand.grid(cat=factor(1:4), lab=factor(1:10)) %>%
mutate(x = runif(nrow(.)),
y = runif(nrow(.), 2, 5))
samplr <- function(dat, nlab = 5, ncat = 2){
dat %>%
filter(lab %in% sample(unique(dat$lab), nlab)) %>%
group_by(lab) %>%
filter(cat %in% sample(unique(dat$cat), ncat))
}
samplr(dat)
and you can then change the number of cats or labs being sampled
samplr(dat, nlab = 4, ncat = 3)

Related

Grouping and Identifying the Maximum During a Simulation in R

I am trying to do some simulations in R and I am stuck on the loop that I need to be doing. I am able to get what I need in one iteration but trying to code the loop is throwing me off. This is what i am doing for one iteration.
Subjects <- c(1,2,3,4,5,6)
Group <- c('A','A','B','B','C','C')
Score <- rnorm(6,mean=5,sd=1)
Example <- data.frame(Subjects,Group,Score)
library(dplyr)
Score_by_Group <- Example %>% group_by(Group) %>% summarise(SumGroup = sum(Score))
Score_by_Group$Top_Group <- ifelse(Score_by_Group[,2] == max(Score_by_Group[,2]),1,0)
Group SumGroup Top_Group
1 A 8.77 0
2 B 6.22 0
3 C 9.38 1
What I need my loop to do is, run the above 'X' times and every time that group has the Top Score, add it to the previous result. So for example, if the loop was to be x=10, I would need a result like this:
Group Top_Group
1 A 3
2 B 5
3 C 2
If you don't mind forgoing the for loop, we can use replicate to repeat the code, then bind the output together, and then summarize.
library(tidyverse)
run_sim <- function()
{
Subjects <- c(1, 2, 3, 4, 5, 6)
Group <- c('A', 'A', 'B', 'B', 'C', 'C')
Score <- rnorm(6, mean = 5, sd = 1)
Example <- data.frame(Subjects, Group, Score)
Score_by_Group <- Example %>%
group_by(Group) %>%
summarise(SumGroup = sum(Score)) %>%
mutate(Top_Group = +(SumGroup == max(SumGroup))) %>%
select(-SumGroup)
}
results <- bind_rows(replicate(10, run_sim(), simplify = F)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group))
Output
Group Top_Group
<chr> <int>
1 A 3
2 B 3
3 C 4
I think this should work:
library(dplyr)
Subjects <- c(1,2,3,4,5,6)
Group <- c('A','A','B','B','C','C')
Groups <- c('A','B','C')
Top_Group <- c(0,0,0)
x <- 10
for(i in 1:x) {
Score <- rnorm(6,mean=5,sd=1)
Example <- data.frame(Subjects,Group,Score)
Score_by_Group <- Example %>% group_by(Group) %>% summarise(SumGroup = sum(Score))
Score_by_Group$Top_Group <- ifelse(Score_by_Group[,2] == max(Score_by_Group[,2]),1,0)
Top_Group <- Top_Group + Score_by_Group$Top_Group
}
tibble(Groups, Top_Group)

How to divide combinations of rows using dplyr or another method in R?

site <- rep(1:4, each = 8, len = 32)
rep <- rep(1:8, times = 4, len = 32)
treatment <- rep(c("A.low","A.low","A.high","A.high","A.mix","A.mix","B.mix","B.mix"), 4)
sp.1 <- sample(0:3,size=32,replace=TRUE)
sp.2 <- sample(0:2,size=32,replace=TRUE)
df.dummy <- data.frame(site, rep, treatment, sp.1, sp.2)
The final dataframe looks like this
For each site, I want to summarize various groups. Two for example: "A.low / A.high" = "sp.1/sp.1"; "A.low/ A.mix" = "sp.1/sp.2". As you will notice, there are two for each site and I want all permutations of that in my final columns. My final product would resemble something like:
site rep treatment value
1. 1/3. A.low/A.high. Inf
1. 1/4. A.low/A.high. 1
I started to use dplyr but I am really not sure how to proceed especially with all the combinations
df.dummy %>%
group_by(site) %>%
summarise(value.1 = sp.1[treatment = "A.low"] / sp.1[treatment = "A.high"])
You could use reshape2 to get the data in a format that is easier to work with.
The code below separates out the sp.1 and sp.2 data. acast is used so that each dataframe consists of a single row per site, and each column is a unique sample with the values being from sp.1 and sp.2.
Name the columns something unique and combine the dataframes with cbind.
Now each column can be compared based on your requirements.
library(dplyr)
library(reshape2)
##your setup
site <- rep(1:4, each = 8, len = 32)
rep <- rep(1:8, times = 4, len = 32)
treatment <- rep(c("A.low","A.low","A.high","A.high","A.mix","A.mix","B.mix","B.mix"), 4)
sp.1 <- sample(0:3,size=32,replace=TRUE)
sp.2 <- sample(0:2,size=32,replace=TRUE)
df.dummy <- data.frame(site, rep, treatment, sp.1, sp.2)
##create unique ids and create a dataframe containing 1 value column
sp1 <- df.dummy %>% mutate(id = paste(rep, treatment, sep = "_")) %>% select(id, site, rep, treatment, sp.1)
sp2 <- df.dummy %>% mutate(id = paste(rep, treatment, sep = "_")) %>% select(id, site, rep, treatment, sp.2)
##reshape the data so that each treament and replicate is assigned a single column
##each row will be a single site
##each column will contain the values from sp.1 or sp.2
sp1 <- reshape2::acast(data = sp1, formula = site ~ id)
sp2 <- reshape2::acast(data = sp2, formula = site ~ id)
##rename columns something sensible and unique
colnames(sp1) <- c("low.1.sp1", "low.2.sp1", "high.3.sp1", "high.4.sp1",
"mix.5.sp1", "mix.6.sp1", "mix.7.sp1", "mix.8.sp1")
colnames(sp2) <- c("low.1.sp2", "low.2.sp2", "high.3.sp2", "high.4.sp2",
"mix.5.sp2", "mix.6.sp2", "mix.7.sp2", "mix.8.sp2")
##combine datasets
dat <- sp1 %>% cbind(sp2)
##choose which columns to compare. Some examples shown below
dat <- dat %>% mutate(low.1.sp1/high.3.sp1, low.1.sp1/high.4.sp1,
low.2.sp1/high.3.sp2)

Apply a function within list-column to another column (compare to reference ecdf by group)

I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)

Select top n columns (based on an aggregation)

I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}
Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]
We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]

How to restrict full_join() duplicates? - R

I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]

Resources