Related
I need to extract a sample that has equal distribution in each experience-level group. For your info, there are total 4 groups (1, 2, 3, 4 years of exp), and total 8 people (A, B, C, D, E, F, G, H) in this example scenario. I was trying to come up with a function with loops, but don't know how to. Please help me out! Thank you! :)
library(tidyverse)
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4), pre_year_exp = year_exp - 1)
data_0 <- data %>% filter(year_exp == max(year_exp) - 0) %>% sample_n(2)
data_1 <- data %>% filter(year_exp == max(year_exp) - 1) %>% anti_join(data_0, by = 'id') %>% sample_n(2)
data_2 <- data %>% filter(year_exp == max(year_exp) - 2) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% sample_n(2)
data_3 <- data %>% filter(year_exp == max(year_exp) - 3) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% anti_join(data_2, by = 'id')
#Result Table
result <- data_0 %>% bind_rows(data_1, data_2, data_3)
result
The below produces the same output as your code and extends the idea to allow for an arbitrary number of values of year_exp using a for loop.
Please note that because this simply extends your code, it must share the following (possibly-undesirable) features with your code:
The code moves sequentially through groups, sampling from the members of later groups who were not sampled for early groups. Accordingly, there is a risk that the code throws an error because it tries to sample from groups whose members were already sampled from previous, other groups.
The probabilities of selection are not uniformly distributed across members of a group. Accordingly, the samples drawn from each group are not representative of that group.
In the event that there data were instead a balanced panel, there are much more efficient and simpler ways to accomplish this.
library(tibble)
library(dplyr)
set.seed(123)
# Create original data
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),
year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4),
pre_year_exp = year_exp - 1)
# Assign values to parameters used by/in the loop.
J <- data$id %>% unique %>% length # unique units/persons (8)
K <- data$year_exp %>% unique %>% length # unique groups/years (4)
N <- 2 # sample size per group (2)
# Initialize objects loop will modify
samples_list <- vector(mode = "list", length = K) # stores each sample
used_ids <- rep(NA_character_, J) # stores used ids
index <- 1:N # initial indices for used ids
# For-loop solution
for (k in 1:K) {
# Identifier for current group
cur_group <- 1 + K - k
# Sample from persons in current group who were not previously sampled
one_sample <- data %>%
filter(year_exp == cur_group, !(id %in% used_ids)) %>%
slice_sample(n = N)
# Save sample and the id values for those sampled
samples_list[[k]] <- one_sample
used_ids[index] <- one_sample$id
index <- index + N
}
# Bind into a single data.frame
bind_rows(samples_list)
#> # A tibble: 8 x 3
#> id year_exp pre_year_exp
#> <chr> <dbl> <dbl>
#> 1 H 4 3
#> 2 D 4 3
#> 3 G 3 2
#> 4 E 3 2
#> 5 C 2 1
#> 6 B 2 1
#> 7 F 1 0
#> 8 A 1 0
I am new to R and we have been given a dataset about flies with column heading such as species and sex. In total there are 111 species. The goal is to know how many males and females are in each species and to have it in a form that can be used for further analysis (t-test).
Ideally I would have one data frame with 3 columns (Species, number of females, number of males). I used the split function which has given me the best result thus far the problem is that I don't know how to do it for 111 species in a reasonable amount of time. I though about using a for loop but am unsure about how I could do that. This is the split code that I used:
data_split <- split(data, data$Species)
data_split
sp1 <- data_split$D_acutila
data.frame(table(sp1$Sex))
I created a simple data frame that I believe is similar to the one you are working with
df <- data.frame(species = sample(c('A', 'B', 'C'), 100, replace = T),
sex = sample(c('M', 'F'), 100, replace = T))
df
You can solve this problem using the dplyr package:
library(dplyr)
#create an auxiliar column to be summed
df$aux <- 1
#group by species and sex, summarise the groups by the sum of the auxiliar column
summary <- df %>%
group_by(species, sex) %>%
summarise(count_subjects = sum(aux))
summary
Using base R you could do:
set.seed(42)
data <- data.frame(
Species = rep(c("A", "B", "C"), 20),
Sex = sample(c("m", "f"), 60, replace = TRUE)
)
data_split <- split(data, data$Species)
data_table <- lapply(data_split, function(x) table(x$Sex))
data_table <- do.call("rbind", data_table)
cbind(Species = rownames(data_table), as.data.frame(data_table))
#> Species f m
#> A A 11 9
#> B B 12 8
#> C C 10 10
or usingtapply instead of split + lapply:
data_table <- tapply(data$Sex, data$Species, table)
data_table <- do.call("rbind", data_table)
cbind(Species = rownames(data_table), as.data.frame(data_table))
#> Species f m
#> A A 11 9
#> B B 12 8
#> C C 10 10
Created on 2021-06-06 by the reprex package (v2.0.0)
I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.
Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)
You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5
I would like to run a KW-test over certain numerical variables from a data frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).
Simulated data:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc = runif(240, 0, 10),
ados.sa = runif(240, 0, 10),
ados.rrb = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)
The following code works just fine outside of the loop.
kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data =
ados.sim)
But it doesn't inside the loop:
for(i in names(ados.sim[,2:4])){
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")),
data = ados.sim)
}
I get the error:
Error in terms.formula(formula, data = data) :
invalid term in model formula
Anybody who knows how to solve this?
Much appreciated!!
Try:
results <- list()
for(i in names(ados.sim[,2:4])){
results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
}
This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.
Note the following:
for(i in names(ados.sim[,2:4])){
print(i)
}
[1] "ados.tsc"
[1] "ados.sa"
[1] "ados.rrb"
That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.
for(i in names(ados.sim[,2:4])){
print(paste((colnames(ados.sim)[i]), "~ muttype"))
}
[1] "NA ~ muttype"
[1] "NA ~ muttype"
[1] "NA ~ muttype"
And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:
First, store all test objects in a dataframe:
library(tidyr)
df <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(test = kruskal.test(x= .$value, g = .$muttype))
You can then subset the dataframe to get the test outcomes:
df[df$key == "ados.rrb",]$test
[[1]]
Kruskal-Wallis rank sum test
data: .$value and .$muttype
Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295
Alternatively, get all results directly in a dataframe, without storing the test objects:
library(broom)
df2 <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(tidy(kruskal.test(x= .$value, g = .$muttype)))
df2
# A tibble: 3 x 5
# Groups: key [3]
key statistic p.value parameter method
<chr> <dbl> <dbl> <int> <fctr>
1 ados.rrb 2.2205031 0.3294761 2 Kruskal-Wallis rank sum test
2 ados.sa 0.1319554 0.9361517 2 Kruskal-Wallis rank sum test
3 ados.tsc 0.3618102 0.8345146 2 Kruskal-Wallis rank sum test
I have a dataframe that looks something like this:
time id trialNum trialType accX gravX
1 1 6 7 low -0.38876217 10.185266
2 2 1 6 low 0.68254705 10.741545
3 3 3 15 high -0.21906854 9.466929
4 4 2 15 none -0.03370001 9.490829
5 5 4 1 high 0.16511542 10.986796
6 6 9 2 none -0.10441621 9.915561
You can generate something similar using this:
testDF <- data.frame(time = 1:50,
id = sample(1:10, size=50, replace=T),
trialNum = sample(1:15, size = 50, replace=T),
trialType = sample(c("none", "low", "high"),
size = 50, replace=T),
accX = sin(seq(1,50,1)),
gravX = 0.1)
And a function to calculate the average time between peaks in a filtered signal (returning mean time, and variance of the time differences):
library(dplyr)
library(signal)
library(quantmod)
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
What I'm trying to do apply this function to each combination of id, trialNum, and trialType using groupby:
tempTrial <-
group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime(.)[1],
varianceTime= calcStepTime(.)[2])
The problem is that in the output dataframe (tempTrial) every row of meanTime and varianceTime is identical
In this toy dataset, sometimes the columns all show NA (this doesnt happen in my actual dataset)
Am I doing something incorrectly to cause each row to be identical for the 2 columns? It should be taking each combination of id, trialNum and trialType, and calculating peak times for each of those separately. However, it seems its only storing a single value for each combination?
The chain is working properly in the sense that . refers to the grouped data frame group_by(testDF, id, trialNum, trialType). Since your defined function has no way of using the group information in ., the results are what you see (i.e. the function applied to the whole data frame).
So your problem here is the incorrect use of summarise. Latrunculia's answer shows you that the proper way to use summarise in the way you expect is to apply the function to combinations of columns in your data frame, in which case the function applies by group in each variable.
dplyr has a do function for applications where you wish to apply a function to the data frame subset implied by group_by. Simply replace your summarise with do:
tempTrial <- group_by(testDF, id, trialNum, trialType) %>% do(meanTime = calcStepTime(.)[1], varianceTime= calcStepTime(.)[2])
The documentation for do is not terribly clear, but this post describes the application very well.
What you get right now is the result of calcStepTime applied on the whole (ungrouped) data frame for each group.
Try rewriting the function such that it depends on the variables, but not on the data frame.
alcStepTime <- function(var1, var2, var3){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, var1 - var2)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- var3[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
testDF %>% group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime( accX, gravX, time)[1],
varianceTime= calcStepTime(accX, gravX, time)[2])
It gives the right result if you just pipe the testDF data frame into it. It breaks for the grouped DF but I can't find if that's because the function is not defined for the subsets or if it's a problem with the function.
let me know if it works for the full data
As noted by yourself and Latrunculia, calcStepTime is very likely to return NaN/NA on the 50 observation datasets. This occurs when either no peak or a single peak was found within a group of observations. You may want to defend against this in your analysis code. I used this for testing:
testDF <- data.frame(time = 1:200,
id = sample(1:2, size=200, replace=T),
trialNum = sample(1:1, size = 200, replace=T),
trialType = sample(c("low"), size = 200, replace=T),
accX = sin(seq(1,200,1)),
gravX = 0.1)
If you change the return type of your function of data_frame (tibble), like so:
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return (data_frame("meanStepTime" = meanStepTime,
"varianceStepTime" = varianceStepTime))
}
Then you can take advantage of purrr::by_slice() for a fairly elegant solution:
library(purrr)
testDF %>%
group_by(id, trialNum, trialType) %>%
by_slice(calcStepTime, .collate="cols")
I got this from my test sample:
# A tibble: 2 x 5
id trialNum trialType meanStepTime1 varianceStepTime1
<int> <int> <fctr> <dbl> <dbl>
1 1 1 low 42.75 802.2500
2 2 1 low 39.75 616.9167
Note that .collate="cols" is the important argument that tells by_slice() to create the named columns for the results in the output. I'm a little curious myself as to why the "1" has been appended to the names we set in the data_frame returned by your function.