I don't understand how weighting works in the dplyr::sample_n function. I have a list of very small numbers (ranging from 0.1020457 to 0.1789028) and I need to weight my sampling so that I get some on the lower end, upper end and in the middle. But since the numbers are so similar, I'm not sure how to do it. I also don't want to restrict my sampling to a certain range either (e.g. numbers > 0.16), i just want those far more likely to be sampled.
I can make the range larger (-1.552115 to 2.008253) but that means scaling by data, and I can't weight with negative numbers. I have to do things like abs(numbers - maximum). Heres an example of what I'm doing:
sample_n(data.frame(scaledMeasurement$V1), 4,
replace = FALSE,
weight = abs((scaledMeasurement $V1) - max(scaledMeasurement $V1)))
Heres a section of my data:
Measurement ID
0.8022473 1
1.6991193 2
0.7262765 3
0.3903775 4
-1.5521155 5
-0.7821887 6
If your goal is to get a sample that contains some on the low end, some near the median, and some on the end, it's far easier to avoid weights and just work with group_by + sample_n.
library(tidyverse)
df = tibble(my_nums = runif(10,0.1020457,0.1789028))
df %>%
mutate(quantile = case_when(
my_nums <= quantile(my_nums, probs = c(0.33)) ~ "a",
my_nums <= quantile(my_nums, probs = c(0.67)) ~ "b",
TRUE ~ "c"
)) %>%
group_by(quantile) %>%
sample_n(2)
Produces:
my_nums quantile
<dbl> <chr>
1 0.105 a
2 0.105 a
3 0.151 b
4 0.124 b
5 0.173 c
6 0.172 c
However, if you wanted to use weights, sample_n requires that the weights be the same length as the vector that's being sampled and also that the sum of the weights is equal to 1. You could add a weight column based on a subdivision of your groups (as I show above quantiles), grouping by that, generating a random number between one and length, ungrouping, and then dividing the values in that column by its sum. Like so:
df %>%
mutate(quantile = case_when(
my_nums <= quantile(my_nums, probs = c(0.33)) ~ "a",
my_nums <= quantile(my_nums, probs = c(0.67)) ~ "b",
TRUE ~ "c"
)) %>%
group_by(quantile) %>%
mutate(weight = sample(seq(1,length(my_nums)),length(my_nums))) %>%
ungroup %>% arrange(quantile) %>%
mutate(weight = weight / sum(weight)) %>%
sample_n(6, weight = weight)
Related
I have a fairly alrge dataset and I am running a for loop to remove one line per transect and calculate the frequency of the category. I am now trying to make it so that instead of one line per transect it removes a whole transect every iteration. Is it possible to do this?
Here is a sample dataset with the same columns I have
Transect<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
Category<- c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C")
dat<- data.frame(Transect,Category)
So the current code below removes one line per transect. How could I do it so that it randomly deletes a whole transect category (i.e. in the first iteration all of transect 3 is removed and on the second all of 1 is removed)
for (q in 1:2) {
for ( i in 0:5){
#if (i>0)
df<- dat2 %>%
group_by(Transect) %>%
sample_n(n() - i, replace = TRUE) %>%
ungroup()
c<-df %>%
group_by(Category) %>%
summarise(n = n(), replace=TRUE) %>%
mutate(freq = n / sum(n),
total=55-i)
if (i==0){
tot_1=c
} else {
tot_1=bind_rows(tot_1,c)
}
}
tot_1$rep = q
if (q==1){
dftot = tot_1
} else {
dftot=bind_rows(dftot, tot_1)
}
}
It seems your goals is to iteratively assess increasingly small subsamples of your data to assess loss of representation of the whole. This code will try dropping a random 1 then 2 then 3... and report the distribution of categories. The last few lines normalize count to fraction of total for easy comparison between iterations.
Note I used set.seed() because it will return a different result each time due to random sampling.
To break down this answer a bit:
It's important that Category is a factor so that table() won't drop any Category values that have no count in a particular iteration. It would run to a point but then the rowbinding operation within map_dfr() would fail.
First I just enumerate the numbers of Transect to leave out (should be 0:4 in this example) using 0:length(unique(d$Transect)). I included 0 so that we can see what it looks like with the full dataset.
I used set_names() so that it becomes a named vector. This allows us to use .id inside map_dfr() so that we get an extra column which stores the value of the leaveout.
purrr::map_dfr() will iteratively apply a function over some list. In this case I piped in the list of leaveout values (which we just named) and the function we apply is given as an rlang-style lambda function which begins with ~ and operates on the argument .x.
Working from the inside of the filter operation, this function first randomly samples a number of values of Transect to exclude given by .x and then removes data with said value of Transect. Here we use %in% and negate the whole result with ! at the beginning.
Then we just use dplyr::pull() to take the Category column as a vector and run table() on it to tabulate the occurrence of each value.
The rest just calculates the total count for each iteration and then divides the values by that to get a fraction.
library(tidyverse)
d <- tibble(
Transect = as.character(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)),
Category = factor(c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C"))
)
set.seed(1)
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr( ~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects") %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~.x/total_count))
#> # A tibble: 4 × 6
#> # Rowwise:
#> leaveout_transects total_count C S SP T
#> <chr> <int> <table> <table> <table> <table>
#> 1 0 15 0.2 0.4 0.1333333 0.2666667
#> 2 1 10 0.2 0.3 0.2000000 0.3000000
#> 3 2 5 0.2 0.2 0.2000000 0.4000000
#> 4 3 0 NaN NaN NaN NaN
It would probably be more rigorous to simulate each leaveout condition multiple times and look at the distribution of performance you get at each value to assess what's likely to happen in the future with a given subsample.
Base r has the built in function replicate which is great for this purpose. Here I'm just using the code above with replicate and then reformatting the data a bit to graph it.
# use replicate to make many simulations
n_reps <- 20
replicate(
n_reps,
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr(
~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects"
) %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~ .x / total_count)) %>%
select(3:6) %>%
t() %>%
cor() %>%
.[, 1]) %>%
as_tibble(.name_repair = "unique") %>%
mutate("leavout_transects" = factor(0:length(unique(d$Transect)))) %>%
pivot_longer(-leavout_transects, values_to = "correlation") %>%
select(-name) %>%
ggplot(aes(leavout_transects, correlation)) +
geom_boxplot()
Created on 2022-09-22 by the reprex package (v2.0.1)
Starting data
I'm working in R and I have a set of data generated from groups (cohorts) of animals treated with different doses of different drugs. A simplified reproducible example of my dataset follows:
# set starting values for simulation of animal cohorts across doses of various drugs with a few numeric endpoints
cohort_size <- 3
animals <- letters[1:cohort_size]
drugs <- factor(c("A", "B", "C"))
doses <- factor(c(0, 10, 100))
total_size <- cohort_size * length(drugs) * length(doses)
# simulate data based on above parameters
df <- cbind(expand.grid(drug = drugs, dose = doses, animal = animals),
data.frame(
other_metadata = sample(LETTERS[24:26], size = total_size, replace = TRUE),
num1 = rnorm(total_size, mean = 10, sd = 3),
num2 = rnorm(total_size, mean = 60, sd = 9),
num3 = runif(total_size, min = 1, max = 5)))
This produces something like:
## drug dose animal other_metadata num1 num2 num3
## 1 A 0 a X 6.448411 54.49473 4.111368
## 2 B 0 a Y 9.439396 67.39118 4.917354
## 3 C 0 a Y 8.519773 67.11086 3.969524
## 4 A 10 a Z 6.286326 69.25982 2.194252
## 5 B 10 a Y 12.428265 70.32093 1.679301
## 6 C 10 a X 13.278707 68.37053 1.746217
My goal
For each drug treatment, I consider the dose == 0 animals as my control group for that drug (let's say each was run at a different time and has it's own control group). I wish to calculate the mean for each numeric endpoint (columns 5:7 in this example) of the control group. Next I want to normalize (divide) every numeric endpoint (columns 5:7) for every animal by the mean of it's respective control group.
In other words num1 for all animals where drug == "A" should be divided by the mean of num1 for all animals where drug == "A" AND dose == 0 and so on for each endpoint.
The final output should be the same size as the original data.frame with all of the non-numeric metadata columns remaining unchanged on the left side and all the numeric data columns now with the normalized values.
Naturally I'd like to find the simplest solution possible - minimizing creation of new variables and ideally in a single dplyr pipeline if possible.
What I've tried so far
I should say that I have technically solved this but the solution is super ugly with a ton of steps so I'm hoping to get help to find a more elegant solution.
I know I can easily get the averages for the control groups into a new data.frame using:
df %>%
filter(dose == 0) %>%
group_by(drug, dose) %>%
summarise_all(mean)
I've looked into several things but can't figure out how to implement them. In order of what seems most promising to me:
dplyr::group_modify()
dplyr::rowwise()
sweep() in some type of loop
Thanks in advance for any help you can offer!
If the intention is to divide the numeric columns by the mean of the control group values, grouped by 'drug', after grouping by 'drug', use mutate with across (from dplyr 1.0.0), divide the column values (. with mean of the values where the 'dose' is 0
library(dplyr) # 1.0.0
df %>%
group_by(drug) %>%
mutate(across(where(is.numeric), ~ ./mean(.[dose == 0])))
If we have a dplyr version is < 1.0.0, use mutate_if
df %>%
group_by(drug) %>%
mutate_if(is.numeric, ~ ./mean(.[dose == 0]))
I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).
I have a large dataset, and I have multiple groups I want to sample. Each group has a certain number of positive cases, with a value of 1, and a lot more negative cases, with a value of zero.
For each group, I want to select all the positive cases, and then a random amount of negative cases equal to 4x the amount of positive cases in that group.
I also need something that run quickly on a lot of data.
Semi-Update:
stratified_sample = data %>%
group_by(group) %>%
mutate(n_pos = sum(response == 1),
n_neg = 4 * n_pos) %>%
group_by(group,response) %>%
mutate(rec_num = n(),
random_val = runif(n()),
random_order = rank(random_val)) %>%
filter(response == 1 | random_order <= n_neg)
This should work if you sub in the correct names. If you have issues, provide a reproducible example.
library(dplyr)
stratified_sample = your_large_dataset %>%
group_by(whatever_your_grouping_variable_is) %>%
mutate(n_pos = sum(column_name_of_your_label == 1),
n_neg = sum(column_name_of_your_label == 0),
cutoff = 4 * n_pos / n_neg) %>%
filter(column_name_of_your_label == 1 | runif(n()) < cutoff)
This gives each negative case a probability of 4 * number of positive cases / number of negative cases to be selected, so the sample fraction won't be exact, but it has the expected value that you want.
Problem
Make a new row containing percent
Data
df<- data.frame(
species = c ("A","A","A","A","B","B","B","B","A","A","A","A","B","B","B","B"),
number = c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2),
treatment = c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1),
variable = c ("x","y","x","y","x","y","x","y","x","y","x","y","x","y","x","y"),
value = sample(1:16)
)
Question
I would like to calculate percent for a species of a given number and treatment.. I.e. variable x and y (two first lines) should sum to 100%.
I tried with dplyr:
result <- df%>%
group_by(variable) %>%
mutate(percent = value*100/sum(value))
test<-subset(result,variable=="x")
sum(test[,6]) # sums to 100%
"test" is wrong because it’s the percent of all x across both species and both treatment.
desired output
species number treatment variable value percent
A 1 0 x 40 40
A 1 0 y 60 60
A 2 0 x 1 10
A 2 0 y 9 90
Here is an answer that uses tidyr:
require(tidyr)
require(dplyr)
df %>% spread(variable, value) %>%
mutate(percent.x = x / (x+y),
percent.y = y / (x+y))
Here also is a dplyr-only solution:
df %>% group_by(number, treatment, species) %>%
mutate(percent = 100 * value / sum(value))
Your problem was that you were doing group_by() on exactly the wrong variables. Since you want the percentage defined within a particular (number, treatment, solution) combination, but to vary across your variable, you should group_by() the former, not the latter.
Is this what you're looking for? I'm using data.table package:
library(data.table)
DT <- as.data.table(df)
DT_output <- DT[,list(value=sum(value)),by=c('species', 'number', 'treatment', 'variable')]
DT_temp <- DT[,list(sum=sum(value)),by=c('species', 'number', 'treatment' )]
T_output <- merge(DT_output, DT_temp, by = c('species', 'number', 'treatment'))
DT_output[, percent := 100 * value / sum]
setorder(DT_output, species,treatment,number,variable)
DT_output