Related
My aim is to eliminate duplicates from a dataset.
For that I wrote a program that calculates correlations.
I want to take the name of the variables that have a correlation higher than a specific value I determine.
Here's one of the results I got.
M926T709 M927T709_1 M927T709_2 M929T709
M926T709 1.0000000 0.9947082 0.9879702 0.8716944
M927T709_1 0.9947082 1.0000000 0.9955145 0.8785669
M927T709_2 0.9879702 0.9955145 1.0000000 0.8621052
M929T709 0.8716944 0.8785669 0.8621052 1.0000000
Let's say i want to obtain the name of variables that have percentage high than 95%
so i should obtain this result
M926T709 , M927T709_1 , M927T709_2
Edit : the answer given by Ronak Shah worked well , but i need to obtain the results as vector so i can use the names after
Note, that I shouldn't analyze orthogonal results because they always equal to 1.
Please tell me if you need any clarification, also tell me if you want to see my entire program.
Using rowSums and colSums you can count how many values are more than 0.95 in each row and column respectively and then return the names.
tmp <- mat > 0.95
diag(tmp) <- FALSE
names(Filter(function(x) x > 0, rowSums(tmp) > 0 | colSums(tmp) > 0))
#[1] "M926T709" "M927T709_1" "M927T709_2"
Sample data: the limit and the correlation matrix m (with an added negative correlation for demonstration purposes):
limit <- 0.95
m <- as.matrix( read.table(text = "
M926T709 M927T709_1 M927T709_2 M929T709
M926T709 1.0000000 -0.9947082 0.9879702 0.8716944
M927T709_1 -0.9947082 1.0000000 0.9955145 0.8785669
M927T709_2 0.9879702 0.9955145 1.0000000 0.8621052
M929T709 0.8716944 0.8785669 0.8621052 1.0000000"))
Create a subset of the desired matrix and extract the row/column names.
Target <- unique( # Remove any duplicates
unlist( # merge subvectors of the `dimnames` list into one
dimnames( # gives all names of rows and columns of the matrix below
# Create a subset of the matrix that ignores correlations < limit
m[rowSums(abs(m) * upper.tri(m) > limit) > 0, # Rows
colSums(abs(m) * upper.tri(m) > limit) > 0] # Columns
),
recursive = FALSE))
Target
#> [1] "M926T709" "M927T709_1" "M927T709_2"
Created on 2021-10-25 by the reprex package (v2.0.1)
I am trying to create multiple confusion matrices from one dataframe, with each matrix generated based off a different condition in the dataframe.
So for the dataframe below, I want a confusion matrix for when Value = 1, Value = 2, Value =3
observed predicted Value
1 1 1
0 1 1
1 0 2
0 0 2
1 1 3
0 0 3
and see the results like:
Value Sensitivity Specificity PPV NPV
1 .96 .71 .84 .95
2 .89 .63 .30 .45
3 .88 .95 .28 .80
This is what I tried with a reproducible example. I am trying to write a loop that looks at every row, determines if Age = 1, and then pulls the values from the predicted and observed columns to generate a confusion matrix. Then I manually pull out the values from the confusion matrix to write out sen, spec, ppv, and npv and tried to combine all the matrices together. And then the loop starts again with Age = 2.
data(scat)
df<-scat %>% transmute(observed=ifelse(Site=="YOLA","case", "control"), predicted=ifelse(Location=="edge","case", "control"),Age)
x<-1 #evaluate at ages 1 through 5
for (i in dim(df)[1]) { #for every row in df
while(x<6) { #loop stops at Age=5
if(x=df$Age) {
q<-confusionMatrix(data = df$predicted, reference = df$observed, positive = "case")
sensitivity = q$table[1,1]/(q$table[1,1]+q$table[2,1])
specificity = q$table[2,2]/(q$table[2,2]+q$table[1,2])
ppv = q$table[1,1]/(q$table[1,1]+q$table[1,2])
npv = q$table[2,2]/(q$table[2,2]+q$table[2,1])
matrix(c(sensitivity, specificity, ppv, npv),ncol=4,byrow=TRUE)
}
}
x <- x + 1 #confusion matrix at next Age value
}
final<- rbind(matrix) #combine all the matrices together
However, this loop is completely non-functional. I'm not sure where the error is.
Your code can be simplified and the desired output achieved like this:
library(caret)
library(dplyr)
data(scat)
df <- scat %>%
transmute(observed = factor(ifelse(Site == "YOLA","case", "control")),
predicted = factor(ifelse(Location == "edge","case", "control")),
Age)
final <- t(sapply(sort(unique(df$Age)), function(i) {
q <- confusionMatrix(data = df$predicted[df$Age == i],
reference = df$observed[df$Age == i],
positive = "case")$table
c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
ppv = q[1, 1] / (q[1, 1] + q[1, 2]),
npv = q[2, 2] / (q[2, 2] + q[2, 1]))
}))
Resulting in
final
#> sensitivity specificity ppv npv
#> [1,] 0.0 0.5625000 0.00000000 0.8181818
#> [2,] 0.0 1.0000000 NaN 0.8000000
#> [3,] 0.2 0.5882353 0.06666667 0.8333333
#> [4,] 0.0 0.6923077 0.00000000 0.6923077
#> [5,] 0.5 0.6400000 0.25000000 0.8421053
However, it's nice to know why your own code didn't work, so here are a few issues that might be useful to consider:
You need factor columns rather than character columns for confusionMatrix
You were incrementing through the rows of df, but you need one iteration for each unique age, not each row in your data frame.
Your line to increment x happens outside of the while loop, so x never increments and the loop never terminates, so the console just hangs.
You are doing if(x = df$Age), but you need a == to test equality.
It doesn't make sense to compare x to df$Age anyway, because x is length 1 and df$Age is a long vector.
You have unnecessary repetition by doing q$table each time. You can just make q equal to q$table to make your code more readable and less error-prone.
You call matrix at the end of the loop, but you don't store it anywhere, so the whole loop doesn't actually do anything.
You are trying to rbind an object called matrix in the last line which doesn't exist
Your lack of spaces between math operators, commas and variables make the code less readable and harder to debug. I'm not just saying this as a stylistic point; it is a major source of errors I see frequently here on SO.
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed
This is the first time I am working with graphs and R igraph package and I need some help with processing graph objects.
What I want to achieve:
From a given contact matrix extract shortest confident path between nodes. By confident I mean that edge weights are higher then neighbouring edges.
Examples:
A
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 1 1 5
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 1 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 5 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is one cluster (clique?) between B-D-E (ie., egde weights between those nodes are high). However, as there is weight between A and F I am also getting cluster there, even though edge weight is low (only 5).
Question A: How to extract only those clusters that have high edge weight? I can transform those contacts to 0 with m[which(m <= 5)] <- 0, but I hope that there is more "mathy" solution for this in igraph package.
B
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 5 1 1
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 5 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 1 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is cluster between B-D-E, but as there is low weight between A and B - A is also connected to this cluster.
Question B: How to not assign nodes to a cluster if edge weight is low?
This is my first question here, if you need clarification or better examples I will improve my questions.
First, it is good to know that when looking up paths, igraph understands weights as costs, i.e. on edges with higher weight it costs more to travel, so it will consider shorter the paths with lower sum weight. It is easy to turn this into the opposite, just take the reciprocal of your weights (1 / E(ig)$weight). Between 2 vertices there might be only one shortest path, but sometimes there are more equally short paths. You can look up all of them (all_shortest_paths), or tell igraph to return only one of the shortests for each pairs of vertices (shortest_paths). Each call of these methods returns the paths from one selected vertex, to have the paths between all pairs, you need to call these once for each vertex (ok, at an undirected graph, it is enough to call for half of the vertices). To formulate what I explained until this point:
spaths <- lapply(V(ig),
function(v){
all_shortest_paths(ig, v,
weight = 1 / E(ig)$weight
)
}
)
Here spaths will be a list of lists, access the paths from C to all the vertices like this:
spaths$C$res
[[1]]
+ 2/6 vertices, named:
[1] C A
[[2]]
+ 2/6 vertices, named:
[1] C B
[[3]]
+ 1/6 vertex, named:
[1] C
[[4]]
+ 2/6 vertices, named:
[1] C D
[[5]]
+ 2/6 vertices, named:
[1] C E
[[6]]
+ 2/6 vertices, named:
[1] C F
spaths$C$res[[2]] # this is the path from `C` to `B`,
# a vector of 2 vertices
Note, the third element is actually from C to itself, you can either ignore it, or provide a vector of all other vertices to the to parameter of all_shortest_paths. Also, in your example all shortest paths will be of length 1, but if I set for example the weight of B--E to 1 instead of 100, we see that the method works, and from B to E the shortest path will be B-D-E.
Regarding your second question, here it is not completely clear what do you want to achieve, especially how do you get these clusters? If you want to find communities, i.e. more closely connected group of vertices, taking into account also the edge weights, there are many methods for this, all those named cluster_[...] or community.[...] in igraph. For example, if we run the fastgreedy method on your graph, it will detect the cluster you mentioned:
fg <- fastgreedy.community(ig, weights = E(ig)$weight)
IGRAPH clustering fast greedy, groups: 2, mod: 0.059
+ groups:
$`1`
[1] "A" "C" "F"
$`2`
[1] "B" "D" "E"
So here we have the B, D, E cluster, what is connected with higher weight edges. If we run the same method without weights, all the vertices will belong to one group (fastgreedy.community(ig, weights = NULL)). Note, at community detection, igraph understands weights as strength, so vertices connected with higher weight edges more likely to cluster together, this is kind of opposite like how it works at calculating paths.
I have an existing data frame with a variable "grade" indicating the type of row/observation. My goal is to select from another dataframe more of these types of rows while not exceeding a maximum percentage for each grade type in my existing data frame. I have defined a named vector with the grade allocations:
gradeAllocation <- c("A" = 0, "B" = 0, "C" = .25, "D" = .40, "E" = .20, "F" = .10, "G" = .05)
This represents the maximum percent of each type of grade in my data frame. Now, lets say I want to select from another data frame a mixture of grades but I dont want to select too many where after the selection would give me more than the max percentage per grade type. I would be basically doing this process in a loop for each new data set that becomes available but want to keep the max distribution given by the gradeAllocation vector.
Is there a package/function that can help here? Any thoughts for custom code?
Thanks, John
So as #Mr.Flick points out, there is no guarantee that this will be possible. In your gradeAllocation the sampling distribution sums to 1. If your test dataset has no "D", for example, it will not be possible to create a sample with at most 25% C, 15% E, 10% F, 5% G, and no A or B.
Also, because the sampling distribution sums to 1, if the sample size you want is N, then the number of samples of each grade must be given by N * gradeAllocation. Here is a method that takes advantage of that fact, starting with a dataset that has 700 samples and is uniformly distributed (same number in each grade), and we extract a random sample of 100 with the distribution given by gradeAllocation.
# sample dataset: 700 observations, grade distribution is uniform
set.seed(1) # for reproducible example
data <- data.frame(grade=rep(LETTERS[1:7],each=100),x=rnorm(700))
# desired distribution in the sample
gradeAllocation <- c(A=0, B=0, C=.25, D=.40, E=.20, F=.10, G=.05)
# you start here...
N <- 100 # sample size
get.sample<- function(g) data[sample(which(data$grade==g),N*gradeAllocation[g]),]
result <- do.call(rbind,lapply(LETTERS[1:7],get.sample))
# confirm distribution of grades in the sample
table(result$grade)
# A B C D E F G
# 0 0 25 40 20 10 5
Here's one approach
Generate some data
nOriginal <- 1000
df1 <- data.frame(grade=sample(c('A','B','C','D','E','F','G'),1000,replace=TRUE),
indx=seq(1:nOriginal))
Get the rows that correspond to each grade
idx_a=which(df1$grade=='A')
idx_b=which(df1$grade=='B')
idx_c=which(df1$grade=='C')
idx_d=which(df1$grade=='D')
idx_e=which(df1$grade=='E')
idx_f=which(df1$grade=='F')
idx_g=which(df1$grade=='G')
Sample the rows based on the prescribed distribution which should sum to one.
location <- c("A" = 0, "B" = 0, "C" = .25, "D" = .40, "E" = .20, "F" = .10, "G" = .05)
nSamples = 200
samp_idx_a <- sample(idx_a,nSamples*location["A"])
samp_idx_b <- sample(idx_b,nSamples*location["B"])
samp_idx_c <- sample(idx_c,nSamples*location["C"])
samp_idx_d <- sample(idx_d,nSamples*location["D"])
samp_idx_e <- sample(idx_e,nSamples*location["E"])
samp_idx_f <- sample(idx_f,nSamples*location["F"])
samp_idx_g <- sample(idx_g,nSamples*location["G"])
df_2 <- df1[c(samp_idx_a, samp_idx_b, samp_idx_c, samp_idx_d,
samp_idx_e, samp_idx_f, samp_idx_g),]
Check the results
(percent_A = sum(df_2$grade=="A")/nrow(df_2)*100)
(percent_B = sum(df_2$grade=="B")/nrow(df_2)*100)
(percent_C = sum(df_2$grade=="C")/nrow(df_2)*100)
(percent_D = sum(df_2$grade=="D")/nrow(df_2)*100)
(percent_E = sum(df_2$grade=="E")/nrow(df_2)*100)
(percent_F = sum(df_2$grade=="F")/nrow(df_2)*100)
(percent_G = sum(df_2$grade=="G")/nrow(df_2)*100)