cosine similarity(patient similarity metric) between 48k patients data with predictive variables - r
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed
Related
How to loop and use if else on this example with logical expressions using R
I have two lengthy data sets with several columns and different lengths, for this example lets subset to few rows and just 3 columns: Temp <- c(12.9423 ,12.9446 ,12.9412 ,12.9617 ,12.9742 ,12.9652 ,12.9463, 12.9847 ,12.9778, 12.9589, 12.9305, 12.9275 ,12.8569 ,12.8531 ,12.9092, 12.9471, 12.9298, 12.9266, 12.9374 ,12.9385, 12.9505, 12.9510, 12.9632 ,12.9621 ,12.9571, 12.9492 ,12.8988, 12.8895 ,12.8777, 12.8956, 12.8748 ,12.7850 ,12.7323, 12.7546 ,12.7375 ,12.7020, 12.7172, 12.7015, 12.6960, 12.6944, 12.6963, 12.6928, 12.6930 ,12.6883 ,12.6913) Density <- c(26.38635 ,26.38531 ,26.38429, 26.38336, 26.38268 ,26.38242, 26.38265, 26.38343, 26.38486, 26.38697 ,26.38945, 26.39188, 26.39365, 26.39424 ,26.39376 ,26.39250, 26.39084 ,26.38912 ,26.38744 ,26.38587, 26.38456 ,26.38367, 26.38341 ,26.38398, 26.38547 ,26.38793 ,26.39120 ,26.39509, 26.39955 ,26.40455, 26.41002, 26.41578, 26.42126, 26.42593 ,26.42968, 26.43255 ,26.43463, 26.43603 ,26.43693 ,26.43750, 26.43787, 26.43815, 26.43841 ,26.43871 ,26.43904) po4 <- c(0.4239840 ,0.4351156, 0.4456128, 0.4542392, 0.4608510, 0.4656445, 0.4690847, 0.4717291, 0.4742391 ,0.4774904 ,0.4831152, 0.4922122, 0.5029904, 0.5128720, 0.5190209, 0.5191368 ,0.5133212, 0.5027542 ,0.4905301 ,0.4796467 ,0.4708035, 0.4638879, 0.4578364 ,0.4519745, 0.4481336, 0.4483697, 0.4531310, 0.4622930, 0.4750474 ,0.4905152 ,0.5082183 ,0.5278212 ,0.5491580 ,0.5720519, 0.5961127, 0.6207716 ,0.6449603, 0.6675704 ,0.6878331 ,0.7051851,0.7195461, 0.7305200, 0.7359634 ,0.7343541, 0.7283988) PP14 <- data.frame(Temp,Density,po4) ##df1 temp <- c(13.13875, 13.13477 ,13.12337 ,13.10662 ,13.09798 ,13.09542 ,13.08734 ,13.07616, 13.06671 ,13.05899, 13.05890 ,13.05293 ,13.03322, 13.01515, 13.02552 ,13.01668, 12.99829, 12.97075 ,12.95572 ,12.95045 ,12.94541 ,12.94365 ,12.94609 ,12.94256, 12.93565 ,12.93258 ,12.93489 ,12.93209 ,12.92219 ,12.90730 ,12.90416 ,12.89974, 12.89749 ,12.89626 ,12.89395, 12.89315 ,12.89274, 12.89276 ,12.89293 ,12.89302) density <- c( 26.35897, 26.36274 ,26.36173 ,26.36401 ,26.36507 ,26.36662 ,26.36838, 26.36996, 26.37286 ,26.37452 ,26.37402, 26.37571 ,26.37776, 26.38008 ,26.37959 ,26.38178, 26.38642 ,26.39158 ,26.39350, 26.39467, 26.39601, 26.39601, 26.39596 ,26.39517, 26.39728 ,26.39766, 26.39774, 26.39699 ,26.40081 ,26.40328 ,26.40416, 26.40486, 26.40513 ,26.40474 ,26.40552 ,26.40584, 26.40613, 26.40602 ,26.40595 ,26.40498) krho <- c( -9.999999e+06, -1.786843e+00, -9.142976e-01, -9.650734e-01, -2.532397e+00, -3.760537e+00, -2.622484e+00, -1.776506e+00, -2.028391e+00, -2.225910e+00, -3.486826e+00, -2.062341e-01, -3.010643e+00, -3.878437e+00, -3.796426e+00, -3.227138e+00, -3.335446e+00, -3.738037e+00, -4.577778e+00, -3.818099e+00, -3.891467e+00, -4.585045e+00 ,-3.150283e+00 ,-4.371089e+00 ,-3.902601e+00, -4.546019e+00, -3.932538e+00, -4.331247e+00, -4.508137e+00, -4.789201e+00, -4.383820e+00, -4.423486e+00, -4.334641e+00, -4.330544e+00, -4.838604e+00, -4.729123e+00, -4.381797e+00, -4.207365e+00, -4.276804e+00, -4.001305e+00) MS14 <- data.frame(temp,density,krho) ##df2 So now I would like to loop through both data sets and check if MS14$density=PP14$Density if it is true then I would like to use the column krho in that row to multiply it by delta po4 that corresponds to the same density so diff(po4) in that row or range. something like #MS14$krho[i] * diff(PP14$po4)[i] BUT when I run PP14$Density == MS14$density of course it is always FALSE, because the large decimal numbers, none is exactly the same. I solved that by round the numbers to the 3rd decimal, but it should be a way to include that in the code so density +- 0.005 for example. Well or just rounding it to the 3rd decimal like: PP14$Density_round2 <- round(PP14$Density ,digit=2) In any case I am not sure if I should use a nested loop to check both columns and make the operations accordingly or if it would be better to create a new data.frame with the intersect of each data.frame: common <- intersect(PP14$Density, MS14$density) and then make calculations....(??) So I would probably need a nested loop like: {for i:PP14 for j:MS14 new-> PP14$Density[i] == MS14$density[j] #if new is true then PP14$krho[i]* MS14$diff(po4)[j]#[for that particular row] #and print it into a new data.frame df3 #} So please, feel free to suggest the best way to proceed.. there might be several ways to do it.. Thank you so much in advance!! Ps: suggestions using Matlab are also welcome
Something like this? compareDec <- function(x, y, digits = NULL, tol = .Machine$double.eps^0.5){ if(is.null(digits)){ abs(x - y) < tol } else { round(x, digits = digits) == round(y, digits = digits) } } icomp <- outer(MS14$density, PP14$Density, compareDec, digits = 2) m <- outer(MS14$krho, c(0, diff(PP14$po4))) new <- which(icomp, arr.ind = TRUE) df3 <- cbind.data.frame(new, Prod = m[new]) head(df3) # row col Prod #1 17 1 0.00000000 #2 18 1 0.00000000 #3 19 1 0.00000000 #4 20 1 0.00000000 #5 17 2 -0.03712885 #6 18 2 -0.04161033
How to code the permutation equivalent of Mood's Median Test in R? (get the p values using permutation)
I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test data_2000 <- c(500,450,600,700,550,551,552) data_2019 <- c(560,460,620,720,540,600,750) mean(data_2000) mean(data_2019) mean(data_2019) - mean(data_2000) combined_data <- c(data_2000, data_2019) set.seed(123) null_dist <- c() for (i in 1:100000) { shuffled_data <- sample(combined_data) shuffled_2000 <- shuffled_data[1:7] shuffled_2019 <- shuffled_data[8:14] null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000) } (p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <= `enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum. data_2000 <- c(500,450,600,700,550,551,552) data_2019 <- c(560,460,620,720,540,600,750) delta_mean <- mean(data_2019) - mean(data_2000) delta_median <- median(data_2019) - median(data_2000) combined_data <- c(data_2000, data_2019) trials <- 100000 set.seed(123) mean_diff <- c() median_diff <- c() for (i in 1:trials) { shuffled_data <- sample(combined_data) shuffled_2000 <- shuffled_data[1:7] shuffled_2019 <- shuffled_data[8:14] mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000) median_diff[i] <- median(shuffled_2019) - median(shuffled_2000) } p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials p_mean #> [1] 0.31888 p_median #> [1] 0.24446 Following up on your question about HL test. Quoting Wikipedia The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences. You could run it on your data with the following code... Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings hl_df <- expand.grid(data_2019, data_2000) hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2 median(hl_df$pair_diffs) [1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties. wilcox.test(data_2019, data_2000, exact = FALSE) Wilcoxon rank sum test with continuity correction data: data_2019 and data_2000 W = 33.5, p-value = 0.2769 alternative hypothesis: true location shift is not equal to 0 I'll update this when I figure out how to do the other tests.
How to reduce dimension of gene expression matrix by calculating correlation coefficients?
I am in interested in finding Pearson correlation coefficients between a list of genes. Basically, I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I have annotation data of microarray experiment observation where sample ID in the rows and description identification on the columns. data > expr_mat[1:8, 1:3] Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04 1_at 6.062215 6.125023 5.875502 10_at 3.796484 3.805305 3.450245 100_at 5.849338 6.191562 6.550525 1000_at 3.567779 3.452524 3.316134 10000_at 6.166815 5.678373 6.185059 100009613_at 4.443027 4.773199 4.393488 100009676_at 5.836522 6.143398 5.898364 10001_at 6.330018 5.601745 6.137984 > anodat[1:8, 1:3] V1 V2 V3 1 SampleID GA Batch 2 Tarca_001_P1A01 11 1 3 Tarca_013_P1B01 15.3 1 4 Tarca_025_P1C01 21.7 1 5 Tarca_037_P1D01 26.7 1 6 Tarca_049_P1E01 31.3 1 7 Tarca_061_P1F01 32.1 1 8 Tarca_051_P1E03 19.7 1 goal: I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data, then generate sub expression matrix of keeping high correlated genes with target observation data anodat$GA. my attempt: gene_corrs <- function(expr_mat, anno_mat){ stopifnot(ncol(expr_mat)==nrow(anno_mat)) res <- list() lapply(colnames(expr_mat), function(x){ lapply(x, rownames(y){ if(colnames(x) %in% rownames(anno_mat)){ cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson") ncor <- ncol(cor_mat) cmatt <- col(cor_mat) ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor) colnames(ord) <- colnames(cor_mat) res <- cbind(ID=c(cold(ord), ID2=c(ord))) res <- as.data.frame(cbind(out, cor=cor_mat[res])) res <- cbind(res, cor=cor_mat[out]) res <- as.dara.frame(res) } }) }) return(res) } however, my above implementation didn't return what I expected, I need to filter out the genes by finding genes which has a strong correlation with anodat$GA. Another attempt: I read few post about similar issue and some people discussed about using limma package. Here is my attempt by using limma. Here I used anodat$GA as a covariate to fit limma linear model: library(limma) fit <- limma::lmFit(expr_mat, design = model.matrix( ~ 0 + anodat$GA) fit <- eBayes(fit) topTable(fit, coef=2) then I am expecting to get a correlation matrix from the above code, and would like to do following in order to get filtered sub expression matrix: idx <- which( (abs(cor) > 0.8) & (upper.tri(cor)), arr.ind=TRUE) idx <- unique(c(idx[, 1],idx[, 2]) correlated.genes <- matrix[idx, ] but I still didn't get the right answer. I am confident about using limma approach but I couldn't figure out what went wrong above code again. Can anyone point me out how to make this work? Is there any efficient way to make this happen?
Don't have your data so hard to double check, but in the abstract I would try this: library(matrixTests) cors <- row_cor_pearson(expr_mat, anodat$GA) which(cors$cor > 0.9) # to get the indeces of genes with correlation > 0.9
trying to perform a t.test for each row and count all rows where p-value is less than 0.05
I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help. Consider the following dataset: set.seed(2112) DataSample <- matrix(rnorm(24000),nrow=1000) colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep="")) I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies). Then I need to count the number of rows with a p-value equal to, or lower than 0.05. So here is the code I tried, which I know is wrong: set.seed(2112) DataSample <- matrix(rnorm(24000),nrow=1000) colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep="")) pValResults <- apply( DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value ) sum(pValResults < 0.05) # Returns the wrong answer (so I was told) I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of. Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant. set.seed(2112) DataSample <- matrix(rnorm(24000),nrow=1000) colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep="")) # initialize vector of stored p-values pvalue <- rep(0,nrow(DataSample)) for (i in 1:nrow(DataSample)){ pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value } # finding number that are significant sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45: DataSample.dt <- as.data.table(DataSample) sum(sapply(seq_len(nrow(DataSample.dt)), function(x) t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F], DataSample.dt[x, paste0('Control', 13:24), with=F], var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply): library("genefilter") library("matrixStats") library("microbenchmark") dd <- DataSample[, 1:12] - DataSample[, 13:24] microbenchmark::microbenchmark( manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)}, apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)}, rowttests = {ps3 <- rowttests(dd)[, "p.value"]}) #Unit: milliseconds # expr min lq mean median uq max # manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347 # apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675 # rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135 # neval # 100 # 100 # 100 You can see the manual method is over 200x faster than apply. If you actually meant an unpaired test, here's the equivalent comparison: microbenchmark::microbenchmark( manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)}, apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)}, rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]}) Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library. Performing the test: library(matrixTests) res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24]) Format of the result: res obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level 1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95 2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95 3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95 Number of rows with p <= 0.05: > sum(res$pvalue <= 0.05) [1] 4
Running 'prop.test' multiple times in R
I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different). One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions. I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks! dat <- data.frame(1:5, c(10, 50, 20, 30, 35)) names(dat) <- c("X", "N") dat$Prop <- dat$X / dat$N ConfLower = 0 x = 1 while (x < 6) { a <- prop.test(dat$X[x], dat$N[x])$conf.int[1] ConfLower <- c(ConfLower, a) x <- x + 1 } ConfUpper = 0 x = 1 while (x < 6) { a <- prop.test(dat$X[x], dat$N[x])$conf.int[2] ConfUpper <- c(ConfUpper, a) x <- x + 1 } dat$ConfLower <- ConfLower[2:6] dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here: https://stackoverflow.com/a/15059327/496803 res <- Map(prop.test,dat$X,dat$N) dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int")) # X N Prop lower upper #1 1 10 0.1000000 0.005242302 0.4588460 #2 2 50 0.0400000 0.006958623 0.1485882 #3 3 20 0.1500000 0.039566272 0.3886251 #4 4 30 0.1333333 0.043597084 0.3164238 #5 5 35 0.1428571 0.053814457 0.3104216