Winsorize dataframe - r

I want to perform winsorization in a dataframe like this:
event_date beta_before beta_after
2000-05-05 1.2911707054 1.3215648954
1999-03-30 0.5089734305 0.4269575657
2000-05-05 0.5414700258 0.5326762272
2000-02-09 1.5491034852 1.2839988507
1999-03-30 1.9380674599 1.6169735009
1999-03-30 1.3109909155 1.4468207148
2000-05-05 1.2576420753 1.3659492507
1999-03-30 1.4393018341 0.7417777965
2000-05-05 0.2624037804 0.3860641307
2000-05-05 0.5532216441 0.2618245169
2000-02-08 2.6642931822 2.3815576738
2000-02-09 2.3007578964 2.2626960407
2001-08-14 3.2681270302 2.1611010935
2000-02-08 2.2509121123 2.9481325199
2000-09-20 0.6624503316 0.947935581
2006-09-26 0.6431111805 0.8745333151
By winsorization I mean to find the max and min for beta_before for example. That value should be replaced by the second highest or second lowest value in the same column, without loosing the rest of the details in the observation. For example. In this case, in beta_before the max value is 3.2681270302 and should be replaced by 3.2681270302. The same process will be followed for the min and then for the beta_after variable. Therefore, only 2 values per column will be changes, the highest and the minimum, the rest will remain the same.
Any advice? I tried different approaches in plyr, but I ended up replacing the whole observation, which I don’t want to do. I would like to create 2 new variables, for example beta_before_winsorized and beta _after_winsorized

I thought winsorizing usually finds the value x% (typically 10%, 15%, or 20%) from the bottom of the ordered list, and replaces all the values below it with that value. Same with the top. Here you're just choosing the top and bottom value, but winsorizing usually involves specifying a percentage of values at the top and bottom to replace.

Here is a function that does the winsorzation you describe:
winsorize <- function(x) {
Min <- which.min(x)
Max <- which.max(x)
ord <- order(x)
x[Min] <- x[ord][2]
x[Max] <- x[ord][length(x)-1]
x
}
If you data are in a data frame dat, then we can windsoroize the data using your procedure via:
dat2 <- dat
dat2[, -1] <- sapply(dat[,-1], winsorize)
which results in:
R> dat2
event_date beta_before beta_after
1 2000-05-05 1.2911707 1.3215649
2 1999-03-30 0.5089734 0.4269576
3 2000-05-05 0.5414700 0.5326762
4 2000-02-09 1.5491035 1.2839989
5 1999-03-30 1.9380675 1.6169735
6 1999-03-30 1.3109909 1.4468207
7 2000-05-05 1.2576421 1.3659493
8 1999-03-30 1.4393018 0.7417778
9 2000-05-05 0.5089734 0.3860641
10 2000-05-05 0.5532216 0.3860641
11 2000-02-08 2.6642932 2.3815577
12 2000-02-09 2.3007579 2.2626960
13 2001-08-14 2.6642932 2.1611011
14 2000-02-08 2.2509121 2.3815577
15 2000-09-20 0.6624503 0.9479356
16 2006-09-26 0.6431112 0.8745333
I'm not sure where you got the value you suggest should replace the max in beta_before as the second highest is 2.6642932 in the snippet of data provided and that is what my function has used to replace with the maximum value with.
Note the function will only work if there is one minimum and maximum values respectively in each column owing to the way which.min() and which.max() are documented to work. If you have multiple entries taking the same max or min value then we would need something different:
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x
}
should do it (latter is not tested).

Strictly speaking, "winsorization" is the act of replacing the most extreme data points with an acceptable percentile (as mentioned in some of the other answers). One fairly standard R function to do this is winsor from the psych package. Try:
dat$beta_before = psych::winsor(dat$beta_before, trim = 0.0625)
dat$beta_after = psych::winsor(dat$beta_after , trim = 0.0625)
I chose trim = to be 0.0625 (the 6.25th percentile and 93.75th percentile) because you only have 16 data points and you want to "rein in" the top and bottom ones: 1/16 = 0.0625
Note that this might make the extreme data equal to a percentile number which may or may not exist in your data set: the theoretical n-th percentile of the data.

The statar package works very well for this. Copying the relevant snippet from the readme file:
# winsorize (default based on 5 x interquartile range)
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))
https://github.com/matthieugomez/statar

follow up from my previous point about actually replacing the to-be-trimmed values with value at trim position:
winsorized.sample<-function (x, trim = 0, na.rm = FALSE, ...)
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)
}
if (na.rm)
x <- x[!is.na(x)]
if (!is.numeric(trim) || length(trim) != 1L)
stop("'trim' must be numeric of length one")
n <- length(x)
if (trim > 0 && n) {
if (is.complex(x))
stop("trimmed sample is not defined for complex data")
if (any(is.na(x)))
return(NA_real_)
if (trim >= 0.5) {
warning("trim >= 0.5 is odd...trying it anyway")
}
lo <- floor(n * trim) + 1
hi <- n + 1 - lo
#this line would work for just trimming
# x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
#instead, we're going to replace what would be trimmed
#with value at trim position using the next 7 lines
idx<-seq(1,n)
myframe<-data.frame(idx,x)
myframe<-myframe[ order(x,idx),]
myframe$x[1:lo]<-x[lo]
myframe$x[hi:n]<-x[hi]
myframe<-myframe[ order(idx,x),]
x<-myframe$x
}
x
}
#test it
mydist<-c(1,20,1,5,2,40,5,2,6,1,5)
mydist2<-winsorized.sample(mydist, trim=.2)
mydist
mydist2
descStat(mydist)
descStat(mydist2)

Related

How to loop and use if else on this example with logical expressions using R

I have two lengthy data sets with several columns and different lengths, for this example lets subset to few rows and just 3 columns:
Temp <- c(12.9423 ,12.9446 ,12.9412 ,12.9617 ,12.9742 ,12.9652 ,12.9463, 12.9847 ,12.9778,
12.9589, 12.9305, 12.9275 ,12.8569 ,12.8531 ,12.9092, 12.9471, 12.9298, 12.9266,
12.9374 ,12.9385, 12.9505, 12.9510, 12.9632 ,12.9621 ,12.9571, 12.9492 ,12.8988,
12.8895 ,12.8777, 12.8956, 12.8748 ,12.7850 ,12.7323, 12.7546 ,12.7375 ,12.7020,
12.7172, 12.7015, 12.6960, 12.6944, 12.6963, 12.6928, 12.6930 ,12.6883 ,12.6913)
Density <- c(26.38635 ,26.38531 ,26.38429, 26.38336, 26.38268 ,26.38242, 26.38265, 26.38343,
26.38486, 26.38697 ,26.38945, 26.39188, 26.39365, 26.39424 ,26.39376 ,26.39250,
26.39084 ,26.38912 ,26.38744 ,26.38587, 26.38456 ,26.38367, 26.38341 ,26.38398,
26.38547 ,26.38793 ,26.39120 ,26.39509, 26.39955 ,26.40455, 26.41002, 26.41578,
26.42126, 26.42593 ,26.42968, 26.43255 ,26.43463, 26.43603 ,26.43693 ,26.43750,
26.43787, 26.43815, 26.43841 ,26.43871 ,26.43904)
po4 <- c(0.4239840 ,0.4351156, 0.4456128, 0.4542392, 0.4608510, 0.4656445, 0.4690847,
0.4717291, 0.4742391 ,0.4774904 ,0.4831152, 0.4922122, 0.5029904, 0.5128720,
0.5190209, 0.5191368 ,0.5133212, 0.5027542 ,0.4905301 ,0.4796467 ,0.4708035,
0.4638879, 0.4578364 ,0.4519745, 0.4481336, 0.4483697, 0.4531310, 0.4622930,
0.4750474 ,0.4905152 ,0.5082183 ,0.5278212 ,0.5491580 ,0.5720519, 0.5961127,
0.6207716 ,0.6449603, 0.6675704 ,0.6878331 ,0.7051851,0.7195461, 0.7305200,
0.7359634 ,0.7343541, 0.7283988)
PP14 <- data.frame(Temp,Density,po4) ##df1
temp <- c(13.13875, 13.13477 ,13.12337 ,13.10662 ,13.09798 ,13.09542 ,13.08734 ,13.07616,
13.06671 ,13.05899, 13.05890 ,13.05293 ,13.03322, 13.01515, 13.02552 ,13.01668,
12.99829, 12.97075 ,12.95572 ,12.95045 ,12.94541 ,12.94365 ,12.94609 ,12.94256,
12.93565 ,12.93258 ,12.93489 ,12.93209 ,12.92219 ,12.90730 ,12.90416 ,12.89974,
12.89749 ,12.89626 ,12.89395, 12.89315 ,12.89274, 12.89276 ,12.89293 ,12.89302)
density <- c( 26.35897, 26.36274 ,26.36173 ,26.36401 ,26.36507 ,26.36662 ,26.36838,
26.36996,
26.37286 ,26.37452 ,26.37402, 26.37571 ,26.37776, 26.38008 ,26.37959 ,26.38178,
26.38642 ,26.39158 ,26.39350, 26.39467, 26.39601, 26.39601, 26.39596 ,26.39517,
26.39728 ,26.39766, 26.39774, 26.39699 ,26.40081 ,26.40328 ,26.40416, 26.40486,
26.40513 ,26.40474 ,26.40552 ,26.40584, 26.40613, 26.40602 ,26.40595 ,26.40498)
krho <- c( -9.999999e+06, -1.786843e+00, -9.142976e-01, -9.650734e-01, -2.532397e+00,
-3.760537e+00, -2.622484e+00, -1.776506e+00, -2.028391e+00, -2.225910e+00,
-3.486826e+00, -2.062341e-01, -3.010643e+00, -3.878437e+00, -3.796426e+00,
-3.227138e+00, -3.335446e+00, -3.738037e+00, -4.577778e+00, -3.818099e+00,
-3.891467e+00, -4.585045e+00 ,-3.150283e+00 ,-4.371089e+00 ,-3.902601e+00,
-4.546019e+00, -3.932538e+00, -4.331247e+00, -4.508137e+00, -4.789201e+00,
-4.383820e+00, -4.423486e+00, -4.334641e+00, -4.330544e+00, -4.838604e+00,
-4.729123e+00, -4.381797e+00, -4.207365e+00, -4.276804e+00, -4.001305e+00)
MS14 <- data.frame(temp,density,krho) ##df2
So now I would like to loop through both data sets and check if MS14$density=PP14$Density if it is true then I would like to use the column krho in that row to multiply it by delta po4 that corresponds to the same density so diff(po4) in that row or range. something like
#MS14$krho[i] * diff(PP14$po4)[i]
BUT when I run
PP14$Density == MS14$density
of course it is always FALSE, because the large decimal numbers, none is exactly the same. I solved that by round the numbers to the 3rd decimal, but it should be a way to include that in the code so density +- 0.005 for example. Well or just rounding it to the 3rd decimal like:
PP14$Density_round2 <- round(PP14$Density ,digit=2)
In any case I am not sure if I should use a nested loop to check both columns and make the operations accordingly or if it would be better to create a new data.frame with the intersect of each data.frame:
common <- intersect(PP14$Density, MS14$density)
and then make calculations....(??)
So I would probably need a nested loop like:
{for i:PP14
for j:MS14
new-> PP14$Density[i] == MS14$density[j]
#if new is true then PP14$krho[i]* MS14$diff(po4)[j]#[for that particular row]
#and print it into a new data.frame df3
#}
So please, feel free to suggest the best way to proceed.. there might be several ways to do it..
Thank you so much in advance!!
Ps: suggestions using Matlab are also welcome
Something like this?
compareDec <- function(x, y, digits = NULL, tol = .Machine$double.eps^0.5){
if(is.null(digits)){
abs(x - y) < tol
} else {
round(x, digits = digits) == round(y, digits = digits)
}
}
icomp <- outer(MS14$density, PP14$Density, compareDec, digits = 2)
m <- outer(MS14$krho, c(0, diff(PP14$po4)))
new <- which(icomp, arr.ind = TRUE)
df3 <- cbind.data.frame(new, Prod = m[new])
head(df3)
# row col Prod
#1 17 1 0.00000000
#2 18 1 0.00000000
#3 19 1 0.00000000
#4 20 1 0.00000000
#5 17 2 -0.03712885
#6 18 2 -0.04161033

Variable length formula construction

I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.
How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

cosine similarity(patient similarity metric) between 48k patients data with predictive variables

I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed

Binning data in R

I have a vector with around 4000 values. I would just need to bin it into 60 equal intervals for which I would then have to calculate the median (for each of the bins).
v<-c(1:4000)
V is really just a vector. I read about cut but that needs me to specify the breakpoints. I just want 60 equal intervals
Use cut and tapply:
> tapply(v, cut(v, 60), median)
(-3,67.7] (67.7,134] (134,201] (201,268]
34.0 101.0 167.5 234.0
(268,334] (334,401] (401,468] (468,534]
301.0 367.5 434.0 501.0
(534,601] (601,668] (668,734] (734,801]
567.5 634.0 701.0 767.5
(801,867] (867,934] (934,1e+03] (1e+03,1.07e+03]
834.0 901.0 967.5 1034.0
(1.07e+03,1.13e+03] (1.13e+03,1.2e+03] (1.2e+03,1.27e+03] (1.27e+03,1.33e+03]
1101.0 1167.5 1234.0 1301.0
(1.33e+03,1.4e+03] (1.4e+03,1.47e+03] (1.47e+03,1.53e+03] (1.53e+03,1.6e+03]
1367.5 1434.0 1500.5 1567.0
(1.6e+03,1.67e+03] (1.67e+03,1.73e+03] (1.73e+03,1.8e+03] (1.8e+03,1.87e+03]
1634.0 1700.5 1767.0 1834.0
(1.87e+03,1.93e+03] (1.93e+03,2e+03] (2e+03,2.07e+03] (2.07e+03,2.13e+03]
1900.5 1967.0 2034.0 2100.5
(2.13e+03,2.2e+03] (2.2e+03,2.27e+03] (2.27e+03,2.33e+03] (2.33e+03,2.4e+03]
2167.0 2234.0 2300.5 2367.0
(2.4e+03,2.47e+03] (2.47e+03,2.53e+03] (2.53e+03,2.6e+03] (2.6e+03,2.67e+03]
2434.0 2500.5 2567.0 2634.0
(2.67e+03,2.73e+03] (2.73e+03,2.8e+03] (2.8e+03,2.87e+03] (2.87e+03,2.93e+03]
2700.5 2767.0 2833.5 2900.0
(2.93e+03,3e+03] (3e+03,3.07e+03] (3.07e+03,3.13e+03] (3.13e+03,3.2e+03]
2967.0 3033.5 3100.0 3167.0
(3.2e+03,3.27e+03] (3.27e+03,3.33e+03] (3.33e+03,3.4e+03] (3.4e+03,3.47e+03]
3233.5 3300.0 3367.0 3433.5
(3.47e+03,3.53e+03] (3.53e+03,3.6e+03] (3.6e+03,3.67e+03] (3.67e+03,3.73e+03]
3500.0 3567.0 3633.5 3700.0
(3.73e+03,3.8e+03] (3.8e+03,3.87e+03] (3.87e+03,3.93e+03] (3.93e+03,4e+03]
3767.0 3833.5 3900.0 3967.0
In the past, i've used this function
evenbins <- function(x, bin.count=10, order=T) {
bin.size <- rep(length(x) %/% bin.count, bin.count)
bin.size <- bin.size + ifelse(1:bin.count <= length(x) %% bin.count, 1, 0)
bin <- rep(1:bin.count, bin.size)
if(order) {
bin <- bin[rank(x,ties.method="random")]
}
return(factor(bin, levels=1:bin.count, ordered=order))
}
and then i can run it with
v.bin <- evenbins(v, 60)
and check the sizes with
table(v.bin)
and see they all contain 66 or 67 elements. By default this will order the values just like cut will so each of the factor levels will have increasing values. If you want to bin them based on their original order,
v.bin <- evenbins(v, 60, order=F)
instead. This just split the data up in the order it appears
This result shows the 59 median values of the break-points. The 60 bin values are probably as close to equal as possible (but probably not exactly equal).
> sq <- seq(1, 4000, length = 60)
> sapply(2:length(sq), function(i) median(c(sq[i-1], sq[i])))
# [1] 34.88983 102.66949 170.44915 238.22881 306.00847 373.78814
# [7] 441.56780 509.34746 577.12712 644.90678 712.68644 780.46610
# ......
Actually, after checking, the bins are pretty darn close to being equal.
> unique(diff(sq))
# [1] 67.77966 67.77966 67.77966

Running 'prop.test' multiple times in R

I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216

Resources