I have a vector as below
data <- c("6X75ML","24X37.5ML(KKK)", "6X2X75ML", "168X5CL (UUU)", "168X5CLKK (BUU)")
The above data is basically pack sizes of bottles in a case. What it means is in case of "6X75ML" there are 6 bottles of 75CL (Centi liters) liquid each. For "6X2X75ML" which is basically a promotion pack has 6X2 = 12 bottles of 75ML (Milli liters) in a case.
I need to find the volume in liters available in each case:
e.g -
"6X75ML" should be (6 * 75 * 0.001) = 0.45 Liters
"24X37.5ML(KKK)" should be (6 * 37.5 * 0.001) = 0.9 Liters
"6X2X75ML" should be (6 * 2 * 75 * 0.001) = 0.9 Liters
[there can only be a maximum of 3 digits between the X]
Conversions:
ML - Milli liters
CL - Centi liters
LTR - Liters
1ML = 0.001LTR
1CL = 0.01LTR
In some cases as in the above there could be values like "168X5CLKK (BUU)" where only CL needs to be taken.
I have the below code helping me to find the quantity of bottles in a case
dataList <- strsplit(data, split="X")
Pack <- sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
eg. "6X2X75ML" ll give 12; "168X5CL (UUU)" ll give 168 etc
Working:
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numeric s and the multiply them. The final element is dropped using head(x, -1).
I am not able to find a way around the an efficient way to split the last element to get the volume.
data <- c("6X75ML","24X37.5ML(KKK)", "6X2X75ML", "168X5CL (UUU)", "168X5CLKK (BUU)")
Replace ML with X0.001
data <- gsub("ML", "X0.001", data)
Replace CL with X0.01
data <- gsub("CL", "X0.01", data)
split the string and do the multiplication
unlist(lapply(strsplit(gsub("[A-Z() ]*$", "", data), "X"), function(x){ prod(as.numeric(x))}))
Output:
[1] 0.45 0.90 0.90 8.40 8.40
Sathish solution is straightforward.
If you would require to extract and store the volume and number of bottles, you could also use this code:
# Data
data <- c("6X75ML","24X37.5ML(KKK)", "6X2X75ML", "168X5CL (UUU)", "168X5CLKK (BUU)")
# (1) Calculate volume per unit/bottle
# ------------
# Get volume measurement:
# Extract bottle volume and define conversion factor
bottle.volumes.units <- substr(sub(".*X[0-9,.]+", "", data), 0, 2)
bottle.volumes.in.liter <- gsub("ML", 0.001, bottle.volumes.units)
bottle.volumes.in.liter <- gsub("CL", 0.01, bottle.volumes.in.liter)
# Get volume per bottle
# All numbers in every vector
all.numbers <- regmatches(data, gregexpr('\\(?[0-9,.]+', data))
# The volume information for each bottle
bottle.volumes <- as.numeric(unlist(lapply(all.numbers, last)))
# Harmonize volume measuremet to "liters"
bottle.volumes.in.liters <- as.numeric(bottle.volumes.in.liter) * bottle.volumes
# (2) Get the number of units/bottles
# ------------
number.of.bottles.temp <- lapply(all.numbers, function(x)as.numeric(setdiff(x, as.list(bottle.volumes))))
number.of.bottles <- sapply(number.of.bottles.temp, prod)
# (3) Calculate volume
# ------------
number.of.bottles * bottle.volumes.in.liters
# [1] 0.45 0.90 0.90 8.40 8.40
Related
I essentially have two columns (vectors) with speed and accel in a data.frame as such:
speed acceleration
1 3.2694444 2.6539535522
2 3.3388889 2.5096979141
3 3.3888889 2.2722134590
4 3.4388889 1.9815256596
5 3.5000000 1.6777544022
6 3.5555556 1.3933215141
7 3.6055556 1.1439051628
8 3.6527778 0.9334115982
9 3.6722222 0.7561602592
I need to find for each value speed on the x axis (speed), what is the top 10% max values from the y axis (acceleration). This also needs to be in a specific interval. For example speed 3.2-3.4, 3.4-3.6, and so on. Can you please show me how a for loop would look like in this situation?
As #alistaire already pointed out, you have provided a very limited amount of data. So we first have to simulate I a bit more data based on which we can test our code.
set.seed(1)
# your data
speed <- c(3.2694444, 3.3388889, 3.3388889, 3.4388889, 3.5,
3.5555556, 3.6055556, 3.6527778, 3.6722222)
acceleration <- c(2.6539535522, 2.5096979141, 2.2722134590,
1.9815256596, 1.6777544022, 1.3933215141,
1.1439051628, 0.9334115982, 0.7561602592)
df <- data.frame(speed, acceleration)
# expand data.frame and add a little bit of noise to all values
# to make them 'unique'
df <- as.data.frame(do.call(
rbind,
replicate(15L, apply(df, 2, \(x) (x + runif(length(x), -1e-1, 1e-1) )),
simplify = FALSE)
))
The function create_intervals, as the name suggests, creates user-defined intervals. The rest of the code does the 'heavy lifting' and stores the desired result in out.
If you would like to have intervals of speed with equal widths, simply specify the number of groups (n_groups) you would like to have and leave the rest of the arguments (i.e. lwr, upr, and interval_span) unspecified.
# Cut speed into user-defined intervals
create_intervals <- \(n_groups = NULL, lwr = NULL, upr = NULL, interval_span = NULL) {
if (!is.null(lwr) & !is.null(upr) & !is.null(interval_span) & is.null(n_groups)) {
speed_low <- subset(df, speed < lwr, select = speed)
first_interval <- with(speed_low, c(min(speed), lwr))
middle_intervals <- seq(lwr + interval_span, upr - interval_span, interval_span)
speed_upp <- subset(df, speed > upr, select = speed)
last_interval <- with(speed_upp, c(upr, max(speed)))
intervals <- c(first_interval, middle_intervals, last_interval)
} else {
step <- with(df, c(max(speed) - min(speed))/n_groups)
intervals <- array(0L, dim = n_groups)
for(i in seq_len(n_groups)) {
intervals[i] <- min(df$speed) + i * step
}
}
return(intervals)
}
# three intervals with equal width
my_intervals <- create_intervals(n_groups = 3L)
# Compute values of speed when acceleration is greater then
# or equal to the 90th percentile
out <- lapply(1:(length(my_intervals)-1L), \(i) {
x <- subset(df, speed >= my_intervals[i] & speed <= my_intervals[i+1L])
x[x$acceleration >= quantile(x$acceleration, 0.9), ]
})
# function to round values to two decimal places
r <- \(x) format(round(x, 2), nsmall = 2L)
# assign names to each element of out
for(i in seq_along(out)) {
names(out)[i] <- paste0(r(my_intervals[i]), '-', r(my_intervals[i+1L]))
}
Output 1
> out
$`3.38-3.57`
speed acceleration
11 3.394378 2.583636
21 3.383631 2.267659
57 3.434123 2.300234
83 3.394886 2.580924
101 3.395459 2.460971
$`3.57-3.76`
speed acceleration
6 3.635234 1.447290
41 3.572868 1.618293
51 3.615017 1.420020
95 3.575412 1.763215
We could also compute the desired values of speed based on intervals that make more 'sense' than just equally spaced speed intervals, e.g. [min(speed), 3.3), [3.3, 3.45), [3.45, 3.6), and [3.6, max(speed)).
This can be accomplished by leaving n_groups unspecified and instead specify lwr, upr, and an interval_span that makes sense. For instance, it makes sense to have a interval span of 0.15 when the lower limit is 3.3 and the upper limit is 3.6.
# custom boundaries based on a lower limit and upper limit
my_intervals <- create_intervals(lwr = 3.3, upr = 3.6, interval_span = 0.15)
Output 2
> out
$`3.18-3.30`
speed acceleration
37 3.238781 2.696456
82 3.258691 2.722076
$`3.30-3.45`
speed acceleration
11 3.394378 2.583636
19 3.328292 2.711825
73 3.315306 2.644580
83 3.394886 2.580924
$`3.45-3.60`
speed acceleration
4 3.520530 2.018930
40 3.517329 2.032943
58 3.485247 2.079893
67 3.458031 2.078545
$`3.60-3.76`
speed acceleration
6 3.635234 1.447290
34 3.688131 1.218969
51 3.615017 1.420020
78 3.628465 1.348873
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
I've looked at several questions like this on SO and still cannot resolve it. But I'm looking to add a constraint where the change in volume (New Volume / Old Volume - 1 >= -10%) cannot be less than -10%.
Example
Customer
Old_volume
Elasticity
Price
X
100
-0.68
15.00
#example dataset
df <- data.frame(customer = c("X"),
old_volume = c(100),
elasticity = c(-0.68),
price = c(15.00))
#function
f3 <- function(x) {
new_vol = 100 * (1+(-0.68 * x))
new_rev = new_vol * (15.00 * (1+x))
new_cost = new_vol * 11.25
return(new_rev - new_cost) }
n_vol <- function(x) {
new_vol = 100 * (1+(-0.68 * x))
return(new_vol) }
#example run function
f3(0.25)
>>>[1] 622.5
#running the optimization
res <- optimize(f3, lower=0, upper=10, maximum = TRUE)
res$maximum
>>>0.6102941
res$objective
>>>[1] 754.9081
n_vol(0.6102941)
>>> 58.5
Therefore vol change % = 58.5/100-1 = -0.415, however I want to limit this to -0.1.
Now I want to add in a constraint where new_vol/old_vol-1 >= -0.1. However, I'm not sure how to add this in using Optimize(), optim() or lpsolve(). I was reading through the lpsolve() documentation and it seemed like the way to go, but I am confused as to what my vector of coefficients would be in the objective.in. Also looking at other packages like constrOptim, it seemed too complex for something one dimensional.
Thanks
As an example, normally on Excel, I would have a bunch of columns with these calculations and run solver with the constraint of volume change >= -0.1. However running a macro takes too long with rows > 1,000
EDIT: Added input example
You can calculate your new boundary algebraically.
change_inv <- function(frac, oldvol = 100) { oldvol * (1 + frac) }
n_vol_inv <- function(vol) { (vol / 100 - 1) / -0.68 }
n_vol_inv(vol = change_inv(frac = -0.1))
# [1] 0.1470588
max_x <- optimize(f3,
lower = 0,
upper = n_vol_inv(vol = change_inv(-0.1)),
maximum = TRUE)
max_x
# $maximum
# [1] 0.1469922
#
# $objective
# [1] 535.9664
n_vol(max_x$maximum) / 100 - 1
# [1] -0.09995469
Unfortunately, this means that the maximum is just the boundary.
I'm trying to write a piece of optimization code in R to calculate a set of unknown values for a biological question concerning drosophila flies.
The dataframe is composed of 13 columns (just showing 9 in code below for clarity) with varying numbers of rows. The first three columns contain collected data and the rest of the columns are calculated using various formulas. Two of the columns, Missing_C and Missing_D are initially populated with empty data and in the optimization problem represent the initial values.
Time.min. Prob_C Prob_D Miss_C Miss_D Event_C Event_D Risk_C Risk_D
1 0 1.00 1.00 0 0 0.00 0.00 86.00 78.00
2 5 0.98 0.97 0 0 1.93 1.98 84.07 76.02
3 16 0.84 0.95 0 0 10.67 1.90 73.40 74.12
4 17 0.50 0.75 0 0 21.02 12.85 52.38 61.27
5 20 0.30 0.50 0 0 14.97 15.32 37.42 45.95
As an example of the some of the formulas used, Event_C and Risk_C are calculated with a for loop as follows:
#define values for events_c and risk_c with for loops`
temp_events_c <-vector()
temp_risk_c <-vector()
for (i in 2:no_rows) {
temp_events_c <- ((prob_c[i] * risk_c[i-1]) - (prob_c[i] * miss_c[i-1]) - (prob_c[i-1] * risk_c[i-1]) + (prob_c[i-1] * miss_c[i-1])) / (prob_c[i] - (2 * prob_c[i-1]))
events_c[i] <- temp_events_c
temp_risk_c <- risk_c[i-1] - miss_c[i-1] - events_c[i]
risk_c[i] <- temp_risk_c
}
From this data, I also have a single, collected, value (9.1 in this instance) which relates to the values in the table. The following function defines the relationship to this value to columns Event_C, Event_D and two columns not shown in the above, Expected_C and Expected_D where the sums of those columns are represented by x[1], x[2], x[3], x[4]:
fn <- function(x) ((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]
I then would like to use a minimization algorithm i.e. slsqp from nloptr to calculate the values in the Miss_C and Miss_D which ultimately satisfy this single value. The extra code for the optimization would be something like this:
x0 <- c(Miss_C,Miss_D)
heq <- function(x) (((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]) - 9.1 # heq == 0
slsqp(x0, fn, gr = NULL,
hin = NULL, heq = heq)
Obviously, that doesn't work as the initial values are not directly contained within the function that needs to be solved and that is the point I am stuck at! I'm not sure if this is an optimization problem as such or more of a general R coding question - either way, any help would be much appreciated.
Cheers, Andrew
* Edit - complete code added as per request *
#input variables
time_vector <- c(0,5,16,17,20)
prob_c <- c(1,0.977,0.835,0.5,0.30)
prob_d <- c(1,0.974,0.949,0.75,0.50)
miss_c <- c(0,0,0,0,0)
miss_d <- c(0,0,0,0,0)
#get number of rows
no_rows <- length(time_vector)
#fill events columns with dummy data
events_c <- c(0:(no_rows - 1))
events_d <- c(0:(no_rows - 1))
#define starting number at risk
risk_c_t0 <- 86
risk_d_t0 <- 78
#add t0 risk to each column
risk_c <- risk_c_t0
risk_d <-risk_d_t0
#fill risk columns with dummy data
risk_c[2:no_rows] <- c(2:no_rows)
risk_d[2:no_rows] <- c(2:no_rows)
#re-define values for events_c and risk_c with for loops
temp_events_c <-vector()
temp_risk_c <-vector()
for (i in 2:no_rows) {
temp_events_c <- ((prob_c[i] * risk_c[i-1]) - (prob_c[i] * miss_c[i-1]) - (prob_c[i-1] * risk_c[i-1]) + (prob_c[i-1] * miss_c[i-1])) / (prob_c[i] - (2 * prob_c[i-1]))
events_c[i] <- temp_events_c
temp_risk_c <- risk_c[i-1] - miss_c[i-1] - events_c[i]
risk_c[i] <- temp_risk_c
}
#re-define values for events_t with for loops
temp_events_d <-vector()
temp_risk_d <-vector()
for (j in 2:no_rows) {
temp_events_d <- ((prob_d[j] * risk_d[j-1]) - (prob_d[j] * miss_d[j-1]) - (prob_d[j-1] * risk_d[j-1]) + (prob_d[j-1] * miss_d[j-1])) / (prob_d[j] - (2 * prob_d[j-1]))
events_d[j] <- temp_events_d
temp_risk_d <- risk_d[j-1] - miss_d[j-1] - events_d[j]
risk_d[j] <- temp_risk_d
}
#calculate total risk, events and expected
total_risk <- risk_c + risk_d
total_events <- events_c + events_d
expected_c <- (risk_c * (total_events/total_risk))
expected_d <- (risk_d * (total_events/total_risk))
#place values into dataframe
df1 <- data.frame(time_vector,prob_c,prob_d, miss_c, miss_d, events_c, events_d, risk_c, risk_d, total_risk, total_events, expected_c, expected_d)
#sum of values
sum_events_C <- sum(events_c)
sum_events_d <- sum(events_d)
sum_expected_c <- sum(expected_c)
sum_expected_d <- sum(expected_d)
#chi_sq formula
chi_sq_combo <- (((sum_events_C - sum_expected_c)^2)/sum_expected_c) + (((sum_events_d - sum_expected_d)^2)/sum_expected_d)
#### end of table calculations before sim
#x <- c(sum_events_C, sum_expected_c, sum_events_d, sum_expected_d)
#x0 <- c(miss_c,miss_d) #inital values
#fn <- function(x) ((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]
#heq <- function(x) (((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]) - 6.5 # heq == 0
#slsqp(x0, fn, gr = NULL,
# hin = NULL, heq = heq)
Rephrasing the comments above, I believe the problem is to use optimization to find
two values which yield a target chi-square value. A complication which may cause problems is that there are likely many solutions that yield the target, so it might be necessary to add some other requirement to make the answer unique.
To do this, you need a function of two variables which calculates the square of the difference between the chi-square value using those variables and the target value, and then you minimize that.
For example,
fn2 <- function(x) {
c <- x[1]
d <- x[2]
chisq <- (((c - sum_expected_c)^2)/sum_expected_c) +
(((d - sum_expected_d)^2)/sum_expected_d)
(chisq - 6.5)^2
}
for (i in 1:no_rows) {
x0 <- c(df1$miss_c[i],df1$miss_d[i]) #initial values
res <- nloptr::slsqp(x0, fn2)
miss_c[i] <- res$par[1]
miss_d[i] <- res$par[2]
}
This gives the same values all 5 times, so I might not have understood you completely.
I have two data files, one raw data set and one calibration file. What I'm trying to do is find the time between the first and second calibration run and match it to the chunk of time in the raw file in order to QC the data using a linear regression (known slope and intercept).
I've been messing around with for loops, and I know it's not ideal in R, but I see no other way right now.
Here is an example of the files
> cal
time slope intercept pressure
24.51 2.4 -71 101
25.61 1.7 -12 101
...
27.12 1.8 -23 102
And an example of the rawdata
> raw
time CO2
24.53 320
24.54 322
24.55 322
....
....
....
25.65 433
25.73 431
Because the calibration file has variable length and the raw files never have a perfectly matching timestamp, I've been using a for loop to filter through the raw data set, apply a slope and intercept fit, and produce a new column in the original raw dataframe which has calibrated values.
Here's what I have so far
mylist <- list()
for (i in 1:length(cal)) {
if (raw$time > cal[i,1] & raw$time < cal[(i+1),1]) {
initial_t = cal[i,1]
final_t = cal[(i+1),1]
initial_b = cal[i,3]
final_b = cal[(i+1),3]
initial_m = cal[i,2]
final_m = cal[(i+1),2]
initial_p = cal[i,4]
final_p = cal[(i+1), 4]
b_factor = (final_b - initial_b)/(final_t - initial_t)
m_factor = (final_m - initial_m)/(final_t - initial_t)
p_factor = (final_p - initial_p)/(final_t - initial_t)
vec <- raw %>%
filter(raw, between(raw$time, initial_t, final_t)
cal_m <- initial_m + m_factor*(vec$time - initial_t)
cal_b <- initial_b + b_factor*(vec$time - initial_t)
cal_p <- initial_p + p_factor*(vec$time - initial_t)
new_value <- ((cal_m * vec$CO2) + cal_b)*(cal_p/101.325)
vec$new_value <- new_value
mylist[[i]] <- vec
}
}
mylist <- as.data.frame(do.call("rbind", mylist))'
And I receive an error message "the condition has length > 1 and only the first element will be used"
Any hints or suggestions on how to sort through one data frame to match the window of time on another?
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed