I'm trying to replicate an Excel solver model in R. It's a simple problem to start looking to maximize points with the only constraint being limited the number events than can be played. So I have a two column data frame with a tournament number and project points. In Excel we have a Play Yes/no binary column and multiply it by the points and set to maximize, allowing the model to change the Play Yes/No column to 0 or 1. The constraint limits the sum of the play yes/no variable to the constraint value, for example 25.
library(lpSolve)
tournament<-rep(1:48,1)
mean<-c(12.2,30.4,30.9,44.1,31.3,27.6,31.5,25.0,31.2,24.0,28.0,23.9,14.1,9.5,17.2,37.8,30.5,43.0,32.1,30.7,30.2,37.0,32.1,28.9,23.7,4.6,29.0,29.1,30.7,31.6,49.5,25.1,30.2,10.3,30.3,21.8,88.5,31.0,30.9,2.9,31.1,30.3,29.7,63.7,31.6,91.6,30.6,31.0)
aggdata<-data.frame(tournament,mean)
maxevents <-25
obj<-aggdata$mean
con <- rep(1,nrow(aggdata))
dir <- c("==")
rhs <- maxevents
result <- lp("max", obj, con, dir, rhs, all.bin = TRUE)
The result looks at only 3 rows of the data frame and it should look at the top 25. Eventually, I'll add additional constraints as I know lp is not required for this simple example, but need to get past this roadblock first.
library(lpSolve)
#objective function
obj <- rep(1, nrow(aggdata))
#constraints
con <- matrix(c(obj <- rep(1, nrow(aggdata)),
as.vector(aggdata$point)), nrow = 2, byrow = T) #you can add another constraints here and make 'nrow' equals to number of total constraints
dir <- c("==", "<=")
rhs <- c(25, #total number of tournament
1000) #let's assume that total points can't exceeds 1000
#optimization solution
result <- lp ("max", obj, con, dir, rhs, all.bin=TRUE)
result$solution
Sample data:
aggdata <- data.frame(tournament = rep(1:48,1),
point = c(12.2,30.4,30.9,44.1,31.3,27.6,31.5,25.0,31.2,24.0,28.0,23.9,14.1,
9.5,17.2,37.8,30.5,43.0,32.1,30.7,30.2,37.0,32.1,28.9,23.7,4.6,
29.0,29.1,30.7,31.6,49.5,25.1,30.2,10.3,30.3,21.8,88.5,31.0,30.9,
2.9,31.1,30.3,29.7,63.7,31.6,91.6,30.6,31.0))
# tournament point
#1 1 12.2
#2 2 30.4
#3 3 30.9
#4 4 44.1
#5 5 31.3
#6 6 27.6
Related
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
I am in interested in finding Pearson correlation coefficients between a list of genes. Basically, I have Affymetrix gene level expression matrix (genes in the rows and sample ID on the columns), and I have annotation data of microarray experiment observation where sample ID in the rows and description identification on the columns.
data
> expr_mat[1:8, 1:3]
Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at 6.062215 6.125023 5.875502
10_at 3.796484 3.805305 3.450245
100_at 5.849338 6.191562 6.550525
1000_at 3.567779 3.452524 3.316134
10000_at 6.166815 5.678373 6.185059
100009613_at 4.443027 4.773199 4.393488
100009676_at 5.836522 6.143398 5.898364
10001_at 6.330018 5.601745 6.137984
> anodat[1:8, 1:3]
V1 V2 V3
1 SampleID GA Batch
2 Tarca_001_P1A01 11 1
3 Tarca_013_P1B01 15.3 1
4 Tarca_025_P1C01 21.7 1
5 Tarca_037_P1D01 26.7 1
6 Tarca_049_P1E01 31.3 1
7 Tarca_061_P1F01 32.1 1
8 Tarca_051_P1E03 19.7 1
goal:
I intend to see how the genes in each sample are correlated with GA value of corresponding samples in the annotation data, then generate sub expression matrix of keeping high correlated genes with target observation data anodat$GA.
my attempt:
gene_corrs <- function(expr_mat, anno_mat){
stopifnot(ncol(expr_mat)==nrow(anno_mat))
res <- list()
lapply(colnames(expr_mat), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(anno_mat)){
cor_mat <- stats::cor(y, anno_mat$GA, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
res <- as.dara.frame(res)
}
})
})
return(res)
}
however, my above implementation didn't return what I expected, I need to filter out the genes by finding genes which has a strong correlation with anodat$GA.
Another attempt:
I read few post about similar issue and some people discussed about using limma package. Here is my attempt by using limma. Here I used anodat$GA as a covariate to fit limma linear model:
library(limma)
fit <- limma::lmFit(expr_mat, design = model.matrix( ~ 0 + anodat$GA)
fit <- eBayes(fit)
topTable(fit, coef=2)
then I am expecting to get a correlation matrix from the above code, and would like to do following in order to get filtered sub expression matrix:
idx <- which( (abs(cor) > 0.8) & (upper.tri(cor)), arr.ind=TRUE)
idx <- unique(c(idx[, 1],idx[, 2])
correlated.genes <- matrix[idx, ]
but I still didn't get the right answer. I am confident about using limma approach but I couldn't figure out what went wrong above code again. Can anyone point me out how to make this work? Is there any efficient way to make this happen?
Don't have your data so hard to double check, but in the abstract I would try this:
library(matrixTests)
cors <- row_cor_pearson(expr_mat, anodat$GA)
which(cors$cor > 0.9) # to get the indeces of genes with correlation > 0.9
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed
I have the following dataset (obtained here):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
I can cluster this dataset into three clusters with kmeans() using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
Using this fundamental concept, I tried it out with GA package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Here are my questions:
How can correctly use the GA package to solve my problem?
How can I make sure the randomly generated chromosomes contains the same number of 1s which corresponds to k number of clusters (eg. if k=3 then the chromosome must contain exactly three 1s)?
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g#solution
g#fitnessValue
Looks like several gene combinations produced the same "best" fitness value
I have a vector as below
data <- c("6X75ML","24X37.5ML(KKK)", "6X2X75ML", "168X5CL (UUU)", "168X5CLKK (BUU)")
The above data is basically pack sizes of bottles in a case. What it means is in case of "6X75ML" there are 6 bottles of 75CL (Centi liters) liquid each. For "6X2X75ML" which is basically a promotion pack has 6X2 = 12 bottles of 75ML (Milli liters) in a case.
I need to find the volume in liters available in each case:
e.g -
"6X75ML" should be (6 * 75 * 0.001) = 0.45 Liters
"24X37.5ML(KKK)" should be (6 * 37.5 * 0.001) = 0.9 Liters
"6X2X75ML" should be (6 * 2 * 75 * 0.001) = 0.9 Liters
[there can only be a maximum of 3 digits between the X]
Conversions:
ML - Milli liters
CL - Centi liters
LTR - Liters
1ML = 0.001LTR
1CL = 0.01LTR
In some cases as in the above there could be values like "168X5CLKK (BUU)" where only CL needs to be taken.
I have the below code helping me to find the quantity of bottles in a case
dataList <- strsplit(data, split="X")
Pack <- sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
eg. "6X2X75ML" ll give 12; "168X5CL (UUU)" ll give 168 etc
Working:
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numeric s and the multiply them. The final element is dropped using head(x, -1).
I am not able to find a way around the an efficient way to split the last element to get the volume.
data <- c("6X75ML","24X37.5ML(KKK)", "6X2X75ML", "168X5CL (UUU)", "168X5CLKK (BUU)")
Replace ML with X0.001
data <- gsub("ML", "X0.001", data)
Replace CL with X0.01
data <- gsub("CL", "X0.01", data)
split the string and do the multiplication
unlist(lapply(strsplit(gsub("[A-Z() ]*$", "", data), "X"), function(x){ prod(as.numeric(x))}))
Output:
[1] 0.45 0.90 0.90 8.40 8.40
Sathish solution is straightforward.
If you would require to extract and store the volume and number of bottles, you could also use this code:
# Data
data <- c("6X75ML","24X37.5ML(KKK)", "6X2X75ML", "168X5CL (UUU)", "168X5CLKK (BUU)")
# (1) Calculate volume per unit/bottle
# ------------
# Get volume measurement:
# Extract bottle volume and define conversion factor
bottle.volumes.units <- substr(sub(".*X[0-9,.]+", "", data), 0, 2)
bottle.volumes.in.liter <- gsub("ML", 0.001, bottle.volumes.units)
bottle.volumes.in.liter <- gsub("CL", 0.01, bottle.volumes.in.liter)
# Get volume per bottle
# All numbers in every vector
all.numbers <- regmatches(data, gregexpr('\\(?[0-9,.]+', data))
# The volume information for each bottle
bottle.volumes <- as.numeric(unlist(lapply(all.numbers, last)))
# Harmonize volume measuremet to "liters"
bottle.volumes.in.liters <- as.numeric(bottle.volumes.in.liter) * bottle.volumes
# (2) Get the number of units/bottles
# ------------
number.of.bottles.temp <- lapply(all.numbers, function(x)as.numeric(setdiff(x, as.list(bottle.volumes))))
number.of.bottles <- sapply(number.of.bottles.temp, prod)
# (3) Calculate volume
# ------------
number.of.bottles * bottle.volumes.in.liters
# [1] 0.45 0.90 0.90 8.40 8.40