Calculating molecular formulas out of mass of certain elements - r

For a chemistry project at school I want to calculate molecular masses of all possible combinations of molecular formulas including carbon (1 atom up to 100), oxygen (1 up to 50), hydrogen (1 up to 200), nitrogen (1 up to 20) and sulfur (1 up to 10) and save the results in one vector and the corresponding molecular formula string in another vector. The masses are numeric values: 12, 16, 1, 14 and 32. The strings are "C", "O", "H", "N", "S".
I want to delete molecular formulas that make no sense like C1 O100 H0 N20 S10 from the string and the corresponding mass, too. So to be more specific only leave the ones with a O/C relation between 0 and 1, a H/C relation between 2 and 1, a N/C relation between 0 and 0.2 and a S/C relation between 0 and 0.1.
Is there a easy way to do this, is using a for loop the only way or is there a faster way (maybe arrays?) and how can I take account to the relations of molecules?
Would be vary happy for some ideas or basic code to solve this.
..so #Gregor to disclude the relations of atoms that dont make sense probably will be better before the whole list is created? #Barker Yes atoms like Nitrogen should go from 0 to max. I am very new to R so when I try a loop I end up with the last value calculated...(reduced amount of dimensions).
z=matrix(0,1,5*20*10*2*2)
C=12
O=16
H=1
N=14
S=32
for( u in 1:length(z)) {
for(i in 1:5) {
for (j in 1:20) {
for(k in 1:10 ) {
for(l in 0:1) {
for(m in 0:1){
z[1,u] <- C*i+H*j+O*k+N*l+S*m
}
}
}
}
}
}
does anyone know where the mistake is here?

expand.grid is a good place to start in generating combinations. For example, to create a data.frame with combinations of H and C you could do this
mol = expand.grid(C = 1:3, H = 1:4)
mol
# C H
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
You can add on the other elements in expand.grid as well and also adjust the inputs up to 1:200 or however many you want. If your computer has enough memory, you'll be able to create the 10MM row data frame as specified in your question - though that is pretty big. If you could reduce the total number of combinations to 1MM it will be much easier on your memory.
The next step would be to delete rows that don't meet your ratio criteria. Here's one example, to make sure that the number of H is between 1 and 2 times the number of C:
mol = mol[mol$H >= mol$C & mol$H <= 2 * mol$C, ]
mol
# C H
# 1 1 1
# 4 1 2
# 5 2 2
# 8 2 3
# 9 3 3
# 11 2 4
# 12 3 4
Repeat steps like that for all your conditions.
Finally you can calculate the weights and put it in a new column:
mol$weight = with(mol, C * 12 + H * 1)
mol
# C H weight
# 1 1 1 13
# 4 1 2 14
# 5 2 2 26
# 8 2 3 27
# 9 3 3 39
# 11 2 4 28
# 12 3 4 40
You could use matrix multiplication for the weight calculation, but there's no need with a small number of possible elements. If you had 20 or more possible input elements it would make sense to do it that way.
Bonus! Formulas can be created with paste or paste0:
mol$formula = paste0("C", mol$C, " H", mol$H)
mol
# C H weight formula
# 1 1 1 13 C1 H1
# 4 1 2 14 C1 H2
# 5 2 2 26 C2 H2
# 8 2 3 27 C2 H3
# 9 3 3 39 C3 H3
# 11 2 4 28 C2 H4
# 12 3 4 40 C3 H4
Of course, most of these still won't make chemical sense - C1 H1 isn't something that would really exist, but maybe you can come up with even smarter conditions to get rid of more of the impossibilities!

Related

Redundant variable naming issue with purrr::map and dplyr

I would guess that this is a basic issue that isn't specific to purrr at all, but it caught me off guard in this context. The general answer would be great if this isn't about how purrr and dplyr play together.
I tried to name a variable I was "mapping" over the same as the variable in the d.f. I wanted to match, and it led to problems. Can someone explain why my first attempt to generate pairwise differences fails?
It seems like a variable scoping issue or something with redundant names but I don't know exactly what is wrong. Obviously, I found a workaround.
Imagine I have data like mydf below and there are a lot of variables, and I want to compute the difference in the values of those variables between each pair of sites:
#four sites
site<-rep(c("j", "k", "l", "m"), 3)
#some measurment
val<-1:12
#some variable
vari<-c(rep(1,4), rep(2, 4), rep(3,4))
mydf<-data.frame(site, val, vari)
#compute pairwise differences between values at each site for each variable
outp<-map_dfr(1:3, function(vari){
dists<-as.numeric(dist(mydf %>% filter(vari==vari) %>% select(val), method="manhattan"))
names(dists)<-c("jk","jl", "jm", "kl", "km", "lm" )
dists
return(data.frame(t(dists), vari=vari))
})
# looks like there was an issue with using "vari"
outp
#but use a different name for the same variable and it works fine
outp2<-map_dfr(1:3, function(a){
dists<-as.numeric(dist(mydf %>% filter(vari==a) %>% select(val), method="manhattan"))
names(dists)<-c("jk","jl", "jm", "kl", "km", "lm" )
dists
return(data.frame(t(dists), vari=vari))
})
outp2
edit as noted in comments and answer below, the issue here is in variable usage in dplyr::filter and not with purrr
If you run some simplified code it might make sense. For example:
# Remove the vector `vari` to avoid confusion.
rm(vari)
# Run using `map` and a simplified function.
map(1:3, function(vari) filter(mydf, vari==vari))
The above call to map returns a list of three dataframes, each identical to mydf:
[[1]]
site val vari
1 j 1 1
2 k 2 1
3 l 3 1
4 m 4 1
5 j 5 2
6 k 6 2
7 l 7 2
8 m 8 2
9 j 9 3
10 k 10 3
11 l 11 3
12 m 12 3
[[2]]
site val vari
1 j 1 1
2 k 2 1
3 l 3 1
4 m 4 1
5 j 5 2
6 k 6 2
7 l 7 2
8 m 8 2
9 j 9 3
10 k 10 3
11 l 11 3
12 m 12 3
[[3]]
site val vari
1 j 1 1
2 k 2 1
3 l 3 1
4 m 4 1
5 j 5 2
6 k 6 2
7 l 7 2
8 m 8 2
9 j 9 3
10 k 10 3
11 l 11 3
12 m 12 3
It's obvious that filter(vari == vari) is comparing mydf$vari with itself, which will simply return an exact copy of mydf. This is a good behavior because we always know what filter will compare. Try the same thing with a temporary variable x:
map(1:3, function(x) filter(mydf, vari==x))
Which returns the expected subsets:
[[1]]
site val vari
1 j 1 1
2 k 2 1
3 l 3 1
4 m 4 1
[[2]]
site val vari
1 j 5 2
2 k 6 2
3 l 7 2
4 m 8 2
[[3]]
site val vari
1 j 9 3
2 k 10 3
3 l 11 3
4 m 12 3
This is basically what you did in your "workaround" – which I would describe as valid code using proper conventions, i.e. not a workaround at all.
aosmith already pointed out that you can use tidy evaluation. This is pretty neat, and it definitely has it's use cases, but I think it would reflect bad practice in this particular context. Using a temporary variable would make your code less ambiguous and thus more readable. It also makes sense because we are really dealing with two different things: vari is a vector containing the (repeated) values 1, 2, and 3 while x is in essence a temporary loop variable that is either 1 or 2 or 3, depending on the iteration.

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

Project R: Variable "depth" of for-loops; generalization

Thank you for helping me out with this one. It's a problem that has been bothering me for quite a time now. I feel like I am so close to the answer, but not quite there yet.
Issue is the following:
Suppose I want to combine every possible combination of m-elements of n-vectors and store the result of, lets say, a multiplication. For the 2d-problem I need two interlaced for-loops:
dim2_Matrix <- matrix(0,nrow=2,ncol=3)
for (i in 1:2){
for (j in 1:3){
dim2_Matrix[i,j] <- i*j
}
}
The inner loop will run through all 3 items, multiplying them with the first item of the outer loop. Once that procedure is done, i will be increased and the inner loop starts from j=1 again. We have 2*3 = 6 combinations.
Now lets raise that to a 3D-Problem. We need a third loop for that:
dim3_Matrix <- array(0,dim=c(2,3,4))
for (i in 1:2){
for (j in 1:3){
for (k in 1:4){
dim3_Matrix[i,j,k] <- i*j*k
}
}
}
It runs the most inner loop 4times, increases the middle one, runs the 4 inner loops again...until we get 2*3*4 = 24 combinations in a 3D-Array.
I could continue like this with dim4, dim5 etc.
My problem now is that I want to keep the script variable. Sometimes I'll want to combine 2, sometimes 3, sometimes n-vectors. Suppose I know how many "layers" there are before the loops are run...how do I get a generalized form of this?
I am as far as this:
n_dimensions <- 3 # specify n° of dimensions
m_Elements <- c(2,3,4) # 2 elements in 1st dim, 3 in 2nd, 4 in 3rd
for (i in 1:n_dimensions){
for (j in 1:m_Elements[i]){
# ...
}
}
But this will go like:
i1 j1 --> i1 j2
i2 j1 --> i2 j2 --> i2 j3
i3 j1 --> i3 j2 --> i3 j3 --> i3 j4
so this is 2 + 3 + 4 combinations instead of 2*3*4.
Please note: multiplying is only an example. Storing the results in a matrix/tensor is not the main problem. It's how to interlace the loops and to generalize it.
Thanks for reading through, I hope you get what I mean!
You can try something like this.
X<-list(1:2, 1:3, 1:4) #one entry for each dimension
Z<-expand.grid(X)
Z looks like:
Var1 Var2 Var3
1 1 1 1
2 2 1 1
3 1 2 1
4 2 2 1
5 1 3 1
6 2 3 1
7 1 1 2
8 2 1 2
9 1 2 2
10 2 2 2
11 1 3 2
12 2 3 2
13 1 1 3
14 2 1 3
15 1 2 3
16 2 2 3
17 1 3 3
18 2 3 3
19 1 1 4
20 2 1 4
21 1 2 4
22 2 2 4
23 1 3 4
24 2 3 4
So now you have every combination in a data.frame and you can use apply functions or something similar to do what you need to do. Such as:
apply(Z,1,prod)
[1] 1 2 2 4 3 6 2 4 4 8 6 12 3 6 6 12 9 18 4 8 8 16 12 24
Your code is equivalent to:
dim2_Matrix = outer(1:2, 1:3)
dim3_Matrix = outer(dim2_Matrix, 1:4)
Which can be generalized to:
dim_n_Matrix <- function(n) {
x <- 1:2
if (n>1) {for (n in 2:n) {x <- outer(x, 1:(n+1))}} else {x <- matrix(1:2, nrow = 1)}
return(x)
}

In R, generating every possible solution to a model, based on constraints

In R, I’m trying to generate a matrix that shows results from a model and the values used to solve them- all of which are constrained. Every possible solution. An example model:
Model= a^2+b^2+c^2+d^2
Where:
20≤Model≤30
a=1
2 ≤b ≤3
2 ≤c ≤3
3 ≤d ≤4
I’d like the output to look like this:
[a] [b] [c] [d] [Model]
[1] 1 3 2 3 23
[2] 1 2 2 4 25
[3] 1 3 3 3 28
[4] 1 2 3 3 23
Order doesn't matter. I just want the full permutation of feasible [integer] values. Any packages or help you could point my way?
In my example case, I want to generate all possible inputs(a,b,c,d) that hold valid, based on the parameters I set. I only want values from my output equation (Model) between 20 and 30. In this case, only 4 solutions are possible based on the criteria I'm setting.
Assuming you're only looking for integer solutions, you can use expand.grid()
dd <- expand.grid(a=1, b=2:3, c=2:3, d=3:4)
m <- with(dd, a^2+b^2+c^2+d^2)
inside <- function(x, a,b) a<=x & x<=b
cbind(dd, m)[inside(m, 20, 30),]
# a b c d m
# 2 1 3 2 3 23
# 3 1 2 3 3 23
# 4 1 3 3 3 28
# 5 1 2 2 4 25
# 6 1 3 2 4 30
# 7 1 2 3 4 30
(you said you want values <=30 but you seem to have left out the 30's in your example, you can change the inside() function of you want an open interval)

Assign pass/fail value based on mean in large dataset

this might be a simple question but I was hoping someone could point me in the right direction. I have a sample dataset of:
dfrm <- list(L = c("A","B","P","C","D","E","P","F"), J=c(2,2,1,2,2,2,1,2), K=c(4,3,10,16,21,3,17,2))
dfrm <-as.data.frame(dfrm)
dfrm
L J K
1 A 2 4
2 B 2 3
3 P 1 10
4 C 2 16
5 D 2 21
6 E 2 3
7 P 1 17
8 F 2 2
Column J specifies the type of variable that is defined in K. I want to be able to take the mean of the K values that have a 1 assigned next to them. In this example it would be 10 and 17
T = c(10,17)
mean(T)
13.5
Next I want to be able to assign a pass/fail rank, where pass = 1, fail = 0 to identify whether the number in column K is larger than the mean.
The final data set should look like:
cdfrm <- list(L = c("A","B","P","C","D","E","P","F"), J=c(2,2,1,2,2,2,1,2), K=c(4,3,10,16,21,3,17,2),C = c(0,0,0,1,1,0,1,0))
cdfrm <-as.data.frame(cdfrm)
cdfrm
L J K C
1 A 2 4 0
2 B 2 3 0
3 P 1 10 0
4 C 2 16 1
5 D 2 21 1
6 E 2 3 0
7 P 1 17 1
8 F 2 2 0
this seems so basic, i am sorry guys, I just don't know what I am overthinking.
There are two steps in the solution. The first is to calculate the mean for the value you are interested in. In other words, take the mean of a subset of values in your data.frame. R has a handy function to calculate subsets, called subset. Here it is in action:
meanK <- mean(subset(dfrm, subset=J==1, select=K))
meanK
K
13.5
Next, you want to compare column K in your data frame with the mean value we have just calculated. This is a straightforward vector comparison:
dfrm$Pass <- dfrm$K>meanK
dfrm
L J K Pass
1 A 2 4 FALSE
2 B 2 3 FALSE
3 P 1 10 FALSE
4 C 2 16 TRUE
5 D 2 21 TRUE
6 E 2 3 FALSE
7 P 1 17 TRUE
8 F 2 2 FALSE
Here's how to do it in one line
transform(dfrm, C = K > sapply(split(dfrm$K, dfrm$J), mean)[J])
split groups the values of K according to the values of J and sapply(..., mean) calculates group wise means.

Resources