I have a list of elemental compositions and I'd like to display a count for the number of times an element is included in a composition mapped onto the periodic table (e.g. CH4 would increase the count on H and C by one).
How can I do this with ggplot? Is there a map I can use?
With a bit of searching I found information about the periodic table in this example code project. They had an Access Database with element information. I've exported it to this gist. You can import the data using the httr library with
dd <- read.table(text=content(GET("https://gist.githubusercontent.com/MrFlick/c1183c911bc5398105d4/raw/715868fba2d0d17a61a8081de17c468bbc525ab1/elements.txt")), sep=",", header=TRUE)
(You should probably create your own local version for easier loading in the future.)
Then your other challenge is decomposing something like "CH4" into the raw element counts. I've created this helper function which I think does what you need.
decompose <- function(x) {
m <- gregexpr("([A-Z][a-z]?)(\\d*)", x, perl=T)
dx <- Map(function(x, y) {
ElementSymbol <- gsub("\\d","", x)
cnt <- as.numeric(gsub("\\D","", x))
cbind(Sym=y, as.data.frame(xtabs(cnt~ElementSymbol)))
}, regmatches(x,m), x)
do.call(rbind, dx)
Here I test the function
test_input <- c("H2O","CH4")
# Sym ElementSymbol Freq
# 1 H2O H 2
# 2 H2O O 1
# 3 CH4 C 1
# 4 CH4 H 4
Now we can combine the data and the reference information to make a plot
ggplot(merge(decompose("CH4"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
Clearly there are opportunities for improvement but this should give you a good start.
You might look for a more robust decomposition function. Looks like the CHNOSZ package has one
decompose <- function(x) {
do.call(`rbind`, lapply(x, function (x) {
z <- makeup(x)
cbind(data.frame(ElementSymbol = names(z),Freq=z), Sym=x)
ggplot(merge(decompose("CaAl2Si2O7(OH)2*H2O"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
I'm trying to write a function that takes in p1 probabilities for Mahalanobis distances and returns p2 probabilities. The formula for p2, along with a worked example is given at on the IBM website. I have written a function (below) that solves the problem, and allows me to reproduce the p2 values given in the worked example at the aforementioned webpage.
p1_to_p2 <- function(p1,N) {
p2 <- numeric(length(p1))
for (i in 1:length(p1))
k <- i;
p1_value <- p1[i];
start_value <- 1;
while (k >= 1)
start_value = start_value - choose(N,N-k+1) * (1-p1_value)^(N-k+1) * (p1_value)^(k-1)
k <- k-1;
p2[i] <- start_value;
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
N <- 73
Although the function works, it's been suggested to me by a colleague that it's inefficient/poorly written as it's not vectorized. This is indeed potentially relevant since in general we will be converting a lot more than just 5 p1 values to p2 values.
I have some limited experience vectorizing code, but I am wondering if a vectorized solution is possible in this context since within the loop the variable start_value constantly needs to update itself. If vectorization is not possible, is there some other way I should improve the code so that it works better?
Here is one way to do it, Breaking the steps here can help(Please read the comments):
N <- 73
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
n <- N:(N-length(p1)+1)
# code:
mahalanobis_dist = function(x=x,n){
m = max(n)
max_min = Reduce(`*`,c(1, n[-length(n)]), accumulate = TRUE)
acc = c(1, Reduce(`*`, seq_along(n), accumulate = TRUE)[-length(n)])
comns = max_min/acc
exp <- comns*((1 - x)**n)*(x**(m - n))
return(1- sum(exp))
} ## the calculation of Mahalanobis distances
## This is just an iterator for each of the sequences we have to run the above function
ls <- lapply(n, function(x)(max(n):x))
## creating a list of iterators
## applying mapply, mapply or Map can iterate multiple inputs of the function,
## here the input p1 and ls , p1 is your input points, ls is the iterator created above
mapply(mahalanobis_dist,p1, ls)
## Applying the function on each iterators
#> mapply(mahalanobis_dist,p1, ls)
#[1] 0.2864785 0.1299047 0.5461263 0.3973690
#[5] 0.2662369
Also, one can join the last two steps like below, with one function and correct iteration this can be achieved:
mapply(mahalanobis_dist,p1, lapply(n, function(x)(max(n):x)))
I am using below code to list out Peaks and Valleys.
x_last <- as.numeric(series[1])
x <- as.numeric(series[2])
d_last <- (x-x_last)
series[1:2] <- NULL
output <- list()
for (x_next in series){
if (x_next == x){
d_next <- (x_next - x)
if (d_last * d_next < 0){
output <- append(output, x)}
x_last <- x
x <- x_next
d_last <- d_next
Here Output(list) contains "continuous Peaks and Valleys".
Output <- c(41.49916, 37.92029, 39.86477, 39.86432, 39.95672, 39.95465, 39.96144, 39.83994, 40.43357, 40.11285, 40.82250, 39.37034, 58.82975, 42.19894)
so on...
the graph plotted using Output(list). My question is how to add threshold in this code? or how can i remove small Peaks and Valleys(values less than 1). I Need continuous Peaks and valleys.
Looking for answers. thank you in advance.
If you just want to plot your data:
You could plot this with ggplot2 and add a geom_smooth() layer. It defaults to method "loess" which is kind of a "do-the-right-thing" smoother for small datasets.
dat <- data.frame(y=c(41.49916, 37.92029, 39.86477, 39.86432, 39.95672, 39.95465, 39.96144, 39.83994, 40.43357, 40.11285, 40.82250, 39.37034, 58.82975, 42.19894))
dat$x <- 1:length(dat$y)
ggplot(dat, aes(x, y)) +
geom_line() +
geom_smooth(method="loess", se=FALSE)
Or do you rather want to smoothen the data yourself? (Your data series is quite short for that.) Do you need an equation for the fit? It's easy to spend quite some time on that.
I don't fully understand this "peak/valley" stuff. In any case, take a look at the diff() function. Maybe this helps:
dat <- data.frame(y=c(41.49916, 37.92029, 39.86477, 39.86432, 39.95672, 39.95465, 39.96144, 39.83994, 40.43357, 40.11285, 40.82250, 39.37034, 58.82975, 42.19894))
dat[which(diff(dat$y) < 0.01)+1,"y"] <- NA
[1] 41.50 NA 39.86 NA 39.96 NA NA NA 40.43 NA 40.82 NA
[13] 58.83 NA
Here I've used a threshold of 0.01.
I'm not sure if it's the right thing. But you can adapt this code for your needs.
At last I created a function to remove small cycles also to maintain Peak and valley. For me it is working perfectly.
hysteresis <- function(series, min_range){
#hysteresis function will remove cycles within the magnitude of min_range
#Series: list of values with continuous Peak & valley.
series <- unlist(series)
f <- series[1]
org <- f
series <- series[2:length(series)]
for(i in series){
val <- abs(i-f)
if(val > min_range){
org <- c(org,i)
f <- i
#else statement is used to maintain peak and valley
org <- org[1:(length(org)-1)]
f <- org[length(org)]
I have a data frame that looks like this:
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
x <- data.frame(data,utility,stage)
data utility stage
1 0.9148060 def xyz
2 0.9370754 abc wxy
3 0.2861395 def xyz
4 0.8304476 cde xyz
5 0.6417455 bcd xyz
6 0.5190959 abc xyz
and I want to generate cumulative distribution functions for the unique combinations of utility and stage. In my real application I'll end up generating about 100 cdfs but this random data will have 12 (4x3) unique combinations. But I'll be using each of those cdfs thousands of times, so I don't want to calculate the cdf on the fly each time. The ecdf() function works exactly as I'd like, except I'd need to vectorize it. The following code doesn't work, but it's the gist of what I'm trying to do:
ecdf_multiple <- function(x)
utilities <- levels(x$utilities)
stages <- levels(x$stages)
for(utility in utilities)
for(stage in stages)
i <- i + 1
y <- ecdf(x[x$utilities == utility & x$stage == stage,1])
# calculate ecdf for the unique util/stage combo
z[i] <- list(y,utility,stage)
# then assign it to a data element (list, data frame, json, whatever) note-this doesn't actually work
z # return value
so after running ecdf_multiple and assigning it to a variable, I'd reference that variable somehow by passing a value (for which I wanted the cdf), the utility and the stage.
Is there a way to vectorize the ecdf function (or use/build another) so that I can the output several times without neededing to generate distributions over and over?
-------Added to respond to #Pascal 's excellent suggestion.-------
How might one expand this to a more general case of taking "n" dimensions of categories? This is my stab, based on Pascal's case of two dimensions. Notice how I tried to assign "y":
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
openclose <- sample(c("open","close"),1000,replace=TRUE)
x <- data.frame(data,utility,stage,openclose)
numlabels <- length(names(x))-1
y <- split(x, list(x[,2:(numlabels+1)]))
l <- lapply(y,function(x) ecdf(x[,"data"]))
utility <- "abc"
stage <- "xyz"
openclose <- "close"
comb <- paste(utility, stage, openclose, sep = ".")
# call the function
During the assignment of "y" above, I get this error message:
"Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?"
The following might help:
# we create a list of criteria by excluding
# the first column of the data.frame
y <- split(x, as.list(x[,-1]))
l <- lapply(y, function(x) ecdf(x[,"data"]))
utility <- "abc"
stage <- "xyz"
comb <- paste(utility, stage, sep = ".")
# [1] 0.2613636
We are given a huge set of points in 2D plane. We need to find, for each point the closest point within the set. For instance suppose the initial set is as follows:
foo <- data.frame(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
The output should be like this:
3 # (could be 4 also)
Any idea?
The traditional approach is to preprocess the data
and put it in a data structure, often a K-d tree,
for which the "nearest point" query is very fast.
There is an implementation in the nnclust package.
foo <- cbind(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
i <- nnfind(foo)$neighbour
arrows( foo[,1], foo[,2], foo[i,1], foo[i,2] )
Here is an example; all wrapped into a single function. You might want to split it a bit for optimization.
ClosesPair <- function(foo) {
dist <- function(i, j) {
sqrt((foo[i,1]-foo[j,1])**2 + (foo[i,2]-foo[j,2])**2)
foo <- as.matrix(foo)
ClosestPoint <- function(i) {
indices <- 1:nrow(foo)
indices <- indices[-i]
distances <- sapply(indices, dist, i=i, USE.NAMES=TRUE)
closest <- indices[which.min(distances)]
sapply(1:nrow(foo), ClosestPoint)
# [1] 2 1 4 3 3
Of cause, it does not handle ties very well.
Use the package spatstat . It's got builtin functions to do this sort of stuff.
As we all know R isn't the most efficient platform to run large analyses.
If I had a large data frame containing three parameters:
A 1 2
A 2 2
A 2 3
B 1 1
B 2 3
B 1 4
millions of rows
and I wanted to run a computation on each group (e.g. compute Pearson's r on X,Y) and store the results in a new data frame, I can do it like this:
df = loadDataFrameFrom( someFile )
results = data.frame()
for ( g in unique( df$GROUP)) ){
gdf <- subset( df, df$GROUP == g )
partialRes <- slowStuff( gdf$X,gdf$Y )
results = rbind( results, data.frame( GROUP = g, RES = partialRes ) )
// results contains all the results here.
The obvious problem is that this is VERY slow, even on powerful multi-core machine.
My question is: is it possible to parallelise this computation, having for example a separate thread for each group or a block of groups?
Is there a clean R pattern to solve this simple divide et impera problem?
First off, R is not necessarily slow. Its speed depends largely on using it correctly, just like any language. There are a few things that can speed up your code without altering much: preallocate your results data.frame before you begin; use a list and matrix or vector construct instead of a data.frame; switch to use data.table; the list goes on, but The R Inferno is an excellent place to start.
Also, take a look here. It provides a good summary on how to take advantage of multi-core machines.
The "clean R pattern" was succinctly solved by Hadley Wickam with his plyr package and specifically ddply:
ddply(df, .(GROUP), your.function, .parallel=TRUE)
However, it is not necessarily fast. You can use something like:
mclapply(unique(df$GRUOP), function(x, df) ...)
Or finally, you can use the foreach package:
foreach(g = unique(df$Group), ...) %dopar$ {
To back up my comment: 10 million rows, 26 groups. Done in < 3 seconds on a single-core 3.3Ghz CPU. Using only base R. No parallelization needed.
> set.seed(21)
> x <- data.frame(GROUP=sample(LETTERS,1e7,TRUE),X=runif(1e7),Y=runif(1e7))
> system.time( y <- do.call(rbind, lapply(split(x,x$GROUP),
+ function(d) data.frame(GROUP=d$GROUP[1],cor=cor(d$X,d$Y)))) )
user system elapsed
2.37 0.56 2.94
> y
A A 2.311493e-03
B B -1.020239e-03
C C -1.735044e-03
D D 1.355110e-03
E E -8.027199e-04
F F 8.234086e-04
G G 2.337217e-04
H H -5.861781e-04
I I 7.799191e-04
J J 1.063772e-04
K K 7.174137e-04
L L 4.151059e-04
M M 4.440694e-04
N N 2.568411e-03
O O -3.827366e-04
P P -1.239380e-03
Q Q -1.057020e-03
R R 1.079676e-03
S S -1.819232e-03
T T -3.577533e-04
U U -1.084114e-03
V V 6.686503e-05
W W -1.631912e-03
X X 8.668508e-04
Y Y -6.460281e-04
Z Z 1.614978e-03
By the way, parallelization will only help if your slowStuff function is the bottleneck. Your use of rbind in a loop is likely the bottleneck, unless you do something similar in slowStuff.
I think your slowness is in part due to your non R programming in R. The following would give you correlations per group (I used the mtcars data set and divided it by cyl group) and do it pretty fast:
by(mtcars, mtcars$cyl, cor)