R: Histogram with large data set - r

Goal
From the data.frame d, I am trying to make a histogram of the column cMPerSite weighted by bpInPiece. In other words, bpInPiece is the number of observations at each cMPerSite value.
The Y-axis should represent densities and the X-axis should be on a log scale.
Attempts
I could do something like (which could be improved by pre-allocating the memory size for x).
x = c()
for (row in 1:nrow(d))
{
x = c(x, rep(d$cMPerSite[row],d$bpInPiece[row]))
}
hist(x,breaks=100, freq=FALSE)
but this becomes completely impractical when there is too much data (I have about 10 millions rows in my full data set) because x becomes too large to be stored in the RAM. Also, putting the X-axis in log scale is, I think, necessarily a bit of a mess.
Alternatively, I would have thought I could do
ggplot(d) + geom_histogram(aes(x = cMPerSite, y=bpInPiece), stat="identity") + scale_x_log10() + theme_classic(25)
Warning: Ignoring unknown parameters: binwidth, bins, pad
but, for some reason I do not understand, nothing gets displayed. Also, I am not sure how to put the Y-axis in density rather than count.
I suppose the bin size should vary logarithmically as the X-axis varies but that's confuses me as it would result in bins gathering an "artificial" high number of observations. Not sure how histograms are typically displayed with log scale X axis. Note that ggplot(d) + geom_histogram(aes(x = cMPerSite, y=bpInPiece), stat="identity") does not display anything either so the problem is not only a question of log scale on the X-axis.
Can you help me to make this histogram?
Subset of my data
structure(list(chrom = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), end = c(241608,
612298, 715797, 956634, 983330, 1190613, 1236417, 1330208, 1391915,
1464000, 1911436, 1913462, 2092038, 2169783, 2354812, 2363639,
2544241, 2551672, 2575287, 2589721, 2659117, 2884565, 3037319,
3100967, 3152276, 4319658, 4335072, 6301896, 6550219, 6596684,
7132319, 7435267, 7469158, 7604030, 7937619, 8131876, 9359659,
9598491, 9945959, 10262757, 10392172, 10646861, 10816847, 11094415,
11360199, 11964985, 12220179, 12222166, 12389943), cMInPiece = c(0,
1e-07, 1e-07, 0.7118558, 9.99999999473644e-08, 0.9540829, 9.99999998363421e-08,
0.4967211, 1.244988, 0.2137991, 8.808171, 0.500545200000001,
1.5721302, 1.6856566, 2.2552469, 1.0000000116861e-07, 2.6973586,
0.355113100000001, 0.355233800000001, 1.0000000116861e-07, 1.4903822,
2.8174978, 1.0000000116861e-07, 0.355231, 1.0000000116861e-07,
8.2735924, 0.425817699999996, 6.4568106, 0.372779399999999, 0.363684999999997,
0.181640399999999, 0.177473599999999, 1.0000000116861e-07, 0.177463800000005,
0.355294099999995, 1.0000000116861e-07, 1.6101482, 1.0000000116861e-07,
0.533477099999999, 0.355287800000006, 9.99999940631824e-08, 1.0000000116861e-07,
1.0000000116861e-07, 1.0000000116861e-07, 1.0000000116861e-07,
1.0000000116861e-07, 9.99999940631824e-08, 1.0000000116861e-07,
1.0000000116861e-07), bpInPiece = c(241608, 370690, 103499, 240837,
26696, 207283, 45804, 93791, 61707, 72085, 447436, 2026, 178576,
77745, 185029, 8827, 180602, 7431, 23615, 14434, 69396, 225448,
152754, 63648, 51309, 1167382, 15414, 1966824, 248323, 46465,
535635, 302948, 33891, 134872, 333589, 194257, 1227783, 238832,
347468, 316798, 129415, 254689, 169986, 277568, 265784, 604786,
255194, 1987, 167777), cMPerSite = c(1e-16, 2.69767190914241e-13,
9.66192910076426e-13, 2.95575762860358e-06, 3.74587953054257e-12,
4.60280341369046e-06, 2.18321543612659e-12, 5.29604226418313e-06,
2.01757985317711e-05, 2.96593049871679e-06, 1.96858790977928e-05,
0.000247060809476802, 8.80370374518411e-06, 2.16818650717088e-05,
1.21886131363192e-05, 1.13288774406491e-11, 1.49353750235324e-05,
4.77880635176962e-05, 1.50427186110523e-05, 6.92808654348135e-12,
2.14764856764078e-05, 1.24973288740641e-05, 6.54647349127419e-13,
5.58118086978381e-06, 1.94897583598608e-12, 7.08730509807415e-06,
2.76253860127155e-05, 3.28286140498591e-06, 1.50118756619403e-06,
7.82707414182711e-06, 3.39112268615754e-07, 5.85821989252278e-07,
2.95063589650969e-12, 1.31579423453352e-06, 1.06506539484214e-06,
5.14781970114898e-13, 1.31142734506016e-06, 4.18704366117646e-13,
1.53532728193675e-06, 1.1214963478305e-06, 7.72707909154135e-13,
3.92635728942395e-13, 5.88283747888707e-13, 3.60272081683082e-13,
3.76245376578762e-13, 1.65347744770232e-13, 3.91858719496471e-13,
5.03271269092148e-11, 5.96029260080999e-13)), .Names = c("chrom",
"end", "cMInPiece", "bpInPiece", "cMPerSite"), row.names = c(NA,
-49L), class = "data.frame")

This might get you started
Assuming your data is too large to process in one step - the idea is to manually generate a histogram, which is essentially the number of observations per bin
1) Split your data.frame to a size that's manageable for your memory - N can be any number
N <- 10
L <- split(df, cut(seq_len(nrow(df)), breaks=N))
2) For each split
sum bpInPiece for each group - { i %>% group_by(G = floor(-log10(cMPerSite))) %>% summarise(sum=sum(bpInPiece)) }
Then aggregate all splits - %>% group_by(G) %>% summarise(sum = sum(sum))
Then plot - ggplot(...)
library(tidyverse)
counts <- map_df(L, function(i) { i %>% group_by(G = floor(-log10(cMPerSite))) %>% summarise(sum=sum(bpInPiece)) }) %>%
group_by(G) %>% summarise(sum = sum(sum)) %>%
ggplot(., aes(G, sum)) + geom_col()
counts

Related

How to code and plot the performance of two trading strategies

So I have the monthly returns on the SP500 in addition to the risk free rate in a zoo timeseries object called SP500.df . I want to backtest two different trading strategies and plot them in a graph. In form of either cumulative return, or just how much 1$ intial investment would be at end of period.
Strategy 1: buy and hold SP500 the whole period.
Strategy 2: Hold SP500 from nov - april (winter), then switch to risk free rate from may - october (summer).
Notice that I also have extracted the winter and summer return in two respective vectors. called Winter_returns and summer_returns.
mkt= returns
rf= risk free rate
this is how dataframe SP500.df looks:
dput(head(SP500.df, 10))
structure(c(0.0286195, 0.03618317, -0.01363269, 0.02977401, 0.04461314,
0.0015209, -0.03207303, -0.0079275, 0.01882991, 0.00584478, 0.02372219,
0.03299206, -0.017908, 0.02540426, 0.04163062, -0.00317315, -0.03732322,
-0.0109474, 0.0147047, 0.00087712, 0.00608527826274047, 0.00495046849033236,
0.00503506482970477, 0.00481634688889247, 0.00424210936461577,
0.00358500724272255, 0.00424210936461577, 0.00480928182207086,
0.00485872460615713, 0.00487990531586144, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
0, 0), .Dim = c(10L, 6L), .Dimnames = list(NULL, c("MKT", "CAP",
"RF", "dummy", "dummyJAN", "adjdummy")), index = structure(c(-36494,
-36466, -36435, -36405, -36374, -36344, -36313, -36282, -36252,
-36221), class = "Date"), class = "zoo")
Here is a way.
Define a calculations function calc and apply it to the column MKT to have strategy 1 then create a vector of all returns/rates rates and apply the function to it. The plot uses base R graphics.
library(zoo)
calc <- function(x, returns) x*cumprod((1 + returns))
strat1 <- calc(1, SP500.df$MKT)
strat1
#> 1870-01-31 1870-02-28 1870-03-31 1870-04-30 1870-05-31 1870-06-30 1870-07-31
#> 1.028620 1.065838 1.051308 1.082610 1.130908 1.132628 1.096301
#> 1870-08-31 1870-09-30 1870-10-31
#> 1.087610 1.108090 1.114567
i_winter <- SP500.df$dummy == 1
rates <- numeric(NROW(SP500.df))
rates[i_winter] <- SP500.df$MKT[i_winter]
rates[!i_winter] <- SP500.df$RF[!i_winter]
strat2 <- calc(1, rates)
strat2
#> [1] 1.028620 1.065838 1.051308 1.082610 1.087202 1.091100 1.095728 1.100998
#> [9] 1.106347 1.111746
matplot(cbind(strat1, strat2), pch = 19)
matlines(cbind(strat1, strat2), lty = "solid")
legend("bottomright", legend = c("strategy 1", "strategy 2"),
col = 1:2, lty = "solid", pch = 19)
Created on 2022-02-07 by the reprex package (v2.0.1)
With ggplot2 graphics, create a temporary data.frame and pipe it to package tidyr's reshaping function pivot_longer. Then pipe the result to ggplot.
library(ggplot2)
data.frame(
date = index(SP500.df),
strategy1 = strat1,
strategy2 = strat2
) |>
tidyr::pivot_longer(-date) |>
ggplot(aes(date, value, color = name)) +
geom_line() +
geom_point() +
scale_color_manual(values = c("black", "red")) +
theme_bw()
Created on 2022-02-07 by the reprex package (v2.0.1)

Interactively identify 3D object in rgl plot

I want to identify 3d cylinders in an rgl plot to obtain one attribute of the nearest / selected cylinder. I tried using labels to simply spell out the attribute, but I work on data with more than 10.000 cylinders. Therefore, it gets so crowded that the labels are unreadable and it takes ages to render.
I tried to understand the documentation of rgl and I guess the solution to my issue is selecting the cylinder in the plot manually. I believe the function selectpoints3d() is probably the way to go. I believe it returns all vertices within the drawn rectangle, but I don't know how to go back to the cylinder data? I could calculate which cylinder is closest to the mean of the selected vertices, but this seems like a "quick & dirty" way to do the job.
Is there a better way to go? I noticed the argument value=FALSE to get the indices only, but I don't know how to go back to the cylinders.
Here is some dummy data and my code:
# dummy data
cylinder <- data.frame(
start_X = rep(1:3, 2)*2,
start_Y = rep(1:2, each = 3)*2,
start_Z = 0,
end_X = rep(1:3, 2)*2 + round(runif(6, -1, 1), 2),
end_Y = rep(1:2, each = 3)*2 + round(runif(6, -1, 1), 2),
end_Z = 0.5,
radius = 0.25,
attribute = sample(letters[1:6], 6)
)
# calculate centers
cylinder$center_X <- rowMeans(cylinder[,c("start_X", "end_X")])
cylinder$center_Y <- rowMeans(cylinder[,c("start_Y", "end_Y")])
cylinder$center_Z <- rowMeans(cylinder[,c("start_Z", "end_Z")])
# create cylinders
cylinder_list <- list()
for (i in 1:nrow(cylinder)) {
cylinder_list[[i]] <- cylinder3d(
center = cbind(
c(cylinder$start_X[i], cylinder$end_X[i]),
c(cylinder$start_Y[i], cylinder$end_Y[i]),
c(cylinder$start_Z[i], cylinder$end_Z[i])),
radius = cylinder$radius[i],
closed = -2)
}
# plot cylinders
open3d()
par3d()
shade3d(shapelist3d(cylinder_list, plot = FALSE), col = "blue")
text3d(cylinder$center_X+0.5, cylinder$center_Y+0.5, cylinder$center_Z+0.5, cylinder$attribute, color="red")
# get attribute
nearby <- selectpoints3d(value=TRUE, button = "right")
nearby <- colMeans(nearby)
cylinder$dist <- sqrt(
(nearby["x"]-cylinder$center_X)**2 +
(nearby["y"]-cylinder$center_Y)**2 +
(nearby["z"]-cylinder$center_Z)**2)
cylinder$attribute[which.min(cylinder$dist)]
If you call selectpoints3d(value = FALSE), you get two columns. The first column is the id of the object that was found. Your cylinders get two ids each. One way to mark the cylinders is to use "tags". For example, this modification of your code:
# dummy data
cylinder <- data.frame(
start_X = rep(1:3, 2)*2,
start_Y = rep(1:2, each = 3)*2,
start_Z = 0,
end_X = rep(1:3, 2)*2 + round(runif(6, -1, 1), 2),
end_Y = rep(1:2, each = 3)*2 + round(runif(6, -1, 1), 2),
end_Z = 0.5,
radius = 0.25,
attribute = sample(letters[1:6], 6)
)
# calculate centers
cylinder$center_X <- rowMeans(cylinder[,c("start_X", "end_X")])
cylinder$center_Y <- rowMeans(cylinder[,c("start_Y", "end_Y")])
cylinder$center_Z <- rowMeans(cylinder[,c("start_Z", "end_Z")])
# create cylinders
cylinder_list <- list()
for (i in 1:nrow(cylinder)) {
cylinder_list[[i]] <- cylinder3d(
center = cbind(
c(cylinder$start_X[i], cylinder$end_X[i]),
c(cylinder$start_Y[i], cylinder$end_Y[i]),
c(cylinder$start_Z[i], cylinder$end_Z[i])),
radius = cylinder$radius[i],
closed = -2)
# Add tag here:
cylinder_list[[i]]$material$tag <- cylinder$attribute[i]
}
# plot cylinders
open3d()
par3d()
shade3d(shapelist3d(cylinder_list, plot = FALSE), col = "blue")
text3d(cylinder$center_X+0.5, cylinder$center_Y+0.5, cylinder$center_Z+0.5, cylinder$attribute, color="red")
# Don't get values, get the ids
nearby <- selectpoints3d(value=FALSE, button = "right", closest = FALSE)
ids <- nearby[, "id"]
# Convert them to tags. If you select one of the labels, you'll get
# a blank in the list of tags, because we didn't tag the text.
unique(tagged3d(id = ids))
When I was trying this, I found that using closest = TRUE in selectpoints3d seemed to get too many ids; there may be a bug there.

label certain points with textxy()

I am trying to plot a volcano plot in R using the plot function and calibrate package in R and am trying to use the textxy function to plot only certain points.
Here is some data:
Metabolites <- data.frame(Metabolite = c("Glucose", "Galactose", "Creatine", "Lactose", "N-Acetylputrescine", "Tyramine", "Adenine", "Glycine", "Erythritol", "Choline"), Neg_pvalue = c(10, 8, 2, 1, 0.5, 0.7, 5, 3, 5.8, 4), LogFC = c(4, -3, 2, -1, 0.5, 0.7, 1, -2, -4, -1), padjust = c(1.453557e-19, 5.312771e-08, 4.983176e-02, 9.585447e-01, 2.449707e-01, 3.058580e-01, 4.223173e-02, 1.002379e-03, 4.466316e-27, 1.003879e-01))
Here is my code:
with(Metabolites, plot(LogFC, Neg_pvalue, pch=20, main="CNL", xlim=c(-5,6)))
with(subset(Metabolites, padjust <.05 ), points(LogFC, Neg_pvalue, pch=20, col="blue"))`
with(subset(Metabolites, padjust <.05 & abs(LogFC) > 2), points(LogFC, Neg_pvalue, ph=20, col="red"))
Now here is the issue:
with(subset(Metabolites, padjust <.05 & abs(LogFC) > 2), textxy(LogFC, Neg_pvalue, labs=Metabolite[1:3], cex=.5, offset = 0.2))`
If I plot this code, I get only the top 3 data points, as is indicated with the labs=Metabolite[1:3] part of the code. Alternatively, if I plot labs=Metabolite, then I get all labels.
If I wanted to plot the labels of only: Glycine, Lactose, and Erythritol as given in the Metabolites$Metabolite, am I able to do this?
Also, say I wanted to keep my top 3 data points labeled (labs=Metabolite[1:3]), but also want to label other metabolites of interest, say Tyramine and N-Acetylputrescine too; how can I do this?
This seems to work by slecting items that are in that set and using those character values as lables:
library(calibrate)
with(subset(Metabolites, Metabolite %in% c( 'Glycine', 'Lactose', 'Erythritol' )),
textxy(LogFC, Neg_pvalue, labs=c( 'Glycine', 'Lactose', 'Erythritol' ), cex=.5, offset = 0.2))

midasR forecast produces NA value

I want a forecast 1-step with Midas Package, but when i try to do the result are NA. The code are the next.
model1=midas_r(TC ~
mls(TC, 1, 1)+
mls(EMCI_MXN, 2:5, 4, nealmon) ,
data=list(dlog_men,
EMCI_MXN=dlog_dS$EMCI_MXN),
start = list(EMCI_MXN=rep(0,3)))
model2=midas_r(TC ~
mls(TC, 1, 1)+
mls(EMCI_MXN, 2:5, 4, lcauchyp) ,
data=list(dlog_men,
EMCI_MXN=dlog_dS$EMCI_MXN),
start = list(EMCI_MXN=c(1,1,1)))
model_frcst=midasr::forecast(model2, newdata=list(
EMCI_MXN =c(-.02,rep(NA,3)) #Adding new value
),method="dynamic")
I´ve tried this, but the result is the same:
average_forecast(list(model1, model2),data=list(
TC=c(dlog_men$TC, NA),
EMCI_MXN =c(dlog_dS$EMCI_MXN,rep(NA,4))
), insample= 1:146, outsample = 147
)
Thank´s a lot.

Highlighting the maximum value of each column of data frame in R

I have a data.frame in R :
p=structure(list(WSbin01 = c(214.98151752527, -46.9493685420515,
154.726947679253), WSbin02 = c(1093.46050365665, 420.318207941967,
927.97317496775), WSbin03 = c(2855.24990411661, 2035.57575481323,
2662.2595957214), WSbin04 = c(5863.91399544626, 4881.81544665127,
5625.17650575444), WSbin05 = c(9891.70254019722, 8845.32506336827,
9666.14583347469), WSbin06 = c(14562.1527820802, 13401.1727730953,
14321.601249974), WSbin07 = c(19091.1307681137, 18003.2115315665,
18903.0179613827), WSbin08 = c(24422.7094972645, 23694.5453703207,
24357.8071162775), WSbin09 = c(30215.4088114124, 30214.3195264298,
30310.242671113), WSbin10 = c(36958.2122031382, 37964.9044838778,
37239.6908819524), WSbin11 = c(41844.810779792, 43701.2643596447,
42343.7442683171), WSbin12 = c(37616.8187087318, 39348.3188777835,
38178.9009247311), WSbin13 = c(20953.0973658833, 21720.1930292221,
21251.8654076726), WSbin14 = c(7155.3786781173, 7262.61983182254,
7233.60584469268), WSbin15 = c(2171.61052809769, 2120.97045661101,
2173.49396732091), WSbin16 = c(779.72276608943, 745.52198490267,
767.81436310063)), .Names = c("WSbin01", "WSbin02", "WSbin03",
"WSbin04", "WSbin05", "WSbin06", "WSbin07", "WSbin08", "WSbin09",
"WSbin10", "WSbin11", "WSbin12", "WSbin13", "WSbin14", "WSbin15",
"WSbin16"), class = "data.frame", row.names = c(NA, -3L))
I would like to set a background color for the maximum value of each column.
Using DT::datatable would return the table but I don't know how to set the formatStyle parameters to return the max value in each column in different color.
Furthermore, I have a vector z= c(1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2, 2, 2, 3, 1) . I wanna have the background color in each column like if z[i]=1 column i should be green, if z[i]=2 then column i should be red and if z[i]=3 the column i should be blue.
Combining parts of the dt guide (https://rstudio.github.io/DT/010-style.html) and this q (Datatable: apply different formatStyle to each column), I get this:
colors <- apply(col2rgb(rainbow(n=ncol(p))),2,function(x)paste0("rgb(",paste(x,collapse=","),")"))
data <- datatable(p)
sapply(c(1:ncol(p)),function(x){
data <<- data %>% formatStyle(colnames(p)[[x]],backgroundColor = styleEqual(max(p[[x]]), colors[x]))
})
data
The answer to your second q is similar-
z= c(1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2, 2, 2, 3, 1)
colors <- apply(col2rgb(rainbow(n=max(z))),2,function(x)paste0("rgb(",paste(x,collapse=","),")"))
data <- datatable(p)
sapply(c(1:ncol(p)),function(x){
data <<- data %>% formatStyle(
colnames(p)[[x]],
backgroundColor = colors[z[x]]
)
})
data

Resources