How to code and plot the performance of two trading strategies - r

So I have the monthly returns on the SP500 in addition to the risk free rate in a zoo timeseries object called SP500.df . I want to backtest two different trading strategies and plot them in a graph. In form of either cumulative return, or just how much 1$ intial investment would be at end of period.
Strategy 1: buy and hold SP500 the whole period.
Strategy 2: Hold SP500 from nov - april (winter), then switch to risk free rate from may - october (summer).
Notice that I also have extracted the winter and summer return in two respective vectors. called Winter_returns and summer_returns.
mkt= returns
rf= risk free rate
this is how dataframe SP500.df looks:
dput(head(SP500.df, 10))
structure(c(0.0286195, 0.03618317, -0.01363269, 0.02977401, 0.04461314,
0.0015209, -0.03207303, -0.0079275, 0.01882991, 0.00584478, 0.02372219,
0.03299206, -0.017908, 0.02540426, 0.04163062, -0.00317315, -0.03732322,
-0.0109474, 0.0147047, 0.00087712, 0.00608527826274047, 0.00495046849033236,
0.00503506482970477, 0.00481634688889247, 0.00424210936461577,
0.00358500724272255, 0.00424210936461577, 0.00480928182207086,
0.00485872460615713, 0.00487990531586144, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
0, 0), .Dim = c(10L, 6L), .Dimnames = list(NULL, c("MKT", "CAP",
"RF", "dummy", "dummyJAN", "adjdummy")), index = structure(c(-36494,
-36466, -36435, -36405, -36374, -36344, -36313, -36282, -36252,
-36221), class = "Date"), class = "zoo")

Here is a way.
Define a calculations function calc and apply it to the column MKT to have strategy 1 then create a vector of all returns/rates rates and apply the function to it. The plot uses base R graphics.
library(zoo)
calc <- function(x, returns) x*cumprod((1 + returns))
strat1 <- calc(1, SP500.df$MKT)
strat1
#> 1870-01-31 1870-02-28 1870-03-31 1870-04-30 1870-05-31 1870-06-30 1870-07-31
#> 1.028620 1.065838 1.051308 1.082610 1.130908 1.132628 1.096301
#> 1870-08-31 1870-09-30 1870-10-31
#> 1.087610 1.108090 1.114567
i_winter <- SP500.df$dummy == 1
rates <- numeric(NROW(SP500.df))
rates[i_winter] <- SP500.df$MKT[i_winter]
rates[!i_winter] <- SP500.df$RF[!i_winter]
strat2 <- calc(1, rates)
strat2
#> [1] 1.028620 1.065838 1.051308 1.082610 1.087202 1.091100 1.095728 1.100998
#> [9] 1.106347 1.111746
matplot(cbind(strat1, strat2), pch = 19)
matlines(cbind(strat1, strat2), lty = "solid")
legend("bottomright", legend = c("strategy 1", "strategy 2"),
col = 1:2, lty = "solid", pch = 19)
Created on 2022-02-07 by the reprex package (v2.0.1)
With ggplot2 graphics, create a temporary data.frame and pipe it to package tidyr's reshaping function pivot_longer. Then pipe the result to ggplot.
library(ggplot2)
data.frame(
date = index(SP500.df),
strategy1 = strat1,
strategy2 = strat2
) |>
tidyr::pivot_longer(-date) |>
ggplot(aes(date, value, color = name)) +
geom_line() +
geom_point() +
scale_color_manual(values = c("black", "red")) +
theme_bw()
Created on 2022-02-07 by the reprex package (v2.0.1)

Related

Grouped bar charts for modal split values

I want to plot my data frame containing different modal split values(% of car usage, % of bike usage..) for different path lengths(under 5 km, 5-10km, 10-30km...)
Each element in my data frame contains the % of vehicle usage for each path length.
My goal is to plot all values in one plot.
I want to create a bar chart, with a bar for each path length, representing all vehicle percentages (the modal split).
My dataframe contains in the first column the vehicle modes(car, bike,..), and the columns 2-10 contain the percentages for each path length group.
I tried:
testtest <- ggplot() + geom_col(data = ms_gruppen_d,
aes(x = colnames(ms_gruppen_d)[2:9],
y = ms_gruppen_d[,2:9],
fill = ms_gruppen_d[,1]))
My values are not categorical, so I cannot use the "count" function.
Can someone help?
Thanks
ms_gruppen_d <- structure(list(VM = c("Fußverkehr", "Fahrrad", "Motorrad/Moped/Mofa",
"Privater_pkw", "Gewerb_pkw", "Lkw_bis_3_5_", "Lkw_ab_3_5_",
"Sattelzug", "ÖPNV"), `Laenge unter 5km` = c(0.218428835651906,
0.208360071967382, 0, 0.337471470224058, 0.195785602540656, 0.0103830833919553,
0.0123737357892543, 0, 0.0171972004347874), `Laenge 5 - 10km` = c(0.138928420064367,
0.140725324716725, 0.00988051174398964, 0.289334484453904, 0.308718256514345,
0.0356902893023975, 0.00988051174398964, 0.0222528559093808,
0.044589345550901), `Laenge 10-20km` = c(0.0667063809168976,
0.172327489225668, 0, 0.271668790053295, 0.346741728107974, 0.0573103622018356,
0.0292526145926873, 0.0149058863164426, 0.0410867485852005),
`Laenge 20-30km` = c(0.0405426428226048, 0.1463357744637,
0.0236972749606593, 0.271246395715663, 0.354248166536575,
0.0855256681459516, 0.0173953892663395, 0.0432292937973128,
0.0177793942911947), `Laenge 30-50km` = c(0.0213163894963155,
0.0503758065644924, 0.0159090254544127, 0.178916279908378,
0.485985672387571, 0.148087763700495, 0.0378558845704386,
0.026693520571143, 0.0348596573467541), `Laenge 50-100km` = c(0.00652604845092996,
0.0123285212525124, 0, 0.177307097376991, 0.380919125770432,
0.154233838933756, 0.213479807823156, 0.0441531204824327,
0.0110524399097905), `Laenge 100-200km` = c(0, 0.00431357399129567,
0, 0.087013827371374, 0.173016082279325, 0.203265193001196,
0.399659385606215, 0.0655495360275712, 0.0671824017230226
), `Laenge 200-300km` = c(0, 0, 0, 0.00953852353026925, 0.147233787704061,
0.130598939323796, 0.518334554408677, 0.146338992010429,
0.0479552030227669), `Laenge 300km+` = c(0, 0, 0, 0.0333890118493603,
0.0876659311982381, 0.0979219742771943, 0.420951006142259,
0.297349051156633, 0.062723025376315)), row.names = c(NA,
-9L), class = "data.frame")
The main problem is I think that your data is in the wide format instead of the long format. You can reshape the data using tidyr::pivot_longer(). Here is how you can use that function to make a grouped bar chart:
library(ggplot2)
# Reshape data, excluding column 1
df <- tidyr::pivot_longer(ms_gruppen_d, -1, names_to = "Laenge")
# Making the distances more pretty to print
df$Laenge <- factor(df$Laenge, levels = colnames(ms_gruppen_d)[-1])
levels(df$Laenge) <- gsub("Laenge ", "", levels(df$Laenge))
# A grouped bar chart
ggplot(df, aes(Laenge, value, fill = VM)) +
geom_col(position = "dodge")
However, I think a stacked bar chart might make more sense in this case, as all fractions should add up to 1.
ggplot(df, aes(Laenge, value, fill = VM)) +
geom_col(position = "stack")
Created on 2021-09-10 by the reprex package (v2.0.1)

Error while running WTC (Wavelet Coherence) Codes in R

I am doing Wavelet Analysis in R using Biwavelet. However, I receive the error message:
Error in check.datum(y) :
The step size must be constant (see approx function to interpolate)
When I run the following code:
wtc.AB = wtc(t1, t2, nrands = nrands)
Please share your help here. Complete Code is:
# Import your data
Data <- read.csv("https://dl.dropboxusercontent.com/u/18255955/Tutorials/Commodities.csv")
# Attach your data so that you can access variables directly using their
# names
attach(Data)
# Define two sets of variables with time stamps
t1 = cbind(DATE, ISLX)
t2 = cbind(DATE, GOLD)
# Specify the number of iterations. The more, the better (>1000). For the
# purpose of this tutorial, we just set it = 10
nrands = 10
wtc.AB = wtc(t1, t2, nrands = nrands)
# Plotting a graph
par(oma = c(0, 0, 0, 1), mar = c(5, 4, 5, 5) + 0.1)
plot(wtc.AB, plot.phase = TRUE, lty.coi = 1, col.coi = "grey", lwd.coi = 2,
lwd.sig = 2, arrow.lwd = 0.03, arrow.len = 0.12, ylab = "Scale", xlab = "Period",
plot.cb = TRUE, main = "Wavelet Coherence: A vs B")```

How to display text out of color figure and easy to read in visNetwork?

I'm getting a Network Map done using R with visNetwork library, but I hate that displayed legend has text inside the shape, so it's size is defined by how big the name is and I don't want it to be that way to avoid confusion. In library documentation they add shapes that display text outside, but you have to set it group by group with a visGroups piece (shape = "triangle", for example) for each one of the groups and I want it set in a way where I can leave that code set without specifically knowing how many groups they will be, because it won't necesarilly be the same number every time.
Coding below is the one I'm using (way much bigger, made it simple by selecting only some colums & rows):
library(igraph)
library(visNetwork)
twd2 <- structure(c(0.0854374047081175, 0.116200039661793, 0.0289142580616779,
0.12768720590989, 0.273786051264039, 0, 0.000593902599973604,
0, 0.00184397276348455, 0, 0, 0, 0, 0, 0, 0, 0.106048390315551,
0, 0, 0, 0.0142648455772593, 0, 0.0197857551361577, 0.0290239379534046,
0, 0, 0, 0, 0, 0, 0, 0, 0.00197967638677129, 0, 0.000296951299986802,
0, 0.0111915576381184, 0, 0.00111081782587656, 0.0276104933163398,
0, 0, 0.00487220095904272, 0.0149921777316026, 0, 0, 8.79855703664599e-05,
0.00674104365369398, 0, 0, 0.0330935726540847, 0, 0, 0.0362094142674287,
0, 0, 0, 0.00172168238114983, 0.00232061941841538, 0.0248983709144504
), .Dim = c(4L, 15L), .Dimnames = list(NULL, c("water_treatment",
"waste_water", "utility_model", "water_inlet", "waste_water_treatment",
"treatment_system", "model_discloses", "utility_model_discloses",
"water_outlet", "water_treatment_system", "treatment_device",
"water_tank", "sludge_treatment", "reverse_osmosis", "raw_water")))
twd2_num_col <- ncol(twd2)
twd2_cor <- cor(twd2, method = "pearson")
twd2_cor[ abs(twd2_cor) < 0.75 ] <- 0
twd2_cor[ abs(twd2_cor) > 0.925 ] <- 0
diag(twd2_cor) <- 0
graph <- graph.adjacency(twd2_cor, weighted=TRUE, mode="lower")
E(graph)$edge.width <- E(graph)$weight
V(graph)$group <- apply(twd2, 2, which.max) # Max topic prob for colors
V(graph)$betweenness <- betweenness(graph, v = V(graph), directed = F)
V(graph)$degree <- degree(graph, v = V(graph))
# Fit data for visNetwork
nm_data <<- toVisNetworkData(graph)
nodes <<- as.data.frame(nm_data[[1]], stringsAsFactors = F)
nodes <<- nodes[nodes$degree != 0,] # Bye topics that don't have a connection (degree = 0)
nodes$group <<- swap(nodes$group , 1:length(topic_names), topic_names) # Swap long real names
nodes$label <<- rep("")
# Plot
# Graph
set.seed(17);visNetwork(as.data.frame(nodes, stringsAsFactors = F),
as.data.frame(nm_data[[2]], stringsAsFactors = F),
main = "Relation between topics") %>%
visOptions(highlightNearest = TRUE, selectedBy = "group") %>%
visInteraction(dragNodes = FALSE) %>%
visLegend(useGroups = TRUE, main = "Topic") %>%
visNodes(shape = "dot",label = NULL) %>%
visIgraphLayout(randomSeed = 17)
Additionaly, here they make reference to what I said aboud selecting type of shapes, but they don't add any code and I can't figure out if there is a way to do it directly and apply it to all group options or any other choice I could have.
ledges <- data.frame(color = c("lightblue", "red"),
label = c("reverse", "depends"), arrows =c("to", "from"),
font.align = "top")
visNetwork(nodes, edges) %>%
visGroups(groupname = "A", color = "red") %>%
visGroups(groupname = "B", color = "lightblue") %>%
visLegend(addNodes = lnodes, addEdges = ledges, useGroups = FALSE)
font.align = "top" must do the job
For future visitors;
See https://www.rdocumentation.org/packages/visNetwork/versions/2.0.9/topics/visNodes
The location of the label depends on the chosen shape, for default elipse this will be inside the node." The types with the label inside of it are: ellipse, circle, database, box, text. The ones with the label outside of it are: image, circularImage, diamond, dot, star, triangle, triangleDown, square and icon."

label certain points with textxy()

I am trying to plot a volcano plot in R using the plot function and calibrate package in R and am trying to use the textxy function to plot only certain points.
Here is some data:
Metabolites <- data.frame(Metabolite = c("Glucose", "Galactose", "Creatine", "Lactose", "N-Acetylputrescine", "Tyramine", "Adenine", "Glycine", "Erythritol", "Choline"), Neg_pvalue = c(10, 8, 2, 1, 0.5, 0.7, 5, 3, 5.8, 4), LogFC = c(4, -3, 2, -1, 0.5, 0.7, 1, -2, -4, -1), padjust = c(1.453557e-19, 5.312771e-08, 4.983176e-02, 9.585447e-01, 2.449707e-01, 3.058580e-01, 4.223173e-02, 1.002379e-03, 4.466316e-27, 1.003879e-01))
Here is my code:
with(Metabolites, plot(LogFC, Neg_pvalue, pch=20, main="CNL", xlim=c(-5,6)))
with(subset(Metabolites, padjust <.05 ), points(LogFC, Neg_pvalue, pch=20, col="blue"))`
with(subset(Metabolites, padjust <.05 & abs(LogFC) > 2), points(LogFC, Neg_pvalue, ph=20, col="red"))
Now here is the issue:
with(subset(Metabolites, padjust <.05 & abs(LogFC) > 2), textxy(LogFC, Neg_pvalue, labs=Metabolite[1:3], cex=.5, offset = 0.2))`
If I plot this code, I get only the top 3 data points, as is indicated with the labs=Metabolite[1:3] part of the code. Alternatively, if I plot labs=Metabolite, then I get all labels.
If I wanted to plot the labels of only: Glycine, Lactose, and Erythritol as given in the Metabolites$Metabolite, am I able to do this?
Also, say I wanted to keep my top 3 data points labeled (labs=Metabolite[1:3]), but also want to label other metabolites of interest, say Tyramine and N-Acetylputrescine too; how can I do this?
This seems to work by slecting items that are in that set and using those character values as lables:
library(calibrate)
with(subset(Metabolites, Metabolite %in% c( 'Glycine', 'Lactose', 'Erythritol' )),
textxy(LogFC, Neg_pvalue, labs=c( 'Glycine', 'Lactose', 'Erythritol' ), cex=.5, offset = 0.2))

R: Histogram with large data set

Goal
From the data.frame d, I am trying to make a histogram of the column cMPerSite weighted by bpInPiece. In other words, bpInPiece is the number of observations at each cMPerSite value.
The Y-axis should represent densities and the X-axis should be on a log scale.
Attempts
I could do something like (which could be improved by pre-allocating the memory size for x).
x = c()
for (row in 1:nrow(d))
{
x = c(x, rep(d$cMPerSite[row],d$bpInPiece[row]))
}
hist(x,breaks=100, freq=FALSE)
but this becomes completely impractical when there is too much data (I have about 10 millions rows in my full data set) because x becomes too large to be stored in the RAM. Also, putting the X-axis in log scale is, I think, necessarily a bit of a mess.
Alternatively, I would have thought I could do
ggplot(d) + geom_histogram(aes(x = cMPerSite, y=bpInPiece), stat="identity") + scale_x_log10() + theme_classic(25)
Warning: Ignoring unknown parameters: binwidth, bins, pad
but, for some reason I do not understand, nothing gets displayed. Also, I am not sure how to put the Y-axis in density rather than count.
I suppose the bin size should vary logarithmically as the X-axis varies but that's confuses me as it would result in bins gathering an "artificial" high number of observations. Not sure how histograms are typically displayed with log scale X axis. Note that ggplot(d) + geom_histogram(aes(x = cMPerSite, y=bpInPiece), stat="identity") does not display anything either so the problem is not only a question of log scale on the X-axis.
Can you help me to make this histogram?
Subset of my data
structure(list(chrom = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), end = c(241608,
612298, 715797, 956634, 983330, 1190613, 1236417, 1330208, 1391915,
1464000, 1911436, 1913462, 2092038, 2169783, 2354812, 2363639,
2544241, 2551672, 2575287, 2589721, 2659117, 2884565, 3037319,
3100967, 3152276, 4319658, 4335072, 6301896, 6550219, 6596684,
7132319, 7435267, 7469158, 7604030, 7937619, 8131876, 9359659,
9598491, 9945959, 10262757, 10392172, 10646861, 10816847, 11094415,
11360199, 11964985, 12220179, 12222166, 12389943), cMInPiece = c(0,
1e-07, 1e-07, 0.7118558, 9.99999999473644e-08, 0.9540829, 9.99999998363421e-08,
0.4967211, 1.244988, 0.2137991, 8.808171, 0.500545200000001,
1.5721302, 1.6856566, 2.2552469, 1.0000000116861e-07, 2.6973586,
0.355113100000001, 0.355233800000001, 1.0000000116861e-07, 1.4903822,
2.8174978, 1.0000000116861e-07, 0.355231, 1.0000000116861e-07,
8.2735924, 0.425817699999996, 6.4568106, 0.372779399999999, 0.363684999999997,
0.181640399999999, 0.177473599999999, 1.0000000116861e-07, 0.177463800000005,
0.355294099999995, 1.0000000116861e-07, 1.6101482, 1.0000000116861e-07,
0.533477099999999, 0.355287800000006, 9.99999940631824e-08, 1.0000000116861e-07,
1.0000000116861e-07, 1.0000000116861e-07, 1.0000000116861e-07,
1.0000000116861e-07, 9.99999940631824e-08, 1.0000000116861e-07,
1.0000000116861e-07), bpInPiece = c(241608, 370690, 103499, 240837,
26696, 207283, 45804, 93791, 61707, 72085, 447436, 2026, 178576,
77745, 185029, 8827, 180602, 7431, 23615, 14434, 69396, 225448,
152754, 63648, 51309, 1167382, 15414, 1966824, 248323, 46465,
535635, 302948, 33891, 134872, 333589, 194257, 1227783, 238832,
347468, 316798, 129415, 254689, 169986, 277568, 265784, 604786,
255194, 1987, 167777), cMPerSite = c(1e-16, 2.69767190914241e-13,
9.66192910076426e-13, 2.95575762860358e-06, 3.74587953054257e-12,
4.60280341369046e-06, 2.18321543612659e-12, 5.29604226418313e-06,
2.01757985317711e-05, 2.96593049871679e-06, 1.96858790977928e-05,
0.000247060809476802, 8.80370374518411e-06, 2.16818650717088e-05,
1.21886131363192e-05, 1.13288774406491e-11, 1.49353750235324e-05,
4.77880635176962e-05, 1.50427186110523e-05, 6.92808654348135e-12,
2.14764856764078e-05, 1.24973288740641e-05, 6.54647349127419e-13,
5.58118086978381e-06, 1.94897583598608e-12, 7.08730509807415e-06,
2.76253860127155e-05, 3.28286140498591e-06, 1.50118756619403e-06,
7.82707414182711e-06, 3.39112268615754e-07, 5.85821989252278e-07,
2.95063589650969e-12, 1.31579423453352e-06, 1.06506539484214e-06,
5.14781970114898e-13, 1.31142734506016e-06, 4.18704366117646e-13,
1.53532728193675e-06, 1.1214963478305e-06, 7.72707909154135e-13,
3.92635728942395e-13, 5.88283747888707e-13, 3.60272081683082e-13,
3.76245376578762e-13, 1.65347744770232e-13, 3.91858719496471e-13,
5.03271269092148e-11, 5.96029260080999e-13)), .Names = c("chrom",
"end", "cMInPiece", "bpInPiece", "cMPerSite"), row.names = c(NA,
-49L), class = "data.frame")
This might get you started
Assuming your data is too large to process in one step - the idea is to manually generate a histogram, which is essentially the number of observations per bin
1) Split your data.frame to a size that's manageable for your memory - N can be any number
N <- 10
L <- split(df, cut(seq_len(nrow(df)), breaks=N))
2) For each split
sum bpInPiece for each group - { i %>% group_by(G = floor(-log10(cMPerSite))) %>% summarise(sum=sum(bpInPiece)) }
Then aggregate all splits - %>% group_by(G) %>% summarise(sum = sum(sum))
Then plot - ggplot(...)
library(tidyverse)
counts <- map_df(L, function(i) { i %>% group_by(G = floor(-log10(cMPerSite))) %>% summarise(sum=sum(bpInPiece)) }) %>%
group_by(G) %>% summarise(sum = sum(sum)) %>%
ggplot(., aes(G, sum)) + geom_col()
counts

Resources