R: Finance - Compute Beta via CAPM for Panel Data - r

I have the following data containing three Funds (A, B and C) and their the respective data for (Return minus Risk Free Rate) and (Market Return minus Risk Free Rate):
structure(list(`Fund Name` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"), Date = c("2018-08-01",
"2018-08-02", "2018-08-03", "2018-10-22", "2018-10-23", "2018-10-24",
"2018-12-18", "2019-01-08", "2019-01-09", "2019-01-10", "2019-01-11",
"2019-01-14", "2019-01-15", "2019-01-16", "2019-02-07", "2019-02-08",
"2019-02-11", "2019-02-12", "2019-02-13", "2019-02-14", "2019-02-15",
"2019-02-18", "2019-02-19", "2019-02-20", "2019-03-15", "2019-03-18",
"2019-03-19", "2019-04-01", "2019-04-02", "2019-04-03", "2019-04-04",
"2019-04-10", "2019-04-11", "2019-04-12", "2019-04-15", "2018-08-01",
"2018-08-02", "2018-08-10", "2018-08-13", "2018-08-14", "2018-08-16",
"2018-08-17", "2018-10-23", "2018-10-24", "2018-10-25", "2018-10-26",
"2018-10-29", "2018-10-30", "2018-10-31", "2018-11-13", "2018-11-14",
"2018-11-22", "2018-11-23", "2018-12-06", "2018-12-07", "2018-12-10",
"2018-12-11", "2018-12-12", "2018-12-13", "2018-12-14", "2018-12-17",
"2018-12-18", "2019-02-06", "2019-02-07", "2019-02-08", "2019-02-11",
"2019-02-12", "2019-02-13", "2019-02-14", "2019-02-15", "2019-03-04",
"2019-03-05", "2019-03-06", "2019-03-07", "2019-03-08", "2019-03-11",
"2019-03-26", "2019-03-27", "2019-04-05", "2019-04-08", "2019-04-12",
"2019-04-15", "2018-08-01", "2018-08-02", "2018-08-03", "2018-08-06",
"2018-08-07", "2018-08-08", "2018-08-09", "2018-08-10", "2018-08-13",
"2018-08-14", "2018-08-23", "2019-01-29", "2019-03-01", "2019-03-04",
"2019-03-05", "2019-03-06", "2019-03-07", "2019-03-27", "2019-03-28",
"2019-03-29", "2019-04-01", "2019-04-02", "2019-04-03", "2019-04-04",
"2019-04-12", "2019-04-15"), `Return-RF` = c(NA, -0.031053409,
-0.004149784, -0.019431914, -0.025985785, -0.022325086, -0.013000177,
-0.005969802, 0.003743827, -0.005973689, -0.012279585, -0.012621233,
-0.014248868, -0.000850313, -0.038296552, -0.020249538, -0.002319941,
-0.003117846, -0.006643616, -0.012684205, 0.00480718, -0.000708029,
-0.007510481, -0.001464912, -0.008793153, -0.003356718, -0.005595538,
0.00592619, -0.006444843, 0.007778815, -0.01019018, -0.008793842,
-0.003549589, 0.000596707, -0.005270976, NA, -0.024337163, -0.030609843,
-0.012780354, -0.011857873, NA, -0.00906015, -0.035681946, -0.007920997,
-0.020963305, -0.013154577, 0.002038879, -0.019934722, 0.007708796,
-0.019404458, 0.000443959, -0.008925886, -0.017543139, -0.033810649,
-0.002362211, -0.02975915, -0.002819632, -0.000687416, -0.006733802,
-0.02423122, -0.017747687, -0.009444599, -0.006353213, -0.020454878,
-0.028563249, -0.005726489, -0.003094262, -0.001040783, -0.012626742,
-0.001097087, -0.009497361, -0.015542972, 5.53889e-05, -0.020560822,
-0.023744172, -0.00744049, -0.00193544, -0.013016594, -0.008529772,
-0.005602241, -0.004651093, -0.005644803, NA, -0.02207606, -0.006369491,
-0.012551725, -0.003201358, -0.01153393, -0.010203346, -0.033352688,
-0.01224557, -0.011346633, -0.012929118, -0.006728953, -0.004243723,
-0.012659234, -0.009103863, -0.011760838, -0.023812576, -0.013908016,
-0.013459074, -0.004005417, 0.004751808, -0.007972052, 0.006040872,
-0.011324789, -0.000427748, -0.007779257), `Mkt-RF` = c(-0.64,
-1.36, 0.36, -0.85, -1.53, -1.26, -0.41, 0.61, 1.51, -0.13, -0.21,
-0.6, -0.01, 0.19, -1.63, -0.75, 0.33, 0.94, 0.07, 0.01, 1.22,
0.46, 0.12, 0.55, 0.93, 0.39, 0.62, 1.09, 0.45, 1.01, -0.28,
0.25, 0.11, 0.63, 0.3, -0.64, -1.36, -2.01, -0.28, -0.54, 0.71,
0.41, -1.53, -1.26, 0.5, -0.61, 0.65, -0.07, 1.37, 1.01, -0.28,
-0.44, -0.29, -2.49, 0.45, -1.98, 0.8, 1.98, -0.13, -1.23, -0.93,
-0.41, -0.28, -1.63, -0.75, 0.33, 0.94, 0.07, 0.01, 1.22, 0.03,
-0.03, -0.19, -1.44, -0.47, 0.85, 0.31, -0.14, 0.15, 0.24, 0.63,
0.3, -0.64, -1.36, 0.36, -0.18, 0.73, -0.08, -0.42, -2.01, -0.28,
-0.54, -0.54, 0.43, 0.52, 0.03, -0.03, -0.19, -1.44, -0.14, -0.34,
0.67, 1.09, 0.45, 1.01, -0.28, 0.63, 0.3)), class = "data.frame", row.names = c(NA,
-108L))
Now I would like to compute the beta via the CAPM for the three different funds.
I tried with the lm function but I it gives only one beta for all three funds together.
I tried with the following code:
Panel <- Panel %>%
group_by(`Fund Name`)
Regression <- lm(Panel$`Return-RF`~ Panel$`Mkt-RF`)
Could someone help me here with the code?

You can split() your dataframe by fund, then run the regression on each subset using lapply():
Panel_Funds <- split(Panel, Panel$`Fund Name`)
Regressions <- lapply(
Panel_Funds,
\(x) lm(`Return-RF` ~ `Mkt-RF`, data = x)
)
Regressions
Output:
$A
Call:
lm(formula = `Return-RF` ~ `Mkt-RF`, data = x)
Coefficients:
(Intercept) `Mkt-RF`
-0.00964 0.01205
$B
Call:
lm(formula = `Return-RF` ~ `Mkt-RF`, data = x)
Coefficients:
(Intercept) `Mkt-RF`
-0.010538 0.008266
$C
Call:
lm(formula = `Return-RF` ~ `Mkt-RF`, data = x)
Coefficients:
(Intercept) `Mkt-RF`
-0.009401 0.010676
If you want to save the coefficients to a table, you can use broom::tidy(); see my answer here for an example.

Are you trying to calculate the variance and covariance to compute the beta?
I would turn your data into a tibble then drop the NA values,
(data %>% as_tibble() %>% drop_na())
then you can easily extract variance for each company,
fundA <- data %>% filter(`Fund Name` == A)
then get variance,
var(fundA$`Return-RF`)

Related

identifying miss classified values in confusion matrix in R

I am using the caret package along with the confusionMatrix function and I would like to know if it is possible to know which are the exact values that were not clasified properly.
Here is a subset of my train data
train_sub <- structure(
list(
corr = c(
0.629922866893549,
0.632354159559817,
0.656112138936032,
0.4469719807955,
0.598136079870775,
0.314461239093862,
0.379065842199838,
0.347331370037428,
0.310270891798492,
0.361064451331448,
0.335628455451358
),
rdist = c(
0.775733824285612,
0.834148208687529,
0.884167982488944,
0.633989717138057,
0.850225777237626,
0.626197919283803,
0.649597055761598,
0.680382136363523,
0.627828985862852,
0.713674404108905,
0.646094473468118
),
CCF2 = c(
0.634465565134314,
0.722096802135009,
0.792385621105087,
0.46497582143802,
0.739612023831014,
0.470724554509749,
0.505961260826622,
0.527876803999064,
0.461724328071479,
0.564117580569802,
0.490084457081904
),
Wcorr = c(
0.629,
0.613,
0.812,
0.424,
0.593,
0.36,
0.346,
0.286,
0.333,
0.381,
0.333
),
Wcorr2 = c(
0.735,
0.743,
0.802,
0.588,
0.691,
0.632,
0.61,
0.599,
0.599,
0.632,
0.613
),
Wcorr3 = c(
0.21,
0.301,
0.421,
-0.052,
0.169,
-0.032,
-0.042,-0.048,
-0.035,
0.006,
-0.004
),
Var = c("W", "W", "W", "W",
"W", "B", "B", "B", "B", "B", "B")
),
row.names = c(1L, 2L,
3L, 5L, 7L, 214L, 215L, 216L, 217L, 218L, 221L),
class = "data.frame"
)
and here is a subset of my test data
test_sub <- structure(
list(
corr = c(
0.636658204667785,
0.5637857758104,
0.540558984461647,
0.392647603023863,
0.561801911406989,
0.297187412065481,
0.278864501603015,
0.505277007007347,
0.403811785308709,
0.510158398354856,
0.459607853624603
),
rdist = c(
0.887270722679019,
0.843656768956754,
0.815806338767273,
0.732093571145576,
0.832944903081762,
0.485497073465096,
0.454461718498521,
0.69094669881886,
0.627667080657035,
0.705558894672344,
0.620838398507191
),
CCF2 = c(
0.802017782695131,
0.731763898271157,
0.689402284804853,
0.577932997250877,
0.715111899030751,
0.324826043263382,
0.298456267077388,
0.544808216945995,
0.458148923874818,
0.551160266327893,
0.461228649848996
),
Wcorr = c(
0.655,
0.536,
0.677,
0.556,
0.571,
0.29,
0.25,
0.484,
0.25,
0.515,
0.314
),
Wcorr2 = c(
0.779,
0.682,
0.734,
0.675,
0.736,
0.5,
0.529,
0.611,
0.555,
0.639,
0.572
),
Wcorr3 = c(
0.368,
0.154,
0.266,
0.103,
0.224,
-0.204,
-0.16,
-0.026,
-0.149,
0.032,
-0.097
),
Var = c("W", "W", "W", "W", "W", "B", "B", "B", "B", "B",
"B")
),
row.names = c(4L, 6L, 8L, 13L, 15L, 321L, 322L, 329L,
334L, 341L, 344L),
class = "data.frame"
)
When I use this line,
confusionMatrix(reference=as.factor(test$Var),data=fittedTL,mode = "everything")
With this I compute some machine learning using glmnet method (it gives the best accuracy ini my case)
classCtrl <- trainControl(method = "repeatedcv", number=10,repeats=5,classProbs = TRUE,savePredictions = "final")
set.seed(355)
glmnetTL <- train(Var~., train_sub, method= "glmnet", trControl=classCtrl)
glmnetTL
And finally I compute the confusion matrix on my test set:
predict_glmnet <- predict(glmnetTL,test_sub)
predict_glmnet
CM_glmnet <- confusionMatrix(reference=as.factor(test_sub$Var),data=predict_glmnet,mode = "everything")
CM_glmnet
The output of the confusion matrix is a table like so
B
W
B
4
0
W
2
5
So here I have two predictions/classifications that are not good.
Is there any way I can traceback to which row of my test set it corresponds ?

How to accurately estimate the start of an increasing value of a variable in time?

Goal
I have brake force (kg) data for many drivers, and I want to find when the brake application started in time. Particularly, I need the time frame of brake start. Following are three examples of brake pedal force and the desired location of the brake start of time frames:
Estimating Brake start
I estimated the brake start by assuming that it is a changepoint. So, I used the changepoint package in R. But I get some of them right and others wrong (the vertical red line below represents the estimated changepoint):
You can see the changepoints for participants B and C are (almost) correct, but incorrect for participant A. In my full dataset, there are many incorrect values so manually estimating them is going to be very time consuming.
Do you have any suggestions to accurately estimate the brake start? Thank you for your time.
The data and code for the above figure are provided below.
Data and Code
Data
foo <- structure(list(participant = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C"), frames = c(39614, 39644, 39674, 39704, 39734,
39764, 39794, 39824, 39854, 39884, 39914, 39944, 39974, 40004,
40034, 40064, 40094, 40124, 40154, 40184, 40214, 40244, 40274,
40304, 40334, 40364, 40394, 40424, 40454, 40484, 40514, 40544,
40574, 40604, 40634, 40664, 40694, 40724, 40754, 40784, 40814,
40844, 40874, 40904, 40934, 40964, 40994, 41024, 41054, 41084,
41114, 41144, 41174, 45296, 45326, 45356, 45386, 45416, 45446,
45476, 45506, 45536, 45566, 45596, 45626, 45656, 45686, 45716,
45746, 45776, 45806, 45836, 45866, 45896, 45926, 63792, 63822,
63852, 63882, 63912, 63942, 63972, 64002, 64032, 64062, 64092,
64122, 64152, 64182, 64212, 64242, 64272, 64302, 64332, 64362,
64392, 64422, 64452, 64482, 64512, 64542, 64572, 64602, 64632,
64662, 64692, 64722, 64752, 64782, 64812, 64842, 64872, 64902,
64932, 64962, 64992, 65022, 65052, 65082, 65112, 65142, 65172,
65202, 65232, 65262, 65292, 65322), ED_brake_pedal_force_kg = c(0.34,
0.34, 0.34, 0.33, 0.33, 0.34, 0.32, 0.34, 0.34, 0.34, 0.34, 0.32,
0.34, 0.34, 0.37, 0.32, 0.32, 0.33, 0.34, 0.32, 0.33, 0.34, 0.34,
0.72, 2.01, 2.91, 4.57, 5.73, 5.84, 5.82, 5.21, 5.23, 5.23, 4.41,
4, 3.57, 3.09, 2.28, 1.37, 0.33, 0.33, 0.65, 1.21, 3.36, 4.91,
5.2, 5.96, 6.24, 7.6, 14.13, 25.8, 32.37, 37.71, 0.32, 0.34,
0.33, 0.32, 1.72, 8.93, 18.83, 22.78, 39.5, 66.63, 9.46, 2.24,
0.33, 0.34, 1.9, 5.5, 8.55, 10.66, 12.24, 12.24, 12.24, 12.27,
0.29, 0.29, 0.31, 0.31, 0.3, 0.29, 0.3, 0.3, 0.3, 0.29, 0.3,
0.31, 0.3, 0.29, 0.29, 0.91, 2.79, 3.67, 4.24, 5.61, 5.91, 6.08,
5.4, 4.46, 3.74, 3.85, 4, 4.43, 2.08, 0.7, 0.3, 0.29, 0.31, 0.32,
0.34, 0.69, 0.83, 0.83, 0.84, 1.36, 1.68, 2.04, 3.87, 5.21, 7.28,
9.84, 13.49, 14.83, 14.79, 14.79, 14.79, 14.71)), row.names = c(NA,
-127L), class = c("tbl_df", "tbl", "data.frame"))
Code
Estimation of changepoint and plotting:
library(changepoint)
library(tidyverse)
foo %>%
group_by(participant) %>%
mutate(brake_start_frame = frames[cpts(cpt.meanvar(ED_brake_pedal_force_kg,
Q = 8,
method = "BinSeg"))][1]) %>%
ungroup() %>%
ggplot() +
geom_line(aes(x = frames, y = ED_brake_pedal_force_kg)) +
geom_vline(aes(xintercept = brake_start_frame), color="red") +
facet_wrap(~ participant, scales = "free_x")
Since this is a time-series problem, you can explore TTR::momentum function to solve this problem. Whenever momentum will go above a particular threshold in upward direction, it will trigger the event.
library(TTR)
library(data.table)
setDT(foo)
foo[, momentum := TTR::momentum(ED_brake_pedal_force_kg, 5), by = participant]
ggplot(foo) +
geom_line(aes(x = momentum, y = ED_brake_pedal_force_kg)) +
facet_wrap(~ participant, scales = "free_x")

How do I remove white space from between letters and not numbers?

I have character strings that I want to convert to tables. The identifier in each row can have white spaces and I need them removed without also removing spaces between the numbers. Is it possible to use a regular expression to achieve this?
For example, the data would look like this:
A B C 5.65 7.8
DC 5.65 7.8
D AB 7.9 12.2
D AB C 7.9 1.2
A BC 13.88 2.4
AB C 7.9 12.2
And I want to get to this:
ABC 5.65 7.8
DC 5.65 7.8
DAB 7.9 12.2
DABC 7.9 1.2
ABC 13.88 2.4
ABC 7.9 12.2
EDIT: As requested, this is an example of the data type and the form in which I receive it. This has 16 rows, each with 6 columns of data, but the first column is an alphabetic identifier.
# Data as I receive it.
data <- c("A", "a", "2.07", "2.35", "39.00", "82.20", "8.8", "3.80",
"B", "2.26", "2.25", "40.00", "80.80", "8.1", "1.86", "D",
"Et", "2.07", "2.22", "41.00", "83.80", "8.8", "3.87", "F",
"2.05", "2.15", "43.00", "82.20", "8.4", "3.11", "Bc", "2.08",
"2.12", "48.00", "82.60", "8.3", "2.47", "Gf", "H", "I",
"2.08", "2.10", "46.00", "82.20", "8.1", "2.90", "J", "K",
"1.95", "2.08", "38.00", "83.40", "8.7", "1.63", "L", "M",
"1.89", "2.07", "45.00", "83.80", "9.0", "1.84", "N", "2.06",
"2.05", "41.00", "80.60", "9.0", "4.09", "O", "P", "1.86",
"2.04", "48.00", "81.60", "8.6", "2.60", "Qst", "R", "1.95",
"2.03", "44.00", "82.80", "8.8", "1.40", "S", "2.03", "2.02",
"40.00", "81.40", "8.2", "1.74", "T", "1.95", "2.01", "43.00",
"81.80", "9.0", "2.30", "Unh", "1.96", "2.00", "44.00", "82.60",
"9.2", "2.40", "V", "W", "C", "1.98", "1.97", "40.00",
"82.00", "8.1", "1.15", "Yu", "1.90", "1.96", "41.00", "82.80",
"9.6", "2.08", "Z", "a", "bi", "1.90", "1.95", "42.00",
"84.20", "9.6", "1.69")
# Required format
data2 <- c("Aa", "2.07", "2.35", "39.00", "82.20", "8.8", "3.80",
"B", "2.26", "2.25", "40.00", "80.80", "8.1", "1.86",
"DEt", "2.07", "2.22", "41.00", "83.80", "8.8", "3.87", "F",
"2.05", "2.15", "43.00", "82.20", "8.4", "3.11", "Bc", "2.08",
"2.12", "48.00", "82.60", "8.3", "2.47", "GfHI",
"2.08", "2.10", "46.00", "82.20", "8.1", "2.90", "JK",
"1.95", "2.08", "38.00", "83.40", "8.7", "1.63", "LM",
"1.89", "2.07", "45.00", "83.80", "9.0", "1.84", "N", "2.06",
"2.05", "41.00", "80.60", "9.0", "4.09", "OP", "1.86",
"2.04", "48.00", "81.60", "8.6", "2.60", "QstR", "1.95",
"2.03", "44.00", "82.80", "8.8", "1.40", "S", "2.03", "2.02",
"40.00", "81.40", "8.2", "1.74", "T", "1.95", "2.01", "43.00",
"81.80", "9.0", "2.30", "Unh", "1.96", "2.00", "44.00", "82.60",
"9.2", "2.40", "VWC", "1.98", "1.97", "40.00",
"82.00", "8.1", "1.15", "Yu", "1.90", "1.96", "41.00", "82.80",
"9.6", "2.08", "Zabi", "1.90", "1.95", "42.00",
"84.20", "9.6", "1.69")
df <- data.frame(matrix(data2, ncol=7, byrow=T))
To do as you request within your R environment, one approach is to convert the vector to a string, apply a regular expression filter to the string, then convert the string back to a vector.
See details below, hopefully this points you in the right direction.
Solution
data <- c("A", "a", "2.07", "2.35", "39.00", "82.20", "8.8", "3.80",
"B", "2.26", "2.25", "40.00", "80.80", "8.1", "1.86", "D",
"Et", "2.07", "2.22", "41.00", "83.80", "8.8", "3.87", "F",
"2.05", "2.15", "43.00", "82.20", "8.4", "3.11", "Bc", "2.08",
"2.12", "48.00", "82.60", "8.3", "2.47", "Gf", "H", "I",
"2.08", "2.10", "46.00", "82.20", "8.1", "2.90", "J", "K",
"1.95", "2.08", "38.00", "83.40", "8.7", "1.63", "L", "M",
"1.89", "2.07", "45.00", "83.80", "9.0", "1.84", "N", "2.06",
"2.05", "41.00", "80.60", "9.0", "4.09", "O", "P", "1.86",
"2.04", "48.00", "81.60", "8.6", "2.60", "Qst", "R", "1.95",
"2.03", "44.00", "82.80", "8.8", "1.40", "S", "2.03", "2.02",
"40.00", "81.40", "8.2", "1.74", "T", "1.95", "2.01", "43.00",
"81.80", "9.0", "2.30", "Unh", "1.96", "2.00", "44.00", "82.60",
"9.2", "2.40", "V", "W", "C", "1.98", "1.97", "40.00",
"82.00", "8.1", "1.15", "Yu", "1.90", "1.96", "41.00", "82.80",
"9.6", "2.08", "Z", "a", "bi", "1.90", "1.95", "42.00",
"84.20", "9.6", "1.69")
# Use stringi base regular expression engine
require(stringi)
# Convert the vector data to be a string sequence - so we can manipulate as text
data1 <- toString(data)
# Now we can apply the regular expression substitution to the data (formatted as a string...
# Here we do a:
#
# (?<!\d) - Negative look behind to prevent a digit.
# , - A literal combination of quotes, comma and space. We drop the ", " in conversion to string...
# (?!\d) - Negative look ahead to prevent a digit.
#
data3 = stri_replace_all_regex(str = data1, pattern = '(?<!\\d), (?!\\d)', replacement = '')
# OK, check the string data...
data3
# Now we convert the string back to be a vector...
newData = strsplit(data3, " ")[[1]]
newData
# Now we convert to a dataframe...
df <- data.frame(matrix(newData, ncol=7, byrow=T))
df
# Done
Output
> data <- c("A", "a", "2.07", "2.35", "39.00", "82.20", "8.8", "3.80",
+ "B", "2.26", "2.25", "40.00", "80.80", "8.1", "1.86", "D",
+ "Et", "2.07", "2.22", "41.00", "83.80", "8.8", "3.87", "F",
+ "2.05", "2.15", "43.00", "82.20", "8.4", "3.11", "Bc", "2.08",
+ "2.12", "48.00", "82.60", "8.3", "2.47", "Gf", "H", "I",
+ "2.08", "2.10", "46.00", "82.20", "8.1", "2.90", "J", "K",
+ "1.95", "2.08", "38.00", "83.40", "8.7", "1.63", "L", "M",
+ "1.89", "2.07", "45.00", "83.80", "9.0", "1.84", "N", "2.06",
+ "2.05", "41.00", "80.60", "9.0", "4.09", "O", "P", "1.86",
+ "2.04", "48.00", "81.60", "8.6", "2.60", "Qst", "R", "1.95",
+ "2.03", "44.00", "82.80", "8.8", "1.40", "S", "2.03", "2.02",
+ "40.00", "81.40", "8.2", "1.74", "T", "1.95", "2.01", "43.00",
+ "81.80", "9.0", "2.30", "Unh", "1.96", "2.00", "44.00", "82.60",
+ "9.2", "2.40", "V", "W", "C", "1.98", "1.97", "40.00",
+ "82.00", "8.1", "1.15", "Yu", "1.90", "1.96", "41.00", "82.80",
+ "9.6", "2.08", "Z", "a", "bi", "1.90", "1.95", "42.00",
+ "84.20", "9.6", "1.69")
>
> # Use stringi base regular expression engine
> require(stringi)
>
> # Convert the vector data to be a string sequence - so we can manipulate as text
> data1 <- toString(data)
>
> # Now we can apply the regular expression substitution to the data (formatted as a string...
> # Here we do a:
> #
> # (?<!\d) - Negative look behind to prevent a digit.
> # , - A literal combination of quotes, comma and space. We drop the ", " in conversion to string...
> # (?!\d) - Negative look ahead to prevent a digit.
> #
> data3 = stri_replace_all_regex(str = data1, pattern = '(?<!\\d), (?!\\d)', replacement = '')
> # OK, check the string data...
> data3
[1] "Aa, 2.07, 2.35, 39.00, 82.20, 8.8, 3.80, B, 2.26, 2.25, 40.00, 80.80, 8.1, 1.86, DEt, 2.07, 2.22, 41.00, 83.80, 8.8, 3.87, F, 2.05, 2.15, 43.00, 82.20, 8.4, 3.11, Bc, 2.08, 2.12, 48.00, 82.60, 8.3, 2.47, GfHI, 2.08, 2.10, 46.00, 82.20, 8.1, 2.90, JK, 1.95, 2.08, 38.00, 83.40, 8.7, 1.63, LM, 1.89, 2.07, 45.00, 83.80, 9.0, 1.84, N, 2.06, 2.05, 41.00, 80.60, 9.0, 4.09, OP, 1.86, 2.04, 48.00, 81.60, 8.6, 2.60, QstR, 1.95, 2.03, 44.00, 82.80, 8.8, 1.40, S, 2.03, 2.02, 40.00, 81.40, 8.2, 1.74, T, 1.95, 2.01, 43.00, 81.80, 9.0, 2.30, Unh, 1.96, 2.00, 44.00, 82.60, 9.2, 2.40, VWC, 1.98, 1.97, 40.00, 82.00, 8.1, 1.15, Yu, 1.90, 1.96, 41.00, 82.80, 9.6, 2.08, Zabi, 1.90, 1.95, 42.00, 84.20, 9.6, 1.69"
>
> # Now we convert the string back to be a vector...
> newData = strsplit(data3, " ")[[1]]
> newData
[1] "Aa," "2.07," "2.35," "39.00," "82.20," "8.8," "3.80," "B," "2.26," "2.25," "40.00," "80.80,"
[13] "8.1," "1.86," "DEt," "2.07," "2.22," "41.00," "83.80," "8.8," "3.87," "F," "2.05," "2.15,"
[25] "43.00," "82.20," "8.4," "3.11," "Bc," "2.08," "2.12," "48.00," "82.60," "8.3," "2.47," "GfHI,"
[37] "2.08," "2.10," "46.00," "82.20," "8.1," "2.90," "JK," "1.95," "2.08," "38.00," "83.40," "8.7,"
[49] "1.63," "LM," "1.89," "2.07," "45.00," "83.80," "9.0," "1.84," "N," "2.06," "2.05," "41.00,"
[61] "80.60," "9.0," "4.09," "OP," "1.86," "2.04," "48.00," "81.60," "8.6," "2.60," "QstR," "1.95,"
[73] "2.03," "44.00," "82.80," "8.8," "1.40," "S," "2.03," "2.02," "40.00," "81.40," "8.2," "1.74,"
[85] "T," "1.95," "2.01," "43.00," "81.80," "9.0," "2.30," "Unh," "1.96," "2.00," "44.00," "82.60,"
[97] "9.2," "2.40," "VWC," "1.98," "1.97," "40.00," "82.00," "8.1," "1.15," "Yu," "1.90," "1.96,"
[109] "41.00," "82.80," "9.6," "2.08," "Zabi," "1.90," "1.95," "42.00," "84.20," "9.6," "1.69"
>
> # Now we convert to a dataframe...
> df <- data.frame(matrix(newData, ncol=7, byrow=T))
> df
X1 X2 X3 X4 X5 X6 X7
1 Aa, 2.07, 2.35, 39.00, 82.20, 8.8, 3.80,
2 B, 2.26, 2.25, 40.00, 80.80, 8.1, 1.86,
3 DEt, 2.07, 2.22, 41.00, 83.80, 8.8, 3.87,
4 F, 2.05, 2.15, 43.00, 82.20, 8.4, 3.11,
5 Bc, 2.08, 2.12, 48.00, 82.60, 8.3, 2.47,
6 GfHI, 2.08, 2.10, 46.00, 82.20, 8.1, 2.90,
7 JK, 1.95, 2.08, 38.00, 83.40, 8.7, 1.63,
8 LM, 1.89, 2.07, 45.00, 83.80, 9.0, 1.84,
9 N, 2.06, 2.05, 41.00, 80.60, 9.0, 4.09,
10 OP, 1.86, 2.04, 48.00, 81.60, 8.6, 2.60,
11 QstR, 1.95, 2.03, 44.00, 82.80, 8.8, 1.40,
12 S, 2.03, 2.02, 40.00, 81.40, 8.2, 1.74,
13 T, 1.95, 2.01, 43.00, 81.80, 9.0, 2.30,
14 Unh, 1.96, 2.00, 44.00, 82.60, 9.2, 2.40,
15 VWC, 1.98, 1.97, 40.00, 82.00, 8.1, 1.15,
16 Yu, 1.90, 1.96, 41.00, 82.80, 9.6, 2.08,
17 Zabi, 1.90, 1.95, 42.00, 84.20, 9.6, 1.69
> # Done

Interrelated columns in R data.table

To keep track of cash flow I have a number of interrelated columns in a data.table:
"Amount_spent" is always 5% of the "Balance".
"Revenue" is "Amount_spent" * "Price"
"Balance" is the cumulative sum of "Revenue" (starting at 100.00).
Transactions only happen on "Day" "a"
I am struggling to calculate these interrelated columns concurrently.
Example as I would like:
library(data.table)
Day <- c( "a", "c", "b", "a", "b", "a", "b", "c", "a", "a" )
Price <- c( 0.6, 0.4, 0.9, -0.3, 0.8, 0.2, 0.3, 0.9, 0.9, -0.7 )
Balance <- c( 100.00, 103.00, 103.00, 103.00, 101.46, 101.46, 102.47, 102.47, 102.47, 107.08 )
Amount_spent <- c( 5.00, 0.00, 0.00, 5.15, 0.00, 5.07, 0.00, 0.00, 5.12, 5.35 )
Revenue <- c( 3.00, 0.00, 0.00, -1.55, 0.00, 1.01, 0.00, 0.00, 4.61, -3.75 )
DT <- data.table( Day, Price, Balance, Amount_spent, Revenue )
DT
Here is my attempt so far:
# set initial balance
Balance2 <- 100.00
Day2 <- c( "a", "c", "b", "a", "b", "a", "b", "c", "a", "a" )
Price2 <- c( 0.6, 0.4, 0.9, -0.3, 0.8, 0.2, 0.3, 0.9, 0.9, -0.7 )
my.try <- data.table( Day2, Price2 )
my.try[, Balance2 := cumsum( Revenue2 )]
my.try[ Day2 == "a", Amount_spent2 := Balance2 * 0.05 ]
my.try[is.na(Amount_spent2), Amount_spent2 := 0]
my.try[, Revenue2 := Price2 * Amount_spent2 ]
my.try
As you will see it fails with this error message Error in eval(expr, envir, enclos) : object 'Revenue2' not found as the "Revenue2" column is yet to be created.
Thank you
You get mentioned error after the line my.try[, Balance2 := cumsum( Revenue2 )] which try to use the column Revenue2 which not exist in DT at that point in the code.
library(data.table)
Day <- c( "a", "c", "b", "a", "b", "a", "b", "c", "a", "a" )
Price <- c( 0.6, 0.4, 0.9, -0.3, 0.8, 0.2, 0.3, 0.9, 0.9, -0.7 )
Balance <- c( 100.00, 103.00, 103.00, 103.00, 101.46, 101.46, 102.47, 102.47, 102.47, 107.08 )
Amount_spent <- c( 5.00, 0.00, 0.00, 5.15, 0.00, 5.07, 0.00, 0.00, 5.12, 5.35 )
Revenue <- c( 3.00, 0.00, 0.00, -1.55, 0.00, 1.01, 0.00, 0.00, 4.61, -3.75 )
DT <- data.table( Day, Price, Balance, Amount_spent, Revenue )
Balance2 <- 100.00
Day2 <- c( "a", "c", "b", "a", "b", "a", "b", "c", "a", "a" )
Price2 <- c( 0.6, 0.4, 0.9, -0.3, 0.8, 0.2, 0.3, 0.9, 0.9, -0.7 )
my.try <- data.table( Day2, Price2 )
my.try[, Balance2 := cumsum( Revenue2 )]
#Error in eval(expr, envir, enclos) : object 'Revenue2' not found
"Revenue2" %in% names(DT)
#[1] FALSE
You did not produce expected results. I'm not sure what you mean by calculate the columns concurrently. If you want to assign/update multiple columns by reference in a single step you can use `:=()` function the same way as you would use .() or list() in data.table's j argument. For example: `:=`(col1=1+2, col2=2+3).
You can read more about update by reference in Reference semantics vignette.

Show only a certain part of the x-axis when using plot(density(mydf))

There's an extreme value in my data. How can I only show the density plot for the "important" part of my data. I'd like to show the x-axis only from let's say -5 to +5 percent.
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-80,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
mydf <- data.frame(COMP, RET, stringsAsFactors=F)
plot(density(mydf$RET))
and the same with boxplot on the y-axis
boxplot(mydf$RET)
I know
boxplot(mydf$RET, outline=FALSE)
but here I want the range of the y-axis even smaller. How is that possible?
Thank you!
Use the arguments xlim and ylim to adjust the axis' scales in R basic graphics.
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-80,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
mydf <- data.frame(COMP, RET, stringsAsFactors=F)
par(mfrow = c(1,2)) #stack plots in 1 row and 2 columns
plot(density(mydf$RET),xlim=c(-5,5), main="")
boxplot(mydf$RET, ylim = c(-2,2), ylab="RET")

Resources