How to insert a distance matrix into R and run hierarchical clustering - r

I know If I have raw data, I can create a distance matrix, however for this problem I have a distance matrix and I want to be able to run commands in R on it, like hclust. Below is my distance matrix I want in R. I am not sure storing this data in matrix form will work as I will be unable to run hclust on a matrix.
I have tried to create it using as.dist functions to no avail. My faulty code:
test=as.dist(c(.76,2.97,4.88,3.86,.8,4.17,1.96,.21,1.51,.51), diag = FALSE, upper = FALSE)
test
1 2 3 4 5 6 7 8 9
2 2.97
3 4.88 2.97
4 3.86 4.88 0.51
5 0.80 3.86 2.97 0.21
6 4.17 0.80 4.88 1.51 0.80
7 1.96 4.17 3.86 0.51 4.17 0.51
8 0.21 1.96 0.80 2.97 1.96 2.97 0.80
9 1.51 0.21 4.17 4.88 0.21 4.88 4.17 0.21
10 0.51 1.51 1.96 3.86 1.51 3.86 1.96 1.51 0.51

Since you already have the distance values, you don't need to use dist() to calculate them. The data can be stored in a regular matrix
test <- matrix(ncol=5,nrow=5)
test[lower.tri(test)] <- c(.76,2.97,4.88,3.86,.8,4.17,1.96,.21,1.51,.51)
diag(test) <- 0
> test
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00 NA NA NA NA
[2,] 0.76 0.00 NA NA NA
[3,] 2.97 0.80 0.00 NA NA
[4,] 4.88 4.17 0.21 0.00 NA
[5,] 3.86 1.96 1.51 0.51 0
In order to apply hclust(), this matrix can then be converted to a distance matrix with as.dist():
> test <- as.dist(test, diag = TRUE)
1 2 3 4 5
1 0.00
2 0.76 0.00
3 2.97 0.80 0.00
4 4.88 4.17 0.21 0.00
5 3.86 1.96 1.51 0.51 0.00
> hclust(test)
#
#Call:
#hclust(d = test)
#
#Cluster method : complete
#Number of objects: 5
> plot(hclust(test))

Related

R: Error making time series stationary using diff()

I have a dataset with monthly data starting from the 1st of every month during a period of time.
Here's head() of my Date column:
> class(gas_data$Date)
[1] "Date"
> head(gas_data$Date)
[1] "2010-10-01" "2010-11-01" "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01"
Trying to make the time series stationary, I'm using diff() from the base package:
> gas_data_diff <- diff(gas_data, differences = 1, lag = 12)
> head(gas_data_diff)
data frame with 0 columns and 6 rows
> names(gas_data_diff)
character(0)
> gas_data_diff %>%
+ ggplot(aes(x=Date, y=Price.Gas)) +
+ geom_line(color="darkorchid4")
Error in FUN(X[[i]], ...) : object 'Price.Gas' not found
As you can see I get an error, and when trying to visualize the data with head() or look for the feature names I get an unexpected output.
Why am I getting this error and how could I fix it?
Here's head() of my original data
> head(gas_data)
Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
1 2010-10-01 2.08 3.54 5.40 0.2 10.44 43.50 46.00
2 2010-11-01 -3.04 3.46 6.74 -0.1 5.52 46.40 49.66
3 2010-12-01 0.31 3.54 9.00 -0.9 0.63 58.03 62.26
4 2011-01-01 2.65 3.59 7.58 0.6 4.05 48.43 55.98
5 2011-02-01 1.52 3.20 5.68 0.4 6.29 46.47 53.74
6 2011-03-01 -1.38 3.40 5.93 0.5 6.59 51.41 60.39
This is how the non-stationary plot of the original data looks like for gas prices for instances
Explanation
As per help from diff, the argument x must be
x : a numeric vector or matrix containing the values to be differenced.
As in your case, diff returns an eplty data.frame if x is a dataframe.
Best approach IMO
I don't see much of point in using the date column in diff. So. I am likely to follow the following approach.
rownames(df) <- df$Date
diff(as.matrix(df[, - 1]), lag = 1)
# converr to a matrix and apply diff
diff_mat <- diff(as.matrix(df[, - 1]), lag = 1)
# convert back to dataframe and set the Date column
diff_df <- as.data.frame(diff_mat)
diff_df$Date <- diff_df$Date
# now plot function should work
Using the data given to address the comments.
Convert the Date vector to numeric and then convert to matrix in diff
df$Date <- as.numeric(df$Date)
diff(as.matrix(df), 1, 2)
# Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth
# [1,] -1 8.47 0.16 0.92 -0.5
# [2,] 1 -1.01 -0.03 -3.68 2.3
# [3,] 0 -3.47 -0.44 -0.48 -1.7
# [4,] -3 -1.77 0.59 2.15 0.3
# Average.Temperature Price.Electricity Price.Gas
# [1,] 0.03 8.73 8.94
# [2,] 8.31 -21.23 -18.88
# [3,] -1.18 7.64 4.04
# [4,] -1.94 6.90 8.89
Creating the Data
df <- read.table(text = "Date Per.Change Domestic.Production.from.UKCS Import Per.GDP.Growth Average.Temperature Price.Electricity Price.Gas
2010-10-01 2.08 3.54 5.40 0.2 10.44 43.50 46.00
2010-11-01 -3.04 3.46 6.74 -0.1 5.52 46.40 49.66
2010-12-01 0.31 3.54 9.00 -0.9 0.63 58.03 62.26
2011-01-01 2.65 3.59 7.58 0.6 4.05 48.43 55.98
2011-02-01 1.52 3.20 5.68 0.4 6.29 46.47 53.74
2011-03-01 -1.38 3.40 5.93 0.5 6.59 51.41 60.39",
header = T, sep ="")
df$Date <- as.Date(df$Date)

Using apply and abind with for loop

I am using apply and abind to create a dataframe with the average of all the individual values from three similar data frames. I want to loop this code where the only thing that changes is the name of the instrument I am using (CAI.600, Thermo.1, etc).
This is what I have so far:
FIDs <- c('CAI.600', 'Thermo.1')
for (Instrument in FIDs) {
A.avg <- apply(abind::abind(paste0('FID.Eval.A.1.', Instrument),
paste0('FID.Eval.A.2.', Instrument),
paste0('FID.Eval.A.3.', Instrument),
along = 3), 1:2, mean)
assign(paste0('FID.Eval.A.', Instrument), A.avg)
}
where all the df's look similar to this (same number of rows and columns):
> FID.Eval.A.1.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 84.98 20.90 0.06 254.96 0.01
2 49.98 20.90 0.09 150.09 0.09
3 25.00 20.94 0.09 75.24 0.31
4 85.03 10.00 0.08 251.99 -1.22
5 50.03 10.00 0.09 148.51 -1.06
6 24.99 10.00 0.07 74.00 -1.27
7 84.99 0.10 0.06 246.99 -3.13
8 50.03 0.10 0.14 146.50 -2.39
9 24.96 0.10 0.10 72.97 -2.55
> FID.Eval.A.2.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.45 21.37 0.53 255.43 0.48
2 50.45 21.37 0.56 150.56 0.56
3 25.47 21.41 0.56 75.71 0.78
4 85.50 10.47 0.55 252.46 -0.75
5 50.50 10.47 0.56 148.98 -0.59
6 25.46 10.47 0.54 74.47 -0.80
7 85.46 0.57 0.53 247.46 -2.66
8 50.50 0.57 0.61 146.97 -1.92
9 25.43 0.57 0.57 73.44 -2.08
> FID.Eval.A.3.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.32 21.24 0.40 255.30 0.35
2 50.32 21.24 0.43 150.43 0.43
3 25.34 21.28 0.43 75.58 0.65
4 85.37 10.34 0.42 252.33 -0.88
5 50.37 10.34 0.43 148.85 -0.72
6 25.33 10.34 0.41 74.34 -0.93
7 85.33 0.44 0.40 247.33 -2.79
8 50.37 0.44 0.48 146.84 -2.05
9 25.30 0.44 0.44 73.31 -2.21
I ether get an error message stating "along must be between 0 and 2", or when I adjust along I get a warning stating "argument is not numeric or logical: returning NA".
Should I be using something other than for loop.
When I run abind without using for loop, the end result looks like this:
## Average of repeat tests
FID.Eval.A.CAI.600 <- apply(abind::abind(FID.Eval.A.1.CAI.600,
FID.Eval.A.2.CAI.600,
FID.Eval.A.3.CAI.600,
along = 3), 1:2, mean)
FID.Eval.A.CAI.600 <- as.data.frame(FID.Eval.A.CAI.600)
> FID.Eval.A.CAI.600
FTIR O2 H2O CAI.600 CAI.600.bias
1 85.25 21.17 0.33 255.23 0.28
2 50.25 21.17 0.36 150.36 0.36
3 25.27 21.21 0.36 75.51 0.58
4 85.30 10.27 0.35 252.26 -0.95
5 50.30 10.27 0.36 148.78 -0.79
6 25.26 10.27 0.34 74.27 -1.00
7 85.26 0.37 0.33 247.26 -2.86
8 50.30 0.37 0.41 146.77 -2.12
9 25.23 0.37 0.37 73.24 -2.28
Where 'FID.Eval.A.CAI.600' displays the average for each value from the three df's.
To fix immediate problem, use get() to return object by character reference. As of now, your paste0 calls will only return character strings and not actual object.
abind::abind(get(paste0('FID.Eval.A.1.', Instrument), envir=.GlobalEnv),
get(paste0('FID.Eval.A.2.', Instrument), envir=.GlobalEnv),
get(paste0('FID.Eval.A.3.', Instrument), envir=.GlobalEnv),
along = 3)
In fact, for a more dynamic solution consider mget to return all objects by name pattern without hard-coding each of the three objects.
Also, in R it best to avoid use of assign as much as possible. Instead, consider building one list of many objects with functional assignment and avoid flooding global environment with many separate objects. Below iterates using sapply to build a named list of average matrices.
FIDs <- c('CAI.600', 'Thermo.1')
mat_list <- sapply(FIDs, function(Instrument) {
FIDs_list <- mget(ls(pattern=Instrument, envir=.GlobalEnv), envir=.GlobalEnv)
FIDs_arry <- do.call(abind::abind, c(FIDs_list, along=length(FIDs_list)))
return(apply(FIDS_arry, 1:2, mean))
}, simplify = FALSE)
# OUTPUT ITEMS
mat_list$CAI.600
mat_list$Thermo.1
Even update names to conform to your original needs.
names(mat_list) <- paste0("FID.Eval.A.", names(mat_list))
mat_list$FID.Eval.A.CAI.600
mat_list$FID.Eval.A.Thermo.1

Applying a custom function repeatedly to same dataframe using purrr

Suppose I have a dataframe as follows:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
I have a custom function that makes new columns. (Note, my actual function is a lot more complex and can't be vectorized without a custom function, so please ignore the substance of the transformation here.) For example:
newfun <- function(var = NULL) {
newname <- paste0(var, "NEW")
df[[newname]] <- df[[var]]/100
return(df)
}
I want to apply this over many columns of the dataset repeatedly and have the dataset "build up." This happens just fine when I do the following:
df <- newfun("alpha")
df <- newfun("beta")
df <- newfun("gamma")
Obviously this is redundant and a case for map. But when I do the following I get back a list of dataframes, which is not what I want:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
out <- c("alpha", "beta", "gamma") %>%
map(function(x) newfun(x))
How can I iterate over a vector of column names AND see the changes repeatedly applied to the same dataframe?
Writing the function to reach outside of its scope to find some df is both risky and will bite you, especially when you see something like:
df[['a']] <- 2
# Error in df[["a"]] <- 2 : object of type 'closure' is not subsettable
You will get this error when it doesn't find your variable named df, and instead finds the base function named df. Two morals from this discovery:
While I admit to using df myself, it's generally bad practice to name variables the same as R functions (especially from base); and
Scope-breach is sloppy and renders a workflow unreproducible and often difficult to troubleshoot problems or changes.
To remedy this, and since your function relies on knowing what the old/new variable names are or should be, I think pmap or base R Map may work better. Further, I suggest that you name the new variables outside of the function, making it "data-only".
myfunc <- function(x) x/100
setNames(lapply(dat[,cols], myfunc), paste0("new", cols))
# $newalpha
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17
# [19] 0.18 0.19 0.20
# $newbeta
# [1] 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47
# [19] 0.48 0.49 0.50
# $newgamma
# [1] 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17
# [19] 1.18 1.19 1.20
From here, we just need to column-bind (cbind) it:
cbind(dat, setNames(lapply(dat[,cols], myfunc), paste0("new", cols)))
# alpha beta gamma newalpha newbeta newgamma
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# ...
Special note: if you plan on doing this iteratively (repeatedly), it is generally bad to iteratively add rows to frames; while I know this is a bad idea for adding rows, I suspect (without proof at the moment) that doing the same with columns is also bad. For that reason, if you do this a lot, consider using do.call(cbind, c(list(dat), ...)) where ... is the list of things to add. This results in a single call to cbind and therefore only a single memory-copy of the original dat. (Contrast that with iteratively calling the *bind functions which make a complete copy with each pass, scaling poorly.)
additions <- lapply(1:3, function(i) setNames(lapply(dat[,cols], myfunc), paste0("new", i, cols)))
str(additions)
# List of 3
# $ :List of 3
# ..$ new1alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new1beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new1gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new2alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new2beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new2gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new3alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new3beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new3gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
do.call(cbind, c(list(dat), additions))
# alpha beta gamma new1alpha new1beta new1gamma new2alpha new2beta new2gamma new3alpha new3beta new3gamma
# 1 0 30 100 0.00 0.30 1.00 0.00 0.30 1.00 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01 0.01 0.31 1.01 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02 0.02 0.32 1.02 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03 0.03 0.33 1.03 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04 0.04 0.34 1.04 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05 0.05 0.35 1.05 0.05 0.35 1.05
# ...
An alternative approach is to change your function to only return a vector:
newfun2 <- function(var = NULL) {
df[[var]] / 100
}
newfun2('alpha')
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
#[15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20
Then, using base, you can use lapply() to loop through your list of functions to do:
cols <- c("alpha", "beta", "gamma")
df[, paste0(cols, 'NEW')] <- lapply(cols, newfun2)
#or
#df[, paste0(cols, 'NEW')] <- purrr::map(cols, newfun2)
df
alpha beta gamma alphaNEW betaNEW gammaNEW
1 0 30 100 0.00 0.30 1.00
2 1 31 101 0.01 0.31 1.01
3 2 32 102 0.02 0.32 1.02
4 3 33 103 0.03 0.33 1.03
5 4 34 104 0.04 0.34 1.04
6 5 35 105 0.05 0.35 1.05
7 6 36 106 0.06 0.36 1.06
8 7 37 107 0.07 0.37 1.07
9 8 38 108 0.08 0.38 1.08
10 9 39 109 0.09 0.39 1.09
11 10 40 110 0.10 0.40 1.10
12 11 41 111 0.11 0.41 1.11
13 12 42 112 0.12 0.42 1.12
14 13 43 113 0.13 0.43 1.13
15 14 44 114 0.14 0.44 1.14
16 15 45 115 0.15 0.45 1.15
17 16 46 116 0.16 0.46 1.16
18 17 47 117 0.17 0.47 1.17
19 18 48 118 0.18 0.48 1.18
20 19 49 119 0.19 0.49 1.19
21 20 50 120 0.20 0.50 1.20
Based on the way you wrote your function, a for loop that assign the result of newfun to df repeatedly works pretty well.
vars <- names(df)
for (i in vars){
df <- newfun(i)
}
df
# alpha beta gamma alphaNEW betaNEW gammaNEW
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05
# 7 6 36 106 0.06 0.36 1.06
# 8 7 37 107 0.07 0.37 1.07
# 9 8 38 108 0.08 0.38 1.08
# 10 9 39 109 0.09 0.39 1.09
# 11 10 40 110 0.10 0.40 1.10
# 12 11 41 111 0.11 0.41 1.11
# 13 12 42 112 0.12 0.42 1.12
# 14 13 43 113 0.13 0.43 1.13
# 15 14 44 114 0.14 0.44 1.14
# 16 15 45 115 0.15 0.45 1.15
# 17 16 46 116 0.16 0.46 1.16
# 18 17 47 117 0.17 0.47 1.17
# 19 18 48 118 0.18 0.48 1.18
# 20 19 49 119 0.19 0.49 1.19
# 21 20 50 120 0.20 0.50 1.20

R dividing dataset into ranged bins?

I am having some problems sorting my dataset into bins, that based on the numeric value of the data value. I tried doing it with the function shingle from the lattice which seem to split it accurately.
I can't seem to extract the desired output which is the knowledge how the data is divided into the predefined bins. I seem only able to print it.
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
How do i extract the intervals which is outputted by the shingle function, and not only print it...
the intervals being the output:
Intervals:
min max count
1 0.38 0.40 0
2 0.42 0.44 6
3 0.46 0.48 46
4 0.50 0.52 251
5 0.54 0.56 697
6 0.58 0.60 1062
7 0.62 0.64 1215
8 0.66 0.68 1227
9 0.70 0.72 1231
10 0.74 0.76 1293
11 0.78 0.80 1330
12 0.82 0.84 1739
13 0.86 0.88 2454
14 0.90 0.92 3048
15 0.94 0.96 8936
16 0.98 1.00 71446
As an variable, that can be fed to another function.
The shingle() function returns the values using attributes().
The levels are specifically given by attr(bin_1,"levels").
So:
set.seed(1337)
data_1 = runif(100)
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
attr(bin_1,"levels")
This gives:
[,1] [,2]
[1,] 0.38 0.40
[2,] 0.42 0.44
[3,] 0.46 0.48
[4,] 0.50 0.52
[5,] 0.54 0.56
[6,] 0.58 0.60
[7,] 0.62 0.64
[8,] 0.66 0.68
[9,] 0.70 0.72
[10,] 0.74 0.76
[11,] 0.78 0.80
[12,] 0.82 0.84
[13,] 0.86 0.88
[14,] 0.90 0.92
[15,] 0.94 0.96
[16,] 0.98 1.00
Edit
The count information for each interval is only computed within the print.shingle method. Thus, you would need to run the following code:
count.shingle = function(x){
l <- levels(x)
n <- nlevels(x)
int <- data.frame(min = numeric(n), max = numeric(n),
count = numeric(n))
for (i in 1:n) {
int$min[i] <- l[[i]][1]
int$max[i] <- l[[i]][2]
int$count[i] <- length(x[x >= l[[i]][1] & x <= l[[i]][2]])
}
int
}
a = count.shingle(bin_1)
This gives:
> a
min max count
1 0.38 0.40 0
2 0.42 0.44 1
3 0.46 0.48 3
4 0.50 0.52 1
5 0.54 0.56 2
6 0.58 0.60 2
7 0.62 0.64 2
8 0.66 0.68 4
9 0.70 0.72 1
10 0.74 0.76 3
11 0.78 0.80 2
12 0.82 0.84 2
13 0.86 0.88 5
14 0.90 0.92 1
15 0.94 0.96 1
16 0.98 1.00 2
where a$min is lower range, a$max is upper range, and a$count is the number within the bins.

How to pick only efficient frontier points in a plot of portfolio performance?

The name of this question does not do it justice. This is best explained by numerical example. Let's say I have the following portfolio data, called data.
> data
Stdev AvgReturn
1 1.92 0.35
2 1.53 0.34
3 1.39 0.31
4 1.74 0.31
5 1.16 0.30
6 1.27 0.29
7 1.78 0.28
8 1.59 0.27
9 1.05 0.27
10 1.17 0.26
11 1.62 0.25
12 1.33 0.25
13 0.96 0.24
14 1.47 0.24
15 1.09 0.24
16 1.20 0.24
17 1.49 0.23
18 1.01 0.23
19 0.88 0.22
20 1.21 0.22
21 1.37 0.22
22 1.09 0.22
23 0.95 0.21
24 0.81 0.21
I have already sorted the data data.frame by AvgReturn to make this (what I believe to be easier). My goal is to essentially eliminate all the points that do not make sense to choose, i.e., I would not want a portfolio where I choose a lower AvgReturn but receive a higher Stdev (assuming stdev is an appropriate measure of risk, but I am assuming that for now).
Essentially, does any know of an efficient (in the code sense) way to choose the "rational" portfolio choices. I have manually created a third column to this data frame to show you which portfolio choices should be kept. I would want to remove portfolio 4 because I would never choose it since I can choose portfolio 3 and receive the same return and a lower stdev. Similarly, I would never choose 8 because I can choose 5 with a higher return and a lower stdev.
> res
Stdev AvgReturn Keep
1 1.92 0.35 TRUE
2 1.53 0.34 TRUE
3 1.39 0.31 TRUE
4 1.74 0.31 FALSE
5 1.16 0.30 TRUE
6 1.27 0.29 FALSE
7 1.78 0.28 FALSE
8 1.59 0.27 FALSE
9 1.05 0.27 TRUE
10 1.17 0.26 FALSE
11 1.62 0.25 FALSE
12 1.33 0.25 FALSE
13 0.96 0.24 TRUE
14 1.47 0.24 FALSE
15 1.09 0.24 FALSE
16 1.20 0.24 FALSE
17 1.49 0.23 FALSE
18 1.01 0.23 FALSE
19 0.88 0.22 TRUE
20 1.21 0.22 FALSE
21 1.37 0.22 FALSE
22 1.09 0.22 FALSE
23 0.95 0.21 FALSE
24 0.81 0.21 TRUE
The only way I can think of solving this issue is by looping through and checking each condition. This, however, will be relatively inefficient in R my preferred language for this solution. I am having difficulty thinking of a vectorized solution. Any help is appreciated!
EDIT
Here I believe is a solution:
domstrat <- function(data){
keep <- c(-1,sign(diff(cummin(data[[1]]))))
data <- data[which(keep!=0),]
return(data)
}
Stdev AvgReturn
1 1.92 0.35
2 1.53 0.34
3 1.39 0.31
5 1.16 0.30
9 1.05 0.27
13 0.96 0.24
19 0.88 0.22
24 0.81 0.21
This uses the function cummax to identify a series of qualifying points by then testing against the original data:
> data <- data[order(data$Stdev),]
> data[ which(data$AvgReturn == cummax(data$AvgReturn)) , ]
Stdev AvgReturn
24 0.81 0.21
19 0.88 0.22
13 0.96 0.24
9 1.05 0.27
5 1.16 0.30
3 1.39 0.31
2 1.53 0.34
1 1.92 0.35
> plot(data)
> points( data[ which(data$AvgReturn == cummax(data$AvgReturn)) , ] , col="green")
It's not actually the convex hull but what might be called the "monotonically increasing hull".
You can define a custom R function which contains some logic to decide whether or not to keep a certain portfolio depending on the standard deviation and the average return:
>portfolioKeep <- function(x){
+ # x[1] contains the Stdev for the input row
+ # x[2] contains the AvgReturn for the input row
+ # make your decision based on these inputs here...
+ # and remember to return either "TRUE" or "FALSE"
+ }
Next we can use an apply function on your input data frame to come up with the Keep column you want:
# your 'input' data frame
input.mat <- data.matrix(input)
# apply custom function to rows
keep <- apply(input.mat, 1, portfolioKeep)
# bind keep vector to input data frame
input <- cbind(input, keep)
The above code first converts the input data frame into a numeric matrix so that we can use the apply function on it. The apply function will run portfolioKeep on each row, returning either "TRUE" or "FALSE." Finally, we roll the Keep column up into the original data frame for convenience.
And now you can do your reporting easily with the data frame input with which you started.

Resources